understanding the pathogenic fungus penicillium …people.tamu.edu/~jcai/pdf/thesis-phd-full.pdf ·...

Abstract of thesis entitled

Understanding the Pathogenic Fungus Penicillium marneffei : A

Computational Genomics Perspective

by James J. Cai

for the degree of Doctor of Philosophy

at The University of Hong Kong

in May 2006

Penicillium marneffei, a thermally dimorphic fungus that alternates be-

tween a filamentous and a yeast growth form in response to changes in

its environmental temperature, has become an emerging fungal pathogen

endemic in Southeast Asia. Defining the genomics of P. marneffei will

provide a better understanding of the fungus.

This thesis reports the draft sequence of the P. marneffei genome as-

sembled from 6.6 coverage of the genome through whole genome shotgun

sequencing. The 31 Mb genome obtained from the assembly contains

10,060 protein-coding genes. The complete mitochondrial genome is 35

kb long and its gene content and gene order are very similar to that of

Aspergillus. An annotation system and P. marneffei genome database

(PMGD) were developed to allow a preliminary annotation of the se-

quences and provide an intuitive graphic interface to give curators and

users ready access to the annotation and the underlying evidence, and

a Matlab-based software package, MBEToolbox, was developed for data

analysis in phylogenetics and comparative genomics. A well-designed and

structured annotation system and powerful sequence analysis software

are essential requirements for the success of large-scale genome analysis

projects.

Analysis of the gene set of P. marneffei provided insights into the

adaptations required by a fungus to cause disease. The genome encodes

a diverse set of putative virulence genes such as proteinase, phospholi-

pase, metacaspase and agglutinin, which may enable the fungus to adhere

to, colonise and invade the host, adapt to the tissue environment, and

avoid the host’s humoral and cellular defences of the innate and adaptive

immune responses. A gene cluster involved in biosynthesis of melanin, a

known virulence factor in some other pathogenic fungi, was also identi-

fied in the genome, indicating that P. marneffei may produce melanin

or melanin-like immunosuppressive compounds that protect the fungus

against immune effector cells. More interestingly, P. marneffei genome

contains more intragenic tandem repeats (IntraTRs) than other fungi.

These IntraTRs encoding repeat domains/motifs may create quantita-

tive variation in surface proteins, allowing the fungus to ‘disguise’ itself

to slip past the vigilant defences of the host immune system. The genome

sequence of P. marneffei also revealed a number of genes associated with

mating processes and sexual development, suggesting an unidentified sex-

ual cycle in the fungus.

The extent and evolutionary patterns of duplicate genes in P. marn-

effei and other ascomycetes were compared. All ascomycetes show a

certain degree of redundancy (though its extent can vary considerably),

which may provide the foundation for the specialisation of fungal genes

and form the basis for fungal diversification. An inverse relationship be-

tween the lineage specificity of a gene and gene’s evolutionary rate was

also discovered, implying that an accelerated evolutionary rate may be

responsible for the emergence of lineage specific genes.

The genome sequence of P. marneffei has provided our first glimpse

into the genomic basis of the physiology of the dimorphic filamentous

fungus.

Understanding the Pathogenic FungusPenicillium marneffei : A Computational

Genomics Perspective

BY

James J. Cai

M.D., Henan Medical University, 1996

M.S., University of New South Wales, 2001

THESIS

Submitted in partial fulfillment of the requirements

for the degree of Doctor of Philosophy

at The University of Hong Kong

May 2006

To Yan

“Any living cell carries with it the experiences of a billion

years of experimentation by its ancestors.”

Max Delbruck (1949)

DECLARATION

I declare that this thesis represents my own work, except where due

acknowledgement is made, and that it has not been previously included

in a thesis, dissertation or report submitted to this University or to any

other institution for a degree, diploma or other qualifications.

Signature:

Date:

i

ACKNOWLEDGEMENTS

First of all, a special thanks goes to my principle supervisor, Pro-

fessor Kwok-yung Yuen, for his enthusiasm and support during

the course of my study. My heartfelt thanks to Dr. David K.

Smith and Dr. Xuhua Xia who introduced me to the fascinating

world of bioinformatics and molecular evolution.

Thanks to my friends and colleagues for their moral support

and technical assistance over the past four years especially Dr.

Patrick Woo, Dr. Sussana Lau, and Jade, Huang Yi, Ken, Haw,

Candy, Rachel ... I am also grateful to my external mentor Dr.

Gavin Huttley and fellow colleagues Peter, Ray, Helen and Brett

in the Australian National University.

Finally, I am very grateful to my wife and my parents. Without

their support, this work would not have been possible.

ii

TABLE OF CONTENTS

Declaration i

Acknowledgements ii

List of Figures x

List of Tables xii

Abbreviations xiv

Glossary xviii

Introduction 1

Chapter 1: The draft genome sequence of Penicillium

marneffei 4

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 General fungal biology . . . . . . . . . . . . . . . . 5

1.2.2 P. marneffei, as an important fungal pathogen . . 7

1.2.3 Penicilliosis marneffei . . . . . . . . . . . . . . . . 13

1.2.4 Fungal genome projects . . . . . . . . . . . . . . . 20

1.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . 23

1.3.1 Strain and DNA preparation . . . . . . . . . . . . 23

1.3.2 Library construction, shotgun sequencing . . . . . 24

1.3.3 Sequence assembly . . . . . . . . . . . . . . . . . . 24

1.3.4 Data release . . . . . . . . . . . . . . . . . . . . . . 24

iii

1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.4.1 Assembly and general characteristic . . . . . . . . 25

1.4.2 Genome architecture and co-linearity . . . . . . . . 29

1.4.3 Gene duplications (multigene families) and com-

parisons . . . . . . . . . . . . . . . . . . . . . . . . 30

1.4.4 Interspecies proteome comparison . . . . . . . . . . 31

1.4.5 Lineage-specific genes . . . . . . . . . . . . . . . . 33

1.4.6 Cell signalling and morphogenesis . . . . . . . . . 35

1.4.7 Potential mating ability . . . . . . . . . . . . . . . 35

1.4.8 Putative virulence genes . . . . . . . . . . . . . . . 35

1.4.9 Cell wall antigens and biosynthetic genes . . . . . 35

1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Chapter 2: Penicillium marneffei genome database and

annotation pipeline 40

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 40


2.2.1 Methods for predicting protein function . . . . . . 42

2.2.2 Software/database systems for protein function pre-

diction . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.2.3 The art of gene finding . . . . . . . . . . . . . . . . 47

2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 50

2.3.1 Annotation pipeline . . . . . . . . . . . . . . . . . 50

2.3.2 Assembly process . . . . . . . . . . . . . . . . . . . 53

2.3.3 Gene finding . . . . . . . . . . . . . . . . . . . . . 55

2.3.4 Database and databank to store results . . . . . . 57

2.3.5 Perl source code collection . . . . . . . . . . . . . . 58

2.3.6 Genome browser configuration . . . . . . . . . . . 58

2.3.7 Synteny identification . . . . . . . . . . . . . . . . 59

iv

2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.4.1 Statistics of assembly . . . . . . . . . . . . . . . . 60

2.4.2 Genome size estimation . . . . . . . . . . . . . . . 61

2.4.3 Accuracy of gene finding . . . . . . . . . . . . . . . 63

2.4.4 Combination of gene finding . . . . . . . . . . . . . 63

2.4.5 Database and databank to store results . . . . . . 65

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Chapter 3: Mitochondrial genome of Penicillium marn-

effei 69

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 69


3.2.1 Library construction and sequence assembly . . . . 72

3.2.2 Mitochondrial DNA sequence annotation . . . . . 72

3.2.3 Phylogenetic analysis . . . . . . . . . . . . . . . . . 73

3.2.4 Mitochondrial DNA sequences in nuclear genome . 73

3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . 74

3.3.1 Gene content and genome organisation . . . . . . . 74

3.3.2 Protein coding genes . . . . . . . . . . . . . . . . . 74

3.3.3 Genetic code and codon usage . . . . . . . . . . . 81

3.3.4 tRNA genes . . . . . . . . . . . . . . . . . . . . . . 81

3.3.5 Other RNA genes . . . . . . . . . . . . . . . . . . 81

3.3.6 Group I introns . . . . . . . . . . . . . . . . . . . . 84

3.3.7 Mitochondrial DNA sequences in nuclear genome . 85

Chapter 4: Genomic evidence for the presence of melanin

biosynthesis gene cluster in Penicillium marn-

effei 88

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 88


v

4.2.1 Potential virulence factors . . . . . . . . . . . . . . 90

4.2.2 Genomic approaches in identification of virulence

factors . . . . . . . . . . . . . . . . . . . . . . . . . 95


4.3.1 Identification of melanin biosynthesis genes in P.

marneffei . . . . . . . . . . . . . . . . . . . . . . . 96

4.3.2 Multiple alignments and phylogenetic analyses . . 97


4.4.1 Melanin gene cluster present in P. marneffei . . . 97

4.4.2 Disrupted aflatoxin biosynthesis gene cluster in P.

marneffei . . . . . . . . . . . . . . . . . . . . . . . 101

4.4.3 Absence of penicillin biosynthesis genes in P. marn-

effei . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Chapter 5: Mating abilities in Penicillium marneffei 105

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 105


5.2.1 Mating in hemiascomycete yeasts . . . . . . . . . . 108

5.2.2 Mating in filamentous ascomycetes . . . . . . . . . 109



5.4.1 Homologs of known sexual genes . . . . . . . . . . 114

5.4.2 Mating type genes . . . . . . . . . . . . . . . . . . 116

5.4.3 Mating pheromone precursor genes . . . . . . . . . 120

5.4.4 Mating pheromone processing genes . . . . . . . . 123

5.4.5 Mating pheromone receptor and other GPCRs . . 126

Chapter 6: Exploring the genetic components associated

with the dimorphism of Penicillium marnef-

fei 128

vi

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 128


6.2.1 Sequence similarity . . . . . . . . . . . . . . . . . 130

6.2.2 Phylogenetic Analysis . . . . . . . . . . . . . . . . 131


6.3.1 Perception of external stimuli by cellular sensors . 132

6.3.2 Transduction of biochemical signal . . . . . . . . . 134

6.3.3 Alteration of the genomic expression . . . . . . . . 136

6.3.4 Structural reorganization towards the morphologi-

cal change . . . . . . . . . . . . . . . . . . . . . . 141

Chapter 7: Intragenic tandem repeats in Penicillium marn-

effei and other ascomycetes 144

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 144


7.2.1 Identification of coding tandem repeats . . . . . . 146

7.2.2 Sequence analysis . . . . . . . . . . . . . . . . . . . 146


Chapter 8: Extent and evolutionary pattern of duplicate

genes in Penicillium marneffei and other as-

comycetes 155

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 156



8.3.1 Sequences and gene families . . . . . . . . . . . . . 160

8.3.2 Estimation of substitution rate . . . . . . . . . . . 161

8.3.3 Relative rate test . . . . . . . . . . . . . . . . . . . 162

8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

8.4.1 Extent of gene duplication in ascomycetes . . . . . 163

vii

8.4.2 Age distribution of duplicate genes . . . . . . . . . 164

8.4.3 Selective constraint between paralogs . . . . . . . . 168

8.4.4 Ka/Ks between paralogs and orthologs . . . . . . 169

8.4.5 Relative evolutionary rate between paralogs . . . . 170

8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

8.5.1 Gene duplication in ascomycetes is highly diverse . 173

8.5.2 Different selective constraints in yeasts and fila-

mentous ascomycetes . . . . . . . . . . . . . . . . . 176

8.5.3 Majority of paralogous genes evolve symmetrically 178

Chapter 9: Accelerated evolutionary rate may be respon-

sible for the emergence of lineage-specific genes180

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 181



9.3.1 Sequences and data sets . . . . . . . . . . . . . . . 185

9.3.2 Identification of orthologs . . . . . . . . . . . . . . 188

9.3.3 Classification of genes into LS groups . . . . . . . 188

9.3.4 Divergence Times . . . . . . . . . . . . . . . . . . . 189

9.3.5 Estimation of substitution rates and statistical analy-

ses . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

9.3.6 Detection of rate variability across species - Rela-

tive Divergence Score (RDS) . . . . . . . . . . . . 190

9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

9.4.1 Evolutionary rate differences among LS groups . . 191

9.4.2 Evolutionary rate-related factors of genes belong-

ing to different LS groups . . . . . . . . . . . . . . 196

9.4.3 Linear regression of divergence time and relative

divergence score (RDS) . . . . . . . . . . . . . . . 201

viii

9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Chapter 10: MBEToolbox: a Matlab toolbox for sequence

data analysis in molecular biology and evo-

lution 205

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 205


10.2.1 Probabilistic DNA substitution models . . . . . . . 206

10.2.2 Maximum likelihood estimation . . . . . . . . . . . 210

10.2.3 Elements of phylogenetic theory . . . . . . . . . . 211

10.2.4 Programs used for phylogenetic analyses . . . . . . 214

10.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 216

10.3.1 Input data and formats . . . . . . . . . . . . . . . 216

10.3.2 Sequence Manipulation and Statistics . . . . . . . 217

10.3.3 Evolutionary Distances . . . . . . . . . . . . . . . 217

10.3.4 Phylogeny Inference . . . . . . . . . . . . . . . . . 219

10.3.5 Combination of functions . . . . . . . . . . . . . . 222

10.3.6 Graphics and GUI . . . . . . . . . . . . . . . . . . 222


10.4.1 Vectorisation simplifies programming . . . . . . . . 223

10.4.2 Extensibility . . . . . . . . . . . . . . . . . . . . . 226

10.4.3 Comparison with other toolboxes . . . . . . . . . . 226

10.4.4 A novel enhanced window analysis . . . . . . . . . 227

10.4.5 Limitations . . . . . . . . . . . . . . . . . . . . . . 230

Chapter 11: Concluding remarks 231

Bibliography 234

ix

LIST OF FIGURES

Figure Number Page

1.1 P. marneffei mould and yeast culture . . . . . . . . . . . 7

1.2 Dimorphic switching of P. marneffei . . . . . . . . . . . . 8

1.3 Phylogenetic tree showing the relationships of P. marneffei

to other fungi . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.4 Microsyntenies containing pheromone precursor loci from

four fungi . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1.5 Triple proteome comparison between P. marneffei, S. cere-

visiae and A. fumigatus . . . . . . . . . . . . . . . . . . . 32

1.6 Putative MAPK signalling pathway in P. marneffei . . . 34

2.1 Flowchart of annotation pipeline for P. marneffei genome 51

2.2 PMGD genome browser . . . . . . . . . . . . . . . . . . . 60

2.3 Database schema of PMGD . . . . . . . . . . . . . . . . . 66

3.1 Fungal respiratory pathways . . . . . . . . . . . . . . . . . 71

3.2 Physical map of P. marneffei mitochondrial DNA . . . . 75

3.3 Comparison of gene order between mitochondrial DNAs . 78

3.4 Phylogenetic distribution of group I and group II introns . 80

3.5 28 tRNAs encoded in the mitochondrial genome of P.

marneffei . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.6 Secondary structures of two representative group I introns 84

4.1 P. marneffei abr1 gene Cu-oxidase domain homologues . 100

4.2 Melanin gene cluster in P. marneffei and A. fumigatus . . 102

x

5.1 Comparison of the mating-type loci in P. marneffei and

other fungi . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.2 Comparison of the alpha1 domian of MAT proteins of fil-

amentous ascomycetes . . . . . . . . . . . . . . . . . . . . 116

5.3 Gene organisation around the MAT locus . . . . . . . . . 117

5.4 P. marneffei biogenesis of the a-factor pheromones . . . . 121

6.1 Phylogenetic tree of fungal GPCR family genes . . . . . . 133

6.2 P. marneffei genes in cAMP pathway . . . . . . . . . . . 135

7.1 Amino acid composition in intragenic tandem repeats . . 153

8.1 Frequency distribution of Ks . . . . . . . . . . . . . . . . 166

8.2 Log-log plots of Ka vs. Ks for duplicate gene pairs . . . . 167

9.1 LS classification based on phylogenetic profiles of genes . 186

9.2 Divergence of nonsynonymous substitution rate in LS groups192

9.3 Dependence of log gene expression level and substitution

rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

9.4 Linear regression analysis of divergence time and RDS . . 195

10.1 Relationship of GTR class DNA substitution models . . . 209

10.2 Log-likelihood of evolutionary distance . . . . . . . . . . . 221

10.3 MBEToolbox GUI . . . . . . . . . . . . . . . . . . . . . . 224

10.4 Comparison between sliding window and enhanced sliding

window methods . . . . . . . . . . . . . . . . . . . . . . . 228

xi

LIST OF TABLES

Table Number Page

1.1 General features of the P. marneffei genome . . . . . . . 25

1.2 Comparison of genome statistics of several fungi . . . . . 27

1.3 Putative virulence genes . . . . . . . . . . . . . . . . . . . 36

1.4 Cell wall antigens and biosynthetic genes predicted in P.

marneffei . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.1 Commonly used domain databases . . . . . . . . . . . . . 48

2.2 Summary of assembly statistics . . . . . . . . . . . . . . . 61

3.1 Gene content of P. marneffei mitochondrial genome . . . 76

3.2 Codon usage in protein-coding genes of P. marneffei mi-

tochondrial genome . . . . . . . . . . . . . . . . . . . . . . 82

3.3 Presence of mitochondrial DNA fragments in nuclear genomes 85

3.4 P. marneffei mitochondrial DNA sequences present in nu-

clear genome . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.1 Major dimorphic fungal pathogens . . . . . . . . . . . . . 95

4.2 Putative gene products related to melanin biosynthesis in

P. marneffei . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.1 Mating strategies adopted by ascomycetous fungi . . . . . 110

5.2 Pheromone-processing enzymes encoded by the putative

P. marneffei genes . . . . . . . . . . . . . . . . . . . . . . 122

6.1 GPCR family in P. marneffei and A. nidulans . . . . . . 132

xii

6.2 Homologous genes related to signal transduction in fila-

mentous growth . . . . . . . . . . . . . . . . . . . . . . . . 137

7.1 P. marneffei genes containing intragenic tandem repeats . 147

7.2 Comparison of genome size and base in repeats . . . . . . 152

8.1 Distribution of multigene families in fungi . . . . . . . . . 163

8.2 Large multigene families in fungi . . . . . . . . . . . . . . 165

8.3 Ka/Ks ratio for recently diverged paralogs . . . . . . . . . 169

8.4 Amino-acid substitution rates versus Ka/Ks ratios in two

copies of duplicate genes . . . . . . . . . . . . . . . . . . . 172

9.1 Genomic sequence sources . . . . . . . . . . . . . . . . . . 185

9.2 Average Ka, Ks and Ka/Ks among LS classes . . . . . . . 197

9.3 Correlation and partial correlation between LS and other

factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

9.4 Regression analyseson predicted S. cerevisiae-S. mikatae

orthologs . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

xiii

ABBREVIATIONS AND SYMBOLS

aa Amino acid

AIDS Acquired Immunodeficiency Syndrome

ADHoRe Automatic Detection of Homologous Regions

BLAST Basic Local Alignment Search Tool

BLOSUM BLOcks SUbstitution Matrix

bp Base pairs

CDS Nucleotide coding sequence

DBMS Database management system

DDC Duplication-degeneration-complementation (model)

EST Expressed Sequence Tag

FASTA Fast-All (pronounced fast-aye) a program for pairwise sequence

alignment

FGI Fungal Genome Initiative

GFF ‘Gene-Finding Format’ or ‘General Feature Format’

GO Gene Ontology

xiv

GOLD Genomes OnLine Database

GPCR G Protein-Coupled Receptor

GTR General Time Reversible model

GUI Graphical User Interface

HAART Highly Active Anti-Retroviral Therapy

HMM Hidden Markov Model

HKU CC Computer Centre, University of Hong Kong

ITR Intragenic Tandem Repeat

Ka Nonsynonymous substitution rate

Ks Synonymous substitution rate

LS Lineage specificity

MAPK Mitogen-activated protein kinase

Mb Megabases

MBEToolbox Molecular biology and evolution toolbox

MCMC Markov-chain Monte Carlo

MDD Maximal dependence decomposition

MFS Major facilitator superfamily

MIPS Munich Information Center for Protein Sequences

xv

TF Transcription Factor

TNF Tumor Necrosis factor

MIT Massachusetts Institute of Technology

MLMT Multilocus microsatellite typing system

NCBI National Centre for Biotechnology Information

RDS Relative Divergence Score

ORF Open Reading Frame

PAUP* Phylogenetic Analysis Using Parsimony, *and other methods

(pronounced pop star)

PFGE Pulsed-field gel electrophoresis

PHYLIP PHYLogenetic Inference Package

PMGD P. marneffei genome database

REV General reversible process model

RIP Repeat-induced point

SAGE Serial Analysis of Gene Expression

SGD Saccharomyces Genome Database in Stanford Genomic Resources

xvi

Swiss-Prot a curated protein sequence database which strives to pro-

vide a high level of annotation (such as the description of the func-

tion of a protein, its domains structure, post-translational modi-

fications, variants, etc.), a minimal level of redundancy and high

level of integration with other databases.

TIGR The Institute for Genomic Research

TrEMBL a computer-annotated supplement of Swiss-Prot that contains

all the translations of EMBL nucleotide sequence entries not yet

integrated in Swiss-Prot.

UML Unified Modelling Language

UCSC University of California, Santa Cruz

URF unidentified reading frame

UTR Untranslated transcriptional region

WGS Whole-genome shotgun

HMG high mobility group motif

xvii

GLOSSARY

ADDITIVE TREE: A phylogenetic tree in which the distance between

any two terminal nodes is equal to the sum of the branch lengths

connecting them.

BOOTSTRAP: A statistical technique using resampling with replace-

ment.

BRANCH: The graphical representation of an evolutionary relation-

ship in a phylogenetic tree.

CODON: A triplet of adjacent nucleotides in mRNA that either codes

for an amino acid carried by a specific tRNA or specifies the ter-

mination of the translation process.

CODON USAGE: The frequency with which members of a codon family

are used in protein-coding genes.

COMPLEMENTARY DNA (CDNA): DNA synthesised from an RNA tem-

plate by the enzyme reverse transcriptase.

CONCERTED EVOLUTION: Maintenance of homogeneity of nucleotide

sequences among members of a gene family in a species, although

the nucleotide sequences change over time.

CONSENSUS SEQUENCE: A sequence that represents the most preva-

lent nucleotide or amino acid at each site in a number of homologous

sequences.

xviii

CONSERVATIVE SUBSTITUTION: The substitution of an amino acid by

another with similar chemical properties.

CONSTANT SITE OR CONSTANT REGION: A site or region within the

DNA that is occupied by the same nucleotide in all homologous

sequences under comparison.

CONVERGENCE: The independent evolution of similar genetic or phe-

notypic traits.

CONVERGENT SUBSTITUTION: The substitution of two different nu-

cleotides by the same nucleotide at the same nucleotide site in two

homologous sequences.

DETERMINISTIC PROCESS: A process, the outcome of which can be

predicted exactly from knowledge of initial conditions.

DIRECTIONAL SELECTION: A selective regime that changes the fre-

quency of an allele in a specific direction, either toward fixation or

toward elimination.

DIVERGENCE: The differences between two homologous sequences due

to the independent accumulation of genetic changes in each lineage.

DOMAIN: A well-defined region within a protein that can perform a

specific function. May not consist of a continuous stretch of amino

acids, although it almost always consists of amino acids that are

adjacent to each other as far as the tertiary structure of the protein

is concerned.

DUPLICATION: The presence or the creation of two copies of a DNA

segment in the genome.

xix

EUKARYOTE: An organism having a true nucleus and membraneous

organelles. One of the three primary lines of descent in the living

world.

EXON: A DNA segment of a gene, the transcript of which appears in

the mature RNA molecule.

FIXATION PROBABILITY: The probability that a particular allele will

become fixed in a population.

FIXATION TIME: The time it takes for a mutant allele to become fixed

in a population.

FLANKING SEQUENCE: Untranscribed sequences at the 5’ or 3’ termi-

nal of transcribed genes.

FOURFOLD DEGENERATE SITE: A nucleotide site within a codon at

which all possible substitutions are synonymous. For example, in

the codon CCT, the third site is fourfold degenerate because CCT,

CCC, CCA and CCG are all codons for proline.

FUNCTIONAL CONSTRAINT (SELECTIVE CONSTRAINT): The degree of

intolerance characteristic of a site or a locus toward nucleotide sub-

stitutions.

GENE CONVERSION: A nonreciprocal recombination process resulting

in a sequence becoming identical with another.

GENE DIVERSITY: A measure of genetic variability in a population.

The mean expected heterozygosity per locus in a population.

xx

GENE DUPLICATION: Generally, the production of two copies of a

DNA sequence. Specifically, the duplication of an entire gene se-

quence.

GENETIC DISTANCE: Broadly, any of several measures of the degree

of genetic difference between individuals, populations, or species.

In reference to molecular evolution, a measure of the number of

nucleotide substitutions per nucleotide site between two homolo-

gous DNA sequences that have accumulated since the divergence

between the sequences.

INFERRED TREE: A phylogenetic tree based on empirical data per-

taining to extant taxa.

INFORMATIVE SITE (DIAGNOSTIC POSITION): A site that is used to

choose the most-parsimonious tree from among all the possible phy-

logenetic trees. In molecular evolution, a site where there are at

least two different kinds of nucleotides or amino acids, and each of

them is represented in at least two sequences.

LIKELIHOOD RATIO TEST: A statistical test of the goodness-of-fit be-

tween two models. A relatively more complex model is compared

to a simpler model to see if it fits a particular dataset significantly

better.

LINEAGE: A linear evolutionary sequence from an ancestral species

through all intermediate species to a particular descendant species.

MAXIMUM LIKELIHOOD: A statistical procedure of finding the value

of one or more parameters for a given statistic which makes the

known likelihood distribution a maximum.

xxi

ORTHOLOGOUS LOCUS: A gene that has evolved directly from an an-

cestral locus. homologous genes: genes that share a common evo-

lutionary ancestor.

PARALOGOUS LOCUS: A gene that originated by duplication and then

diverged from the parent copy by mutation and selection or drift.

PATTERN OF SUBSTITUTION (SUBSTITUTION SCHEME): The relative fre-

quency with which a nucleotide or an amino acid changes into an-

other during evolution.

POSITIVE SELECTION: Selection for an advantageous mutant allele.

POSTERIOR PROBABILITY: The probability of a parameter value in-

ferred from an analysis.

RELATIVE-RATE TEST: A calibration-free test for checking the con-

stancy of the rate of nucleotide substitutions in different lineages

during their evolution, thus determining whether or not the mole-

cular clock operates at the same rate among different lineages.

ROOTED TREE: A phylogenetic tree that specifies ancestral and de-

scendant species, thus indicating the direction of the evolutionary

path.

SENSE CODON: A codon specifying an amino acid.

SEQUENCE DIVERGENCE (DIVERGENCE): The differences between two

homologous sequences due to the independent accumulation of ge-

netic changes in each lineage.

xxii

STOCHASTIC PROCESS: A process, the outcome of which cannot be

predicted exactly from knowledge of initial conditions. However,

given the initial conditions, each of the possible outcomes of the

process can be assigned a certain probability.

SYNTENY: A pair of genomes in which at least some of the genes are

located at similar map positions.

TANDEM DUPLICATION: A duplication, the products of which reside

in close proximity to each other on the chromosome.

TRANSITION: The substitution of a purine for a purine or a pyrimidine

for a pyrimidine.

TRANSVERSION: The substitution of a purine for a pyrimidine or vice

versa.

xxiii

1

INTRODUCTION

Penicillium marneffei is a dimorphic fungus that intracellularly in-

fects the reticuloendothelial system of humans and bamboo rats. En-

demic in Southeast Asia, it infects 10% of AIDS patients in this re-

gion [365, 201, 182, 50, 348, 350]. The complete genomic sequencing for

various organisms has accelerated rapidly, which has offered another path

to gene discovery in recent years. This thesis presents the sequence of

P. marneffei genome, as well as related studies from the perspectives of

comparative and evolutionary genomics. These studies will throw light

on the molecular mechanism of virulence of this important pathogenic

fungus.

Chapter 1 gives an overview of P. marneffei genome, including se-

quence statistics, gene content and prediction of gene function. Chapter

2 describes the organisation and implementation of genome database of

P. marneffei genome project. The complete mitochondrial genome of P.

marneffei is reported in Chapter 3. The gene content and gene order

P. marneffei of mitochondrial genome are highly similar to that of As-

pergillus, further confirming their close phylogenetic relationship. This

provides the basis for comparative genomics study between P. marneffei

and Aspergillus species.

This is followed by Chapter 4 that reports the presence of impor-

tant virulence gene cluster, the melanin biosynthesis gene cluster, in P.

marneffei genome. Since melanin is a highly toxic natural product pro-

duced by some species of Aspergillus which are phylogenetically close to

P. marneffei, this finding is also valuable in revealing the evolutionary

origin of this gene cluster.

2

Mating of P. marneffei has not yet been observed in nature or under

laboratory defined conditions. The lack of a sexual stage impairs the

utility of experimental fungal genetics. By using genome sequence infor-

mation, however, we found evidence of the potential mating ability of P.

marneffei (Chapter 5). It suggests that P. marneffei, like other patho-

genic fungi, may limit access to the sexual cycle to generate a population

structure that is in part clonal but which retains the ability to undergo

sexual cycle in response to challenging conditions in the environment or

in the host. Chapter 6 contributes to the thesis by offering a systemic

exploration of genetic components that may be responsible for the mor-

phogenetic processes in the genome of P. marneffei, mainly through the

sequence analysis in a context of comparative genomics. Chapter 7 re-

ports an interesting phenomenon: Tandemly repeated DNA sequences

occuring frequently in the genomes of P. marneffei, not only in noncod-

ing regions, but also in protein-coding regions, i.e. intragenic regions.

These highly dynamic genomic components provide the clue on how the

pathogenic fungus adapts to the host immune system.

Chapter 8 introduces a systematic test about the extent of duplicate

genes in major ascomycetes. We observed significant variation within

ascomycetes in the extent of gene duplications. Age distribution of gene

duplications tentatively suggests that P. marneffei genome have experi-

enced duplication in large scale twice. We argue that different extents

and evolutionary patterns of duplicate genes in ascomycetes might be

associated with the great genotypical and phenotypical differences in as-

comycetes. Chapter 9 tackled the question of the origin of species-specific

genes. The statistically significant correlation between accelerated evo-

lutionary rate and the degree of lineage specificity is confirmed. This

correlation is independent of many confounding factors, like gene essen-

tiality and expression level. This finding helps to explain the origin of

P. marneffei -specific genes, which is about one third of all P. marneffei

3

genes.

Finally, Chapter 10 introduces the software package, developed in a

high-performance scientific computer language, for sequence data manip-

ulation and analysis, which performed very successfully throughout the

whole genome project.

Publications arising from this thesis are:

1. Cai JJ, Liu B, Woo PC, Lau SKP, Wong SS, Zhen H, Yuen KY (In

preparation) Genomic evidence for the presence of melanin biosyn-

thesis gene cluster in the thermal dimorphic fungus Penicillium

marneffei

2. Cai JJ, Woo PCY, Lau SKP, Smith DK and Yuen KY (2006) Ac-

celerated evolutionary rate may be responsible for the emergence

of lineage-specific genes in Ascomycota Journal of Molecular Evo-

lution, in press

3. Cai JJ, Smith DK, Xia X and Yuen KY (2005) MBEToolbox: a

MATLABTM toolbox for sequence data analysis in molecular biol-

ogy and evolution. BMC Bioinformatics, 6:64

4. Woo PC, Zhen H, Cai JJ, Yu J, Lau SKP, Wang J, Teng JLL,

Wong SS, Tse RH, Chen R, Yang H, Liu B and Yuen KY (2003) The

mitochondrial genome of the thermal dimorphic fungus Penicillium

marneffei is more closely related to those of molds than yeasts.

FEBS Letters, 555 (3): 469-77

5. Yuen KY, Pascal G, Wong SS, Glaser P, Woo PC, Kunst F, Cai JJ,

Cheung EY, Medigue C, Danchin A (2003) Exploring the Penicil-

lium marneffei genome. Archives of Microbiology, 179 (5): 339-53

I have tried to explicitly acknowledge where the other authors’ ideas

have contributed significantly to the present work.

4

Chapter 1

THE DRAFT GENOME SEQUENCE OF

PENICILLIUM MARNEFFEI

This chapter describes basic features of genome of Penicillium marn-

effei, such as, genome assembly, gene content and some comparative re-

sults, attempting to give an overall impression of the genome. More detail

and complete analyses of some sections may be found in corresponding

chapters.

1.1 Introduction

Although fungi pose little threat to people with healthy immune systems,

they can cause fatal infections in the immunocompromised individuals.

Penicillium marneffei is the most important thermal dimorphic fungus

causing respiratory, skin and systemic mycosis in Southeast Asia [365,

201, 182, 50, 348, 350]. Discovered in 1956 in hepatic abscesses of the

Chinese bamboo rat Rhizomys sinensis, only 18 cases of human diseases

were reported (in HIV-negative patients) until 1985 [66]. The appearance

of the HIV pandemic, especially in South-east Asian countries, saw the

emergence of the infection as an important opportunistic mycosis in this

group of immunocompromised patients. About 10% of AIDS patients in

Hong Kong are infected with P. marneffei [346]. In northern Thailand,

penicilliosis is the third most common indicator disease of AIDS following

tuberculosis and cryptococcosis [300].

Genome sequencing of P. marneffei will increase the understanding

molecular biology and biochemical mechanisms for the pathogenicity of

this fungus. Despite its medical importance and its unusual thermal di-

5

morphism, our understanding of gene organisation in P. marneffei was

limited. To my knowledge, only one cell wall mannoprotein gene has

been characterised and successfully used in serodiagnosis and prevention

of this infection [38,37,347]. As a ‘pilot study’ of this genome project, the

random analysis of 2303 random sequence tags has been performed [364],

which laid down the foundation for the complete genomic sequencing

project of this fungus. In 2002, the complete genome sequencing project

of P. marneffei was initiated, and we have now approximately 6.6 cov-

erage of the genome, which includes a contig that contains the complete

sequence of the mitochondrial genome. The sequencing of its genome

paves the way for the development of novel methods for detecting, pre-

venting and treating this infection.

1.2 Literature Review

In this section I will first recap some basic concepts and terminologies

in fungal biology, and then review some clinical aspects, including the

diagnosis and management of P. marneffei infection. Finally, I will give

a survey of the recent advances in fungal genome projects.

1.2.1 General fungal biology

Fungi are a large and diverse group of eukaryotes characterised by their

absorptive mode of nutrition, i.e., digesting food outside of their bodies.

Modern taxonomists place fungi in their own kingdom, on equal footing

with plants and animals, sometimes called “The Fifth Kingdom”. They

include moulds, yeasts, and mushrooms. Most fungi are multicellular,

but some, the yeasts, are simple unicellular organisms. Fungi are plastic,

having a diversity of forms which influence the manner of function, and

a range of dispersal mechanisms enabling various approaches to survival

over time. Nevertheless, some basic structures of diverse fungi are in

common.

6

A fungal organism consists of a mass of threadlike filaments called

hyphae, which combine to make up the fungal mycelium. Each hypha is

composed of a chain of fungal cells, a continuous cytoplasm with many

nuclei. The hypha is surrounded by a plasma membrane and a polysac-

charide chitin cell wall. The hyphae in a fungus branch off from one

another to form the mycelium, and are all ultimately connected to the

original hypha. Septa are barriers across the filament. In all fungi, septa

form, either adventitiously in all filamentous fungi, or at regular intervals

along the hypha in most members of the Ascomycota and Basidiomycota.

Different methods of reproduction have been adopted by different types

of fungus. For example, yeasts reproduce mitotically, while moulds have

much more complex life cycles involving distinct phases, including diploid

and haploid phases.

Fungi are often directly involved in our lives. Some fungi are in-

deed parasitic, and cause devastating plant infections. Serious agricul-

tural pests, parasitic fungi such as the rusts and the smuts can ruin

entire crops, especially affecting cereals such as wheat and corn. Only

about 50 species are known to harm animals. Many medical applications

of fungi have been discovered, of which antibiotic production by fungi

is the most important. The first among these antibiotics is penicillin,

possibly the most important non-genetic medical breakthrough of last

century. Approximately 75% of all described fungi belongs to the As-

comycota. Among them are some famous ones, such as, Saccharomyces

cerevisiae, the baker’s yeast, Penicillium chrysogenum, producer of peni-

cillin, and Neurospora crassa, the “one-gene-one-enzyme” organism, As-

pergillus flavus, the producer of aflatoxin, Candida albicans, the cause of

thrush.

7

(A) (B)

Figure 1.1: P. marneffei mould (A) and yeast (B) culture. Courtesy ofProf. KY Yuen, Micriobiolgy, HKU

1.2.2 P. marneffei, as an important fungal pathogen

Mycology

The fungus grows well on the Sabouraud dextrose agar. When grown

at 25, the fresh culture appears similar to other Penicillium species,

with rapidly growing greenish-silver mycelial colonies. The reverse side

is usually of a beige colour. One of the most characteristic features is the

production of a soluble red pigment that diffuses into the medium. Of all

the Penicillium species, only P. marneffei, P. citrinum, P. janthinellum,

P. purpurogenum, and P. rubrum produce diffusible red pigments. The

other Penicillium species are generally not associated with human infec-

tions nor do they display dimorphism. In contrast to a room temperature

culture, the fungus assumes a yeast form at 37, whether in cultures or

in vivo. Colonies at 37 are glabrous and beige-coloured and do not

produce any red pigment (Fig. 1.1). The dimorphic growing feature that

as a yeast-like fungus at 37 and as a mould in culture at temperatures

below 30 is illustrated in Fig 1.2.

Microscopically, the mycelial form resembles other Penicillium species

with conidiophore-bearing biverticillate penicilli, and each penicillus be-

ing composed of four to five metulae with smooth-walled conidia. The

8

Figure 1.2: Dimorphic switching of P. marneffei.The diagram is obtainedfrom the website of Department of Genetics, University of Melbourne.

yeast forms are ovoid or elongated measuring 2–3 µm × 2–6.5 µm. Sim-

ilar forms are also observed in tissue samples obtained from patients,

which may be seen within macrophages or extracellularly. In contrast to

other yeasts, the yeast cells of P. marneffei divide not by budding, but

by fission, with the result that a transverse septum is often seen in the di-

viding cell. This helps to differentiate P. marneffei from other dimorphic

fungi in histological sections, especially Histoplasma capsulatum.

Ecology and epidemiology

P. marneffei is geographically restricted to the Southeast Asia. Cases

have been reported mostly from northern Thailand, southwestern China

(e.g., around the Guangxi Province), Hong Kong, Taiwan, Singapore,

Malaysia, and the Philippines.

The ecology and possible environmental reservoirs of P. marneffei was

first investigated in 1986 by Deng et al. [67]. In the Guangxi Province

of region of the People’s Republic of China, it was found that P. marn-

effei can be isolated in the internal organs of 18 out of 19 bamboo rats

belonging to the species Rhizomys pruinosus. The findings of Deng et al.

9

were confirmed by a subsequent study by Li et al. [195]. Rhizomys pru-

inous senex bamboo rats in the Guangxi Province were studied. 93.1%

of the wild bamboo rats carried P. marneffei in the internal organs. The

fungus was most commonly isolated from the lungs (87.5%), followed by

the liver (56.3%), spleen (56.3%) and mesentery lymph node (50%).

The association between P. marneffei and bamboo rats had also been

noted in Thailand, another country endemic for the infection. In two

studies by Ajello et al. [3] and Chariyalertsak et al. [47], P. marneffei

was recovered from various species of bamboo rats, including Cannomys

badius, Rhizomys pruinosus, and R. sumatrensis. The distribution of the

fungus in the internal organs was similar to previous studies, with the

highest prevalence in the lungs followed by the liver.

The consistency of these findings suggests that inhalation of the (pre-

sumably) infective conidia could be an important mode of transmission.

The occurrence of the fungus in the liver could be a result of the propen-

sity of the fungus to invade the reticuloendothelial system. It has been

suggested that bamboo rats, like human victims, probably acquired the

infection from a common environmental source. The possible link to en-

vironmental factors is demonstrated by two studies from northern Thai-

land which showed a significant clustering of cases of penicilliosis marn-

effei during the rainy season [45,46]. A recent history of occupational or

other forms of exposure to soil is also a significant risk factor. Impor-

tantly, exposure to or consumption of bamboo rats, was not a risk factor

for infection. The exact mode of transmission of the fungus its natural

habitat is still unsettled at the moment.

Although P. marneffei is a naturally occurring sylvatic infection in

a high proportion of bamboo rat species [67], it is not known whether

bamboo rats are (1) an obligate stage in P. marneffei ’s life cycle or (2) a

zoonotic focus for human infection. Furthermore, it is not known whether

all lineages of P. marneffei are equally infectious to bamboo rats and hu-

10

mans or rather represent a subset of a wider, more genetically diverse

population. In order to address these questions, four groups of investiga-

tors reported the use of various molecular typing techniques in the differ-

entiation of P. marneffei strains. Vanittanakom et al. [323] first reported

in 1996 the use of restriction endonuclease analysis for epidemiological

typing of strains isolated in Thailand. Hsueh et al. noted an increase

in the incidence of P. marneffei infection in Taiwan in the 1990’s [134].

Antifungal susceptibility, chromosomal DNA restriction fragment-length

polymorphism types, and randomly amplified polymorphic DNA patterns

recognised 8 strain types out of 20 isolates. Trewatcharegon et al., on

the other hand, used pulsed-field gel electrophoresis (PFGE) with NotI

digestion for strain differentiation [316]. Fisher et al. [88] used multilo-

cus microsatellite typing (MLMT) system, an accurate and reproducible

method of characterizing genetic diversity of eukaryotic pathogens that

have low levels of genetic variation. They observed the high genetic di-

versity and extensive spatial structure among clinical isolates, revealing

spatially structured P. marneffei populations [88]. In further study, again

based on MLMT typing results, Fisher et al. [89] showed that different

clones of the fungus are found in different environments, all the samples

from any given location were genetically very similar. This led them to

the conclusion that the fungus becomes highly adapted to its local en-

vironment, making it highly successful there, but stopping it spreading

to other areas. This is why P. marneffei is only endemic to a relatively

small area of south-east Asia.

Immunobiology

Like most other pathogens, the availability of iron is crucial to the survival

of P. marneffei in the human host. Studies by Taramelli et al. shown

that the antifungal activity of macrophages is markedly suppressed in the

presence of iron overload and that iron chelators inhibit the extracellular

11

growth of P. marneffei [306].

The route of transmission and infection of P. marneffei is unknown at

the moment. However, it is generally believed that inhalation of the coni-

dia is a likely route, in line with the mode of infection for other moulds.

The attachment of P. marneffei conidia to host cells and tissues is the

first step in the establishment of an infection. The conidia-host interac-

tion may occur via adhesion to the extracellular matrix protein laminin

and fibronectin via a sialic acid-dependent process. Using immunofluores-

cence microscopy, Hamilton et al. demonstrated that fibronectin binds to

the conidia surface and to phialides, but not to hyphae [122]. The inves-

tigators suggested that there could be a common receptor for the binding

of fibronectin and laminin on the surface of P. marneffei [123,122].

The interaction between human leukocytes and heat-killed yeast-phase

P. marneffei has been studied by Rongrungruang et al. [269]. Their data

suggested that monocyte-derived macrophages phagocytose P. marneffei

even in the absence of opsonisation and the major receptor(s) recognising

P. marneffei could be a glycoprotein with N-acetyl-beta-D-glucosaminyl

groups. P. marneffei stimulates the respiratory burst of macrophages

regardless of whether opsonins are present, but tumour necrosis factor-α

secretion is stimulated only in the presence of opsonins. The authors thus

speculated that the ability of unopsonised fungal cells to infect mononu-

clear phagocytes in the absence of TNF-α production is a possible viru-

lence mechanism.

Although P. marneffei is capable of infecting and replicating inside

mononuclear macrophages, it is also evident that macrophages do possess

antifungal activities. The fungicidal activities of macrophages is likely to

involve the generation of reactive nitrogen intermediates, as described

by Kudeken et al. [180]. In addition to macrophages, the neutrophils

also exhibit antifungal properties. The fungicidal activity of neutrophils

is significantly increased in the presence of proinflammatory cytokines,

12

especially GM-CSF, G-CSF and IFN-γ. In addition to GM-CSF, G-CSF

and IFN-γ, other cytokines such as TNF-α and IL-8 are capable of en-

hancing the neutrophil’s inhibitory effects on germination of P. marneffei

conidia. The strongest effect was observed with GM-CSF [179]. Coni-

dia are, however, generally not susceptible to killing by phagocytes. The

fungicidal activity exhibited by neutrophils is believed to be independent

of superoxide anion, but through exocytosis of granular enzymes [181].

Recently, Koguchi et al. demonstrated that osteopontin (secreted by

monocytes) could be involved in IL-12 production by peripheral blood

mononuclear cells during infection by P. marneffei, and the production

of osteopontin is also regulated by GM-CSF [171]. It is also likely that

the mannose receptor is involved as a signal-transducing receptor for trig-

gering the secretion of osteopontin by P. marneffei-stimulated peripheral

blood mononuclear cells.

Molecular biology

The mechanism of thermal dimorphism and morphogenesis in P. marnef-

fei is not fully understood. However, studies by Borneman et al. start to

provide important information in this area [18,19]. It was shown that the

homologue of the Aspergillus nidulans abaA gene is involved in the reg-

ulation of cell cycle and morphogenesis in P. marneffei [18]. An STE12

homologue of P. marneffei (stlA gene) was subsequently shown to be able

to complement the sexual defect of an A. nidulans steA mutant [19]. A

hitherto unknown sexual stage of P. marneffei is therefore postulated to

be present.

Other genes which are involved in the growth and development of

P. marneffei have been described recently. A CDC42 homologue (cflA

gene) was shown to be required for polarisation and determination of cor-

rect cell shape during yeast-like growth, and for the separation of yeast

cells [22]. Deletion of the homologue of Aspergillus nidulans stuA gene in

13

P. marneffei showed that the gene is required for metula and phialide for-

mation during conidiation but is not required for dimorphic growth [20].

No vaccine is currently available for P. marneffei. Some recent studies

showed that vaccine development is potentially feasible. The P. marnef-

fei mannoprotein Mp1p (encoded by the MP1 gene) has been tested in a

mouse model as a potential vaccine candidate [347]. The relative efficacy

of intramuscular MP1 DNA vaccine, oral mucosal MP1 DNA vaccine us-

ing live-attenuated Salmonella typhimurium carrier, and intraperitoneal

recombinant Mp1p protein vaccine were compared. Intramuscular MP1

DNA vaccine appears to give the best protection against P. marneffei.

1.2.3 Penicilliosis marneffei

Clinical features

Penicilliosis marneffei manifests clinically as a progressive systemic febrile

illness as a result of infiltration and inflammation of the reticuloendothe-

lial system by the yeast stage of P. marneffei. Common clinical fea-

tures include systemic symptoms of fever, weight loss, anaemia, and those

due to local organ involvement such as pulmonary syndrome, chest radi-

ographic infiltrate, lymphadenopathy, hepatosplenomegaly, molluscum-

contagiosum-like skin lesions, osteolytic bone lesions, arthritis, subcuta-

neous abscesses and even endophthalmitis. Almost all organs could be

involved in severe disseminated disease.

In immunocompetent hosts, the tissue damage is mainly associated

with granulomatous inflammation with multinucleated giant cells, lym-

phocytes, and neutrophils. A suppurative inflammation dominated by

neutrophils resulting in abscess formation can be present. In immuno-

suppressed hosts, an anergic and necrotising reaction is found with diffuse

infiltration of macrophages engorged with yeast cells.

Underlying immunosuppression could be found in 80% of penicilliosis

patients. The commonest underlying disease is AIDS. P. marneffei is

14

second only to Cryptococcus neoformans as the commonest opportunis-

tic fungal pathogen in AIDS patients in Southeast Asian countries like

Thailand.

Infections in non-HIV-infected patients have also been described, pri-

marily among immunocompromised patients and less frequently in pa-

tients without any known underlying diseases. Reported cases of non-

HIV-associated penicilliosis marneffei had occurred in patients with al-

coholism, tuberculosis, systemic lupus erythematosus, patients receiving

corticosteroid or other forms of immunosuppressive therapy, and even

patients without any apparent underlying disease. Manifestations of the

infection included lymphadenopathy, osteomyelitis and septic arthritis,

pulmonary infection, and disseminated infection with multi-organ in-

volvement.

Comparison of the clinical manifestations of penicilliosis in HIV-positive

and HIV-negative patients has been published recently [349]. Of the 15

patients who had culture-documented P. marneffei infection, 8 (53.3%)

were HIV positive and 7 (46.7%) were HIV negative. The HIV-infected

patients were more likely to have a higher incidence of fungaemia than

the non-HIV-infected patients (50% vs. 28.6%) while the latter group fre-

quently required tissue biopsies for confirmation of the infection. There

was a significant delay in establishing the diagnosis in non-HIV-infected

patients when compared with HIV-infected patients (median delay of 5.5

weeks vs. 1 week, P < 0.01). Most of the non-HIV patients (85.7%)

have underlying immunocompromising conditions including haematolog-

ical malignancies and autoimmune diseases requiring the use of corticos-

teroids or cytotoxic chemotherapy, as well as diabetes mellitius. In both

categories, pulmonary involvement was the commonest manifestation on

initial presentation, followed by pyrexia of unknown origin and cutaneous

manifestation.

15

Diagnosis

Fungal culture The infection itself is relatively amenable to antifun-

gal therapy and a cure is potentially possible. Early recognition of the

infection is therefore essential for timely initiation of effective therapy.

Conventional fungal culture remains the diagnostic test of choice in

most settings. The fungus may be cultivated from appropriate clinical

specimens in most cases, such as blood cultures, skin lesions, and respira-

tory tract specimens. In the AIDS patients with high levels of fungaemia,

it has been occasionally reported that a direct smear of the peripheral

blood may reveal the fungus. In HIV-positive patients, fungaemia could

be detected in at least 55% of the patients in previous reports.

Unfortunately, fungal culture suffers from the drawback of a long

turnaround time and that sometimes invasive tissue biopsies are necessary

for obtaining a satisfactory specimen. In a series of HIV-infected patients

from Hong Kong, 50% of them had documented fungaemia [349].

The yeast form of P. marneffei may be stained by the methenamine

silver or periodic acid-Schiff stains in tissue sections. When the cen-

tral septation of the yeast cell is seen in the histopathological section,

this offers clues to the diagnosis of penicilliosis. Pierard et al. reported

that the monocloncal antibody EB-A1 against the galactomannan of As-

pergillus species may also be used to detect P. marneffei in formalin-

fixed, paraffin-embedded tissues [249].

Serology A number of studies aimed at detecting fungal antibodies

and/or antigens in the serum and body fluids of infected patients. In

earlier studies, culture filtrates or whole cell extracts were being used as

antigens. P. marneffei was cultured in liquid media, and the culture fil-

trate was concentrated to immunise rabbits. The culture filtrate and the

anti-P. marneffei rabbit sera were incorporated in an immunodiffusion

test to detect antibody or antigens respectively [277,333,144].

16

In 1994, an indirect immunofluorescent antibody test for serodiagnosis

of P. marneffei infection was reported, using the yeast-hyphae (represent-

ing tissue multiplication phase) or the germinating conidia (representing

initial tissue invasion phase) as antigens [365]. None of the eight sera

from culture-documented patients tested at 1 : 10 dilution gave a posi-

tive result for IgM. High IgG titres (of the respective phases, geometric

mean 1 : 905 and 1 : 1280) were found in all eight penicilliosis marneffei

patients, in contrast to that obtained from 78 healthy controls (with a

respective geometric mean of 1 : 1.34 and 1 : 2.14). Sera from patients

with cryptococcosis (n = 2) or candidaemia (n = 2) did not show cross-

reactivity (IgG titre < 1 : 40, which is similar to that of the healthy con-

trols). Overall, the IgG titre was higher than IgA for the cases but there

was little difference in using the germinating conidia or the yeast-hyphae

form as the testing antigen. Moreover, IgA could not be detected in two

out of eight positive cases. Three HIV patients with culture-documented

penicilliosis marneffei were tested positive (IgG titres 1 : 80 − 1 : 160).

An IgG titre > 1 : 80 is suggestive of penicilliosis marneffei.

In 1996 Kaufman et al. developed a latex agglutination test to detect

antigenaemia, where polystyrene beads were coated with rabbit anti-P.

marneffei globulin, obtained from rabbits immunised with yeast culture

filtrate [160]. 77% of the 17 P. marneffei culture-positive HIV patients

were tested positive.

Desakorn et al. later used purified hyperimmune IgG, from rabbits

immunised with yeast cells, in an enzyme-linked immunosorbent assay

(ELISA) to quantitate P. marneffei yeast antigens in urine samples [69].

All urine samples from 33 P. marneffei culture-positive HIV patients

were tested positive, with a median titre of 1 : 20.

Jeavons et al. characterised and purified three cytoplasmic yeast anti-

gens of 50-, 54- and 61-kDa, which were found respectively in 48, 71

and 85% of serum samples from 21 P. marneffei culture-positive pa-

17

tients [146]. Chongtrakool et al. isolated a 38-kDa antigen partially-

purified from yeast culture filtrate, where 45% of P. marneffei culture-

positive HIV patients (n = 51), 17% of HIV positive asymptomatic pa-

tients (n = 262) and 25% of other fungal culture-positive HIV patients

(n = 67) have developed antibodies against this antigen [54].

PCR The detection of the P. marneffei genomic DNA in clinical spec-

imens have also been reported. LoBuglio and Taylor used primers PM2

and PM4 to amplify a 347 bp fragment of the internal transcribed spacer

region between 18S rDNA and 5.8S rDNA [202]. On the other hand

Vanittanakom et al. used a PCR-Southern hybridisation format, where

primers RRF1 and RRH1 were used to amplify a 631 bp fragment of

the 18S rDNA, followed by hybridisation with a P. marneffei -specific 15-

oligonucleotide probe [324]. Recently Vanittanakom et al. described a

nested PCR assay which might prove useful in the detection of P. marn-

effei and identification of young fungal cultures [325].

Mp1p The first gene cloned from P. marneffei was the MP1 gene [37].

Serum from guinea pigs immunised with P. marneffei yeast cells was used

to screen the cDNA library of P. marneffei. The MP1 gene was subse-

quently cloned which encodes an abundant antigenic cell wall manno-

protein in P. marneffei. MP1 is a unique gene without homologues in

sequence databases. It codes for a protein, Mp1p, of 462 amino acid

residues, with a few sequence features that are present in several cell wall

proteins of Saccharomyces cerevisiae and Candida albicans. It contains

two putative N-glycosylation sites, a serine- and threonine-rich region for

O-glycosylation, a signal peptide, and a putative glycosylphosphatidyli-

nositol attachment signal sequence. Specific anti-Mp1p antibody was

generated with recombinant Mp1p protein purified from Escherichia coli

to allow further characterisation of Mp1p. Western blot analysis with

anti-Mp1p antibody revealed that Mp1p produces dominant bands with

18

molecular masses of 58 and 90 kDa and that it belongs to a group of cell

wall proteins that can be readily removed from yeast cell surfaces by glu-

canase digestion. In addition, Mp1p is an abundant yeast glycoprotein

and has high affinity for concanavalin A, a characteristic indicative of a

mannoprotein. Furthermore, ultrastructural analysis with immunogold

staining indicated that Mp1p is present in the cell walls of the yeast, hy-

phae, and conidia of P. marneffei. Finally, it was observed that infected

patients develop a specific antibody response against Mp1p, suggesting

that this protein represents a good cell surface target for host humoral

immunity.

The antibody response of penicilliosis patients to Mp1p was studied

in two subsequent studies [38, 39]. An ELISA-based antibody test with

purified Mp1p was produced. Evaluation of the test with guinea pig sera

against P. marneffei and other pathogenic fungi indicated that this assay

was specific for P. marneffei. Clinical evaluation revealed that high levels

of specific antibody were detected in two immunocompetent penicilliosis

patients. Furthermore, approximately 80% (14 of 17) of the documented

penicilliosis patients with human immunodeficiency virus tested positive

for the specific antibody. No false-positive results were found for serum

samples from 90 healthy blood donors, 20 patients with typhoid fever,

and 55 patients with tuberculosis, indicating a high specificity of the test.

Thus, this ELISA-based test for the detection of anti-Mp1p antibody can

be of significant value as a diagnostic for penicilliosis.

In vitro, Mp1p is found to be secreted into the cell culture super-

natant at a level that can be detected by Western blotting. A sensitive

ELISA developed with antibodies against Mp1p was capable of detect-

ing this protein from the cell culture supernatant of P. marneffei at 104

cells/mL. The anti-Mp1p antibody is specific since it fails to react with

any protein-form lysates of Candida albicans, Histoplasma capsulatum, or

Cryptococcus neoformans by Western blotting. In addition, this Mp1p

19

antigen-based ELISA is also specific for P. marneffei since the cell cul-

ture supernatants of the other three fungi gave negative results. Finally,

a clinical evaluation of sera from penicilliosis patients indicates that 17

of 26 (65%) patients are Mp1p antigen test positive. Furthermore, an

Mp1p antibody test was performed with these serum specimens. The

combined antibody and antigen tests for P. marneffei carry a sensitivity

of 88% (23 of 26), with a positive predictive value of 100% and a negative

predictive value of 96%. The specificities of the tests are high since none

of the 85 control sera was positive by either test.

The value of antigen (Mp1p) and antibody (anti-Mp1p) detection in

the diagnosis of penicilliosis marneffei is best evaluated by comparing the

results in patients with or without underlying HIV infection. In a study

involving eight HIV positive and seven non-HIV penicilliosis marneffei

patients, the HIV positive patients tended to have a higher antigen titre

and a lower antibody titre, while the converse is true in the HIV negative

patients. This presumably is due to impaired antibody production as a

result of the underlying immune defects associated with HIV infection

and a higher fungal load in this group of patients. Concomitant testing

of the serum antigen and antibody levels could therefore improve the

diagnostic yield of serology in immunocompromised patients.

When serial serum samples were available for the HIV-positive pa-

tients, it was found that the serum antigen and antibody titres against

P. marneffei were elevated as early as 30 days before the day of posi-

tive cultures. The titres of both serum antigen and antibody dropped

with the initiation of amphotericin B therapy and itraconazole prophy-

laxis. Upon subsequent follow up, there was no clinical and mycological

evidence of relapse and this was associated with a persistently negative

serum antigen and antibody ELISA.

20

Treatment

In vitro, P. marnefffei is susceptible to itraconazole and amphotericin

B, while the susceptibility to fluconazole and 5-fluorocytosine is less uni-

form [301]. The recommended antifungal regimen to date consists of two

weeks of intravenous amphotericin B (0.6 mg/kg/d) followed by ten weeks

of oral itraconzaole (400 mg/d), which resulted in clinical and microbio-

logical cure in 97.3% of the patients. Long term secondary prophylaxis

has also been suggested to reduce the relapse rate [290,302]. With wider

use of HAART for HIV infection, it has been suggested that long term

antifungal prophylaxis may not be necessary. The highly active anti-

retroviral therapy (HAART) has been shown to reduce the incidence of

many opportunistic infections in AIDS patients, including invasive fun-

gal infection. There is, however, currently no specific cut-off value of

CD4 cell count can be used to guide the use of secondary antifungal

prophylaxis [140]. One recent interesting observation is that several 4-

aminoquinoline agents including chloroquine were found to be able to

inhibit the growth of P. marneffei inside macrophages. The activity of

chlorquine on P. marneffei is postulated to be due to an increase in the

intravacuolar pH and a disruption of pH-dependent metabolic processes.

This finding could be of value in the chemotherapy or chemoprophylaxis

of penicilliosis marneffei [307].

1.2.4 Fungal genome projects

Genomics has only just started to impact on biological/medical research,

although modern molecular genetics has been at the center of the bio-

medical revolution in research since 1980s. The potential of studying

whole genome sequences is a new tool in biomedical research.

At the time when this thesis is written, there are about 317 completed

and published genome sequence projects and 549 eukaryotic and 802

prokaryotic ongoing projects (data from the Genomes OnLine Database

21

(GOLD) at http://www.genomesonline.org/). Current estimates sug-

gest at least 2 million fungal species, of which only some 50,000 to 70,000

have been documented and merely a couple of them whose genomes them

have been completed.

S. cerevisiae was the first eukaryote to have its genome fully se-

quenced. In 1996 the work was completed by many different laboratories

and organisations. Its genome contains ≈6,000 genes on 16 chromosomes.

At the time that genome sequence was published, only 43.3% of the

yeast genes were classified as ‘functionally characterised’, i.e., having ex-

perimentally well-investigated properties, being members of well-defined

protein families, or displaying strong homology to proteins with known

biochemical functions. Despite this limitation, it is the most well studied

fungus, which serves as the most important model organim for fungal

genetics. The all-against-all matching of the yeast genome had been

accomplished and duplication patterns within the genome have been in-

vestigated in a systematic way. Such a view of the genome’s architecture,

based on an exhaustive intra-genomic sequence comparison, revealed that

whole genome duplication seems to have had an important influence of

the evolutionary development of S. cerevisiae [220].

The S. pombe genome [354] contains the smallest number of protein-

coding genes yet recorded for a eukaryote: 4,824. Centromere structure

has been well studied in S. pombe: the centromeres are between 35 and

110 kb and contain related repeats including a highly conserved 1.8-kb

element. More introns (of which there are 4,730) are found than in S.

cerevisiae. Some 43% of the genes contain introns. Some homologs of

human disease genes, such as cancer related genes, have been identified.

Comparative study identified highly conserved genes important for eu-

karyotic cell organisation including those required for the cytoskeleton,

compartmentation, cell-cycle control, proteolysis, protein phosphoryla-

tion and RNA splicing, which may have originated with the appearance

http://www.genomesonline.org/

22

of eukaryotic life. In constrast, few similarly conserved genes that are

important for multicellular organisation were identified. The lesson from

studying S. pombe genome is that the transition from prokaryotes to eu-

karyotes required more new genes than did the transition from unicellular

to multicellular organisation.

The N. crassa genome has been reported recently [101]. The genome

is assembled from genomic data of more than 20-fold sequence coverage

of the genome. It has the highest genome size (39.9 Mb) and gene num-

ber (10,082 protein-coding genes) among all published fungal genomes so

far. On average, the gene density is one gene per 3.7 kilobases (kb) and

an average of 1.7 short introns (134 bp on average) per gene. Neurospora

genome comprises a small number of repetitive elements, a low degree of

segmental duplications and very few paralogous genes. Neurospora genes

are highly divergent – of the predicted proteins 41% have no significant

matches to known proteins. Many of genes with predicted products likely

to be involved in determining hyphal growth and multicellular develop-

mental structures in Neurospora, as well as involved in catabolism, chem-

ical detoxification and stress-defense mechanisms. It has also been noted

that for some Neurospora genes the only known homologs are found in

prokaryotes [216], indicating that occupation of similar ecological niches

has resulted in conservation of genes for substrate degradation and sec-

ondary metabolism.

Magnaporthe grisea, one of the most devastating agricultural pathogens

in the world, has been sequenced [64]. The fungus causes blast disease in

rice, a scourge that destroys enough rice crops to feed 60 million people

annually. The pathogen’s remarkable ability to overcome plant defences

has stymied efforts to fight the disease. Analysis of its predicted gene set

provides an insight into the adaptations required by a fungus to cause

disease. The M. grisea genome encodes a large and diverse set of se-

creted proteins, including those defined by unusual carbohydrate-binding

23

domains. This fungus also possesses an expanded family of G-protein-

coupled receptors, several new virulence-associated genes and large suites

of enzymes involved in secondary metabolism. Together with the draft

rice genome sequences published earlier this year, the new information

will help researchers develop better and cheaper methods of protecting

plants than the currently available fungicides.

Recently, the C. albicans and C. neoformans genomes were reported

[148, 203], enabling a comparison between these divergent fungi. More-

over, high-quality draft sequences of A. nidulans and A. fumigatus are

already in the public domain, and others, such as Ustilago maydis, are

likely to be available soon. Other genome sequencing projects of patho-

genic fungi are also under way or will soon be started (for instance,

Pneumocystis carinii).

1.3 Materials and Methods

Strain and DNA preparation of P. marneffei genome were done by col-

leagues in the department of Microbiology, University of Hong Kong.

Library construction and shotgun sequencing were carried out by Beijing

Genomics Institute (BGI).

1.3.1 Strain and DNA preparation

P. marneffei strain PM1 was isolated from an HIV-negative patient suf-

fering from culture-documented penicilliosis in Hong Kong. The arthro-

conidia (“yeast form”) of PM1 was used throughout the DNA sequencing

experiments. Genomic DNA, including mitochondrial DNA, was pre-

pared from the arthroconidia purified at 37 . A single colony of the

fungus grown on Sabouraud dextrose agar at 37 was inoculated into

yeast peptone broth and incubated in a shaker at 30 for 3 days. Cells

were cooled in ice for 10 min, harvested by centrifugation at 2000g for

10 min, washed twice and re-suspended in ice cold 50 mmol EDTA/l

24

buffer (pH 7.5). 20 mg novazym/ml was added and incubated at 37for one hour followed by digestion in a mixture of 1 mg proteinase K/ml,

1% N-lauroylsarcosine, and 0.5 mol EDTA/l pH 9.5 at 50 for 2 hours.

Genomic DNA was then extracted by phenol, phenol-chloroform, and fi-

nally precipitated and washed in ethanol. After digestion with RNase A,

a second ethanol precipitation was followed by washing with 70% ethanol,

air-dried and dissolved in 500 µl of TE (pH 8.0).

1.3.2 Library construction, shotgun sequencing

Two genomic DNA libraries were made in pUC18 carrying insert sizes

from 2.0 – 3.0 kb and 7.5 – 8.0 kb, respectively. DNA inserts were pre-

pared by physical shearing using the sonication method. The genome

sequence was assembled from deep whole-genome shotgun (WGS) cov-

erage obtained by paired-end sequencing from a variety of clone types,

i.e., all inserts were sequenced from both ends to generate paired reads.

A total of about 190.3 Mb of sequence data, which is equivalent to ap-

proximately 6.6 coverage of the genome, has been generated by random

shotgun sequencing.

1.3.3 Sequence assembly

Phred/Phrap/Consed package was used for base calling, contig assembly

and quality assessment [83, 84, 112]. Contigs were ordered into scaffolds

by the scaffold building program, Bambus [255]. Refer to Chapter 2 for

more detailed descriptions of annotation procedure and genome database

construction.

1.3.4 Data release

Sequence data generated by the project were released continuously and

were available for searching using the on-site BLAST server and down-

loading by FTP with access restriction. The annotated sequences are

25

Table 1.1: General features of the P. marneffei genome.

Feature ValueAssembly size (excluding gaps) 28.98 MbEstimated genome size ∼ 31 MbGC content overall 47%GC content (coding) 50%Protein coding genes 10,060tRNAs 110% coding 62%Average gene size 1,753 bpAverage intergenic distance 1,051 bpAverage intron size 111 bpAverage exon size 380 bp

available for browsing and downloading from web interface of P. marn-

effei Genome Database (PMGD), http://www.pmarneffei.hku.hk. At

present, PMGD contains 10,060 protein-coding genes.

1.4 Results

1.4.1 Assembly and general characteristic

Using a pure whole genome shotgun approach, we sequenced the P. marn-

effei genome to 6.6× coverage. The net length of assembled contigs

totalled 28.98 Mbp. Genome statistics are presented in Table 1.1.

Genome sequence

The P. marneffei genome size was estimated ∼ 31 Mb (see Section 2.4.2),

which is similar to that of Magnaporthe (∼ 30 Mbp), larger than that

of S. cerevisiae and S. pombe (both about 12 Mbp), but smaller than

Neurospora (greater than 40 Mbp). The resulting assembly consists of

2,911 sequence contigs with a total length of 28,977,603 bp. Contigs

were ordered into 273 supercontig (i.e., scaffolds) with a total length

of 28.42 Mbp (excluding gaps between contigs). Most of the assembly

http://www.pmarneffei.hku.hk

26

(98.35%) is contained in the contigs. Given the high sequencing cover-

age, the assembly represents the vast majority (> 95%) of the genome,

as theoretically assessed by the Lander-Waterman model [186]. The mi-

tochondrial genome (35 kb, circular) has been completely sequenced and

assembled (See Chapter 3 for detail).

Genes

A total of 10,060 protein-coding genes (9,257 (92%) longer than 100

amino acids) were predicted. This, again is similar to that of Magna-

porthe and less than that of Neurospora, and constitutes nearly twice as

many genes as in S. cerevisiae(about 6,300) and S. pombe (about 4,800),

and nearly as many as in D. melanogaster (about 14,300). The average

gene density is one gene per 2.8 kb. The average gene length of 1.75 kb

is slightly longer than the 1.67 kb average gene length for Magnaporthe

and the 1.40 kb for both S. cerevisiae and S. pombe. The protein-coding

sequence is predicted to occupy 62.1% (51.2% excluding introns) of the se-

quenced portion of the P. marneffei, compared with 71% in S. cerevisiae

(70.5% excluding introns) and 60.2% in S. pombe (57% excluding introns)

(Table 1.2). An estimated total of 28,180 introns are distributed among

91% of P. marneffei genes, with 34 being the largest number of introns

found within a single gene. Introns varied from 15 to 1,617 nucleotides

long, with a mean length of 111 nucleotides. The telomere tandem re-

peat identified is TTAGGG. Several predicted genes that encode conserved

telomere and centromere proteins, such as, telomere-associated helicases,

were identified, but telomere and centromere sequences have remained

elusive. Note, although the complete genomes of A. fumigatus and A.

nidulans are not published, the high-quality drafts of their genomes can

be obtained. Preliminary analyses reveal that most of above statistics

about gene number and gene density of P. marneffei are similar to those

of Aspergillus. This result is consistent with our understanding of phylo-

27

Table 1.2: Comparison of P. marneffei genome statistics to those of otherfungi. PM - P. marneffei, AN - A. nidulans, MG - M. grisea, NC - N.crassa, SC - S. cerevisiae, and SP - S. pombe.

PM AN MG NC SC SPGenome size (Mb) 31 31 30 43 12 12Gene number 10,060 9,457 11,108 10,620 6,300 4,800Gene coverage 62.1% 59.2% 48.2% 44.5% 71.0% 60.2%Gene coverage (ex-cluding introns)

51.2% 50.6% 40.5% 37.6% 70.5% 57.0%

genetic relationship between them, as obtained by small ribosomal RNA

sequences (Section 1.4.1) and mitochondrial comparison (Chapter 3).

Ribosomal RNA and tRNA

Copies of the large rRNA tandem repeat containing the 18S, 5.8S and

25S rRNA genes are present in P. marneffei genome. Ribosomal RNAs

from P. marneffei and other fungi were used to construct phylogeny

to study phylogenetic relationships. 18S rRNA from 43 species of As-

comycetes were obtained from Ribosomal Database Project II Release

8.1 (http://rdp.cme.msu.edu/html/). The phylogenetic relationship

is presented in Fig. 1.3. The neighbour-joining method of tree recon-

struction, implemented in MBEToolbox (Chapter 10), was used. Align-

ment replicates for bootstrapping were generated by using Phylip [86].

Result suggests that P. marneffei is likely to be an anamorph of a Ta-

laromyces species. This substantiates the observation that the spacer

regions of the rRNA loci are highly similar to that found in Talaromyces

species [158,330]. Indeed the sequence is almost identical with that of T.

flavus and T. bacillisporus (Fig. 1.3). It is also very similar to that of

Chromocleista cinnabarina, a soil fungus that produces a red pigment, as

does P. marneffei. A total of 110 tRNA genes were identified, including

69 (63%) with introns.

http://rdp.cme.msu.edu/html/

28

Clavispora lusitaniae [M55526]

Pichia anomala [D86914]

Candida tropicalis [M55527]

Zygosaccharomyces rouxii [X58057]

Saccharomyces cerevisiae [Z75578]

Torulaspora delbrueckii [X53496] 100 100

64

94

Schizosaccharomyces pombe [X58056]

Saitoella complicata [D12530]

Protomyces inouyei [D11377]

Taphrina populina [D14165]

Taphrina deformans [U00971]

Taphrina wiesneri [D12531] 77

97

100

65

100

97

Chaetomium elatum [M83257]

Neurospora crassa [X04971]

Podospora anserina [X54864]

77

Microascus cirrosus [M89994]

Pseudallescheria boydii [U43913]

100

100

Ophiostoma ulmi [M83261]

Leucostoma persoonii [M83259]

54

50

Aureobasidium pullulans [M55639]

Pleospora rudis [U00975]

100

76 Thermoascus crustaceus [M83263]

Penicillium verruculosum [AF510496]

Penicillium marneffei

Talaromyces flavus [M83262]

63

Talaromyces bacillisporus [D14409]

97

Chromocleista cinnabarina [AB003952]

62

Byssochlamys nivea [M83256]

Eurotium rubrum [U00970]

Aspergillus fumigatus [M55626]

Aspergillus flavus [D63696]

50

Monascus purpureus [M83260]

Eupenicillium javanicum [U21298]

Penicillium notatum [M55628]

Penicillium chrysogenum [AF548086]

Penicillium commune [AF236103]

Penicillium expansum [AF218786]

Penicillium allii [AF218787] 90 72

80

100

99

82

73

75

100

55

53

Histoplasma capsulatum [Z75306]

Blastomyces dermatitidis [M55624]

Paracoccidioides brasiliensis [AF227151]

59

Coccidioides immitis [M55627]

Eremascus albus [M83258]

Ascosphaera apis [M83264] 98

98

62

100

68

81

100

100

0.01

Figure 1.3: Phylogenetic tree showing the relationships of P. marneffei toother Penicillium and Talaromyces species. The tree was inferred from18S rRNA data by the neighbour-joining method and bootstrap valuescalculated from 1000 trees. The scale bar indicates the estimated numberof substitutions per 100 bases using the Jukes-Cantor correction. Namesand accession numbers are given as cited in the GenBank database.

29

1.4.2 Genome architecture and co-linearity

Identification of syntenies conserved between species is valuable for trac-

ing the evolutionary events that affect genomes, however, little informa-

tion about synteny among chromosome segments (or contig) is known

for filamentous ascomycetes. Analysis of orthologous genes among P.

marneffei, A. nidulans and A. fumigatus, revealed extensive regions of

conserved synteny, as well as a considerable extent of genome reorganisa-

tion that has occurred in this phylum. There are 1,340 regions containing

four or more genes that were found to be co-linear between P. marneffei

and A. nidulans. A total 3,188 P. marneffei genes are in these regions.

There are 1,273 regions between P. marneffei and A. fumigatus, contain-

ing 3,716 P. marneffei genes. The largest syntenic cluster contains 27

gene pairs, appearing in P. marneffei and A. nidulans.

Melanin-biosynthesis gene cluster

One of the interesting examples of the syntenic segments conserved be-

tween P. marneffei and Aspergillus spp. is the melanin biosynthesis gene

cluster. This six-gene cluster, spanning ∼ 19 kb, which participates in

DHN-melanin biosynthesis [24, 187, 317, 318], is found in P. marneffei,

and is syntenic in A. fumigatus (Chapter 4).

Pheromone precursor gene loss

Syntenic regions reveal evolutionary events, like gene loss, which are dif-

ficult to identify by other methods. One of the examples is the loss

of known mating pheromone precursor genes. Figure 1.4 shows the mi-

crosyntenies among pheromone precursor loci from P. marneffei, A. nidu-

lans, A. fumigatus and N. crassa. The pheromone precursor gene has

been identified in all these species (highlighted in green) except for P.

marneffei. The hypothetical locations of P. marneffei pheromone pre-

cursor genes before loss are indicated by triangles in the figure.

30

Figure 1.4: Microsyntenies containing pheromone precursor loci from P.marneffei, A. nidulans, A. fumigatus and N. crossa. The pheromone pre-cursor genes have been highlighted in green. The hypothetical locationsof P. marneffei pheromone precursor genes before gene loss are indicatedby triangles.

1.4.3 Gene duplications (multigene families) and comparisons

Among all predicted P. marneffei genes (total 10,060 with 9,541 longer

than 100 bp), 1,335 of them belong to 428 multigene families which con-

tain more than one homologous member. The largest gene family consists

of 34 genes. The most expanded gene families include MFS multidrug

transporter, dehydrogenase/reductase and hexose transporter, as well as

pepsin-type protease (see Table 8.2 on page 165). Comparisons of con-

31

tig/supercontig sequences and searches for tracts of conserved gene order

reveal little evidence for large-scale duplications in P. marneffei. The

incomplete genome sequences and unordered contigs obviously impair

the detection. Notably, the result is inconsistent with that based on the

other line of evidence, as presented in Chapter 8, in which histogram of

synonymous substitution rate of P. marneffei duplicate gene pairs sug-

gesting two large-scale gene duplications probably happened. Compared

to S. cerevisiae which undergone genome duplication (i.e., the largest

gene duplication), P. marneffei has relatively smaller number of recently

duplicate gene pairs. But, the age distribution of duplicate genes in P.

marneffei at the first peak (see Chapter 8 for detail) shows a similar

pattern with that in S. cerevisiae, which might suggest that duplicate

genes in P. marneffei probably originated through one or two episodic,

large-scale gene duplication.

1.4.4 Interspecies proteome comparison

The comparison of genomic sequences of two or more species may provide

highlighted information on how evolution shapes genome structure and

content, and to reveal specific sequences that have been conserved, as well

as those that have been invented throughout evolution. I conducted such

a comparative analysis of proteome sequences between P. marneffei and

A. fumigatus and S. cerevisiae. The analysis started by defining ortholog

or paralog pairs among proteomes. Two genes are said to be paralogous

if they are derived from a duplication event, but orthologous if they are

derived from a speciation event. Determining ortholog is important step

in assessing the relationship between genomes. This was performed us-

ing the BLAST comparison tool. BLASTP was used to compare the

sequences of proteins encoded by genes of one genome against those from

the other genomes. Protein sequences, instead of nucleotide sequences,

were compared because protein sequences remain conserved much longer,

32

on an evolutionary time scale and therefore can detect much older rela-

tionships among alignments. The lower the E-value, the greater chance

that two proteins are orthologous, that is, derived from a common ances-

tral protein and therefore having the same function. E-values have been

shown to be an accurate indication for the ratio of false positives to true

positives of homologous relationships. Genes g and h were considered or-

thologues if h is the best BLASTP hit for g and vice versa, with E-value

less than or equal to 1e-10.

The translated ORFs sequences of S. cerevisiae were obtained from

the Saccharomyces Genome Database (SGD) at http://www.yeastgenome.

org/. The predicted peptides of A. fumigatus were downloaded from the

FTP service at the A. fumigatus genome project in the Sanger Institute

(http://www.sanger.ac.uk/). The result of the proteome comparison

is given in Fig. 1.5.

Figure 1.5: Graphical representation of a triple proteome comparisonbetween P. marneffei, S. cerevisiae and A. fumigatus.

http://www.yeastgenome.org/

http://www.yeastgenome.org/

http://www.sanger.ac.uk/

33

1.4.5 Lineage-specific genes

We identified many genes only present in P. marneffei or its closely re-

lated fungal species, namely lineage-specific genes. At the most extreme,

some genes are present in P. marneffei exclusively. These genes are of

particular interest because they may be determinators of characteristic

features of the fungus. A total of 1,447 genes whose proteins lack signifi-

cant matches to known proteins from public databases (TBLASTN cutoff

10−10) were found. This reflects that the Penicillium and its closely re-

lated fungal genome projects are still in the early stage, the diversity of

fungal genes remaining to be explored. Furthermore, 2,506 proteins do

not have significant matches to genes in either of the sequenced yeast

and A. nidulans. A novel theory about the emergence of lineage- or

species-specific genes is given in Chapter 9. Briefly speaking, the accel-

erated evolutionary rate, one of the most characterised properties of a

lineage-specific gene, may be responsible for the gene’s emergence.

In addition to the lineage-specific genes, many fungal specific domains

have been identified. These include cell wall antigen MP1 domain that is

first described in cell wall antigen Mp1p encoded in P. marneffei [347].

The Mp1p contains two self conserved regions, namely CR1 and CR2,

which form a new conserved domain family that has not been described

in conserved domain databases, such as Pfam and ProDom. The genome

sequence reveals more than 12 P. marneffei genes containing at least one

MP1 domain. That is to say, the genes encoding MP1 containing proteins

have been expanded in P. marneffei genome. Such an expansion is not

so extensive in A. fumigatus and A. nidulans, despite at least two MP1

containing proteins, afmp1 and afmp2 (GenBank Acc.: AAG09624 and

AAR22399), were discovered in A. fumigatus genome.

34

Figure 1.6: Putative MAPK signalling pathway in P. marneffei.Overview of major intracellular signalling pathways in P. marneffei.Common genes between S. cerevisiae and P. marneffei are marked withasterisks. Names of S. cerevisiae genes are presented. The P. marnef-fei genes are in parentheses. Created by using GenMAPP v2.0, a freeprogram for visualising genes on biological pathways.

35

1.4.6 Cell signalling and morphogenesis

The sequences encoding proteins that act on well-studied signalling path-

ways, including mitogen-activated protein kinases (MAPK) and cyclic

AMP-dependent protein kinase, as well as small GTPases of the Ras

family, are readily recognised in the P. marneffei genome. Figure 1.6 is

the comparison of MAPK signalling pathways between S. cerevisiae and

P. marneffei.

1.4.7 Potential mating ability

Traditionally, P. marneffei is considered as an asexual (anamorph) as-

comycete that lacks an apparent sexual (teleomorph) stage in its life cycle

and seems to reproduce only mitotically [44, 104]. Recent genetic stud-

ies, however, suggest it may have an unidentified sexual cycle. Except

for the pheromone precursor gene, the whole set of sex-related genes in

P. marneffei genome was identified, which demonstrates the potential

matting ability of this important thermally dimorphic fungus (Chapter

5).

1.4.8 Putative virulence genes

What makes a fungus a pathogen is an old question. The P. marneffei

genome sequence has revealed many proteins and systems with functions

that have previously been found to be important in pathogenic fungi. For

example, proteins such as phospholipases and proteinases are involved in

direct host cell damage and lysis. A review about fungal virulence factor

is in Section 4.2. A few identified putative virulence factors are presented

in Table 1.3.

1.4.9 Cell wall antigens and biosynthetic genes

The cell wall of a fungus maintains the structural integrity of the cell,

protects the fungus against the defence mechanism of the host and har-

36

Table 1.3: Putative virulence genes

Gene Acc. No. BLAST hit E valueProteinase

Pm47.49 P87184 Intracellular vacuolar serine pro-teinase precursor

0

Pm61.35 Q96WN2 Lon proteinase 0Pm109.24 P25375 Saccharolysin (EC 3.4.24.37) (Pro-

tease D) (Proteinase yscD)1e-159

Pm61.50 Q6FX66 YCL057w PRD1 proteinase yscD 1e-158Pm88.30 Q64HW0 Aspartyl proteinase 1e-122Pm66.31 P32379 Proteasome component PUP2 (EC

3.4.25.1)3e-98

Pm13.58 Q871P4 Related to ubiquitin-specific pro-teinase UBP1

6e-97

PhospholipasePm1.261 Q769K2 N-acyl-phosphatidylethanolamine-

hydrolysing phospholipase D6e-61

Pm103.31 Q874F2 Phospholipase D 1e-156Pm16.57 Q6U820 Lysophospholipase (EC 3.1.1.5) 0Pm167.18 Q877A5 Phospholipase (Fragment) 2e-51Pm182.7 Q76H92 Phospholipase A2 3e-27Pm22.27 Q9P866 Candida albicans Phosphatidylinosi-

tol phospholipase C4e-44

MetacaspasePm112.34 Q8J140 Metacaspase 1e-91Pm205.1 Q8J140 Metacaspase 3e-58

AgglutininPm113.29 Q9P5P9 related to A-agglutinin core protein

AGA11e-24

Pm10.4 P11219 Lectin precursor (Agglutinin) 5e-09Pm2.195 Q8CMU7 Streptococcal hemagglutinin protein 3e-07Pm28.53 Q7N911 Similar to hemagglutinin/hemolysin-

related protein0.00005

ToxinPm21.30 A45086 HC-toxin synthetase - fungus

(Cochliobolus carbonum)0

Pm21.31 Q9UVN5 AM-toxin synthetase 0Pm71.10 Q9UVN5 AM-toxin synthetase 0Pm71.39 Q9UVN5 AM-toxin synthetase 0Pm137.4 Q9UVN5 AM-toxin synthetase 0Pm151.1 A45086 HC-toxin synthetase - fungus

(Cochliobolus carbonum)0

Pm112.24 Q96WL1 Aflatoxin efflux pump Aflt 1e-141

37

bours most of the fungal antigens. It consists of a polymer of α and

β(1,3)-glucans, chitin, galactomannan and β(1,3)(1,4)-glucan embedding

protein antigens including the adhesins. The cell wall is synthesised and

continuously remodelled by enzymes including synthases, transglycosi-

dases and glycosyl hyrolases. All these are absent in human cell and thus

ideal targets for anti-fungal agents and immunisation. Previous studies

have shown that the specific monoclonal antibody against the galactofu-

rane side chain of galactomannan antigen of A. fumigatus can react with

the cell wall of P. marneffei and can be used to detect the presence of

antigenaemia or antigenuria in patients suffering from penicilliosis marn-

effei [363]. Ortholog of one of the known P. marneffei cell wall antigen

genes, MP1, is present in A. fumigatus. Within P. marneffei, homologs

of a number of Aspergillus genes encoding similar biosynthetic enzymes

and cell wall antigens have been identified (Table 1.4).

1.5 Discussion

This is the initial analysis of the genome of a thermal dimorphic fun-

gus. Although P. marneffei has not been studied intensively, the analy-

sis of the genome sequence has provided many new insights into a va-

riety of gene functions and cellular processes, including cell wall com-

ponents, signalling pathway, secondary metabolism and mating ability.

Comparisons of the genome of P. marneffei with those of other patho-

genic/nonpathogenic fungi have also uncovered surprising similarities and

differences, providing a new perspective on the molecular underpinnings

of these lifestyles. The analysis of P. marneffei -specific genes might allow

researchers to begin to make insights into the transition from mould to

yeast growth. Furthermore, the genome sequence has revealed the differ-

ent pattern of gene duplication in P. marneffei and other ascomycetes,

which might be linked with their divergent biological characteristics. The

apparent lack of a pheromone precursor loci in P. marneffei may provide

38

Table 1.4: Cell wall antigens and biosynthetic genes predicted in P. marn-effei.

Aspergillus gene Acc. No. Pm gene E valueCHSs

Class I CHSA AAB33397 Pm14.101 e-107Class II CHSB AAB33398 Pm132.15 5e-097Class III CHSG AAB07678 Pm110.5 0Class IV CHS F AAB33402 Pm87.22 6e-064Class V CHSE CAA70736 Pm38.37 0Class VI CHSD AAB33400 Pm223.4 e-051

β(1,3)-glucan synthaseFKS1 AAB58492 Pm120.1 0RHO1 AAG12155 Pm203.6 5e-099

α(1,3)-glucan synthaseAGS1 AAL28129 Pm162.3 0AGS2 AAL18964 Pm66.50 0

β(1,3)-glucanosyl transferasesGEL 1 AAC35942 Pm221.6 e-154GEL 2 AAF40139 Pm94.24 e-123GEL 3 AAF40140 Pm119.10 e-124

Mannosyl transferasesMNN9 Afu2g01450 Pm207.2 5e-097PIG-M Afu7g01300 Pm90.41 2e-063

Chitinases Endo-β(1,3)-glucanasesEngl1 AAF13033 Pm5.32 0

39

an explanation of its asexual life style. However, the fungus may indeed

undergo a yet undetected sexual cycle, which is supported by the findings

of homologs of many mating genes. Finally, one of the most interesting

findings is the abundant intragenic tandem repeats in the coding regions

of the genome. This finding provides a possible mechanism to explain

how the fungus can change its surface coat and thereby evade detection

by the host’s natural defences (see Chapter 7).

The draft genome sequence of P. marneffei presented in this chapter

provides the first attempt to understand the genetic basis of the physi-

ology of the special Penicillium species. Nonetheless, This first glimpse

may be expanded as many other fungal genomes generated from fungal

genome sequence projects ongoing or planned. This new era in fungal

biology promises to yield insights into this important group of organisms,

as well as to provide a deeper understanding of the fundamental cellular

processes common to all eukaryotes.

40

Chapter 2

PENICILLIUM MARNEFFEI GENOME DATABASE

AND ANNOTATION PIPELINE

The draft genome of Penicillium marneffei has been obtained (Chap-

ter 1). The huge amount of sequence data needs efficient analysis in order

to extract valuable information. A computer-based analysis system tai-

lored for the genome is required. Such a sequence data management

system with a number of peripheral applications has been developed to

solve this problem.

2.1 Introduction

The ever accelerating amount of genome information of P. marneffei

needs to be adequately processed, annotated and interpreted. Computa-

tional annotation systems providing tools and algorithms can facilitate

this process and advance our understanding of the genome sequences.

For the systems to be developed and refined, data must be easily acces-

sible and amenable to analysis. The analysed data must be fed back into

the loop to allow the data to be re-analysed, refined, verified, and new

hypotheses to be built. This is the issue of data management. Good data

management practices are essential to users of genomic data.

This chapter is concerned with two aspects: (1) construction of the

PMGD (P. marneffei genome database) system, and (2) the issues rele-

vant to the development of annotation pipeline. Many steps are involved

in these two aspects. Among these steps, prediction of protein function

is one of the most critical one in genome information processing. The

process of function prediction therefore stands the central part of an-

41

notation pipeline. Since P. marneffei genetics has not been well estab-

lished, most of proteins derived from its genome will be totally unknown

to biologists. More than ten thousand unknown proteins will undergo

function prediction. Different methods of protein function prediction

have been developed (see Literature Review). Briefly, these methods

can be categorised into two major groups: homology based methods and

non-homology based method. The former methods depend on the de-

tectable homolog between unknown protein and the characterised pro-

teins in database. The latter methods are based on various contexts in

functional information of a protein, which are collected and integrated

around the protein in order to assign a putative function for the protein

in an indirect way [218]. However, none of these methods can guarantee

a ‘one-stop’ solution that are particularly successful in P. marneffei gene

function prediction. Hence, the newly developed annotation pipeline in-

tegrates several currently used methods, but it is by no means a collection

of methodologies. Different methods have been tailored before it can be

integrated in order to give its maximum predicting power in respect to

the features of fungal proteins.

In next section, I will first review underlying principle behind the

methodologies used for predicting function of unknown proteins. I will

then examine a few protein function prediction systems implemented by

several research groups, before pointing out some additional considera-

tions in regard to the further development of similar systems. Note that

the topic of protein function prediction is a broad one. It could be broken

down into different subtopics in many different ways. I have tried to or-

ganise them in a flow from theory to application as smoothly as possible.

But still, the content of sections might jumpover slightly; some of key

concepts, such as, algorithm of sequence alignment, might be mentioned

more than once in different sections.

42


In this literature review I will first examine the most widely used methods

in protein function prediction. Then give a survey of software/database

systems currently available, highlighting their strengths and shortcom-

ings. Further possible research directions will be addressed before final-

ising the whole literature review section.

2.2.1 Methods for predicting protein function

Based on the underlying principle, the methods of protein function pre-

diction can be categorised into two major groups: homology-based meth-

ods and nonhomology-based methods [17,217,142].

Homology-based methods

Homology-based annotation relies on sequence similarity between query

protein and a well characterised protein. If two proteins are highly similar

in sequence, they possibly share the same function. The rationale behind

this function extrapolation is that similarity in sequence is determinate

enough to functional similarity. This is reasonable but counter-examples

are not rare. For instance, in the presence of domains that are shared by

numerous proteins [74], choosing the first or the best hit may not be op-

timal. The multi-domain organisation of proteins can lead to incorrectly

annotated database entries. Despite such criticisms, homology-based

methods are definitely the most widely used method. To calculate simi-

larities/distances with sequences of known proteins, pairwise similarities

are computed using the rigorous dynamic programming algorithm [292],

or heuristic algorithms such as FASTA [245] and BLAST [6].

Besides the whole protein similarity comparison, detecting motif or

domain sharing among proteins gives additional information about func-

tion. Motif is a simple combination of a few consecutive secondary struc-

ture elements with a specific geometric arrangement (e.g., helix-loop-

43

helix). Not all, but some motifs are associated with a specific biologi-

cal function. Domain is the fundamental unit of structure folding and

evolution. It may combine several secondary elements and motifs, not

necessarily contiguous. A domain can fold independently into a stable

3D structure, and it has a specific function. A variety of mathemati-

cal representations of protein motif/domain were developed and utilised

in detecting and storing these motifs/domains, such as, regular expres-

sion, position specific scoring matrices [97], hidden Markov models [57],

probabilistic suffix trees [15], and sparse Markov transducers [81].

Nonhomology-based methods

Although homology-based annotation has been widely successful in ex-

tending knowledge from the small set of experimentally characterised

proteins to the tens of thousand proteins found in genome sequencing

projects, a fatal problem for this method is that a well characterised

reference protein must be found base on sequence similarity; otherwise,

one cannot assign putative function to the unknown protein. Accord-

ing to the data that we currently have, 30-40% of proteins cannot find

a clear sequence homology in today’s most updated protein databases.

Another fungal genome sequencing project finished recently has the same

problem [101].

Nonhomology-based methods, also called context-based function pre-

diction is complementary to homology-based function prediction. Phy-

logenetic profiles, domain fusion and gene neighbouring are examples of

these methods. Pellegrini et al. [248] presented the phylogenetic profiles

method based on the assumption that proteins that function together in

a pathway or structural complex are likely to evolve in a correlated fash-

ion. If protein A and B tend to be either preserved or eliminated together

in a new species, we can expect that they are functional linked. In this

case, if we know the function of protein A, we can manage to predict the

44

function of protein B with respect to this functional linkage. The method

of phylogenetic profiling could be useful in predicting the function of un-

characterised proteins in P. marneffei, especially, when more and more

fungal species are sequenced. But for the time being this method has

to be performed manually because there is no free software available in

assisting automation of the analysis.

2.2.2 Software/database systems for protein function prediction

Over decades, with the close cooperation of biological scientists and soft-

ware engineers, a wide range of software and/or database systems have

been developed. As we can see in the next section of this review, some

of them utilise mainly one of methods mentioned above as its predictive

tool, while some of them try to integrate more than one method in order

to give more comprehensive annotation for unknown proteins.

Systems for automatic function assignment

A group of software systems, such as, PEDANT, Genequiz, Bio-Dictionary,

is attempting to accelerate the task of human experts by providing de-

tailed and exhaustive information for function assignment.

PEDANT (http://pedant.gsf.de) is a software system for com-

pletely automatic and exhaustive analysis of protein sequence sets - from

individual sequences to complete genomes [96]. It was launched in 1996

and is one of the earliest such systems. It was extensively utilised in

MIPS, a Europe based bioinformatics institute. It claims that it is fully

integrated with sequence database system and provides access to a broad

range of biological information through a hierarchically organised, con-

trolled vocabulary. The whole system became commercialised like some

other similar systems these days, which limits its popularity.

The GeneQuiz analysis server is open to public usage and accepts

anonymous protein sequences with GQserve [7]. It is composed of several

http://pedant.gsf.de

45

major modules: GQupdate keepings target databases current; GQsearch

performs database searching of queries, applies a variety of sequence

analysis tools to the query sequence, parsing, and storing the results

in a common format; GQbrowse allows browsing and querying of results;

GQupdate maintains integrated, up-to-date, non-redundant protein and

nucleotide sequence databases, as well as databases of protein structures

and motifs. These modules are general engineering achievement with no

principle different from other database systems. It is GQreason module

that is the most critical know-how for the whole system. The module

analyses results and makes intelligent guesses, assigns a specific function

to the query, a general functional class, and a reliability estimate.

Bio-Dictionary [264] employs a weighted, position-specific scoring scheme

and uses the complete collection of amino acid patterns (referred to as

seqlets) and can determine, in a single pass, the following: all local and

global similarities between the query and any protein already present in a

public database. The most unique feature of Bio-Dictionary is the usage

of seqlets that completely cover the natural sequence space of proteins in

the currently available public databases. As its developers claimed the

seqlets contain in this collection can capture both functional and struc-

tural signals that have been reused during evolution both within as well

as across families of related proteins. With this capacity, seqlets are ideal

elements for use in the context of protein annotation.

Classification system

It is not always the case that an unknown protein can be readily as-

signed a definite functional description. In such a case, protein classifi-

cation can help to elucidate the function of the new protein. Comparing

a protein sequence with a database of protein families is more effective

than a standard database search. Generally, conserved proteins are clas-

sified according to their homologous relationships. Each protein group

46

composes of a set of “seed” proteins which is represented as multiple

alignments, regular expression profiles or HMM. Protein classification is

useful in structure and function prediction, and especially important in

large-scale annotation efforts.

As it claims as of 2001, Clusters of Orthologous Groups of proteins

(COGs) were delineated by comparing protein sequences encoded in 43

complete genomes, representing 30 major phylogenetic lineages [308].

Now it is more updated by including more complete genomes represent-

ing broader lineages. Each COG consists of individual proteins or groups

of paralogs from at least 3 lineages and thus corresponds to an ancient

conserved domain. The problem with COGs system is that the system is

not fully open to public. Batch-application of COGnitor, the key compo-

nent of the system used to fit new proteins into the COGs, can only be

accessed inside the NCBI. Another issue has to be taken into account is

that COGs does not discriminate paralog (genes from the same genome

which are related by duplication) from ortholog (genes in different species

that evolved from the same ancestral protein). Orthologs typically have

the same function, allowing transfer of functional information from one

member to an entire COG. In contrast, paralogs are functionally diverse

proteins whose genes duplicated after speciation, although high sequence

similarity is normally preserved in paralogs. A system like COGs can

only be used as classifying system for automatically yielding a number of

functional predictions for poorly characterised genomes. COGs system

is of limited usefulness in P. marneffei genome project because its cur-

rent version contains few fungal genomes. The other database systems,

such as, Systers [177], iProClass [135], ProtoMap [362], have the same

shortcoming as COGs. They are better to be treated as protein infor-

mation storage/retrieval systems than active protein function prediction

systems.

47

Protein domain databases

A list of commonly used protein domain databases are given in Table 2.1.

Two of them have been used in PMGD. They are Pfam and InterPro.

Pfam (http://www.sanger.ac.uk/Software/Pfam) is a large collec-

tion of multiple sequence alignments and hidden Markov models covering

many common protein domains and families [13]. For each protein fam-

ily, Pfam allows looking at multiple alignments, viewing protein domain

architectures, examining species distribution, and so on. Pfam is built

from fixed releases of Swiss-Prot and TrEMBL. At current version 18.0

(2005), 75% of protein sequences in Swiss-Prot and TrEMBL have at

least one match to Pfam.

InterPro (http://www.ebi.ac.uk/interpro) is a database of pro-

tein families, domains and functional sites in which identifiable features

found in known proteins can be applied to unknown protein sequences.

It provides an integrated view of the commonly used signature databases

like PROSITE, PRINTS, SMART, Pfam, ProDom, etc., and has an in-

tuitive interface for text- and sequence-based searches. The latest release

11.0 contains 12,294 entries and covers 77.5% of UniProt proteins. Inter-

ProScan is a tool that combines different protein signature recognition

methods native to the InterPro member databases into one resource with

look up of corresponding InterPro and GO annotation.

2.2.3 The art of gene finding

The last 20 years has witnessed the significant development of compu-

tational methodology for finding genes and other functional sites in ge-

nomic DNA. Two major classes of computational approaches are com-

monly used to detect genes in genomic sequences. They are homology-

based approaches, and ab initio gene-finding algorithms. The former

approaches are relatively straightforward, focusing on search of homol-

ogous relationship with the content and structure of known genes. If a

http://www.sanger.ac.uk/Software/Pfam

http://www.ebi.ac.uk/interpro

48

Table 2.1: Commonly used domain databases.

Database Method Data type URLProsite Semi-Maual Motif www.expasy.ch/prosite/Pfam Semi-Auto Domain www.sanger.ac.uk/Software/Pfam/Blocks Full-Auto Motif www.blocks.fhcrc.org/ProDom Full-Auto Domain prodes.toulouse.inra.fr/prodomPrints N/A Motif www.bioinf.man.ac.uk/PRINTS/Domo Full-Auto Domain www.infobiogen.fr/services/domo/InterPro N/A Motif www.ebi.ac.uk/interpro/Smart Semi-Auto Domain smart.embl-heidelberg.de/eMotif Full-Auto Motif dna.stanford.edu/identify

region of sequence is similar to the sequence of an identified gene it is

highly suggestive, though not necessarily conclusive, of a gene. The most

common program for such comparison may be BLAST.

Next I will review some issues related to ab initio gene finding al-

gorithms. Generalised hidden Markov models (GHMMs) appear to be

approaching acceptance as a de facto standard for state-of-the-art ab

initio gene finding, as evidenced by the recent proliferation of GHMM

implementations, including GenScan [30] and FGENESH (Softberry). At

the time of this thesis’ written, neither GenScan nor FGENESH is open-

sourced, and no detailed information about underlying algorithm and

implementation is available. According to general algorithm description,

GenScan uses a training set in order to estimate the HMM parameters,

then the algorithm returns the exon structure using maximum likelihood

approach standard to many HMM algorithms (Viterbi algorithm). The

generalised HMM that GenScan uses consists of a number of states mod-

elling the various parts of a gene. These states include 5’ splice site, 3’

splice site, internal coding exon, start exon, and terminal exon. The final

gene structure predicted by GenScan is the maximum probability path

through the HMM. FGENESH is also HMM-based with the algorithm

similar to GenScan [30], differing in the model of gene structure a signal

www.expasy.ch/prosite/

www.sanger.ac.uk/Software/Pfam/

www.blocks.fhcrc.org/

prodes.toulouse.inra.fr/prodom

www.bioinf.man.ac.uk/PRINTS/

www.infobiogen.fr/services/domo/

www.ebi.ac.uk/interpro/

smart.embl-heidelberg.de/

dna.stanford.edu/identify

49

term (such as splice site or start site score) has some advantage over a

content term (such as coding potentials), reflecting the biological signifi-

cance of the signals. No matter what algorithm a gene finding program

implements, several basic types of signal are indispensable to be detected.

These signals (or functional sites in genomic DNA) that researchers have

ever sought to recognise are splice sites, start and stop codons, branch

points, promoters and terminators of transcription, polyadenylation sites,

ribosomal binding sites, topoisomerase II binding sites, topoisomerase I

cleavage sites, and various transcription factor binding sites [108]. From

the point of view of information sciences, two basic types of information

are used here (1) “signals” in the sequence, such as splice sites; and (2)

“content” statistics, such as codon bias. Among signal measures, the

splice junctions-the donor and acceptor sites is the most important fea-

tures to be identified. The most common method for this has been the

“weight matrix” based methods. Other methods like consensus, Maximal

dependence decomposition (MDD) and Neural network based methods

are also used. Other signals, such as, start and stop codons, TATA boxes,

transcription factor (TF) binding sites, and CpG islands, are also use-

ful in predicting protein-coding regions. Content measures, like such as

codon bias, periodicities and asymmetries of coding regions, help to dis-

tinguish coding from noncoding regions. Fairly long exons are easy to

identify whereas short ones remain difficult. Neural networks have also

been used to distinguish coding from noncoding sequences.

Recently homolog-based approaches have been incorporated into the

ab initio gene-finding algorithms. GenomeScan, for example, is a com-

bination of two sources of information: probabilistic models of exons-

introns and sequence similarity information [361]. It is an extension of

the GenScan program, predicting gene structures that have at least one

exon with supporting evidence from an existing protein sequence. The

major disadvantage to this method is the requirement of a close homolog.

50

It is often the case that homologs are unknown or are remote, in which

case this system would be inappropriate.

Although the programs for gene structure prediction have greatly im-

proved in the last decade, even the best cannot autonomously detect all

genes and genomic elements and have to be supported by experimental

analysis. The programs still have considerable proportion of incorrect

and missed exons, and they concentrate only on the detection of coding

exons, while 5’ and 3’ UTRs, promoter elements, and polyA sites often

remain undetected. The elucidation of complex genome organisation,

such as nested and overlapping genes or alternative splicing, has not yet

been considered by any of the programs [267].

2.3 Implementation

The overall objective of PMGD is to design and implement a distributed

information framework that will provide services, tools and infrastruc-

ture for high-quality analysis and annotation of large amounts of diverse

genomic data. The whole system starts from assembly of sequences, and

ends with the web interface for output of all processed information. The

requirements of the update are dependant on the genomic data sources

to be updated, so the PMGD was designed to be modules and config-

urable so that adding new sequence data should be as straightforward as

possible.

2.3.1 Annotation pipeline

The general strategy applied to the analysis of all contigs is diagrammed

in Fig. 2.1. It uses standard published procedures of sequence compar-

isons as well as sh/bash shell scripts and Perl specifically developed for

this work (see Section 2.3.5). The procedure involves the following major

steps:

51

Predicted Genes

(10,060)

Contigs (2911)

Scaffolds (273)

FGENESHGenScan HmmGene

Consed/BAMBUS

Domain Identification

Best Gene Prediction

Relational Database Storing

Annotation

Sequence Data Files

PMGD Website Interface

Gene Structure

& Functional Annotation

BLASTP SearchOther Protein Analyses ...

Tandem Repeat Finder

BLASTX Search

Other Nucleotide Analyses ...

Figure 2.1: Flowchart of annotation pipeline for P. marneffei genome.

Step 1: contig assembly

Contigs were assembled from the sequence electropherograms using the

Phred/Phrap with their default options except as otherwise indicated

(for detail, see Section 2.3.2).

Step 2: comparisons of contigs to sequence databases

Comparisons of all contigs with fungal DNA sequences were performed

using BLASTN (default parameters) to search for rDNA, plasmid or mi-

tochondrial DNA sequences. The contigs were also compared to all known

proteins in GenBank (release 131) using ungapped BLASTX, with sig-

nificant hits indicating potential exons. The searches were made using

the seg filter and the PAM250 substitution matrix. The searches against

mitochondrial sequences were made using the filamentous fungal mito-

chondrial genetic code. In order to facilitate the visual inspection of the

52

alignments, I have developed blast2html script that converts regular

BLAST output to the HTML format. A graph was inserted above the

descriptive lines showing alignments coloured according to their similarity

score with the contig or protein query. Note BLASTX hits can often in-

dicate the approximate location of many coding exons but not every exon

and do not accurately delineate exon boundaries, so BLASTX search in

this step only provide preliminary coding information.

Step 3: identification of genetic elements

This step identifies protein coding genes and other genetic elements. Dif-

ferent gene finding programs were evaluated and then the best one was

used as the primary gene finding program (for detail, see Section 2.3.3).

In addition to the protein-coding genes, tRNAs were identified using the

tRNAScan-SE program [207](http://www.genetics.wustl.edu/eddy/

tRNAscan-SE/).

Step 4: BLAST comparisons to protein sequences

After obtaining predicted proteins, comparisons of proteins with the non-

redundant NCBI protein database were performed using BLASTP ver-

sion 2.0.10 with the seg filter and the PAM250 substitution matrix. All

predicted genes were searched against the Pfam set of hidden Markov

models using the HMMER program and InterPro using modified Inter-

ProScan running locally on Bioinfo server.

Step 5: Data storing and PMGD web interface

Before dumping the annotation data into database system, information

from vairous software programs were integrae d and the results were

converted into either GenBank or GFF format (see below). A manual

validation step was introduced at this stage. Data storing procedure will

be described in Section 2.3.4.

http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

http://www.genetics.wustl.edu/eddy/tRNAscan-SE/

53

2.3.2 Assembly process

Phred/Phrap/Cosed package (version 0.99.03.19) is one of the most fre-

quently used software sets for trace file base calling, contig assembly and

contig editing [83,84,112].

Base calling

The purpose of base calling is to determine the nucleotide sequence on

the basis of multi-colour peaks in the sequence trace. Because traces

(and regions within a trace) are of variable quality, the fidelity of “called”

nucleotides is also variable. This accuracy for each called base is measured

by what are called base quality values. Phred takes trace file as input.

The Phred base calling program provides these base quality values to

help realistically evaluate sequence accuracy. It computes a probability p

of an error in the base call at each position, and converts this to a quality

value q using the transformation q = −10 × log10(p). Thus a quality of

30 corresponds to an error probability of 1/1000, a quality of of 40 to an

error probability of 1/10000, etc.

Vector clipping

Use the cross match alignment program to compare each read in fasta-

format file generated by base calling to a fasta database of cloning and

sequencing vectors vector.fasta. The sequence of the cloning vector used

(pUC18 plasmid sequence in our case) was added to the vector sequence

database. On the bioinfo server, the the vector sequence database is lo-

cated at /db/univec/UNIVEC/UniVec or /pgm1/phrap/vector.seq. The

example command line for clipping CLONE.fasta is:

% cross match -minmatch 12 -penalty -2 -minscore 20 -screen

CLONE.fasta

54

/db/univec/UNIVEC/UniVec

The -screen option tells cross match to produce another fasta file, CLONE.fasta.screen,

nearly identical to CLONE.fasta, except that recognised vector sequences

are replaced by X (or x, according to the original capitalisation).

Sequence assembly

Assemble the vector-clipped reads to reconstruct the clone sequence, us-

ing the Phrap sequence assembler. The program takes as input a fasta

format file of sequence fragments and a companion base quality file, con-

structs contig sequence as a mosaic of the highest quality parts of reads.

Run the assembly program using command line:

% phrap -new ace CLONE.fasta.screen > phrap.out

As a result, Phrap creates a number of files. The most important ones:

CLONE.fasta.screen.contigs (assembly consensus sequence in Fasta

format),

CLONE.fasta.screen.contigs.qual (assembly consensus base quality

values assigned by Phrap), and CLONE.fasta.screen.ace (a complicated-

looking file that enables one to view the result of the assembly in the

Consed assembly viewer/editor program).

In file CLONE.fasta.screen.contigs.qual, Phrap provides quality

information about assembly (i.e., quality values for contig sequence) by

generating its own quality measures (based on read-read confirmation).

This process seems rule-based (few references about it). For example, if

all input quality values (given by Phred) are relatively small (less than

15), Phrap assumes that they do not correspond to error probabilities

and attempts to rescale them so that the largest quality value is approx-

imately 30; in contrast, if input quality values are relatively high (≥ 40),

55

Phrap may give the base in contig (consensus of more than one bases of

reads) a higher quality value like 90. After contig assembly, for a contig

of length n, the average quality value is given by:

∑(Quality value of base in contigs)

Number of base in contigs

2.3.3 Gene finding

One of the main aims of annotation pipeline is to aid in identification

of protein-coding genes. This can be done by using a gene-finding pro-

gram to predict gene models (ab initio gene finding), or by predicting

possible genes based on the similarity of the sequence to other sequences,

particularly other identified sequences. I used both of these approaches

as follows. Ab initio gene predictions were performed using FGENESH

(SoftBerry). The automated gene prediction pipeline was hosted on the

bioinfo server at the Computer Center, HKU. The original prediction

was manually refined with assistance from GenomeScan, another gene

prediction program that combines sequence similarity and exon-intron

composition (i.e., two distinct types of evidence used by these classes of

methods), into one integrated algorithm.

Evaluation of gene recognition accuracy

The predictive accuracy of a gene-finding program is evaluated by com-

paring the exons predicted by the program with the actual coding exons

at nucleotide level and exon level [31]. For nucleotide level accuracy,

define the values TP (true positives), TN (true negatives), FP (false

positives), and FN (false negatives) as follows: TP = the number of

coding nucleotides predicted as coding; TN = the number of noncoding

nucleotides predicted as noncoding; FP = the number of noncoding nu-

cleotides predicted as coding; FN = the number of coding nucleotides

predicted as noncoding, then sensitivity as the proportion of coding nu-

56

cleotides that are correctly predicted as coding:

Sn =TP

TP + FN,

and specificity as the proportion of nucleotides predicted as coding that

are actually coding:

Sp =TP

TP + FP.

For exon level accuracy, the formulas for exon level sensitivity (ESn) and

specificity (ESp) are:

ESn =TE

AE, ESp =

TE

PE.

where TE (true exons) is the number of exactly predicted exons and AE

and PE are the numbers of annotated and predicted exons, respectively.

Combining predictions from two gene-finding programs

Gene-finding programs are still unable to provide automatic gene dis-

covery with desired correctness. The benefits of combining predictions

from more than one already existing gene prediction program have been

explored [268]. Therefore, methods for combining predictions from pro-

grams, GenScan and HMMgene, was used in predication of P. marneffei

genes, in attempt to improving exon level accuracy of gene-finding by

identifying more probable exon boundaries and by eliminating false pos-

itive exon predictions. The scripts implementing these methods are ob-

tained from http://www.cs.ubc.ca/labs/beta/genefinding/. Note

that at the time this combining prediction study was conducted, the gene-

finding program FGENESH was still not available. A late retrospective

test was conducted after combining FGENESH with either GenScan or

HMMgene though.

http://www.cs.ubc.ca/labs/beta/genefinding/

57

2.3.4 Database and databank to store results

The first step in database design is to decide what the database will be

used for and how users will interact with it. Once these are defined, the

data to be stored and how these data are associated with one another

is defined. This is done using a conceptual data model. The model is

independent of how the information will be stored in the final, physical

implementation on the computer. Entities, like gene, contig and gene

product, are defined that informally represent concepts from the real

world. The relationship between these concepts were also defined, for

example, a contig contains more than one genes; generally one gene pro-

duces one gene product. A formal language such as Unified Modelling

Language (UML) was used for specifying both use cases and conceptual

data models.

The next step is physical implementation of the data model. Now a

database management system (DBMS) has to be selected. Here I used

Microsoft Access, relational database manager running on a Windows

operating system. It is available in our departmental facilities and is

quite powerful and efficient for medium-size database management. It

has straightforward Web-publication capabilities and intuitive graphic

user interface-building capabilities. Administrators of the database work

through the application interface, while users interact with database

through a web interface. Physical implementation of the conceptual data

model was mediated with the database schema (Fig. 2.3).

Large-scale data that are to be made accessible to the community

should be well curated, annotated and documented and appropriately

formatted for publication. At present, no universally accepted standards

for data format exist for genomics data. Here, I adopted GFF (http:

//www.sanger.ac.uk/Software/formats/GFF) and GenBank format to

transfer information to and from public databases and applications. The

database was populated using Perl scripts written using ActiveState Perl

http://www.sanger.ac.uk/Software/formats/GFF

http://www.sanger.ac.uk/Software/formats/GFF

58

Version 5.6 for Windows (downloaded from http://www.activestate.

com) and the Perl modules Bioperl (obtained from http://www.bioperl.

org).

2.3.5 Perl source code collection

In the annotation pipeline, a sequence of analysis steps each using differ-

ent tools must be carried out one after the other. The challenge was that

in the absence of defined standards for the input and output of different

tools. Because there is no explicit ‘contract’ between the various tools

as to what input and output formats each will support, at any time one

of the tools in the pipeline may change the format of its input or output

(breaking the system). To connect together multiple tools ‘smoothly’

and ‘robustly’, special ‘glue codes’ have been written, mostly in Perl.

The collection of Perl scripts organised into several modules are available

at the PMGD website.

2.3.6 Genome browser configuration

Visualisation of genomic information is not just for the beauty or aes-

thetic purposes. It is of practical use that it gives more meaning to people

than reading those ‘cipher texts’. For example, three of the most promi-

nent genome browsers are the Ensembl Genome Browser (http://www.

ensembl.org/) by the European Bioinformatics Institute and the Sanger

Institute, the Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/)

by National Centre for Biotechnology Information and the UCSC Genome

Browser (http://genome.ucsc.edu/) by the University of California

Santa Cruz Genome Bioinformatics Group. They are highly specified

to their particular data type and information. Most of genome browsers

can work either online or offline. They are usually developed in Perl,

Java or other high-level languages.

PMGD incorporates two free but powerful genome browsers, Argo

http://www.activestate.com

http://www.activestate.com

http://www.bioperl.org

http://www.bioperl.org

http://www.ensembl.org/

http://www.ensembl.org/

http://www.ncbi.nlm.nih.gov/mapview/

http://genome.ucsc.edu/

59

(Java applet Fig. 2.2) and GBrowse (Generic Genome Browser), in

order to organise and annotate genomic data. The GBrowse (http:

//www.gmod.org/) combines database and interactive web page for ma-

nipulating and displaying annotations on genomes. It requires 3 steps,

installation, configuration and customisation. Installation is a easy walk

through following the instruction. Configuration is done by a configu-

ration file. Customisation was achieved by the configuration file. The

machine is equipped with a Pentium III Processor at the clock speed of

800 MHz and 128 MB main memory. ActivePerl, BioPerl and Apache web

server are necessarily installed. There is an advanced option for choos-

ing between the ‘in-memory’ database or the relational database MySQL

for storing the sequence and annotation information. For genome size

of P. marneffei, the ‘in-memory’ architecture is already good enough to

handle. Sequence files (in FASTA format) and annotation files (in GFF

format) are to be stored under ‘$HTDOCS/gbrowse/databases’ of the di-

rectory of Apache web server. The configuration file (.conf) defining the

settings is stored in ‘$CONF/gbrowse.conf’. GBrowse is highly customis-

able. For example, administrators can use different colours or shapes to

represent exon, intron, and other genetic elements. More sophisticated

functions, such as the display of different reading frames, transcription

profile, ESTs and alignments, are also provided. Administrators are al-

lowed to freely customise it by switching ON/OFF these functions and

altering the default settings so that Genome Browser can better fit the

purposes of a particular database.

2.3.7 Synteny identification

To perform synteny analyses, amino acid identity between P. marnef-

fei and A. nidulans (or other fungi) was first determined by comparing

the predicted proteins from each fungus using BLASTP. The putative

ortholog pairs is predicted by using INPARANOID program [261]. Puta-

http://www.gmod.org/

http://www.gmod.org/

60

Figure 2.2: PMGD genome browser.

tive ortholog pairs were aligned using ClustalW and the amino acid per

cent identity for each pair was calculated. If alignments spanned 60%

of both genes and the alignment score was within 80% of the top score

for either of the pair of genes, then the pair was accepted. Using these

putative ortholog pairs, supercontigs were compared with the ADHoRe

program [322] (r2 cutoff = 0.8, maximum gap size = 35 genes, minimum

number of pairs = 3). Results were filtered such that the maximum

probability for a segment to be generated by chance was < 0.01.

2.4 Results

2.4.1 Statistics of assembly

As mentioned in Section 1.3.2, all inserts were sequenced from both ends

to generate paired reads. These paired sequence fragments were assem-

bled using the Phrap package of assembly tools [84], yielding a draft

assembly. 98.35% of the assembled sequence was reconstructed in 273

supercontigs (2911 contigs); The longest contig is 178,730 bp and the

longest supercontig is 729,276 bp; The fidelity of the assembly is sup-

61

ported by the high degree (80.50%) of plasmid-end pairs preserved in

contigs and scaffolds. The net length of assembled contigs totaled 28.98

Mbp, including the mitochondrial genome of ∼ 35 kbp (Table 2.2).

Table 2.2: Summary of assembly statistics.

Features ValueRead

Total Number of Reads Sequenced 315,580Number of Bases in Total Reads 173,664,505 bpAverage Read Length 550.20Number of Confirmed Reads (by Phrap) 310,365Fraction of Reads Assembled 98.35%Fraction of Reads Paired in Assembly 80.50%Number of Bases Used in Assembly 170,951,774 bpAverage Shotgun Coverage 6.6 fold (Phrap report)

ContigTotal Number of Contigs 2,911Number of Bases in Contigs 28,977,603 bpLongest Contigs 178,730 bpAverage Length of Contigs 9,955 bp

Supercontig (scaffold)Total Number of Supercontigs 273Number of Bases in Supercontigs 28,421,390 bpLongest Supercontigs 729,276 bpAverage Length of Supercontigs 104,110 bp

2.4.2 Genome size estimation

The genome size was approximated from the draft assembly by estimat-

ing the size of gaps between contigs and scaffolds. As shown in Table

2.2, total base summarised is 28.42 Mb in supercontigs, 28.98 Mb in con-

tigs. These estimates do not include gaps. Within a supercontig, gaps,

so called within-supercontig gaps, are between contigs that belong to

the supercontig. The size of these gaps can be derived from the size of

clones spanning the gap. As mentioned in Section 1.3.2, two sequencing

clone libraries were constructed, carrying insert sizes from 2.0 – 3.0 kb

62

and 7.5 – 8.0 kb, respectively. Paired-reads belonging to contigs adjacent

gaps was recognised to be from which library. The size of gaps between

adjacent contigs in a supercontig can therefore be derived from the size

of clones spanning the gap. When estimated gap sizes are included, the

total physical length of all scaffolds is estimated to be 29.8 – 30.5 Mb.

Between supercontigs there are so called between-supercontig gaps. The

size of these gaps is hard to estimate since no spanning clones are avail-

able. In addition, these gaps include difficult-to-sequence regions of the

genome including the ribosomal DNA (rDNA) repeats, centromeres, and

telomeres. If we take these considerations, the genome size is estimated

to be ∼ 31 Mb.

When the sequencing is at the stage of relatively low coverage. There

is ‘dynamic’ way to estimate genome size by applying Lander-Smith

mathematical model. Assuming there is no cloning bias, the DNA frag-

ments generated in the shotgun sequence process are located around the

chromosome according to a Poisson distribution [92]. The unsequenced

fraction of a genome (double-strand) is:

p = e−nw/L

where n is the number of reads, w is the average length of reads and L

is the length of genome. For a 20 Mb genome, it would require about

120,000 reads of 500 bp to produce theoretically about 95% (P = 0.05)

coverage.

The number of unsequenced regions on both strands generates the

same number of contigs, N , which can be calculated as:

N = ne−nw/L

For the total sequence data (about 60 Mb reads) we have got, there are

total 119,744 reads with a mean length of 511 bp. After assembly with

63

Phrap, it generated 13,861 contigs. Therefore, n = 119744, w = 511, N =

13861. The genome size can be calculated as the following:

L = − nw

ln(N/n)= 28, 377, 000

In practice, the number of contigs is higher than theoretical expectation,

since when assembling fragments Phrap needs overlap of nucleotides to

link two reads together. These overlap regions do not contribute to the

actual coverage but was taken into calculation as it does. Another factor

is the bias due to cloning difficulties [186].

2.4.3 Accuracy of gene finding

The purpose for evaluation of gene recognition accuracy is to select the

best gene finding program. The testing data set, composing of 103 Peni-

cillium protein-coding genes that contain multiple exons was built. Our

results shows that FGENESH gives the most accurate predication over-

all. With it, we can identify ∼ 90% of coding nucleotides with 12% false

positives. It provides sensitivity (Sn) = 96% and specificity (Sp) = 89%

at the base level, Sn = 92% and Sp = 84% at the exon level and Sn =

85% and Sp = 67% at the gene level.

2.4.4 Combination of gene finding

Gene recognition accuracy may be improved by combining predictions

from two gene-finding programs. Rogic et al. [268] implemented a series

of algorithms combining gene prediction from two existing gene finding

systems, GenScan and HMMgene. The combined algorithms were tested

on the HMR195 sequence dataset and generated improved accuracy at

both the nucleotide and exon levels, where the average improvement was

7.9% compared to the best result obtained by GenScan or HMMgene

alone.

In order to identify the most accurate gene prediction system for P.

64

marneffei, I conducted an evaluation study to compare GenScan, HM-

Mgene and the combined gene prediction system based on them. The

improved accuracy of result obtained by using the combined algorithm

as in Rogic’s study was not observed in our study, where we used a dataset

of 103 sequences with known genes from Penicillium species. Our result

shows that GenScan tends to give a significantly better prediction than

either of the other systems. At the nucleotide level, the sensitivity de-

creased from 95% for GenScan to 89% for HMMgene, to 92% for the

combined algorithm.

Two considerations came up in regard to the discouraging result ob-

tained when the combined algorithm was applied to the dataset from

Penicillium species. Firstly, the different performance of combined algo-

rithm in ours and Rogic’s study is most likely caused by the difference

of organisms. The dataset HMR195 used in Rogic’s study is composed

of 195 human, mouse and rat sequences. Secondly, if two systems gen-

erate consistent (no matter good or bad) predictions, then combining

them would not give better results. For the human and rodents’ dataset,

GenScan and HMMgene performed differently, but neither of them was

always superior to the other. But when GenScan and HMMgene were

used in our dataset composed of sequences from Penicillium species, we

found GenScan always generated significantly better results than HMM-

gene. Obviously, it does not help to combine gene finding systems if one

system is always superior.

As mentioned, FGENESH was not available during the time when the

gene combination test was conducted. A late retrospective test indicated

that no improvement can be obtained when combining FGENESH with

either GenScan or HMMgene (data not shown). Consequently we decided

to use FGENESH alone to perform the gene prediction for this project.

65

2.4.5 Database and databank to store results

Physical deployment of P. marneffei genome database is different from

that of annotation pipeline hosted in SUN Solaris server at the Computer

Center, HKU. PMGD is located in the Windows 2000 based system at the

Department of Microbiology, HKU, which is accessible as a workstation

for administrators, and as a web service system for general users.

2.5 Discussion

Nowadays high through-put DNA sequencing offers a rapid and cost ef-

fective approach to obtain the most important and relevant of all ge-

netic information – the complete DNA sequence of an organism. As

the quantity of data increases for a genome project like P. marneffei

genome, researchers have to become more sophisticated about data man-

agement issues. The study developed the system for P. marneffei genome

project. This system performs semi-automatic tasks of assembly analy-

sis, gene prediction/analysis, and extragenic region analyses. In order to

be compatible with the computer systems available at the Department of

Microbiology, HKU, the system was designed to span multiple working

environments and integrate several public domains and newly developed

software programs capable of dealing with several types of databases.

Our PMGD solution approves a feasible way to handle the information

and to manage large quantities of data internally or for public use. The

genome sequence was searched against the public protein databases using

BLAST. Genes were predicted using FGENESH and adjusted manually

by referring GenomeScan. The FGENESH was selected as the best pre-

dictor from a number of gene calling programs validated against a test

set of 103 previously characterised Penicillium protein-coding genes.

Ab initio gene finding is challenging in P. marneffei. This is because

1) lack of training dataset. Normally training gene-finding program re-

quires more than 300 genes, in order to reach statistical power. However,

66

SG

D_

ES

SE

NT

IAL_

OR

F

FK

1,I

1S

YS

_N

AM

E

Fie

ld3

INT

ER

PR

O

PK

INT

ER

PR

O_N

O

DO

MA

IN_N

AM

E

ALIA

S

AL

IAS

_N

O

AL

IAS

_N

AM

E

FE

AT

UR

E_N

O

OR

TH

OLO

G

PK

Ort

oID

Score

I1G

EN

E_

NA

ME

SG

D_

SY

S_

NA

ME

SG

D_

GE

NE

NA

ME

PK

DB

_O

bje

ct_

ID

ST

AN

DA

RD

_N

AM

E

ALIA

S

DE

SC

RIP

TIO

N

GE

NE

_P

RO

DU

CT

PH

EN

OT

YP

E

FK

1,I1

SY

S_N

AM

E

IS_E

SS

EN

TIA

L

BLA

ST

_P

RO

GR

AM

PK

,I1

BLA

ST

_P

RO

GR

AM

_N

O

BLA

ST

_P

RO

GR

AM

BLA

ST

_V

ER

SIO

N

BLA

ST

_D

B

BLA

ST

_D

B_LE

N

BLA

ST

_D

B_LE

T

DA

TE

_M

OD

IFIE

D

DA

TE

_C

RE

AT

ED

CR

EA

TE

D_

BY

GE

NE

_P

RO

DU

CT

PK

GE

NE

_P

RO

DU

CT_N

O

FK

1,I1

GE

NE

_N

O

GE

NE

_P

RO

DU

CT

DE

SC

RIP

TIO

N

FU

NC

TIO

N_

EV

IDE

NC

E

PK

FU

NC

TIO

N_

EV

IDE

NC

E_N

O

FU

NC

TIO

N_

EV

IDE

NC

E_N

AM

E

DE

SC

RIP

TIO

N

CO

NT

IG

PK

CO

NT

IG_N

O

CO

NT

IG_N

AM

E

OR

GA

NIS

M

SO

UR

CE

LE

NG

TH

PO

ST

_G

AP

PR

E_

GA

P

CO

NT

IG_O

RD

ER

FK

1,I

1S

CA

FF

OLD

_N

O

CO

MM

EN

TS

CR

EA

TE

D_

BY

DA

TE

_C

RE

AT

ED

SG

D_

GO

DB

FK

1,I2

DB

_O

bje

ct_

ID

ST

AN

DA

RD

_N

AM

E

NO

T

I1G

Oid

DB

_R

efe

rence

Evid

ence

With

Aspect

DB

_O

bje

ct_

Nam

e

DB

_O

bje

ct_

Synonym

DB

_O

bje

ct_

Type

taxon

Date

Assig

ned

_by

PA

TH

WA

Y

PK

,I1

PA

TH

WA

Y_

ID

PA

TH

WA

Y

GO

_E

VID

EN

CE

PK

GO

_E

VID

EN

CE_N

O

EV

IDE

NC

E_C

OD

E

DE

SC

RIP

TIO

N

GE

NE

PK

GE

NE

_N

O

I2G

EN

E_

NA

ME

FK

1,I1

SC

AF

FO

LD

_N

O

EX

ON

_N

UM

BE

R

C_S

TA

RT

C_E

ND

CD

S_LE

NG

TH

FR

AM

E

CH

RO

MO

SO

ME

GE

NE

TIC

_P

OS

ITIO

N

GE

NE

_D

ES

CR

IPT

ION

CO

MM

EN

T

BLA

ST

P

PK

,I2

BLA

ST

P_N

O

I3H

IT_

ID

HIT

_G

I

HIT

_LE

N

HIT

_A

CC

ES

SIO

N

HIT

_D

EF

HIT

_S

IGN

IF

HIT

_S

CO

RE

BLA

ST

_Q

UE

RY

_D

EF

BLA

ST

_Q

UE

RY

_LE

N

BLA

ST

_Q

UE

RY

_A

CC

BLA

ST

_Q

UE

RY

DE

SC

FK

1,I1

BLA

ST

_P

RO

GR

AM

_N

O

PR

OT

EIN

I3P

RO

TE

IN_N

O

FK

1,I1

GE

NE

_N

O

I2P

RO

TE

IN_N

AM

E

PR

OT

EIN

_S

EQ

PR

OT

EIN

_LE

N

DE

SC

RIP

TIO

N

EC

_N

UM

BE

R

GO

_G

EN

E_

GO

EV

FK

1G

EN

E_N

O

GO

id

FK

2,I

2G

O_E

VID

EN

CE_

NO

IS_N

OT

PR

OT

EIN

_IN

FO

FK

1P

RO

TE

IN_N

O

FE

AT

UR

E_

NO

MO

LE

CU

LA

R_

WE

IGH

T

PI_

VA

LU

E

CA

I

PR

OT

EIN

_LE

NG

TH

N_T

ER

M_

SE

Q

C_T

ER

M_

SE

Q

CO

DO

N_B

IAS

TO

P_S

CO

RE

GR

AV

Y_

SC

OR

E

AR

OM

AT

ICIT

Y_S

CO

RE

HO

MO

LO

G

PK

,I2

ID HO

MO

LO

G_

NO

I1G

EN

E_

NO

FK

1,I3

GE

NE

_N

AM

E

HM

LG

_S

PE

CIE

S

HM

LG

_G

EN

E_

NA

ME

HM

LG

_S

YS

_N

AM

E

HM

LG

_F

UN

CT

ION

SC

OR

E

PR

OT

EIN

_IN

TE

RP

RO

FK

2,I

2P

RO

TE

IN_

NA

ME

FK

1,I

1IN

TE

RP

RO

_N

O

GE

NE

_A

LIA

S

FK

1A

LIA

S_N

O

FK

2,I2

GE

NE

_N

O

SC

AF

FO

LD

PK

SC

AF

FO

LD

_N

O

LE

NG

TH

I1O

LD

_ID

GE

NE

_F

UN

CT

ION

PK

GE

NE

_F

UN

CT

ION

_N

O

FK

2,I2

GE

NE

_N

O

GE

NE

_P

RO

DU

CT

DE

SC

RIP

TIO

N

FK

1,I1

FU

NC

TIO

N_E

VID

EN

CE_

NO

BLA

ST

X

PK

,I2

BLA

ST

X_

NO

I3H

IT_ID

HIT

_G

I

HIT

_LE

N

HIT

_A

CC

ES

SIO

N

HIT

_D

EF

HIT

_S

IGN

IF

HIT

_S

CO

RE

BLA

ST

_Q

UE

RY

_D

EF

BLA

ST

_Q

UE

RY

_LE

N

BLA

ST

_Q

UE

RY

_A

CC

BLA

ST

_Q

UE

RY

DE

SC

FK

1,I1

BLA

ST

_P

RO

GR

AM

_N

O

RE

FE

RE

NC

E

PK

,I3

RE

FE

RE

NC

E_

NO

RE

F_S

OU

RC

E

ST

AT

US

CIT

AT

ION

YE

AR

_V

AL

UE

U1

PU

BM

ED

DA

TE

_P

UB

LIS

HE

D

DA

TE

_R

EV

ISE

D

ISS

UE

PA

GE

VO

LU

ME

TIT

LE

FK

1,I

2JO

UR

NA

L_N

O

I1B

OO

K_N

O

DA

TE

_C

RE

AT

ED

CR

EA

TE

D_B

Y

AB

ST

RA

CT

FK

1,U

2R

EF

ER

EN

CE_

NO

AB

ST

RA

CT

AU

TH

OR

PK

,I1

AU

TH

OR

_N

O

AU

TH

OR

_N

AM

E

AU

TH

OR

_F

ULLN

AM

E

DA

TE

_C

RE

AT

ED

CR

EA

TE

D_B

Y

AU

TH

OR

_E

DIT

OR

FK

1,I4

RE

FE

RE

NC

E_N

O

FK

2,I3,I

2A

UT

HO

R_N

O

AU

TH

OR

_T

YP

E

AU

TH

OR

_O

RD

ER

CA

TE

GO

RY

PK

,I1

CA

TE

GO

RY

_N

O

CA

TE

GO

RY

DA

TE

_C

RE

AT

ED

CR

EA

TE

D_

BY

RE

MA

RK

FK

1P

UB

ME

D

RE

MA

RK

RE

FE

RE

NC

E_W

EIG

HT

DA

TE

_C

RE

AT

ED

CR

EA

TE

D_

BY

CA

TE

GO

RY

_R

EF

FK

1,I3

CA

TE

GO

RY

_N

O

FK

2,I4

,I2

RE

FE

RE

NC

E_N

O

JO

UR

NA

L

PK

JO

UR

NA

L_N

O

FU

LL_

NA

ME

AB

BR

EV

IAT

ION

ISS

N

PU

BLIS

HE

R

UR

L_N

O

PU

BLIC

AT

ION

_T

YP

E

FK

1,U

2R

EF

ER

EN

CE_N

O

PU

B_T

YP

E

Figure 2.3: Database schema of PMGD.

67

for P. marneffei we don’t have enough characterised genes; 2) lack of

cDNA which is very useful for confirming initial gene prediction. To

identify the genes that lack available cDNA sequence will require other

methods, such as, interspecies homolog search. We do have small amount

of RST sequences available [364], but, due to the poor sequence quality,

they are not even helpful. Our solution for this problem is to apply a

pre-existing gene finding program, namely FGENESH. Generally speak-

ing, if one uses a pre-existing gene finding program in a newly sequenced

organism, one expects inaccurate predictions. However, our evaluation

shows that FGENESH trained with A. nidulans dataset produced satis-

factory results when applied onto P. marneffei. This is due to the close

phylogenetic relationship between two species. We also tried to combine

predictions made by more than one gene prediction system, which has

been proposed that would significantly improvement gene prediction ac-

curacy. But unfortunately, because FGENESH is dominately better than

any other gene finding programs available, we did not observe such an

improvement after combination.

The further direction can be envisaged basing on current stage of

the system. Firstly, one of striking characteristics of the genomes of eu-

karyotic organisms is the existence of muiltigene family. This confounds

the identification of orthologous relationship among genes in interspecies

comparison. In order to solve the problem of discrimination between or-

tholog and paralog, more sophisticated algorithms are required. These al-

gorithms should take phylogenetic information into account and integrate

this into the protein prediction system. Secondly, when assigning a func-

tion to protein, controlled vocabulary should be used to all organisms.

Recent development of Gene Ontology [9] project produced a dynamic

controlled vocabulary environment that can cope with ever accumulating

and changing knowledge of gene and protein functions. Thirdly, it is ob-

vious that the more function prediction system develops, the more impor-

68

tant will be its evaluation of accuracy. Iliopoulos (2002) has established a

scoring scheme to measure performance of prediction systems [143]. De-

spite of this, considerable concerns are still raised regarding the accuracy

of assignment and the reproducibility of methodologies. The evaluation

of the performance of these systems is missing at this stage.

In summary, modern biology has created an information explosion.

The areas of whole-genome sequencing and functional genomics have pro-

duced a prodigious amount of data. This is the case in P. marneffei

genome project. This study provided a solution by offering the anno-

tation pipeline linking variant biological softwares in a systemic way, as

well as the state-of-art database management system for storing and re-

trieval biological sequence data. It has been successfully applied on the

daily-based work of annotation for the most important thermal dimorphic

fungus.

69

Chapter 3

MITOCHONDRIAL GENOME OF PENICILLIUM

MARNEFFEI

This work described in this chapter is very closely based on a paper

I have published with colleagues [353].

3.1 Introduction

Mitochondria are the power centres of the cell. They are generally the

major sites of aerobic respiration and the energy production centre in

fungi, providing the energy a cell needs to move, divide, produce se-

cretory products and contract. They are small oval-shaped, membrane-

bound organelles, about the size of a bacterium, surrounded by highly

specialised double membranes. The outer membrane is fairly smooth.

But the inner membrane, where oxidative phosphorylation takes place, is

highly convoluted, forming two compartments, the intermembrane space

and matrix. The reaction of the citric acid cycle and fatty acid oxidation

occur in the matrix.

Mitochondria maintain their own genomes. Nowadays a number of

mitochondrial genome sequences have become available. At present, the

NCBI organelle genome resource maintains a collection of 350 completed

mitochondrial genomes from different organisms, including 256 meta-

zoans, 15 fungi, 9 plants and 22 others. The number is subject to change

with the advance of sequencing endeavours. The gene content of mito-

chondrial genomes is generally well conserved. In metazoans, for exam-

ple, the mitochondrial genomes are generally circular, about 16 kb long,

and encode three primary transcript types (13 proteins used for energy

70

production, two rRNAs and 22 tRNAs). The homologous genes exist-

ing in the mitochondria of plants, protists, fungi, and animals, and in

the genomes of prokaryotes, make it possible to undertake inter-species

gene comparisons. Next I will review major components in respiratory

pathway of fungal mitochondria.

The common and invariant feature of respiratory pathways of mi-

tochondria is production of ATP coupled to electron transport. The

respiratory chain begins with electrons being transferred from NADH to

complex I (NADH:ubiquinone oxidoreductase) or from the tricarboxylic

acid cycle intermediate succinate to complex II (succinate:ubiquinone

oxidoreductase). Electrons are transferred via ubiquinones, complex III

(ubiquinol:cytochrome c oxidoreductase), cytochrome c, complex IV (cy-

tochrome c oxidase) and finally to molecular oxygen to give water (Fig.

3.1).

Complex I is comprised of peptides encoded by both nuclear- and

mithochondrial-genes (more than 25 nuclear-genes and seven mitochondrial-

encoded genes, nad 1, 2, 3, 4, 4L, 5, 6 ), forming a large multisubunit

complex and spanning the inner mitochondrial membrane. Note that a

few fungi like Saccharomyces cerevisiae and Schizosaccharomyces pombe

lack complex I, and many fungi have additional components, such as al-

ternative NADH dehydrogenases and/or an alternative terminal oxidase

(see review [152]). Complex III contains nine subunits, of which only

the gene for apocytochrome b is encoded in the mitochondrion. Between

complexes III and IV there is Cytochrome c existing in the intermembrane

space and passes electrons. Cytochrome c is encoded by the nuclear cyc-1

gene. Complex IV contains 7-8 polypeptides of which three are encoded

in mitochondrion, cox1,2,3. It is the terminal oxidase of the standard

respiratory pathway. Complex V is the mitochondrial ATP synthase,

encoded by two of the ATP synthase subunit genes, atp6 and atp8.

Since the formation of several mitochondrial complexes have subunits

71

Figure 3.1: Fungal respiratory pathways. The diagram is downloadedfrom http://pages.slu.edu/faculty/kennellj

encoded in both mitochondrion- and nuclear- genomes, the coordinated

expression of genes encoded in the nucleus and mitochondrion is critical

for the mitochondrial function. These mitochondrial complexes include

not only the large respiratory complexes as mentioned above, but also the

translational machinery that involves nuclear-encoded polypeptides and

mitochondrially-encoded rRNAs and tRNAs, and so on [240]. Therefore,

the communication between the nuclear and mitochondrial genomes con-

tributes essential subunit polypeptides to important mitochondrial pro-

teins and they collaborate in the synthesis and assembly of these proteins

(for review, see [256]).

In this chapter I report the complete sequence of the mitochondr-

ial genome of Penicillium marneffei, the first complete mitochondrial

DNA sequence of thermally dimorphic fungi. This 35 kb mitochondrial

genome contains the genes encoding ATP synthase subunits 6, 8, and 9

(atp6, atp8, and atp9 ), cytochrome oxidase subunits I, II, and III (cox1,

cox2, and cox3 ), apocytochrome b (cob), reduced nicotinamide adenine

dinucleotide ubiquinone oxireductase subunits (nad1, nad2, nad3, nad4,

nad4L, nad5, and nad6 ), ribosomal protein of the small ribosomal sub-

http://pages.slu.edu/faculty/kennellj

72

unit (rps), 28 tRNAs, and small and large ribosomal RNAs. Analysis

of gene contents, gene orders, and gene sequences revealed that the mi-

tochondrial genome of P. marneffei is more closely related to those of

moulds than yeasts.


3.2.1 Library construction and sequence assembly

The P. marneffei mitochondrial genome was sequenced as part of the

P. marneffei whole genome sequencing project as described in Chapter

1 and 2. A genomic DNA (including mitochondrial DNA) library was

made in pUC18 carrying insert sizes from 2.0 to 8.0 kb. DNA inserts

were prepared by physical shearing using the sonication method. These

work above were done by my colleagues in the Department of Micriol-

ogy, HKU and Beijing Genome Institute. I used Phred/Phrap/Consed

software package for base calling, contigs assembly and assembly qual-

ity assessment [83, 84, 112]. The complete mitochondrial DNA genome

was generated from assembly of 467 successful sequence reads (100 bp at

Phred value Q20 [112,243]), which corresponded to an overall mitochon-

drial genome coverage of about 7×.

3.2.2 Mitochondrial DNA sequence annotation

The putative ORFs in P. marneffei mitochondrial DNA were denoted

by using Artemis, a free sequence viewer and annotation tool, with the

genetic code of mould. Genes, in which the putative ORFs were lo-

cated, were functionally assigned through BLASTP searces against fun-

gal mitochondrion encoding proteins available in the GenBank database.

Introns and rRNAs were mainly identified by BLASTN pairwise compar-

ison of P. marneffei mitochondrial DNA with mitochondrial DNAs of

Aspergillus nidulans, Neurospora crassa, Saccharomyces cerevisiae (Acc.

NC 001224), Schizosaccharomyces pombe (Acc. NC 001326), Podospora

73

anserina (Acc. NC 001329), Allomyces macrogynus (Acc. NC 001715),

Pichia canadensis (Acc. NC 001762), Candida albicans (Acc. NC 002653),

Yarrowia lipolytica (Acc. NC 002659), and Candida glabrata (Acc. NC 004691)

[29, 91, 101, 354, 175, 262]. The BLASTN results were viewed through

ACT, a DNA sequence comparison viewer based on Artemis [40], and

exon and intron boundaries were adjusted manually. The tRNAs were

predicted by tRNAscan-SE 1.21 [207]. The core structures of the group

I introns were inferred by the program CITRON [200].

3.2.3 Phylogenetic analysis

Phylogenetic analysis was performed by using MBEToolbox as described

in Chapter 10. The 11 genes that encode subunits of respiratory chain

complexes (cox1, cox2, cox3, cob, nad1, nad2, nad3, nad4, nad4L, nad5,

and nad6 ) and the three that encode ATPase subunits (atp6, atp8, and

atp9 ) in the P. marneffei mitochondrial genome and the corresponding

genes in 24 other fungi with completed mitochondrial genomes were used

to determine the phylogenetic relationships of P. marneffei to the other

fungi. Phylogenetic trees were constructed using unambiguously aligned

portions of concatenated amino acid sequences of these 14 protein cod-

ing genes by the maximum likelihood method in the Phylip package [86].

The corresponding nad genes are not present in Schizosaccharomyces

japonicus, Schizosaccharomyces octosporus, S. pombe, C. glabrata, Sac-

charomyces castellii, Saccharomyces servazzii, and S. cerevisiae, and the

maximum likelihood method is not as sensitive to a lack of sequence in-

formation as the distance methods. A total of 3,462 amino acid positions

were included in the analysis.

3.2.4 Mitochondrial DNA sequences in nuclear genome

Fragments of mitochondrial DNA sequences were searched for in the cor-

responding nuclear genomes in P. marneffei, A. nidulans, N. crassa, S.

74

cerevisiae, and S. pombe. For each fungus, the corresponding mitochon-

drial DNA sequence was used as the query sequence to search against

its own nuclear genome, using a published method for S. cerevisiae

[262]. The mitochondrial and genomic DNA sequences of A. nidulans

and N. crassa were downloaded from the A. nidulans Database (http:

//www-genome.wi.mit.edu/annotation/fungi/aspergillus/) and N.

crassa Database (http://www-genome.wi.mit.edu/annotation/fungi/

neurospora/) respectively, and those of S. cerevisiae and S. pombe were

obtained from GenBank. For P. marneffei, the 6.6× coverage of ge-

nomic DNA sequences was generated by our own whole genome sequenc-

ing project.

3.3 Results and Discussion

3.3.1 Gene content and genome organisation

The mitochondrial DNA of P. marneffei is a circular DNA molecule of

35,438 bp (Fig. 3.2). The overall G+C content is 25%, and 24% in

protein-coding genes. The genome encodes 28 tRNAs, the small and

the large subunit rRNAs, the ribosomal protein of the small ribosomal

subunit, 11 genes encoding subunits of respiratory chain complexes, and

the three ATPase subunits (Table 3.1). All genes are encoded by the

same DNA strand. 63.6% of the genome is occupied by structural genes

(40.5% corresponds to protein coding exons, 5.9% to the 28 tRNA genes,

and 17.3% to the rRNA subunits), 8.8% by intergenic spacers that are

14-372 bp in size, and 32.4% by the 11 introns.

3.3.2 Protein coding genes

The P. marneffei mitochondrial genome contains 15 protein coding genes.

These include genes encoding ATP synthase subunits 6, 8, and 9 (atp6,

atp8, and atp9 ), the cytochrome oxidase subunits I, II, and III (cox1,

http://www-genome.wi.mit.edu/annotation/fungi/aspergillus/

http://www-genome.wi.mit.edu/annotation/fungi/aspergillus/

http://www-genome.wi.mit.edu/annotation/fungi/neurospora/

http://www-genome.wi.mit.edu/annotation/fungi/neurospora/

75

P. marneffei mtDNA35,438 bp

nad5

cob

rnl

cox1

nad9

nad4

nad2

nad4L

atp9

atp8

nad6

cox3

urf1

urf2nad3 cox2

rps

atp6

rns

introns

exons

intronic ORFs

tRNAs

0/35.4

10

20

30

L2FA

L1M2

M1,V,E,T

M3H

Q

P1,S2,I,W,S1,D,G2,G1,K,R2

Y

N2

R1

C

N1

P2

Figure 3.2: Physical map of P. marneffei mitochondrial DNA. The mapis based on an annotation of the reverse complement of Assembly 3 ofthe P. marneffei mitochondrial sequence determined by the P. marneffeiSequencing Project at the University of Hong Kong in collaboration withBeijing Genomics Institute of Chinese Academy of Sciences. Numbers inthe inner circle are in kb. The sequence is numbered from the uniquerestriction enzyme ClaI site (AT|CGAT) (0/35.4), which is located justupstream to the nad4L gene and downstream to the cox2 gene. Exonsare shown in black, introns in white, and intronic ORFs in gray.

76

Table 3.1: Gene content of P. marneffei mitochondrial genome. * Exactstart codon could not be determined merely through sequence compari-son.

Genetic element Localisation (nt)Size Codons

bp aa Start Stopnad4L 26-295 270 89 ATG TAAnad5 295-2271 1977 658 ATG TAAnad2 2289-4028 1740 579 TTA TAAatp9 4216-4440 225 74 ATG TAAcob Join: (4706-5098, 6270-7037) 2332 386 ATG TAAcob-i1-ORF 5099-5965 867 288 TTG* TAAnad1 Join: (7532-8179, 8650-9081) 1550 359 ATA TAAnad4 9253-10716 1464 487 ATG TAAatp8 10945-11091 147 48 ATG TAGatp6 11158-11928 771 256 ATG TAArns 12341-13721 1381nad6 14053-14637 585 194 ATG TAAURF1 14722-15177 456 151 ATG TAAcox3 15352-16161 810 269 ATG TAArnl Join: (17165-19688, 21361-

21902)4738

rps 19987-21252 1266 421 ATG TAAcox1 join: (23339-23718, 24994-

25099, 26298-26641, 27740-27875, 29012-29201, 30504-30553, 31652-31806, 32835-33159)

9821 561 ATT TAA

cox1-i1-ORF 23720-24622 903 300 AAA* TAAcox1-i2-ORF 25100-26200 1101 366 AAA* TAAcox1-i3-ORF 26643-27647 1005 334 AAA* TAAcox1-i4-ORF 27876-28928 1053 350 TGA* TAAcox1-i5-ORF 29204-30043 840 279 TTA* TAAcox1-i6-ORF 30554-31384 831 276 ACA* TAAcox1-i7-ORF 31808-32629 821 273 AGA* TAGURF2 33223-33660 438 145 ATT TAAnad3 33955-34362 408 135 ATG TAAcox2 34591-35346 756 251 ATG TAA

77

cox2, and cox3 ), apocytochrome b (cob), the reduced nicotinamide ade-

nine dinucleotide ubiquinone oxireductase subunits (nad1, nad2, nad3,

nad4, nad4L, nad5, and nad6 ), and the ribosomal protein of the small

ribosomal subunit (rps). This set of protein coding genes is exactly the

same as that in the A. nidulans mitochondrial genome. Furthermore, the

gene order of the protein genes is the same as that in the A. nidulans mito-

chondrial genome, except for the atp9 gene, which is located between the

cox1 and nad3 genes in the A. nidulans mitochondrial genome, but be-

tween the nad2 and cob genes in the P. marneffei mitochondrial genome

(Fig. 3.3).

Concatenated amino acid sequences of the 14 protein coding genes in

the mitochondrial genomes of P. marneffei and 24 other fungi were used

for phylogenetic tree construction. The closest relatives of P. marnef-

fei were A. nidulans and other moulds, such as P. anserina, N. crassa,

Hypocrea jecorina, and Verticillium lecanii (Fig. 3.4). On the other hand,

the yeasts, such as the Saccharomyces species, Schizosaccharomyces species,

Candida species, and P. canadensis were more distantly related to P.

marneffei. This implied that phylogenetically the mitochondrial genome

of P. marneffei is more related to those of moulds than yeasts. This is in

line with our previous observation and also results published by others,

that when the chromosomal 18S rRNA genes or the internal transcribed

spacers and 5.8S rRNA genes (ITS1-5.8S-ITS2) and mitochondrial small

subunit rRNA genes were used for phylogenetic trees construction, the

closest neighbours of P. marneffei, besides the other Penicillium species,

were the Aspergillus species as well as other moulds [202, 364]. Fur-

thermore, the same gene content and almost the same gene order in the

mitochondrial genomes of P. marneffei and A. nidulans also implies that

the mitochondrial genome is probably not related to the unique charac-

teristic of thermal dimorphism of P. marneffei. Interestingly, MP1, the

gene that encodes an abundant and highly immunogenic protein in P.

78

Protein & rRNA genes

tRNA genes

G1

nad4L

nad5

nad2

atp9

N1

cob

G2

cox3

R2

K

D

S1

W

I

S2

P1

rnl

rpsT

E

V

M1

M2

L1

A

F

L2

Q

M3

H

cox1

P2

nad3

cox2

nad4L

nad5

atp9

N1

cob

cox3

P1

rnl

rps

T

E

V

M1

M2

L1

A

L2

Q

M3

H

cox1

nad3

cox2

C1

C

R1

nad1

nad4

atp8

atp6

rns

Y

nad6

G1

C2

R

nad1

nad4

atp8

atp6

G2

rns

Y

nad6

K

D

S

W

I

N2

N2

nad2

P. marneffei A. nidulans

F

Figure 3.3: Gene content and order comparison between P. marneffei mi-tochondrial DNA and A. nidulans mitochondrial DNA. The only exonicgene that has undergone gene rearrangement is atp9, which is highlightedin black background.

79

Hya

lora

ph

idiu

m c

urv

atu

m

Mo

no

ble

ph

are

lla s

p. J

EL

15

Ha

rpo

ch

ytriu

m s

p. J

EL

10

5

Ha

rpo

ch

ytriu

m s

p. J

EL

94

Sp

ize

llom

yce

s p

un

cta

tus

Rh

izo

ph

yd

ium

sp

.

Allo

myce

s m

acro

gyn

us

Ve

rticilliu

m le

ca

nii

Hyp

ocre

a je

co

rina

Ne

uro

sp

ora

cra

ssa

Po

do

sp

ora

an

se

rina

Asp

erg

illus n

idu

lan

s

Pe

nic

illium

ma

rne

ffei

Pic

hia

ca

na

de

nsis

Sa

cch

aro

myce

s c

ere

vis

iae

Sa

cch

aro

myce

s s

erv

azzii

Sa

cch

aro

myce

s c

aste

llii

Ca

nd

ida

gla

bra

ta

Ca

nd

ida

alb

ica

ns

Ya

rrow

ia lip

oly

tica

Sch

izo

sa

cch

aro

myce

s p

om

be

Sch

izo

sa

cch

aro

myce

s o

cto

sp

oru

s

Sch

izo

sa

cch

aro

myce

s ja

po

nic

us

Cry

pto

co

ccu

s n

eo

form

an

s v

ar. g

rub

ii

Sch

izo

ph

yllu

m c

om

mu

ne

0.1

£ G

roup I in

tron w

ith in

tron

ic O

RF

¢ G

roup I in

tron w

ithout in

tronic

OR

F¿

Gro

up II in

tron

Ge

ne

s no

t pre

sen

t we

re cro

ssed

ou

t

rnl

atp

6

atp

8

atp

9

co

b

co

x1

co

x2

co

x3

nad

1

nad

2

nad

3

nad

4

nad

4L

n

ad

5

nad

6

£

££

££¿£

¿

¿

££

££££

£££££££¢£

£

¢

££

¢

¢¢¢

£££££££¢£

¢

¢£

¢

£

¢££

£

¿££££

¿¿££££¿

£

£

£

£

£££££££

¢

£

£

£££

££

£

££

¿££££££££££££££ ££

££££

£

£

£

£££¿

£

£

££

£

¢

£

£

££

£

££

£££££

¢

£

¿¢¢¢

¢££¢£¢

¢¢¿¢¢¢££¢¢¢¢

¢

¢£

££¢

¢¢¢¿¢¢¢¢£

££££¢¢ ¢¢£££¢£££¿¢¢£¢

¢

¢

¢¢¢

££

£

¢

¢

¢

¢£

££

££££

££

£

80

Figure 3.4: Phylogenetic relationships of P. marneffei to other fungiand distribution of group I and group II introns in the correspondingfungi. Maximum likelihood tree showing phylogenetic relationships ofP. marneffei to other fungi and distribution of group I and group II in-trons in the corresponding fungi. The tree was constructed using unam-biguously aligned portions of concatenated amino acid sequences of the14 protein-coding genes (atp6, atp8, atp9, cob, cox1, cox2, cox3, nad1,nad2, nad3, nad4, nad4L, nad5 and nad6 ). A total of 3462 amino acidpositions were used for the inference with ProML [86]. Sequences were ob-tained from GenBank: Allomyces macrogynus (NC 001715), Aspergillusnidulans (CAA32799, CAA33481, AAA99207, AAA31737, CAA25707,AAA31736, CAA23994, P15956, CAA23995, CAA33116), Candida albi-cans (NC 002653), Candida glabrata (NC 004691), Cryptococcus neofor-mans var. grubii (NC 004336), Harpochytrium sp. JEL105 (NC 004623),Harpochytrium sp. JEL94 (NC 004760), Hyaloraphidium curva-tum (NC 003048), Hypocrea jecorina (NC 003388), Monoblepharellasp. JEL15 (NC 004624), Neurospora crassa (CAA24041, CAA32799,AAA31961, CAA27029, CAA27418, AAA66053, AAA31959), P. marn-effei (Present study), Pichia canadensis (NC 001762), Podospora anse-rina (NC 001329), Rhizophydium sp. 136 (NC 003053), Saccharomycescastellii (NC 003920), Saccharomyces cerevisiae (NC 001224), Saccha-romyces servazzii (NC 004918), Schizophyllum commune (NC 003049),Schizosaccharomyces japonicus (NC 004332), Schizosaccharomyces oc-tosporus (NC 004312), Schizosaccharomyces pombe (NC 001326), Spizel-lomyces punctatus (NC 003052, NC 003061 and NC 003060), Verticil-lium lecanii (NC 004514), Yarrowia lipolytica (NC 002659). Some se-quences of A. nidulans were downloaded from Fungal MitochondrialGenome Project (http://megasun.bch.umontreal.ca/People/lang/FMGP/FMGP.html), and some sequences of N. crassa were downloadedfrom http://pages.slu.edu/faculty/kennellj/genbank.html. Thescale bar indicates the branch lengths that were scaled in terms of ex-pected numbers of amino acid substitutions.

http://megasun.bch.umontreal.ca/People/lang/FMGP/FMGP.html

http://megasun.bch.umontreal.ca/People/lang/FMGP/FMGP.html

http://pages.slu.edu/faculty/kennellj/genbank.html

81

marneffei, only has known homologues in A. nidulans, A. fumigatus, and

A. flavus, but not in other fungi [37,39,38,363,43,351,352].

3.3.3 Genetic code and codon usage

Since the mitochondrial genome P. marneffei is phylogenetically closely

related those of moulds and its gene content is the same as that of A.

nidulans, the genetic code of the mitochondrial genome of P. marneffei

is assumed to be the same as that of A. nidulans .

There is a strong codon usage bias in exonic ORFs in the mitochondr-

ial genome of P. marneffei towards codons ending in A or T. In fact, eight

codons (CTC, CTG, ACG, TGC, TGG, CGC, CGG, and GGC) were not

used at all, five codons (GTC, TCC, TCG, ACC, and AGG) were used

only once, and nine codons (ATC, CCG, GCC, GCG, CAC, CAG, AGG,

GAC, GGG) were used 2 to 10 times, in exonic ORFs. Moreover, this

codon usage bias is also evident in the use of stop codon, where TAA is

used as the stop codon in 14 genes, but TAG is only used in one gene.

3.3.4 tRNA genes

Twenty-eight tRNA genes were identified in the P. marneffei mitochon-

drial genome (Fig. 3.5). These are all located on the same DNA strand

as the other genes. The set of mitochondrial tRNAs in P. marneffei is

similar in type to that in A. nidulans. Furthermore, the sequences of

the mitochondrial tRNA genes of P. marneffei are fairly conserved with

those of A. nidulans, especially between the two tRNA gene clusters of

two species (Fig. 3.3).

3.3.5 Other RNA genes

The genes that encode the 23S and 16S ribosomal RNAs of the large and

small subunits of the ribosome (rnl and rns) were identified. Further-

more, a gene (rps), located within the intron of rnl (Table 3.1 and Fig.

82

Table 3.2: Codon usage in protein-coding genes of P. marneffei mi-tochondrial genome. Numbers indicate the total numbers of codonsin either identified protein coding genes or ORFs (including both free-standing URFs, intronic ORFs and RPS).

Codon AA Genes ORFs Codon AA Genes ORFsTTT F 307 143 TCT S 160 93TTC F 66 13 TCC S 1 5TTA L 572 250 TCA S 105 45TTG L 26 33 TCG S 1 13

CTT L 49 42 CCT P 119 35CTC L 0 6 CCC P 4 2CTA L 20 24 CCA P 25 20CTG L 0 4 CCG P 4 3

ATT I 182 134 ACT T 121 78ATC I 10 12 ACC T 1 7ATA I 326 162 ACA T 105 45ATG M 112 38 ACG T 0 4

GTT V 132 74 GCT A 144 49GTC V 1 3 GCC A 4 7GTA V 131 70 GCA A 81 35GTG V 18 5 GCG A 7 3

TAT Y 191 180 TGT C 24 21TAC Y 32 27 TGC C 0 4TAA * 14 9 TGA W 56 37TAG * 1 1 TGG W 0 5

CAT H 76 47 CGT R 10 24CAC H 8 7 CGC R 0 1CAA Q 83 75 CGA R 0 1CAG Q 5 7 CGG R 0 2

AAT N 196 277 AGT S 123 90AAC N 11 30 AGC S 15 8AAA K 101 347 AGA R 78 94AAG K 6 18 AGG R 1 9

GAT D 97 112 GGT G 188 94GAC D 3 11 GGC G 0 1GAA E 89 133 GGA G 92 32GAG E 21 21 GGG G 6 13

83

1 U 2 U 3 U 4 U 5

A A G C U A G C AA U G C U A C G A UG C A U C G U A G CG C G C U A C G G CA U C G U A U A A U

A A A A AA U U A U A A U U A

UGUC A CCCC A G A A UCU A A G CU AA A A A A A A A A A A UU A

UUU A UGG A A AG A U GGG A CUC CU A CUC AG A U U A C A UG AU A U A U U U

A A A A CU A G AG A G AG A A UG UUU A A U G A U A A U U A A G A A A

A U A A A A A U AA A A

A A A A A U AA A A A U

A A A AA A A A A A

A A A A AU GC A UC G G A

A S N C Y S A RG A S N T Y R

6 U 7 A 8 A 9 A 10 GA U G U A U A U G CA U A U U A C G G UC G G C G C G C G CG C A U A U A U U AU A C G C G U A U AG C C G U A C G A UU A U A U A U A A U U A A U U A G C GG

U U A UUC A U U A CCC A U CG A CU A U CG A CU A U C AGCC AA A A G A A A A A A A G G A A G U A A A A G

U CUU A A U A AG C C UUUG GUGGG C U UUU A GCUG A C C UUG A GCUG A C A UUC GUCGG CG U UU G U UU G U UU G C UU G U A UG G A A U A G A A A C UC G G A A U A G G A CU G G A AGG A

U A A CU U A A U U A C A U A A U U A G U AGU A U U A G U A C GG U AU A G U U A U A A UA U U A C G C G U AU A G C G C G C C GG C A U A U U A U A

C A C A C C U A U AU G U A U A U A U G

A CG UUU UCC A CC G U C

A RG L Y S G L Y G L Y A S P

11 G 12 G 13 A 14 G 15 AG C A U G C A U C GG C A U A U G C A UA U G C U A A U G CA U A U U A G C G CA U G C C G A U U AG C U A U A G C U AG C U A A U U A U A CU U A U A A U U A

U UUCUC A U G AGUC A U GUC A C A UG A U U A U A C A G A A U A C AU A G G A A A G C A A A G G G G G GG A A G

U CCG A AG AG C U UUUG CUU AG C U UUC A C AGUG C G UCG A U A UG C U CCG UU A UG CG U UU G U UU G U UU U C UU U G UUG GGC C G A A A C U G A AGU U U GGC U G GGC U

U AG G A U U U A C U U A G UG U A A G A A U G A G A A UGG U A A G CC A U A A UU A U UU U G

U A U U C G A U C G G C C GC G G A A U U A U A G A A U AG C U U A U U G A U U A C GU A A A U C G G C A A A U

U A C A U A C U U UU A U A U A U G U G

G C U UC A G A U U G A U G G

S E R UR P I L E S E R P R O

16 U 17 A 18 19 A 20 AG C G U A U A A UC G A U A U G C A UC G U A A U C G G CU G C G G C U C G UG U C G A U A U C GG C A U A U A U U AU A A U A U U A A U U G U A A U UG

U CGUUC A U CUCUC A U A U A U A A UCC A U UUU A C AA A A G UG A A G G G A A U UCC A C A A A A G U A A A G

A U A CG GU A AG C C CUG G AG AG C C CUCG A U UGU A UU AGG C A UUCG A A A UG CA A UG G C UU G AGGUG C G A UU G U UUG A UGC A G G A C A G G AGC C UU G GC A U A G A AGC A

U A A G UU A A A G U A U UG U A A A U A G U AC U A U UG G G U A A U GC G A U U A A A U A A UC G A U C G U A C GG C C G G C G C U A

U U A U C G G C G UU A U C U C C A C AU A U A U A U G U G

U GU UUC U A C C A U C A U

T HR G L U V A L M E T M E T

21 A 22 A 23 A 24 A 25 AA U G C G C A U U AU A G C C G C G A UC G G U U A A U U AC G G C C G G C C GA U U A G C A U U AA U U A A U U A C GG C CU A U CC G U UG A U U A G C U A

A A U CUUGC A U AGUCC A A UGUUC A A A U U A UUC A U CUCGC AC A G G U A A A G U A A A G U C G G U A A U G

U GUC G A A CG C U UUUG UC AGG C U CUCG A C A AG C U GUC A U A AG C A UC AG G AGUG CG U UU G C UU G U UU G U UU G U UUG C AG G G A A A C U G G AGC U G C AGG G G AGUC A

U A U U C U A U U A UU G AGG U A U A U A C UA U G C G C G U G U AG C GU A A U A UU

C G G G A C G A U U A A UA U C U G C A U C G A UA U GC A U A U A U A UA U U A A U C A U A

C A U U C A U G U UU G U A U G U AG U G

U A A UGC G A A U U G

L E U A L A P H E L E U G L N

26 U 27 G 28 AG C G C A UC G U U A UC G G C G CA U G C G CA U G U A UG C U A U AG U U A A U U A U A U A

U CUCUU A U GG A A C A U UGUCC AA A A G A A A G A A A A G

C UUUG G AG A A C U CUUG CCUUG C U UUUG A U AGG AG U UU G U UU U U UUG G A A C A A G G A A C A G A A A C A

U A A A U A A A C UU A A UU A AG G U U A A AA U C G A UA U U A U AU A G U A U

U A U A UA

U G A C AU

C A U U G U IntronGUG A

UGG

M E T H I S P RO

U U U U UU G U G G U C U U G C U

U C G U U U C U C U U CG G G G U U G

U G G G C U G G C U G G C G C G CU U U G U G U U G U U G U UG C G C C G C G C U G U

U U U G U G UU U U G U G C U CA U U U U U G C U G GU C G U U G G CA U U G U U U U

U U U U U G CC C C C

U U U U UA U U U U U

Figure 3.5: 28 tRNAs encoded in the mitochondrial genome of P. marn-effei. Predicted clover-leaf structures of the 28 tRNAs encoded in themitochondrial genome of P. marneffei. Anticodons are underlined andthe corresponding amino acids are indicated. tRNAs are listed accordingto the order of their positions in the map in Fig. 3.2.

84

3.6), that encodes the ribosomal protein of the small ribosomal subunit,

which is also present in the A. nidulans mitochondrial genome, was also

identified.

A

T

A

T

A

P5

T

A

C

G

CA

A

T

G

A

A

A

A

A

T

A

T

A

T

P4

C

G

19720

T

A

T

A

A

..

G

A

T

T

T

A

38 bp

A

A

T

AA

A

T

G

C

G

47 bp

T

G

C

98 bp

A

P3

T

T

G

G

G

T

A

A

C

C

C

G

C

T

A

A

T

A

T

T

T

C

T

A

C

C

P6

24 bp

A

T

A

T

A

A

A

G

A

T

G

C

A

A

A

A

T

C

A

G

A

45 bp

P8

A

G

A

A

T

A

G

T

C

T

G

A

A

T

T

G

A

A

C

..

P7

21360

RPS5

1256 bp

Pm Lsu.1

44 bp

75 bp

G

A

C

G

T

A

T

A

P5

C

G

G

T

CT

A

A

A

G

A

A

T

G

C

T

A

C

G

P4

26642

G

C

P3

T

A

T

A

A

T

A

A

TA

T

A

G

G

T

T

A

C

G

T

T

A

C

30 bp

C

C

A

A

T

A

G

C

A

A

T

GC

T

A

A

T

G

A

T

G

T

T

A

A

G

A

C

P6

783 bp

T

A

T

A

C

T

A

C

A

A

T

T

T

T

T

C

A

GA

G

A

A

P8

A

A

A

G

T

C

C

G

A

T

A

T

A

A

A

G

G

A

T

A

..

P7

27647

Pm Cox1.3

..

T

G

G

T

14 bp

76 bp

Figure 3.6: Predicted secondary structures of two representative groupI introns. Group I introns, PmRnl.1 and PmCox1.3, of rnl and cox1genes respectively, in P. marneffei. The exon/intron boundaries are rep-resented by dotted lines. Base pairs are depicted by bars. The corre-sponding sizes of nucleotides not shown are indicated in bp. RPS5 geneis depicted by square box. The numbers correspond to the coordinatesin the mitochondrial genome.

3.3.6 Group I introns

In P. marneffei, the cox1 gene contains seven introns (PmCox1.1, Pm-

Cox1.2, PmCox1.3, PmCox1.4, PmCox1.5, PmCox1.6, and PmCox1.7),

while the cob gene, nad1 gene, and rnl gene contain one intron each

(PmCob1.1, PmNad1.1, and PmRnl1.1 respectively). Each intron in the

cox1, nad1, and rnl genes contains an ORF. The ORF in the rnl gene

85

Table 3.3: Presence of mitochondrial DNA fragments in nuclear genomes.‘Nuc no.’, number of mtDNA fragments in nuclear genomes; ‘Mt size’,size of mitochondrial genomes (kb); ‘Nuc size’, Size of nuclear genome(Mb); ‘Ratio’, ratio of sizes of mitochondrial to nuclear genome (kb/Mb).

Fungus Nuc no. Mt size Nuc size RatioP. marneffei 10 35.4 ∼ 29.5 ∼ 1.20A. nidulans 17 ∼ 33.2 ∼ 31.0 ∼ 1.07N. crassa 21 ∼ 64.8 ∼ 43.0 ∼ 1.51S. cerevisiae 34 85.7 12.1 7.08S. pombe 21 19.4 13.8 1.41

encodes the rps gene. The predicted secondary structures of two repre-

sentative group I introns are depicted in Fig. 3.6. In both introns, the

upstream exons end with a T and the introns end with a G, typical for

most group I introns.

A comparison of the distribution of group I and group II introns in the

14 protein coding genes and rnl gene in the P. marneffei mitochondrial

genome and that in the corresponding genes in the other 24 fungi is

shown in Fig. 3.4. As a whole, the distribution of these introns in the

genes encoded in the mitochondrial genome of P. marneffei concurs with

those of the other fungi. The cox1 gene, the gene that contains the

largest number of self-splicing introns in other mitochondrial genomes,

is also the gene that contains the largest number of self-splicing introns

in the P. marneffei genome. The cob and nad1 genes, the genes that

also contain significant numbers of self-splicing introns, also possess one

group I intron each in the P. marneffei mitochondrial genome.

3.3.7 Mitochondrial DNA sequences in nuclear genome

Presence of mitochondrial DNA sequence fragments in the correspond-

ing nuclear genomes of P. marneffei, A. nidulans, N. crassa, S. cere-

visiae, and S. pombe were compared (Table 3.3). By using the same

method of sequence similarity comparison used for S. cerevisiae [262],

86

Table 3.4: P. marneffei mitochondrial DNA sequences present in nucleargenome.

No. Coordinates Size (bp) Location E-value1 9031..9069 39 nad1 9e-082 10182..10201 20 nad4 1e-033 11622..11697 76 atp6 2e-154 13445..13465 21 rrs 2e-045 15158..15177 20 nad6 – cox3 1e-036 18757..18776 20 rnl 1e-037 25168..25187 20 cox1 1e-038 31197..31216 20 cox1 1e-039 32560..32580 21 cox1 2e-0410 34510..34529 20 nad3 – cox2 1e-03

only 10 mitochondrial DNA sequence fragments were detected in the 4×coverage, representing 95%, nuclear genome sequences for P. marneffei

(Table 3.4). This number of mitochondrial DNA sequence fragments in

the corresponding nuclear genomes, as well as the ratio of mitochondrial

to nuclear genome size, was comparable to those found in A. nidulans,

N. crassa, and S. pombe (Table 3.3). On the other hand, the number

of mitochondrial DNA sequence fragments in the nuclear genome of S.

cerevisiae was 34, which was about two times more than the other fungi.

Although the relatively high ratio of mitochondrial to nuclear genome

size of S. cerevisiae may partly explain this phenomenon, further studies

would be necessary to elucidate the difference in the significance of these

mitochondrial DNA fragments in the nuclear genomes for the different

fungi.

In conclusion, among the known mitochondrial genomes of fungi, the P.

marneffei mitochondrial genome has an intermediate size. The replica-

tion origin of the P. marneffei mitochondrial genome is unknown. De-

87

spite the distinct biological property of thermal dimorphism in P. marn-

effei, its mitochondrial genome is much more closely related to those of

moulds, especially to that of A. nidulans, than to yeasts. The set of

protein coding genes in the P. marneffei mitochondrial genome is ex-

actly the same as that in the A. nidulans mitochondrial genome. Except

for the atp9 gene, the gene order of the protein genes is also the same

as that in the A. nidulans mitochondrial genome. Furthermore, when

concatenated amino acid sequences of 14 protein coding genes in the mi-

tochondrial genomes of P. marneffei and 24 other fungi were used for

phylogenetic tree construction, the closest relatives of P. marneffei were

A. nidulans and other moulds, whereas the yeasts were more distantly

related.

88

Chapter 4

GENOMIC EVIDENCE FOR THE PRESENCE OF

MELANIN BIOSYNTHESIS GENE CLUSTER IN


In this Chapter, I will firstly review fungal virulence factors and their

identification by genomic approaches, then I give genomic evidence for

the presence of melanin biosynthesis genes in Penicillium marneffei.

4.1 Introduction

In Chapter 3, when I compared the mitochondrial genome of P. marneffei

to those of other fungi, it was observed that the mitochondrial genome

of P. marneffei is much more closely related to those of moulds, espe-

cially to that of Aspergillus nidulans, than to yeasts. The set of protein

coding genes in the P. marneffei mitochondrial genome is exactly the

same as that in the A. nidulans mitochondrial genome. Except for the

atp9 gene, the gene order of the protein genes is also the same as that

in the A. nidulans mitochondrial genome. Furthermore, the amino acid

sequence identity between the mitochondrial genes of P. marneffei and

those of A. nidulans is significantly higher than those between the mi-

tochondrial genes of P. marneffei and those of Neurospora crassa, Can-

dida albicans, Saccharomyces cerevisiae, and Schizosaccharomyces pombe.

This evidence of close relationships between P. marneffei and Aspergillus

species has prompted a further search for previously undiscovered charac-

teristics in P. marneffei based on our knowledge of the various Aspergillus

species.

Melanins are negatively charged pigments of high molecular weight

89

with hydrophobic surfaces. They are formed by the oxidative polymeri-

sation of phenolic and/or indolic compounds [341]. They are carcinogens

that are widespread in agricultural products and food. They are mainly

produced by various Aspergillus species, like A. parasiticus and A. flavus,

and less frequently, also by A. nomius, A. pseudotamarri, and A. bom-

bycis [170]. Since melanin is made by these important pathogenic fungi

and has been implicated in the pathogenesis of a number of fungal infec-

tions, it would be of interest to investigate whether P. marneffei could

synthesise melanin or melanin-like compounds.

Here, after the literature review, I report the progress in identifying a

gene cluster in P. marneffei, spanning 19 kb, which contains six homologs

of genes. All these six genes in the cluster in A. fumigatus have been

shown to be involved in DHN-melanin biosynthesis [24, 187, 317, 318].

These genes are alb1, arp1, arp2, abr1 and abr2 encoding polyketide

synthases, scytalone dehydratases, and hydroxynaphthalene reductases,

a putative protein possessing two signatures of multicopper oxidases and

laccase respectively, as well as, ayg1 of unknown function. The order of

genes in the clusters of two fungi differs slightly from each other. These

findings indicate that P. marneffei can potentially produce melanin or

melanin-like compounds. Since melanin is an important virulence factor

in other pathogenic fungi, this pigment may have a similar role to play

in the pathogenesis of penicilliosis.


Most fungi cannot survive in the environment provided by human tissue

and therefore are not pathogenic. Amongst more than 100,000 fungal

species which have been described, only a handful of them are pathogens.

The pathogenic fungi are divided into two classes, primary pathogens and

opportunistic pathogens. Primary pathogenic fungi, e.g., Coccidioides

immitis and Histoplasma capsulatum, are “professional” pathogens which

90

adapt to live inside healthy mammalian and human tissue, causing dis-

ease not only in immuno-compromised patients but also in healthy peo-

ple. Opportunistic fungi may have an environmental reservoir or exist as

commensals in a healthy host. Some examples include Candida species,

C. neoformans and A. fumigatus. These fungi are able to grow and in-

vade host tissue only when they take advantage of immuno-compromised

host. However, the incidence of life-threatening mycoses caused by op-

portunistic fungal pathogens has increased dramatically in recent years.

They are eventually the major cause of fungal infections. The infections

cause by pathogenic fungi can be superficial, subcutaneous or systemic.

Superficial infection localises to the skin, the hair, and the nails; subcu-

taneous infection confines to the dermis, subcutaneous tissue or adjacent

structures; systemic infection refers to deep infections of the internal or-

gans.

4.2.1 Potential virulence factors

Virulence factor in a fungus literally refers to any factor that a fungus

possesses that increases its virulence in the host. For instance, if a gene

or a protein is essential for growth in vivo whose deletion does not af-

fect mycelial growth in vitro, it is considered as a virulence factor [189].

The concept of virulence factor is different in primary pathogens and

opportunistic pathogens and it is relatively difficult to define literally

when dealing with the latter, as pointed out by [128]. For most of fungal

pathogens, few virulence factors which contribute to their pathogenicity

have been reported.

Although the mechanisms of fungal pathogenicity remain less-well

understood, the development of a fungal infection must satisfy several

considerations. The fungus must first be able to adhere to the host

tissues. The fungus must colonise the host and invade the host tissue.

Once the fungus has invaded the host tissue, it must be able to adapt to

91

the tissue environment. Probably most importantly, the fungus must be

able to avoid the host’s cellular defences.

Adherence to host tissues

Adherence factor is essential for fungal pathogens to attach themselves

onto host tissue, and to resist physical clearing of the infectious agent.

For example, C. immitis, Aspergillus species, H. capaulatum and Cryp-

tococcus neoformans all infect via the bronchial route and must have

specific adaptations in order to avoid effective clearance from a host’s

lungs. Adherence is dependent on a variety of factors, including surface

glycoprotions, fungal cell surface hydrophobicity, pH, temperature, and

of course, phenotype of the organism. Adhesins are biomolecules that

promote the adherence of fungi to host cells or host-cell ligands that

bind to several extracellular matrix proteins of mammalian cells, such as

fibronectin, laminin, fibrinogen and collagen Type I and IV.

Amongst many studies that have shown the association of adherence

and fungal pathogenesis, the studies on adhesion in C. albicans are most

extensive. Candida species express several cell surface proteins termed

adhesions which actively promote binding to host cells. These include a

lectin-like protein that recognises sugar residues of epithelial cell surface

glycoproteins, and a complement receptor-like protein, CR3, which may

play in a role in adherence to endothelial cells. Several adherence promot-

ing molecules or adhesions of C. albicans regulate attachment, invasion,

and dissemination of the fungus [36,157].

Als1p (agglutinin-like sequence) of C. albicans is a member of a fam-

ily of seven lycosylated proteins with similarity to the S. cerevisiae -

agglutinin protein that is required for cell-cell recognition during mating.

Als1p is essential for virulence in a hematogenously disseminated murine

model [98].

HWP1 is a hyphal- and germ-tube-specific outer surface mannopro-

92

tein that binds C. albicans hyphae to human buccal epithelial cells [319].

The null mutant was less virulent than parental or single-gene-deleted

strains in a hematogenously disseminated murine model. The yeast ger-

minated less readily in the kidneys of infected mice and caused less en-

dothelial cell damage [319]. C. albicans binds to several ECM ligands,

including FN, laminin and collagens I and IV. C. albicans expresses an

integrin-like protein INT1 which is 25% identical to a non-repeat region

of the fibrinogen-binding protein, ClfA, of Staphylococcus aureus. Strains

of C. albicans deleted in INT1 were less virulent and adhered less readily

to an epithelial cell line [102]. Strains of C. albicans deleted in the 1,2-

mannosyltransferase gene (MNT1) are less able to adhere in vitro and are

avirulent. Mnt1p is a type II membrane protein that is required for both

O- and N-mannosylation in fungi and found to be required for adherence

to an epithelial cell line [34].

Adhesins of other medically important fungi, such as Blastomyces der-

matitidis (a dimorphic fungal pathogen that infects the host through in-

halation of conidia [276], have also been characterised. This is a 120-kDa

surface protein adhesin, namely WI-1, on B. dermatitidis, binding CD18

and CD14 receptors on human macrophages [232]. Hogan et al. [133]

cloned the adhesion WI-1 gene and found a total of 30 highly conserved

repeats of a 24-amino acid sequence. The repeat sequence is similar to

invasion, an adhesion-promoting protein on Yersiniae [169].

Invasion

Invasion is required for the development of deep mycoses in the internal

tissues of the body. The process is probably aided by hydrolytic enzymes,

such as proteinases and lipases, and in the case of dermatophytes, kerati-

nases. Secretion of extracllular enzymes, such as phospholipase, has been

proposed as one of the virulence mechanisms used by bacteria, parasites,

and pathogenic fungi in overcoming host defence mechanism. The role of

93

extracellular phospholipase as a potential virulence factor in pathogenic

fungi, including C. albicans, C. neoformans, and A. fumigatus has been

reported. Of the 4 Candidal phospholipases (PLA, PLB, PLC and PLD),

only C. albicans null mutants that failed to secrete phospholipase B, en-

coded by PLB1, constructed by targeted gene disruption, when tested in

two clinically relevant murine models of candidiasis, was shown to have

attenuation of its virulence. Initial data suggest that direct host cell

damage and lysis are the main virulence mechanisms.

The secretion of lytic and degradative enzymes is also of obvious im-

portance to the invasion of host tissues. Those necrotic enzymes secreted

by fungi can break down structural barriers and play an important role

in mediating host tissue invasion. The most extensively studied example

is SAP gene family in C. albicans [294]. At least nine proteins comprise

the family of secreted aspartyl proteinases. In guinea pig and murine

models of invasive disease, deletions in sap1-6 attenuated virulence. The

SAP genes have been shown to be differentially expressed, according to

the growth phase and phenotype of the organism; SAP2 mRNA was the

dominant transcript in the yeast phase organism; SAP4, SAP5 and SAP6

transcripts were observed only at neutral pH during serum-induced yeast

to hyphal transition. The order of expression was SAP1, -2, followed

sequentially by SAP8, -6 and -3 was correlated with tissue invasion i.e.,

early invasion (SAP1, 2), extensive penetration (SAP8) and extensive

hyphal growth (SAP6). This data indicates that members of the SAP

gene family may have distinct roles in the colonisation and invasion of

the host [63].

Growth at elevated temperature/Thermotolerance

Thermotolerance is one of the most obvious factors leading to pathogene-

sis. The ability of grow at body temperature 37 and within fever range

38 – 42 is important to systemic infection. The majority of fungi has an

94

optimum growth temperature of 25 to 30, and may grow only weakly

or not at all at 37. The first genome-wide analysis of the temperature-

regulated transcriptome of C. neoformans has been done by Steen et

al. [296]. They identified sets of genes with higher transcript levels at

25 or 37 respectively.

Morphology/Morphogenesis

There is a growing body of evidence linking morphogenesis and virulence.

Changes in morphologies are advantageous for fungal pathogens. It has

been demonstrated that fungal hyphae can exert significant tip pressure

for penetration [224]. Many fungi adapt this morphological change and

develop virulence. Filamentous fungi (such as Aspergillus species) tend

to form branched hyphae in lung. C. neoformans, being an unique en-

capsulated yeast, is coated with a polysaccharide capsule. The capsule

is a potent inhibitor of macrophage phagocytosis, which is an important

factor in the resistance to C. neoformans infection.

The most remarkable ability shared among the dimorphic fungi, such

as, B. dermatitidis, C. immitis, H. capsulatum, Paracoccidioides brasilien-

sis, Sporothrix schenckii, is to switch between two distinct forms: yeast

and mould. The dimorphic fungi exist normally as non-pathogenic forms

(normally filamentous mycelia) in the environment and converse into

pathogenic forms (yeast) in the tissues of a host. This process is re-

versible; the switching trigger of conversion is unknown and differs amongst

fungi though. The importance of the yeast cell, as an invasive morphol-

ogy, for dimorphic fungi has been reviewed by Gow et al. [113, 114]. As

shown in Table 4.1, most dimorphic mycelial pathogens invade tissues of

a host as yeast cells. Yeast cells are regarded as a better adapted for

dissemination within host circulatory system and avoidance of immune

capture. Note that although the opportunistic pathogens C. albicans

and Candida tropicalis shows dimorphic growth, these Candida species

95

Table 4.1: Major dimorphic fungal pathogens and their characteristicmorphologies in infectious disease. Taken from [114]

.

Fungal species Form in diseased tissueBlastomyces dermatitidis Budding yeastsCandida albicans (Pesudo)hyphae, budding yeastsCandida tropicalis Yeast and pesudohyphaeCoccidioides immitis Endosporulating spherulesCryptococcus neoformans Budding capsulate yeastsHistoplasma capsulatum Budding yeastsParacoccidioides brasiliensis Budding yeastsPenicillium marneffei Yeasts undergoing binary fissionSporothrix schenckii Budding yeastsWangiella dermatitidis Budding yeasts

mainly form pseudohyphae, therefore they are not regarded as true di-

morphic fungi. Nevertheless conversion to pseudohyphae has been long

regarded as essential for tissue invasion for Candida species.

4.2.2 Genomic approaches in identification of virulence factors

In practice, the combinatorial approaches by combining a few of the

following techniques have great potential to make elucidation of detailed

biological systems.

Mining whole genome sequences and fishing for virulence factors

The sequence of the genome of budding yeast, S. cerevisiae, is a landmark

of genomics. Since then, progress has been made in sequencing whole fun-

gal genomes. The second complete sequence of a fungal genome, that of

S. pombe, was published in 2002 [354]. The filamentous fungi A. nidulans,

A. fumigatus, N. crassa and Ashbya gossypii are nearing completion (see

also Section 1.2.4). Even at its early stage, Fungal Genome Initiative

(FGI), a genome sequencing program by the National Human Genome

Research Institute, USA, proposed to sequence 15 fungi selected on the

basis of medical, scientific and commercial criteria, in 2002. FGI will ap-

96

ply deep-shotgun sequencing approaches (sequencing coverage > 10) in

order to finish all sequencing work quickly. If fully funded, it will produce

massive valuable information for elaborate comparative genomic analysis

across the fungal taxa.

The genome sequences have an immediate impact on conventional fun-

gal genetics by eliminating years of efforts previously associated with gene

discovery. Traditionally genetic and biochemical approach in gene discov-

ery suffered from many aspects of limitation in fungi, such as poor efficien-

cies of transfer, lack of stable extrachromosomal elements, poor growth in

the laboratory. With the genomic sequence in hand, one can bypass these

limitations by using genomics approaches, which permit rapid identifica-

tion of novel genes. Therefore, obtaining genome sequences from patho-

genic fungi is one of the most efficient steps in identification of potential

targets for therapeutic, intervention and vaccination.

Other genomic approaches

Current genomic approaches can be categorised into three groups: mutagenic-

based, nucleotide-based and protein-based [206]. The mutagenic-based

techniques include signature-tagged mutagenesis and construction of mu-

tant libraries, etc. Microarray analysis and serial analysis of gene expres-

sion (SAGE), for example, belong to the nucleotide based techniques.

Two-hybrid system, protein arrays and 2D-PAGE expression analysis

are examples of protein-based techniques.


4.3.1 Identification of melanin biosynthesis genes in P. marneffei

To identify melanin biosynthesis genes in P. marneffei genome, pro-

tein sequences of melanin biosynthesis genes of Aspergillus were down-

loaded from GenBank. The downloaded protein sequences were used

as queries to the P. marneffei genome. The comparison was conducted

97

using the NCBI TBLASTN program version 2.0 with the BLOSUM62

scoring matrix [6]. The E-value cutoff used to assign homologues was

1 × 10−20. The contigs in the P. marneffei genome that contained

homologues were extracted and annotated manually. Predicted pep-

tides were compared to the amino acid sequences of their correspond-

ing query proteins using NCBI BLAST2SEQ (http://www.ncbi.nlm.

nih.gov/blast/bl2seq/bl2.html). The statistics of the “expect value”

were calculated based on the size of NCBI non-redundant protein data-

base. Conserved domains/motifs were identified using InterPro release

5.1 [367].

4.3.2 Multiple alignments and phylogenetic analyses

Multiple alignments of amino acid sequences were performed using the

program ClustalX 1.81 [311]. Initial pairwise alignments were per-

formed using the Blosum62 protein weight matrix and adjustments to

the alignments were performed manually. Graphic presentation of the

alignments and consensus sequences were performed using the program

BOXSHADE 3.21 (http://www.ch.embnet.org/software/BOX form.html).

Regions of ambiguous alignment were removed by using the GeneDoc pro-

gram (http://www.psc.edu/biomed/genedoc). Phylogenetic trees were

inferred by the neighbour-joining method [273]. Bootstrap resampling

with 1000 pseudoreplicates was carried out to assess support for each

individual branch.


4.4.1 Melanin gene cluster present in P. marneffei

Secondary metabolism, the production of compounds not essential for

growth in culture, is thought to be integrally intertwined with develop-

ment in fungi. These events, usually induced by nutrient, biosynthesis

or addition of an inducer, and/or by a growth rate decrease, generate

http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html


http://www.ch.embnet.org/software/BOX_form.html

http://www.psc.edu/biomed/genedoc

98

signals which effect a cascade of regulatory events resulting in chemical

differentiation (secondary metabolism) and morphological differentiation

(morphogenesis). Microbial secondary metabolites have a major effect on

the health, nutrition and economics of our society. They include antibi-

otics, pigments, toxins, effectors of ecological competition and symbiosis,

pheromones, enzyme inhibitors, immunomodulating agents, receptor an-

tagonists and agonists, pesticides, antitumor agents and growth promot-

ers of animals and plants. Among them, fungal secondary metabolites

are of intense interest due to their pharmaceutical (antibiotics) and/or

toxic (mycotoxins) properties. Unlike primary metabolism, the pathways

of secondary metabolism are still not understood to a great degree and

thus provide opportunities for basic investigations of enzymology, con-

trol and differentiation. Recently tremendous progress has been made

in understanding the genes that are associated with production of var-

ious fungal secondary metabolites. For example, work with Aspergillus

species has revealed a link between asexual reproduction and the produc-

tion of toxic secondary metabolites. One of the most well studied fungal

secondary metabolic processes is the biosynthesis of melanin.

Based on the principle of similarity search, we took advantage of

the whole genome sequence to identify the presence of this important

genetic capacity in P. marneffei. Six known genes for DHN-melanin

biosynthesis in A. fumigatus are abr2, abr1, ayg1, arp2, arp1, and alb1

[318]. Functions or gene products of these genes are given in Table 4.2,

note that function of ayg1 is unknown. All these genes are available

from GenBank and gene order has been determined by a previous genetic

study [318] and further confirmed by the A. fumigatus genome project.

The gene order is: abr2 -abr1 -ayg1 -arp2 -arp1 -alb1 (Fig 4.2).

When the amino acid sequences of proteins encoded by these 6 genes

were used as queries to the P. marneffei genome, significant hits were

obtained for all 6 proteins. When the predicted peptides of the corre-

99

Table

4.2:P

utativegene

productsrelated

tom

elaninbiosynthesis

inP.m

arneffei.

Afprotein

(Acc.

No.)

FunctionP

mprotein

Length

(aa),A

f/Pm

E-value

Identity/

Pos-

itive(%

)O

verlaplength

(aa)

abr1(A

AF03353)

brown

1pm

-abr1664/555

0.060/77

528ayg1

(AA

F03354)

yellowish-green

1pm

-ayg1406/403

e-14057/71

403arp2

(AA

F03314)

1,3,6,8-tetra-hydroxynaphthalene

reductasepm

-arp2273/275

8e-9563/74

254

arp1(A

AC

49843)scytalone

dehydratasepm

-arp1168/208

2e-8177/91

160abr2

(AA

F03349)

brown

2pm

-abr2587/526

0.055/73

505alb1

(AA

C39471)

polyketidesynthase

pm-alb1

2146/15680.0

59/711639

100

sponding contigs were compared to the amino acid sequences of the corre-

sponding query proteins, the E-values of the 6 comparisons ranged from

5E-13 to 0 (Table 4.2), indicating high levels of similarity between the P.

marneffei protein and the A. fumigatus proteins. In A. fumigatus, abr1

encodes a multicopper oxidase and abr2 encodes laccase. We detected

weak sequence similarity (60% alignable overlap with 30% amino-acid

positive similarity) between the two genes at the amino-acid level. This

weak sequence similarity suggests two genes are paralogs of each other

which originated from gene duplication. In addition, we collected abr1 or

abr2 homologs from some other fungal species and did a multiple align-

ment of the gene family (Fig. 4.1). This gives information about how

the gene family diverges.

Figure 4.1: P. marneffei abr1 gene Cu-oxidase domain homologues.Alignment of partial amino acid sequences of Cu-oxidase domains of as-comycetes.

More importantly, the synthases of secondary metabolism are often

coded by clustered genes on chromosomal DNA. It has been suggested

that such an organisation of genes may allow coordinated regulation of

the pathway [337]. The 6 melanin biosynthesis are located in a gene clus-

ter in P. marneffei (Fig. 4.2). The gene order is largely conserved when

101

compared to that of A. fumigatus. In P. marneffei, abr1 -ayg1 -arp2 -arp1

locate in one contig, and abr2 and alb1 in other two contigs. Scaffolding

suggests that these 3 contigs belong to one single scaffold. Within this

scaffold, the 3 contigs are ordered one after another, i.e. uninterrupted

by other contigs. Therefore, gene order in P. marneffei can be inferred

as: abr1 -ayg1 -arp2 -arp1 -abr2 -alb1. Such a placement was supported by

5 and 6 pairs of forward-reverse paired reads respectively in the 2 gaps

of the 3 contigs, therefore, it is likely the location of 6 genes is correctly

ordered and the length of this gene cluster can be closely approximated.

As shown in Fig 4.2, the 6 genes span over 35 kb on the P. marneffei

genome, which is about as twice the length in A. fumigatus (19 kb). The

majority of this difference is due to a > 15 kp of gene-free region between

abr2 and alb1 (Fig 4.2). Comparing the gene order in the two fungi, the

only gene order change is abr2 jumping from the beginning of the cluster

(as in A. fumigatus) to after arp1 in P. marneffei. In addition, the di-

rection of alb1 is reversed. The tendency of genes for enzymes of certain

metabolic pathways to be clustered in filamentous fungi has been noted

previously [161]. Generally these gene clusters encode optional pathways

for nutrient utilisation (e.g., the optional carbon source, quinate) [107]

or for synthesis of secondary metabolites (e.g., the mycotoxin, sterigma-

tocystin) [28]. Unlike the clustering of genes as operons in prokaryotes,

clusters of similar genes in fungi are not cotranscribed, nor has any vital

regulatory function for clustering been established [161]. Thus the rea-

son for the existence of gene clusters in filamentous fungi has not been

resolved.

4.4.2 Disrupted aflatoxin biosynthesis gene cluster in P. marneffei

With the possible exception of the penicillin metabolic cluster, the most

thoroughly examined fungal secondary metabolite gene clusters are those

involved in mycotoxin biosynthesis, particularly the aflatoxin (AF) and

102

A. fumigatus abr2 abr1 ayg1 arp2 arp1 alb1

P. marneffei abr1 ayg1 arp2 arp1 abr2 alb1

5kb

Figure 4.2: Comparison between melanin gene cluster between P. marn-effei and A. fumigatus.

sterigmatocystin (ST) biosynthetic clusters found in several Aspergillus

species [28]. These clusters contain a total of 23 genes involved in afla-

toxin biosynthesis and other related functions (including 20 genes that

encode enzymes, two genes that encode regulatory proteins, and one gene

that encode an efflux transport protein) in Aspergillus species. No se-

quence information of cypA, norB, and ordB was available from Gen-

Bank at the time of analysis. The sequences of the remaining 20 genes,

including 17 genes that encode enzymes (hexA, hexB, pksA, nor-1, avnA,

adhA, norA, avfA, cypX, estA, vbs, ver1, moxY, verB, omtB, omtA, and

ordA) and the two regulatory (aflR and aflJ ) and one transport (aflT )

genes, were downloaded. When the amino acid sequences of these pro-

teins were used as queries to search against the P. marneffei genome,

significant hits (TBLSTN E-value cutoff 1.0e-10) were obtained for all 20

proteins. When the predicted peptides of the corresponding contigs were

compared to the amino acid sequences of the corresponding query pro-

teins, the BLASTP E-values of these comparisons ranged from 5.0e-13

to 0 (data not shown), indicating high levels of similarity between the P.

marneffei protein and the Aspergillus proteins. It is noticeable that the

putative gene products of omtA and ordA that are responsible for the

last step in conversion of ST to AF were found in P. marneffei to have

high similarity with their corresponding genes in A. parasiticus.

Despite putative homologues of the Aspergillus genes in the aflatoxin

biosynthesis pathway being present in the P. marneffei genome, these

103

genes do not form a cluster as they do in Aspergillus. This contradicts the

general trend that genes involved in fungal secondary metabolism usually

appear as a cluster, as in the A. flavus and A. parasiticus genomes.

Since almost all of these genes in the P. marneffei genome were not

in the same contig, it suggests that the homologs we identified might

be for production of other unknown secondary metabolites, instead of

aflatoxin. Or major movement of the genes in the aflatoxin biosynthesis

gene cluster has occurred in P. marneffei during evolution, which might

affect the ability and amount of aflatoxins.

4.4.3 Absence of penicillin biosynthesis genes in P. marneffei

Genomic sequence provides evidence for the presence of genetic compo-

nents, such as, melanin biosynthesis gene cluster. On the other hand, it

also provides evidence for the absence of some important genetic compo-

nent, which is also valuable. The beta-lactam antibiotic penicillin, one

of the most commonly used antibiotics for the therapy of infectious dis-

eases, is produced as an end product by some filamentous fungi, such as,

Penicillium chrysogenum. Penicillin biosynthesis is catalysed by three

enzymes which are encoded by the following three genes: acvA (pcbAB),

ipnA (pcbC ) and aatA (penDE ), which are organised in a gene cluster.

Although the production of secondary metabolites, such as penicillin,

is not essential for the direct survival of the producing organisms, sev-

eral studies indicated that penicillin biosynthesis genes are controlled by

a complex regulatory network, e.g., by the ambient pH, carbon source,

amino acids, nitrogen etc. Most notably, this gene cluster is present in

A. nidulans which is a penicillin producer.

In conclusion, the identification of the coding capacity for a set of

proteins that could be involved in melanin biosynthesis has been reported

here. The presence of these homologues suggests the potential ability for

the biosynthesis of melanin or melanin-like substances in P. marneffei.

104

Since melanin is a well-defined fungal virulence factor, it is reasonable to

infer that it is also a virulence factor in P. marneffei, albeit experimental

confirmation is required. In addition, despite putative homologues of the

Aspergillus genes in the aflatoxin biosynthesis pathway being present in

the P. marneffei genome, these genes do not form a cluster as they do in

Aspergillus. They might be involved in the production of other unknown

secondary metabolites.

105

Chapter 5

MATING ABILITIES IN PENICILLIUM MARNEFFEI

Penicillium marneffei was believed to be asexual, but the genome

sequence analysis suggests that the fungus maintains the genetic capa-

bility for sexual reproduction. If confirmed, this raises the potential for

developing powerful genetic tools for the organism, with far reaching im-

plications for its genetic study and disease control.

5.1 Introduction

The most unique feature of Penicillium marneffei is the temperature-

dependent dimorphic switch. At 25 P. marneffei exhibits true fila-

mentous growth, while at 37 it undergoes a dimorphic transition to

produce uninucleate yeast cells that divide by fission. The control of this

“dramatic” developmental process is of interest because it is required for

pathogenicity and may therefore provide a target for controlling infec-

tion. Fungal dimorphic growth and mating are regulated by common

signal transduction pathways, such as the mitogen-activated protein ki-

nase pathway and the nutrient sensing cAMP pathway. Studies of devel-

opment in many fungi have converged to define these conserved pathways,

which are organised in different ways to regulate filamentation, mating

and virulence, in different fungi as they adapt to unique environmental

challenges [192]. Given such a common regulatory mechanism, it is not so

surprising to find an association between the mating process and virulence

in some fungi. For example, a MATα strain of Cryptococcus neoformans

is 30-fold more prevalent in the environment and 40-fold more prevalent

in infections than a MATa strain [183, 193]. Candida albicans utilises a

106

number of the same genes for both mating and pathogenesis. The mating

pheromone of C. albicans elicits an over-expression of a set of virulence

genes in recipient cells [16]. Proteins encoded by these genes were previ-

ously shown to be required for virulence in a mouse model of disseminated

candidiasis. Therefore, it is of particular interest to understand the P.

marneffei mating system, which may be parallel to dimorphic develop-

ment and pathogenesis of this medically important fungus.

Traditionally, P. marneffei is considered as an asexual (anamorph)

ascomycete that lacks an apparent sexual (teleomorph) stage in its life

cycle and seems to reproduce only mitotically [44, 104]. Recent genetic

studies, however, suggest it may have an unidentified sexual cycle [20,19].

Two homologs of the Aspergillus nidulans steA and stuA genes, stlA and

stuA have been cloned from P. marneffei [20, 19]. Both steA and stuA

are involved in controlling mating in the sexual homothallic A. nidulans.

The stlA gene displays no role in vegetative growth, asexual develop-

ment, or dimorphic switching in P. marneffei and is able to complement

the sexual defect of an steA mutant of A. nidulans [19]. The P. marn-

effei stuA gene encodes a basic helix-loop-helix (bHLH) protein of the

APSES family and is supposed to regulate both dimorphic growth and

mating or asexual sporulation. Loss of stuA from P. marneffei resulted

in no obvious effect on dimorphic growth and P. marneffei stuA is able

to complement the conidation defect of an A. nidulans stuA mutant [20].

Moreover, the P. marneffei tupA gene, a homolog of rcoA, is able to com-

plement both the asexual and sexual development phenotypes of an A.

nidulans rcoA deletion mutant [315]. This indicates that the sexual func-

tion of tupA has been retained in P. marneffei. Although the presence of

these highly conserved P. marneffei homologs of these A. nidulans genes

indeed suggests the presence of an undiscovered mating systems in P.

marneffei, the mating process needs a comprehensive network of genes

to function coordinately. Therefore, the finding of a complete mating

107

gene repository in P. marneffei would be a stronger piece of evidence to

support the presence of a sexual stage for the fungus.

Now the genome sequence information has enabled us to conduct a

search for mating-related genes in the P. marneffei genome in order to

reveal the potential mating system in this important dimorphic fungal

pathogen. Similar studies have been carried out in C. albicans, which

was thought to be constitutively diploid and to reproduce only asexually

[138]. The complete genome predicted that a mating system existed in C.

albicans after the identification of numerous highly conserved homologs

of S. cerevisiae mating genes [190, 259, 272]. Eventually, it has been

demonstrated by two research groups that C. albicans can be induced to

mate under certain conditions [139,213].

The sexual cycle introduces valuable genetic tools for fungal study.

If a fungus has a sexual cycle, we can always screen for mutants from

recombination events during meosis and gamete formation, then zygote

formation. In the case of P. marneffei, the absence of a sexual stage

has handicapped biological studies with this fungus. Genome sequence

analysis reported in this chapter, however, provides encourageing infor-

mation: many homologs of sex cycle-related genes have been identified

in the P. marneffei genome, suggesting a potential matting ability of

this important pathogenic fungus, despite which the sexual state has not

been reported. Practically, this discovery might open the door to simple

and efficient procedures for obtaining sexual recombinants of P. marn-

effei that will be useful for genetic analyses of pathogenicity and other

traits.


Studies on mating type in fungi have been helpful for the understand-

ing of many eukaryotic regulation pathways, including cell cycle regu-

lation, cellular and nuclear identity, and signal transduction. Most as-

108

comycetes have only two different mating types, their MAT locus encodes

transcription factors that regulate mating-type–specific genes involved in

pheromone production, pheromone sensing, and signal transduction [94].

Some ascomycetes are asexual, while many others have adopted different

reproductive strategies: heterothallic, homothallic, and, less frequently,

pseudohomothallic (Table 5.1). For homothallic species, homokaryotic

haploid strains are self-fertile and complete the sexual cycle without seek-

ing a mate. This diversity is so extensive that even species within the

same genus, such as Neurospora, adopt either homothallic or heterothallic

modes. More strikingly, in a recent study, researchers discovered that the

heterothallic C. neoformans α cells can sexually reproduce via fruiting,

without fusing with a partner of the opposite mating type.

5.2.1 Mating in hemiascomycete yeasts

The mating-type locus has been well studied in ascomycete S. cerevisiae.

Two haploid cell types of S. cerevisiae are determined by their MAT loci,

denominated as α and a. A pheromone-mediated fusion process creates

a diploid cell (a/α), which then, under starvation conditions, can un-

dergoes meiosis with the formation of four haploid cells, two of which

are a, two are α. Each α and a mating-type locus contains two diver-

gently transcribed genes: a1, a2 and α1, α2, respectively. The a1 and

α2 proteins are transcriptional repressors (when both are present) and

both contain a homeodomain DNA-binding motif [284]. The α1 protein

has been shown to be a transcription activator [278] but its DNA-binding

domain (the α-box) has yet to be characterised in detail. The function of

a2 is unknown. The a1 and α1 proteins are encoded by totally dissimilar

sequences of 642 and 747 bp, respectively, while a2 and α2 sequences

have partial similarity [227, 299]. S. cerevisiae is basically heterothal-

lic, however, a homothallic breeding system can be achieved through a

mating-type switching, in which S. cerevisiae α haploid cell can switch

109

to the opposite mating type a, or vice verse [132]. This is caused by gene

conversion between the MAT locus and two MAT-like loci during cellular

division of haploid cells [120]. The molecular basis of the gene conver-

sion is the presence of two MAT-like cassettes, HMR and HML. Normally

they are transcriptionally repressed through silencing by the formation of

a specialised compacted chromatin structure. They are both surrounded

by “silencers,” short specific sequences that are binding sites for DNA-

binding proteins and are also involved in transcriptional activation and

DNA replication (for recent reviews, see [105, 117]). Moreover, haploid-

specific gene products, such as the HO endonuclease, are involved in

repression of meiosis and mating-type switching [120].

5.2.2 Mating in filamentous ascomycetes

These mating systems include many conserved components, such as gene

regulatory polypeptides and pheromone/receptor signal transduction cas-

cades, as well as conserved processes, like self-nonself recognition and

controlled nuclear migration. The mating systems in filamentous as-

comycetes share similar components and processes with those in yeasts

but they exhibit many unique properties. First, the sequence dissimi-

larity between two alternate mating-type alleles is more pronounced in

filamentous ascomycetes. Usually they consist of unrelated and unique

sequences. Second, the mating-type switching mechanism of filamentous

ascomycetes is unknown but different from that of yeast. Filamentous

ascomycetes exhibit great stability of the mating type, which might be

due to the lack of additional copies of mating-type sequences outside

the mating-type locus. The additional copies of the mating-type locus in

yeasts are usually silent copies facilitating mating type switching through

gene conversion.

Among filamentous ascomycetes, the structure of the components and

genetic arrangements of their mating type loci vary greatly. Neurospora

110

Table 5.1: Mating strategies adopted by ascomycetous fungi, the presenceof mating type gene and ability in switching between mating types.

Species Mating strategy Matingtypegene

Switching

S. cerevisiae Homothallic Y YC. glabrata Asexual? Y NAKluyveromyces lactis Heterothallic, some

homothallic strainsY Y

Kluyveromyceswaltii

Homothallic Y Y

Ashbya gossypii Asexual? Y YDebaryomyceshansenii

Homothallic Y Y

Yarrowia lipolytica Heterothallic Y YNeurospora crassa Homothallic Y NAPodospora anserina Pseudohomothallic Y NABipolaris sacchari Asexual Y NANeurospora interme-dia

Heterothallic Y NA

S. almonella Heterothallic Y YC. neoformans Heterothallic Y N

111

crassa and Podospora anserina are two representative ascomycetes from

which molecular analyses of mating systems have been well-characterised.

In N. crassa, mat a-1 and mat A-1 are the two genes responsible for a

and A mating specificity, respectively. Two additional genes mat A-

2 and mat A-3, with opposite orientations are present at the mat A-1

adjacent region. In P. anserine, FPR1 is the only gene present in the

mat+ idiomorph and sufficient to induce fertilisation, in contrast, FMR1

with two additional genes, SMR1 and SMR2, are required for the mat-

strain to develop perithecia to maturity.

Heterothallic species require a partner for mating, whereas homothal-

lic species are able to self-mate. The difference between heterothallic

species and homothallic species is not due to the presence or absence of

mating-type genes. Sequences similar to mating types have been identi-

fied and functionally characterised in all the species tested, whether they

are heterothallic or homothallic. Mating type genes are even present in

asexual species, for example, asexual Bipolaris sacchari has a homolog of

the MAT-2 gene of the related species C. heterostrophus. The process of

sexual development is identical in homothallic and heterothallic species.

Homothallic filamentous ascomycetes, even individual nuclei contain both

mating-type informations, could be functionally heterothallic through a

proposed a mechanism allowing alternate expression of either mating

type.

Mating may serve as a model for the study of developmental genetics

and could help in elucidating regulatory mechanisms of multicellularity

and sexual dimorphism. Mating systems are divergent in ascomycetes.

The presence of mating-type genes does not determine the mode of sexual

reproduction. Because the changes in modes of sexual reproduction are

frequent and disruption of sexual function is tolerated in ascomycetous

fungi, the presence or absence of particular genetic components involved

in the mating system is not necessarily a good indicator for which repro-

112

ductive modes a fungus adopted.


Protein sequences of fungal sex-related genes downloaded from GenBank

were used as queries to the P. marneffei genome sequences. The com-

parison was conducted using the NCBI TBLASTN program 2.0 with

the BLOSUM62 scoring matrix [6]. The E-value cutoff used to assign

homologues was 1.0e-20. The contigs of the P. marneffei genome that

contained homologues were extracted and annotated manually. Each an-

notated gene is given a locus number of the form Pm## sequentially

to identify a gene uniquely and positively. Each gene also has a ver-

sion attribute (so loci are in fact displayed as Pm##.version). Predicted

peptides were compared to the amino acid sequences of their correspond-

ing query proteins using NCBI BLAST2SEQ (http://www.ncbi.nlm.

nih.gov/blast/bl2seq/bl2.html). The statistics of the expect value

were calculated based on the size of NCBI non-redundant protein data-

base. Conserved domains/motifs were identified using InterPro release

5.1 [367]. Multiple alignments of amino acid sequences were performed

using the program ClustalX 1.81 [311]. Adjustments to the alignments

were performed manually. Graphic presentation of the alignments and

consensus sequences were performed using the program BOXSHADE 3.21

(http://www.ch.embnet.org/software/BOX form.html).

In addition to the degree of sequence similarity, several lines of supple-

mentary information were used to further support gene homology. These

include: (i) conserved positions of intron(s) between homologs, which

argues for a common ancestor of genes studied; (ii) phylogenetic trees

constructed from aligned genes, so that the most close homolog can be

identified when paralogous genes present; (iii) identified features charac-

teristic of the family that a gene belongs to.

Phylogenetic trees were inferred by the neighbour-joining method



http://www.ch.embnet.org/software/BOX_form.html

113

AfMAT-2 (Af59.m09249)

mat a-1 (M54787)

A. fumigatus

A. nidulans

P. marneffei

N. crassa

HMG box

alpha box

AnMAT-2 (AF508279/AN4734.2*)

mat A-1 mat A-3 mat A-2

PmMAT-1 (Pm1.126)

AnMAT-1 (AY339600/AN2755.2)

S. cerevisiaeMATalpha2 MATalpha1 MATa1

S. pombemat1-P mat2-P mat3-M

15kb 11kb15kb 11kb

mat1-M mat2-P mat3-M

Chromosome 3 Chromosome 6

Figure 5.1: Comparison of the mating-type loci in P. marneffei and otherfungi. Boxes interrupted by gaps represent the coding sequences of thegenes and the introns, respectively. Arrows indicates the directions ofgenes. Dash lines indicate the genes linked together are present in thegenome of the same isolate. Symbols: dark-gray bar, conserved HMG-box domain; light-gray bar, conserved alpha-box motif.

[273]. Genetic distances between protein sequences was estimated using

WAG amino-acid substitution model [342] implemented in MBEToolbox

(Chapter 10).


The close relationship between Penicillium and Aspergillus genera has

been well established based on various sources of evidences. It is further

supported by our recent comparative study of the mitochondrial genome

of P. marneffei and those of other fungi (Chapter 3). It has prompted the

search for previously undiscovered characteristics in P. marneffei based

on our knowledge in the various Aspergillus species.

114

5.4.1 Homologs of known sexual genes

With respect to the potential mating system of P. marneffei, A. nidu-

lans is of particular interest as this model species has two distinctive

reproductive developmental processes: sexual and asexual development.

We used a set of empirically selected A. nidulans genes involved in sex-

ual development as queries to identify their homologs in P. marneffei.

These genes are veA, medA, tubB, phoA and nsdD. The veA gene was

first known to mediate the light response as early as 1965 [156]. It was

later found to be required for cleistothecium and ascospore formation as

well [159]. The veA1 mutant is unable to develop sexual structures and

asexual sporulation in the veA1 mutant is promoted and increased [164],

implying that veA gene plays a key role in activating sexual develop-

ment and/or inhibiting asexual development. A. nidulans medA (Gen-

bank Acc.: AAC31205) encodes a transcriptional regulator of sexual and

asexual reproduction. tubB, one of two genes encoding alpha-tubulin, is

involved in the processes of karyogamy and meiosis I [167, 168], but it

is not required for vegetative growth or asexual reproduction, nor is it

required for the initiation or early stages of sexual differentiation. The

gene nsdD encodes a GATA-type transcription factor that functions in

activating sexual development [124]. The gene phoA [33], like stuA [222],

is involved in the biosynthesis of tryptophan and has been identified as

being involved in sexual development [77,314,355].

As in A. nidulans veA, the predicted P. marneffei veA contains one

intron with conserved boundaries. The predicted P. marneffei MedA

(741 aa) shows 49% identity in amino acid to A. nidulans MedA (600 aa)

within an alignable region of 555 aa. The predicted P. marneffei tubB

and phoA are highly conserved, sharing 83 and 80% identical amino acid

residues with A. nidulans tubB and phoA, respectively. The predicted P.

marneffei NsdD consists of 385 amino acid residues and, like A. nidulans

NsdD, is rich in proline (13.8 and 11.3%) and serine (13.8 and 13.4%).

115

Both have the type IVb C-X2-C-X18-C-X2-C zinc finger DNA-binding

domains at their C-termini.

We also identified homologs of two inhibitors of sexual processes, lsdA

and rosA, in P. marneffei. The LsdA is expressed abundantly at the late

sexual developmental stage of A. nidulans. Disruption of lsdA causes the

preferential formation of sexual structures even under certain conditions,

such as a salt at high concentration, where sexual development in the wild

type is inhibited [191]. Hence, the lsdA gene inhibits sexual development

in the presence of sex-inhibiting environmental signals. Under low-carbon

conditions and in submersed culture, A. nidulans RosA is also a repressor

of sexual development initiation [331]. The predicted P. marneffei lsdA

encodes a 350 amino acid polypeptide, which when compared to the

356 amino-acid A. nidulans lsdA, shares 43% identical and 60% similar

amino-acid residues. The predicted P. marneffei RosA exhibits 57%

amino acid identity to A. nidulans RosA. The position of the larger intron

of P. marneffei rosA is same as that in orthologs of A. nidulans, Sordaria

brevicollis and N. crassa. At the N terminus of P. marneffei RosA,

the highly conserved Zn(II)2Cys6 motif and a putative bipartite nuclear

localisation signal and a predicted DNA-binding domain are predicted.

In summary, although studies of the molecular mechanism controlling

sexual development in filamentous fungi are very limited, several sexual

genes that have been identified, isolated and characterised from A. nidu-

lans enable us to find their homologs in P. marneffei. This finding is in

line with the other two genes mentioned above, stuA [222] and steA [19],

that have been experimentally characterised in both A. nidulans and P.

marneffei, revealing the functional exchangeability between correspond-

ing homologs. The presence of these faithful homologs suggests that

sexual development is potentially possible in P. marneffei. However, it

becomes not so conclusive when the following fact is taken into account –

many sexual genes may function not only in sexual development but also

116

Figure 5.2: Comparison of the alpha1 domian of MAT proteins of filamen-tous ascomycetes. The amino acid sequence alignments are as follows:putative P. marneffei, MAT-1 (Pm1.126); putative A. nidulans, MAT-1 (AN2755.2); N. crassa, mat A-1; Paecilomyces tenuipes MAT1-1-1;Gibberella fujikuroi, MAT-1-1; Alternaria alternate, MAT-1; Pyrenopez-iza brassicae, alpha-1 domain protein (CAA06844.1); Gibberella zeae,MAT1-1-1; Fusarium oxysporum, MAT-1; Cochliobolus ellisii, MAT-1;Podospora anserine, FMR1. The arrow indicates conserved position ofintrons.

in other processes, like secondary metabolism. Hence, homologous sexual

genes in P. marneffei might be responsible for other processes that are

not related to sexual development. Therefore we need further evidences

to draw a conclusion.

5.4.2 Mating type genes

Fungi are capable of sexual reproduction by using either heterothallic

(self-sterile) or homothallic (self-fertile) mating strategies. In most as-

comycetes, mating ability is controlled by a single mating type locus,

MAT, with two alternate forms (MAT-1 and MAT-2) called idiomorphs.

MAT-1 and/or MAT-2 mediate not only mating, but also several other

key processes, including secretion of and response to pheromones and

vegetative incompatibility. In heterothallic ascomycetes, these alternate

idiomorphs reside in different nuclei. In contrast, most homothallic as-

comycetes carry both MAT-1 and MAT-2 in a single nucleus, usually

closely linked.

A. nidulans is a homothallic ascomycete. A. nidulans MAT-2 (AnMAT -

117

Pm1.124

Pm1.128

Pm1.127

PmMAT-1

(Pm1.126)

Pm1.125

Pm1.129

AnMAT-1

(AN2755.2)

AN2756.2AN4732.2

AN4736.2

AN4735.2

AnMAT-2

(AN4734.2)

AN4733.2

AN4737.2

AN2753.2

AN2754.2AfMAT-2

(Af59.m09249)

Af59.m09500

Af59.m09247

Af59.m09248

Af59.m09250

Af59.m09246

Relationship:

is neighbor

is homolog

A. nidulans contig 47 A. fumigatus

P. marneffei

A. nidulans contig 27

DNA lyase

cytoskeleton assembly control protein

Figure 5.3: Gene organisation around the MAT locus of A. nidulans andthe putative MAT loci of P. marneffei and A. fumigatus. AnMAT -1 andAnMAT -2 are A. nidulans MAT-1 and MAT-2, locating on contig 47 and27 of A. nidulans unfinished genome, respectively.

2) have been previously characterised using ‘classic’ molecular biological

techniques [76], while A. nidulans MAT-1 (AnMAT -1, Genbank Acc.

BK001307) has been found by similarity searching [76]. In the MIT A.

nidulans genome database, two annotated genes AN2755.2 and AN4734.2

on different contigs are actually the AnMAT -1 and AnMAT -2 respec-

tively. Note that AN4734.2 is slightly different from AnMAT -2 (Genbank

Acc. AF508279), simply due to different isolates of A. nidulans. In con-

trast to A. nidulans, only MAT-2 has been identified by genome analyses

from A. fumigatus [253,326]. The AfMAT -2 encodes a regulatory protein

with a high mobility group (HMG) DNA-binding domain [320], which is

the characteristic feature of MAT-2 genes. No homologue of the MAT -

1 gene sequence in any of the tested fungi was found in the TIGR A.

fumigatus genomic database. This suggests A. fumigatus is perhaps a

heterothallic ascomycete, rather than a homothallic ascomycete (as all

homothallic euascomycetes so far analysed either contain only MAT-1 or

both an MAT-1 and MAT-2 [252]), and the genome sequence was from a

118

MAT-2 strain.

Using this pair of Aspergillus species that are closely related to P.

marneffei, the homothallic A. nidulans and the possibly heterothallic A.

fumigatus as models we undertook a series of MAT searches to determine

whether P. marneffei has a hypothetical MAT locus, and if so, whether P.

marneffei carries both MAT1-1 and MAT1-2 genes. Through BLAST

searches, we identified a putative mating-type (PmMAT ) locus in P.

marneffei, containing a conserved homolog of the A. nidulans MAT-1

(AnMAT -1), which is denoted as PmMAT -1 hereafter. The PmMAT -1

gene encodes a putative 348 amino acid polypeptide which shares 38%

similarity to AnMAT-1 (361 aa) in full length, and exhibits 59, 60, 61

and 60% similarity to the alpha-box domain of AnMAT-1, P. brassicae

MAT-1, G. fujikuroi MAT-1 and P. anserine MAT-1. More importantly,

the intron boundaries are conserved between the putative PmMAT -1 and

other fungal MAT -1 genes (Fig. 5.2).

Despite extensive genome sequence searches, we cannot identify a

MAT-2 like gene in P. marneffei. Having one mating-type gene is similar

to the situation in A. fumigatus, where, in contrast, MAT-1 cannot be

found. The other mating type gene, P. marneffei MAT -2 or A. fumiga-

tus MAT-1, might be present in other isolates, as observed in the asexual

Fusarium culmorum species [163]). Alternatively the other putative mat-

ing type gene could have become extinct, as observed in C. neoformans

populations and Ophiostoma novoulmi [356].

The former explanation seems more plausible after we identified pu-

tative mating-type loci in P. marneffei and A. fumigatus, which show

similarity to A. nidulans MAT-2 and MAT-1 regions, respectively. We

compared flanking genes of two mating-type loci to each other, as well as

to corresponding A. nidulans MAT-2 or MAT-1 regions (Fig. 5.3). Strik-

ing patterns were observed in the organisation of flanking genes where

several syntenies were identified. Comparing P. marneffei to A. fumi-

119

gatus, PmMAT-1 (Pm1.126) and AfMAT-2 (Af59.m09249) are oriented

differently, upstream of a hypothetical gene (Pm1.127 and Af59.m09250

respectively). The mating-type gene and its following gene occupy a

unique region of ∼5 kb in both P. marneffei and A. fumigatus. No sig-

nificant similarity at the amino-acid or nucleotide level can be detected

between the two regions. Three pairs of homologous genes flank the two

regions, the first pair encodes a homologues of S. cerevisiae SLA2-like

cytoskeleton assembly control protein, and the other two encode a pu-

tative DNA lyase and a proteins of the cytochrome c oxidase subunit

VIa family. It therefore seems likely that the non-homologous regions in

P. marneffei or A. fumigatus are the mating-locus of their idiomorphic

type. The mating-locus of the other idiomorphic type might be found

in another isolates. This suggests P. marneffei and A. fumigatus are

heterothallic fungi.

Taken together with N. crassa, we now have the schematic organ-

isation of mating-type loci from four filamentous fungi, whose genome

sequences are completed or almost completed (Fig. 5.1). To compare

them with those from yeasts, we note that the mating-type DNA regions

of filamentous fungi are generally larger than in S. cerevisiae [10] or in

S. pombe [162]. In fission yeast S. pombe, the mating-type region com-

prises three linked loci, mat1, mat2 and mat3, which occupy about 30

kb of DNA on chromosome II [14]. The mat1 locus determines the cell

type, depending on whether it has P (for plus) or M (for minus) infor-

mation. mat2-P and mat3-M loci are transcriptionally silent and act as

donors of information for switching mat1 DNA by the process of gene

conversion. There is no similar arrangement of such mating-type regions

in P. marneffei ; however, it is noteworthy that there are other genes,

such as Pm6.88 or AN1962.2, in P. marneffei or A. nidulans, having

similarity to the HMG mating-type genes. They are not ‘true’ MAT-

2 family mating-type genes because they do not contain the intron with

120

conserved positions and some other conserved motifs, which are only seen

in the MAT -2 gene. Also they are not located at the MAT locus, unlike

other filamentous fungi, such as N. crassa, which may have an additional

HMG gene at the MAT-1 idiomorph involved in fertility. These extra

HMG genes are not possible to be silent copies of MAT genes, as seen in

the yeasts. However, they may theoretically have some role in fertility

which will need experimental investigation [Dr Paul S. Dyer, personal

communication].

Finally, the detection of mating type genes, which play roles in sexual

signalling between compatible heterothallic isolates, yet are present in a

‘selfing’ fungus like A. nidulans, is noteworthy itself. As suggested by

Dyer [76], this observation can be interpreted by either the evolution of

heterothallic species towards homothallic form or vice versa. Taking our

observation from the P. marneffei genome into account, then we assume

the former interpretation is more plausible, i.e., homothallic A. nidulans

is originated from a heterothallic common ancestor of Penicillium and

Aspergillus.

5.4.3 Mating pheromone precursor genes

The nucleotide sequence and deduced amino acid sequence of the pheromone

precursor gene from several fungi have been used to search the P. marn-

effei genome. After intensive searches, however, no significant similarity

was found (BLAST E-value cutoff = 10). As mentioned in a previous

section (Section 1.4.2), syntenic comparisons suggest the loss of original

mating pheromone precursor loci may occur in P. marneffei. However,

we cannot exclude the possibility that P. marneffei mating pheromone

precursor genes are so highly specific that they are too divergent to be

detected by similarity searches.

121

C-TerminalCAAXModification

N-TerminalProcessing

Export

Ste6p

Ram1p

Ram2pFarnesylation

Pm6.49

Pm60.30

Ste24pAXX Proteolysis

Pm60.4

Pm96.20

Ste14p CarboxylmethylationPm92.26

Ste24p P1->P2 ProteolysisPm60.4

Axl1p

Ste23pP2->M Proteolysis

No match

Pm134.14

Export

Pm125.22

Rce1p

Figure 5.4: Predicted P. marneffei homologues of the genes involved inthe biogenesis of the a-factor pheromones in S. cerevisiae. The a-factorbiosynthetic intermediates and the components of the a-factor biogene-sis machinery are shown (see the text for more information). Several ofthe a-factor intermediates can be directly visualised by SDS-PAGE andare designated P0, P1, P2, and M [49]. The a-factor precursor containsan N-terminal extension, a mature portion, and a C-terminal CAAXmotif, as indicated at top. During a-factor biogenesis, the unmodifieda-factor precursor (P0) undergoes C-terminal modification (prenylation,proteolytic cleavage of AAX, and carboxylmethylation) to yield the fullyC-terminally modified species P1. Next, N-terminal proteolytic process-ing occurs in two distinct steps, the first (P1→P2) cleavage removingseven residues from the N-terminal extension to yield the P2 species, andthe second (P2→M) cleavage generating mature a-factor, which is ex-ported from the cell. The corresponding components predicted from P.marneffei have been given. Among them, AXL1 has not been identified.

122

Table

5.2:P

heromone-processing

enzymes

encodedby

theputative

P.m

arneffei

genes,as

shown

bya

BLA

STsearch

ofthe

P.

marneff

eigenom

e.

Scprotein

(aa)Function

Pm

protein(aa)

E-value,

identityand

similarity

inoverlap

Kex1p

(729)C

arboxypeptidaseα-factor

processingP

m76.8

(672)4e-057,

124/350(35%

),183/350

(52%)

Kex2p

(814)E

ndoproteaseα-factor

processingP

m6.3

(813)1e-154,

302/774(39%

),428/774

(55%)

Ste13p(931)

Dipeptidyl

aminopeptidase

α-factor

processingP

m10.77

(899)1e-128,

263/787(33%

),399/787

(50%)

Ram

2p(316)

CaaX

Farnesyltransferaseα

subunit;a-factor

modi-

ficationP

m60.30

(350)5e-051,

124/354(35%

),177/354

(50%)

Ram

1p(431)

CaaX

Farnesyltransferaseβ

subunit;a-factor

modi-

ficationP

m6.49

(635)6e-050,

114/329(34%

),157/329

(47%)

Rce1p

(315)C

aaXprotease

a-factorC

-terminal

processingP

m96.20

(333)3e-025,

79/263(30%

),132/263

(50%)

Ste14p(239)

Prenylcysteine

carboxylm

ethyltransferaseP

m92.26

(259)1e-034,

61/134(45%

),87/134

(64%)

Ste24p(453)

CaaX

prenylprotease

N-

andC

-terminal

a-factorprocessing

Pm

60.4(456)

1e-115,202/446

(45%),

274/446(61%

)

Ste23p(988)

Metalloprotease

involved,w

ithhom

ologA

xl1p,in

N-term

inalprocessing

ofpro-a-factor

tothe

mature

form

Pm

134.44(1012)

0.0,369/947

(38%),

562/947(59%

)

Ste6p(1290)

AT

P-dependent

multidrug

efflux

pump

ofa-factor

Pm

125.22(1262)

1e-127,335/1280

(26%),

580/1280(45%

)

123

5.4.4 Mating pheromone processing genes

The production of pheromones has provided important insights into pro-

protein processing in eukaryotic cells. The system has been well char-

acterised in S. cerevisiae (for review, see [62]). A budding yeast cell

produces either a-factor or α-factor corresponding to its mating type.

Either a- or α-factor is synthesised as precursor that undergoes multiple

maturation steps to generate its mature form. A number of S. cerevisiae

pheromone processing genes have been cloned and characterised [32]. We

used the protein sequences of all these genes in a BLAST search to iden-

tify pheromone-processing genes encoding putative homologous proteins

in P. marneffei. For all the query S. cerevisiae proteins, except Axl1p,

the corresponding P. marneffei homologs with high levels of amino-acid

similarity have been identified (Table 5.2). Hence, P. marneffei ap-

pears capable of synthesising/processing mating pheromones although

the pheromone precursor gene has not been identified by searching for

known pheromone precursor genes.

Genes involved in the processing of α-factor and a-factor are different.

In the case of α-factor, the maturation requires signal cleavage, glycosy-

lation and proteolytic processing by three peptidases encoded by KEX2,

KEX1 and STE13. The S. cerevisiae KEX2 gene encoding kexin belongs

to the prohormone convertase family, which has been identified in many

species. The S. cerevisiae Kex2p is membrane-bound and cleaves pep-

tide substrates at both Lys-Arg and Arg-Arg sites [26, 100]. A previous

study has shown that mutant Kex2p enzyme molecules lacking as many

as 200 C-terminal residues still retained protease activity. Although not

essential for enzymatic activity, C-terminal cytoplasmic tail contains a

localisation signal so that Kex2p is localised to a later compartment of

the Golgi complex. The predicted P. marneffei Kex2p shows high simi-

larity (55%) to S. cerevisiae Kex2p overall and similarity at C-terminal

residues is slightly lower, hence, the predicted P. marneffei Kex2p pos-

124

sibly bears protease activity but may be localised differently. The S.

cerevisiae KEX1 encoding carboxypeptidase cleaves the Lys-Arg residues

exposed at the C-terminus of α-factor precursor following digestion with

the kexin [60, 70, 188]. Like Kex2p, the C-terminal residues of S. cere-

visiae Kex1p are not highly conserved in P. marneffei, also suggesting

a difference in peptide localisation between species. P. marneffei is pre-

dicted to have a homolog of S. cerevisiae Ste13p, a type IV dipeptidyl

aminopeptidase that trims N-terminal x-Ala dipeptides of the α-factor

precursors [154].

a-factor undergoes three major maturation stages: C-terminal mod-

ification, N-terminal modification, and export [49], which involve genes

RAM2, RAM1, RCE1, STE14, STE24/AFC1, STE23, AXL1 and STE6

(Fig. 5.4). The S. cerevisiae RAM2 and RAM1 genes encode the α

and β subunits of farnesyltransferase (FTase), respectively [129]. FTase

catalyses the addition of 15-carbon (farnesyl) groups to a-factor des-

tined for cell membranes [260]. RAM2 and RAM1 are conserved genes

that have mammalian counterparts. RAM2 is essential to the viabil-

ity of C. albicans, while RAM1 is essential to C. neoformans, indicating

that protein prenylation is an indispensable cellular process in these op-

portunistic yeast pathogens. The predicted P. marneffei Ram1p shows

high levels of similarity to S. cerevisiae Ram1p (51 %) and to mam-

malian protein farnesyltransferase β subunits (e.g. 55 % similarity to

rat fntb). The predicted P. marneffei Ram2p shows 50 % similarity to

S. cerevisiae Ram2p, with both containing at least three PPTA (Pfam

acc. PF01239) domains at their N-termini. The S. cerevisiae RCE1

encodes an AAX prenyl protease [21]. The sequence of RCE1 contains

three potential transmembrane domains but there are no other defining

features and no significant similarity with other proteins, hence it may

belong to a novel superfamily [247]. The predicted P. marneffei Rce1p,

which is 50% similar, also contains multiple potential transmembrane

125

domains. More importantly, the three putative zinc-binding residues

(E156A, H184A, H248A) and Cys (C251) are all conserved. Mutating

each of these residues inactivates the protease [72]. The S. cerevisiae

STE14 encodes a carboxyl methyltransferase that methylates a-factor.

The predicted P. marneffei Ste14p, containing multiple predicted trans-

membrane spans, shares 64% similarity with S. cerevisiae Ste14p. The

S. cerevisiae Ste24p, a membrane-associated metalloprotease, is required

for the first step of N-terminal processing of a-factor [99]. The predicted

P. marneffei Ste24p shows 60% similarity to its counterpart. Like S.

cerevisiae Ste24p, P. marneffei Ste24p (at position 299 to 303) has a Zn-

dependent metalloprotease motif (HEXXH) [304]. It also matches the

larger consensus sequence characteristic of neutral Zn metalloproteases,

and contains multiple predicted transmembrane regions. Unlike S. cere-

visiae Ste24p, however, the C-terminal di-lysine motif, KKXX (K is Lys)

is replaced with KXXX in P. marneffei Ste24p. Our analysis reveals that

the predicted Ste24p homologs in A. fumigatus (AF58.m07859) and N.

crassa (NCU03637.2) also have the replacement of the di-lysine motif.

Since the di-lysine motif at the C-terminus of many proteins facilitates

their retrieval from the Golgi complex to the ER [310], it could sug-

gest that Ste24p in S. cerevisiae is localised to the ER, but this is not

the case in P. marneffei or the other two filamentous fungi. The S. cere-

visiae metalloprotease Ste23p, a member of the insulin-degrading enzyme

family, is involved in N-terminal processing of pro-a-factor to the mature

form. Axl1p is a paralog to Ste23p. In S. cerevisiae, Ste23p and Axl1p

proteins show 22% identity and 39% similarity throughout their entire

length and Ste23p performs a role at least partially redundant with that

of Axl1p in a-factor processing [1]. In P. marneffei, I identified a pu-

tative homolog of Ste23p but not Axl1p. P. marneffei Ste23p is highly

conserved, showing 59% similarity to S. cerevisiae Ste23p. We argue that

since STE23 genes are present in S. cerevisiae and P. marneffei while

126

AXL1 is present in S. cerevisiae only, it is possible that AXL1 was cre-

ated by duplication of the gene STE23 after the separation of the two

species. Moreover, S. cerevisiae STE23 and AXL1 may be an example of

duplicate genes that undergo subfunctionalisation, through which Axl1p

gains a new role in controlling the axial budding pattern of haploid cells

while retaining partial STE23 functions in processing a-factor. Finally,

unlike α-factor that is exported in MATα cells via the classical secretion

pathway, a-factor is pumped out of the cell by the MATa cell-specific

protein Set6p. The homolog of Set6p was identified in P. marneffei, with

multiple transmembrane domains and two ATP binding domains.

5.4.5 Mating pheromone receptor and other GPCRs

In S. cerevisiae, a or α-factor binds to cell-type-specific receptors encoded

by STE2 or STE3. STE2 is expressed in a cells and is recognised by α-

factor, and STE3 is expressed in α cells and recognised by a-factor. The

binding is essential for signalling mating process between haploid cells.

In A. nidulans, Han et al. [125] identified 9 genes, gprA∼I, belonging to

the GPRC family. Among them, gprA and gprB are putative orthologs to

STE2 and STE3. gprD is similar to the yeast glucose sensing Gpr1p [176]

and plays a key role in coordinating hyphal growth and sexual develop-

ment. Using these A. nidulans GPCRs as query genes, I identified 7 P.

marneffei GPCRs closely related to them. A phylogeny reconstructed

from a collection of fungal GPCRs gives an indication of several distinct

families. The seven P. marneffei distribute across all these sub-divisions.

They all contain multiple predicted transmembrane domains, which is one

of characteristic features of GPCRs. Han et al. [125] also claimed that 7

putative GPCRs have been found in A. fumigatus genome. It would be

interesting to re-analyse this gene family when gene sequences from all

these three genomes of closely related species become available.

Our results indicate that P. marneffei might have a recent evolu-

127

tionary history of sexual recombination and might have the potential for

sexual reproduction. The possible presence of a sexual cycle is highly

significant for the population biology and disease management of the

species.

128

Chapter 6

EXPLORING THE GENETIC COMPONENTS

ASSOCIATED WITH THE DIMORPHISM OF


Penicillium marneffei accommodates both complex asexual develop-

ment and dimorphic switching programs, hence becomes a valuable sys-

tem for the study of morphogenesis and pathogenicity. The study of

the morphogenetic programs of P. marneffei has been recently greatly

facilitated by the development of molecular genetic techniques, but we

are only beginning to uncover some determinants which control these

events, and the comprehensive picture still remains blurred. This chap-

ter contributes to the thesis by offering a systemic exploration of genetic

components that may be responsible for the morphogenetic processes in

the genome of P. marneffei, mainly through sequence analysis in a con-

text of comparative genomics. This will provide insights into the biology

of P. marneffei and its pathogenic capacity.

6.1 Introduction

Dimorphism, the ability to switch between a cellular yeast form and a

filamentous form, is a common morphogenetic feature in many fungi, de-

spite their enormous diversity in size and shape. The change of growth

form is believed to be effected by an altered programme of gene expres-

sion, which is induced by a wide range of metabolic and environmental

factors. In Saccharomyces, it is starvation for nitrogen, in Candida, it is

serum (among other things); in Ustilago, it is a putative molecular signal

from the host plant; and in P. marneffei, it is apparently temperature.

129

Note that environment conditioned dimorphism is reversible.

The yeast-form is characterised by a round or ovoid unicellular or-

ganisms, dividing mitotically, either by budding or fission, to form two

independent daughters. Filamentous or mould forms are more com-

plex multicellular structures. The filaments are characterised by long,

thin, parallel-walled tubes, growing by apical extension, with occasional

branching at an angle from the original direction of growth. In contrast

to yeast, filamentous cells do not separate after nuclear division but,

rather, forming septations between cellular units that remain physically

associated to the mother cell.

There is a growing body of evidence suggesting that the morphogen-

esis is a crucial determinant of fungal pathogenicity in both plants and

animals. In Magnaporthe grisea, for example, MAPK and cAMP sig-

nalling promote the formation of a highly specialized infection structure,

appressorium, which is essential for invasion into the host [223]. Most di-

morphic fungal pathogens including P. marneffei, Blastomyces dermati-

tidis, Coccidioides immitis, Histoplasma capsulatum and Paracoccidioides

brasiliensis, typically enter the body as spores or, possibly, mycelial frag-

ments via the lungs and grow in yeast forms in the body. Pathogenic

Cryptococcus neoformans has been shown to form self-fertilising, diploid

strains that are thermally dimorphic [286]. Aspergillus fumigatus spores

establish invasive disease in lung tissue exclusively by hyphal develop-

ment.

Because of the prevalence of dimorphism among human pathogenic

fungi, it is of interest and importance to identify the molecules neces-

sary for the morphologic switch. However, the mechanism of thermal di-

morphism of P. marneffei remains unknown. Nevertheless, since fungal

dimorphism has been seen by many investigators as a useful model of dif-

ferentiation in eukaryotic systems, significant progress has been achieved

in the study of fungal morphogenesis in other fungi. The approach to

130

this chapter is a review of this progress (especially experimental devel-

opments) achieved in recent years in the fields of fungal genetics. These

developments have suggested models and hypothesis to understand the

regulation of the molecular mechanisms involved in fungal differentia-

tion. Comparative sequence analysis is adopted to explore the genetic

components that may be involved in the morphogenesis of P. marneffei.

Specifically, we would like to know whether P. marneffei possess spe-

cific (probably temperature-sensitive) cellular sensors to detect external

stimuli, or unique signalling transduction pathways that translate the

external stimuli into biochemical messages that alter genomic expression

levels, or an enhanced ability in structural reorganization resulting in the

morphological change.

It is noteworthy that the comparative genomics approach adopted

in this Chapter is impaired by the lack of genome sequence information

from true dimorphic fungus. Nevertheless, even the genome sequences of

Blastomyces dermatitidis, Coccidioides immitis, Histoplasma capsulatum

or Paracoccidioides brasiliensis had become available, the comparative

genomics approach might also be handicapped by the too far genetics

distance between P. marneffei and these divergent species. The follow-

ing analysis is therefore mainly limited by the comparison between P.

marneffei and Aspergillus species.


6.2.1 Sequence similarity

To identify homologous genes in the P. marneffei genome, protein se-

quences derived from target genes were used as queries to the P. marneffei

genome. Sequence similarity searches were performed using BLASTP or

PSI-BLAST against selected fungal genomes downloaded from GenBank.

The searches were also performed against an inhouse database composed

of whole-genome sequences of several fungal species from finished and

131

ongoing sequencing projects. The comparison was conducted using the

BLOSUM62 scoring matrix [6]. The E-value cutoff used to assign homo-

logues was 1e10-5, unless otherwise claimed. Conserved domains/motifs

were identified using InterPro release 5.1 [367].

6.2.2 Phylogenetic Analysis

Protein sequences were aligned using PROBCONS [71] and columns of

low conservation removed manually. Phylogenetic trees were inferred

by the neighbour-joining method [273]. The alignments were also used

to infer maximum-likelihood trees. The maximum-likelihood trees were

constructed using the PHYLIP package [86], applying the JTT substi-

tution model with a gamma distribution (alpha = 0.5) of rates over

four categories of variable sites. In general, the maximum-likelihood and

neighbour-joining trees were congruent.


It has long been assumed that morphogenesis and virulence are associated

in dimorphic fungi, as one morphotype exists in the environment or dur-

ing commensalism, and another within the host during invasive process.

For instance, P. marneffei lives outside the host as environmental sapro-

phytic moulds. Its primary infectious form may be conidia or mycelial

fragments aerosolised from disturbed soil or animal excreta. After enter-

ing the host via the respiratory route upon inhalation, the cells rapidly

convert to the yeast form. So do the other members of dimorphic fungi,

such as B. dermatitidis, C. immitis, H. capsulatum and P. brasiliensis.

From the perspective of the fungal cell, the phenomenon of dimorphic

switching can be divided into four interwoven events as follows [275]:

(i) perception of external stimuli by cellular sensors; (ii) transduction of

biochemical signal; (iii) alteration of the genomic expression, and (iv)

structural reorganization towards the morphological change.

132

6.3.1 Perception of external stimuli by cellular sensors

Table 6.1: GPCR family in P. marneffei and A. nidulans. orthologrelationship supported by synteny; when knocked out, no phenotypicchanges. Abbreviations: Pm - P. marneffei, An - A. nidulans, Sc - S.cerevisiae, Af - A. fumigatus, and Sp, S. pombe.

Family An gene Pm gene Sc/Af homolog Sp homolog1 gprA (AN2520.2) Pm198.6 Ste22 gprB (AN7743.2) Pm20.41 Ste3 Map3

3gprC (AN3765.2)gprD (AN3387.2) Pm14.37 Gpr1 Git3gprE (AN9199.2)

4gprF (AN5721.2) Pm105.27 AF54.m07020 Stm1gprG (An5720.2) Pm34.71

5gprH (AN8262.2) Pm58.4 AF53.m04209gprI (AN8348.2) Pm31.53

Limited information about cellular sensors that detect external stim-

uli (especially temperature) is available for ascomycetes. Among known

receptors, G protein-coupled receptors (GPCRs) are key components of

heterotrimeric G protein-mediated signalling pathways. The receptors

detect environmental signals and confer rapid cellular responses. The

GPCR family has been propagated in the genome of Aspergillus nidulans

as shown in the recent analyses of the Aspergillus nidulans genome: 9

genes (gprA∼gprI) predicted to encode seven transmembrane spanning

GPCRs have been identified [125]. Among them, gprD gene was found

to play a central role in coordinating hyphal growth and sexual devel-

opment. Deletion of gprD causes extremely restricted hyphal growth,

delayed conidial germination and uncontrolled activation of sexual devel-

opment resulting in a small colony covered by sexual fruiting bodies. We

identified 7 P. marneffei GPCRs closely related to A. nidulans GPCRs

(Table 6.1). The phylogenetic tree of fungal GPCR family genes (Fig.

6.1) helps the assignment of these putative P. marneffei GPCRs into

their corresponding sub-families.

133

An G

prF

5721

Af5

4.m

07020

Pm105.2

7

Pm34.71

An GprG 5720

Sp Stm1

Sc Ste2

Sp map3

An G

prBS

c Ste

3

Pm

20.4

1

An G

prH

8262

Pm

58.4

Af53.m

04209

Sp

Git3

Sc Gpr1

An GprC

3765

Pm14.37 An GprD 3387

An G

prE 9199

Dd crlA

Pm

31.5

3

Dd

cA

R1

AN

8348.2

An G

prA

2520

Sp m

am2

Pm198.6

2

Figure 6.1: Phylogenetic tree of fungal GPCR family genes. Classifica-tion of fungal GPCR families was carried out by analyses of P. marneffeiPm198.6, Pm20.41, Pm14.37, Pm105.27, Pm34.71, Pm58.4 and Pm31.53,A. nidulans GprA∼GprI, A. fumigatus Af54.m07020, Af53.m04209, Sac-charomyces cerevisiae Ste2p, Ste3p, Gpr1p, Schizosaccharomyces pombeMam2p, Map3p, Git3p, Stm1p, Dictyostelium discoideum cAR1p andcrlAp (GenBank Acc.: AAO62367) using PROBCONS [71]. Algorithmparameters: Gaps/Missing data - Pairwise Deletion; Distance method– Amino Gamma Model [Pairwise distances]; Tree making method -Neighbour-joining.

134

6.3.2 Transduction of biochemical signal

Studies combining the powerful genetic and genomics tools available in

fungi (mainly in Saccharomyces) have revealed three pathways that cou-

ple afferent signals to the dimorphic switch. Although many different

signals can induce filamentous development, the strategies for connect-

ing the external signal to the change in cell differentiation are broadly

conserved among the fungi. For example, studies show that distantly

related fungi – Saccharomyces, an ascomycete, and Cryptococcus, a ba-

sidiomycete, – use common STE12 family members to forms filamentous

structures in response to nitrogen starvation, sharing a high degree of

conservation in the regulatory pathways that control filamentous growth.

Studies on signalling filamentous growth in S. cerevisiae have revealed

that four genes of the MAPK pathway that signals the mating pheromone

response are also required for filamentous growth of diploid cells and the

invasive growth of haploid cells (Fig. 1.6). These four genes are STE20,

STE11 and STE7, which encode three protein kinases that act in se-

quence, and STE12, which acts as a transcription factor at the terminus

of both pathways. As shown in Fig. 1.6 all these four genes are marked

with asterisks, indicating that the S. cerevisiae genes’ ortholog in P.

marneffei has been identified (see also Table 6.2). The STE20 homolog

from P. marneffei, pakA (GenBank Acc. AY621630; Pm80.15), is known

to be essential during yeast but not hyphal growth (Boyce KJ et al., per-

sonal communications). The STE12 homolog, stlA, has been cloned [19].

The P. marneffei stlA gene together with the A. nidulans steA and C.

neoformans STE12alpha genes form a distinct subclass of STE12 ho-

mologs that have a C2H2 zinc-finger motif in addition to the homeobox

domain that defines STE12 genes. The stlA gene had no detectable func-

tion on vegetative growth, asexual development, or dimorphic switching

in P. marneffei. However stlA complements the sexual defect of an A.

nidulans steA mutant [19]. These data suggest that although members

135

Ras2p (Pm85.8) Gpa2P (Pm51.59)

ATP

PKA (r) PKA (r)

PKA (c)

Cyr1p (Pm7.24) Pde2p (Pm146.17)

Bcy1p (Pm33.83)

cAMP AMP

Tpk1p, 2p, 3p

(Pm18.86, Pm47.4, Pm19.3)

Figure 6.2: P. marneffei genes in cAMP pathway.

of the STE12 family of regulators are involved in both controlling mating

and yeast-hyphal transitions in a number of fungi, stlA in P. marneffei

may only play a role in controlling mating processes (see also chapter 5)

but not dimorphic switching. There may be as yet undetected compen-

satory genes or pathways responsible for dimorphic switching.

Another pathway controlling filamentation in Saccharomyces is cAMP

pathway (Fig. 6.2). Ras2p and Gpa2p are regulators of cAMP levels,

acting upstream of adenylate cyclase, Cyr1p, which in turn regulates for-

mation of cAMP. The processes inactivates the cAMP-dependent protein

kinase (protein kinase A, PKA), leading to enhanced filamentous growth

in Saccharomyces. Homologs of all genes related in this pathway have

been identified in P. marneffei (Fig. 6.2 and Table 6.2).

Another regulator implicated in Saccharomyces filamentation is Rim1p

zinc-finger transcription factor. It is activated by a proteolytic cleavage

dependent on several other RIM genes (RIM8, RIM9, RIM13). Rim1p’s

homolog in Aspergillus nidulans, PacC, is also regulated by such a prote-

olysis mechanism. Again homologs of all these RIM genes are identified

136

in P. marneffei, suggesting the existence of the regulatory pathway.

Because signal transduction pathways have been well elucidated in

Saccharomyces, the yeast has been used as a reference library for the

analysis of conserved signalling pathways. However, the most detailed

analyses in S. cerevisiae will be able only to provide stepping stones on

the way to the explaining of key morphological features in more com-

plex, multicellular filamentous fungi. These mould-specific features may

include polarized hyphal growth, septation, establishment of multinucle-

ate cellular compartments, cell type-specific gene expression, and sub-

cellular localization of proteins. Furthermore, protein networks of other

fungi may even differ in their regulation of similar morphological tasks.

Hence, further studies toward an understanding of these differences on

the molecular level will remain an important task in functional analy-

ses, particularly of organisms, like P. marneffei, whose genomes will be

completely sequenced in the near future.

6.3.3 Alteration of the genomic expression

Elevated temperature is apparent by the major environmental stimulus

to P. marneffei resulting in the fungus undergoing a mycelium-to-yeast

transformation. However, the influence of elevated temperature on the

overall gene expression of P. marneffei has not been studied. Neverthe-

less, since surviving at the elevated temperatures, i.e. thermotolerance,

is a trait critical to the ability of many fungal pathogens to thrive in host

infections, a number of studies have been conduced in other fungi. For

example, two genes have been implicated during growth at elevated tem-

peratures in C. neoformans. Gene RAS1 (encoding a small GTP-binding

protein) regulates filamentation, mating and growth at high tempera-

ture [5]. Gene CNA1 (encoding calcineurin) is required for C. neofor-

mans virulence and may define signal transduction elements required

for fungal pathogenesis [236]. Homologs of both genes can be identified

137

Table 6.2: Homologous genes related to signal transduction in filamentousgrowth.

Sc gene Pm gene Function/productMAPK pathwaySTE20(CST20)

Pm80.15 Signal transducing kinase of the PAK fam-ily, involved in pheromone response andpseudohyphal/invasive growth pathways

STE11 Pm129.8 MAP kinase kinase kinase in the filamen-tous growth pathway pathway

STE7(HST7)

Pm161.15 Serine/threonine/tyrosine protein kinaseof MAP kinase kinase family

STE12(CPH1)

Pm201.2(stlA)

Ortholog to AN2290.2 (SteA). Membersof the STE12 family of regulators are in-volved in controlling mating and yeast-hyphal transitions in a number of fungi

TEC1 Pm109.16(abaA)

Transcription factor participates in twodevelopmental programmes: conidiationand dimorphic growth

PSS1 Pm41.61 MAP kinase dedicated to filamentationpathway

FUS3 Pm8.42 MAP kinase dedicated to pheromone re-sponse pathway

cAMP pathwayPDE2 Pm146.17 cAMP phosphodiesterase, component of

the cAMP-dependent protein kinase sig-naling system

RAS2 Pm85.8 Regulator of cAMP levelsGPA2 Pm51.59 G protein alpha subunit homologueCYR1 Pm7.24 Adenylate cyclase, required for cAMP pro-

duction and cAMP-dependent protein ki-nase signalling

BCY1 Pm33.83 Regulatory subunit of the cyclic AMP-dependent protein kinase (PKA)

TPK1, 2, 3 Pm18.86,Pm47.4,Pm19.3

Subunit of cytoplasmic cAMP-dependentprotein kinase; promotes vegetativegrowth in response to nutrients; inhibitsfilamentous growth

to be continued...

138

RIM1 relatedRIM1 Pm20.42 Rim1p is homologous to the Aspergillus

nidulans transcription factor PacC, whichis also regulated by proteolysis

RIM8 Pm148.7 Protein of unknown function, involved inthe proteolytic activation of Rim101p inresponse to alkaline pH; has similarity toA. nidulans PalF

RIM9 Pm26.50 Involved in the proteolytic activation ofRim101p in response to alkaline pH; hassimilarity to A. nidulans PalI

RIM13 Pm146.2 Calpain-like protease involved in prote-olytic activation of Ri0m101p in responseto alkaline pH; has similarity to A. nidu-lans palB

within the P. marneffei genome. The P. marneffei homolog of C. neo-

formans RAS1, Pm85.8, is a known P. marneffei gene (rasA, GenBank

Acc. AY232652). It has been confirmed by experiment to act upstream

of CflA (Cdc42) to regulate germination of spores and polarized growth

of both hyphal and yeast cells, while also exhibiting CflA-independent

activities [23]. For CNA1, the putative homologue gene, Pm119.15, en-

codes a highly conserved (74% aa identity within alignable region of 485

aa) calcineurin peptide sequence (557 aa long).

In addition to these analyses on individual gene’s functions, Steen

et al. have initiated a genome-wide analysis of the response of C. neo-

formans to host temperature [296]. This analysis revealed differences

in the levels of responsiveness of serotype A and D strains to growth

at 25 versus 37 with changes in transcript levels for histone genes,

stress-related genes, and genes encoding translation components. Nunes

et al. [234] used a Paracoccidioides brasiliensis biochip to monitor gene

expression at several time points of the mycelium-to-yeast morpholog-

ical shift. Their results revealed a total of 2,583 genes that displayed

statistically significant modulation in at least one experimental time

point. Among the identified genes, some encoded enzymes involved in

139

amino acid catabolism, signal transduction, protein synthesis, cell wall

metabolism, genome structure, oxidative stress response, growth control,

and development. Particularly, the gene 4-HPPD encoding 4-hydroxyl-

phenyl pyruvate dioxygenase is highly overexpressed during mycelium-to-

yeast differentiation, and its function has been shown to be the inhibition

of growth and differentiation of the pathogenic yeast phase of the fun-

gus in vitro [234]. Two copies of 4-HPPD, Pm48.10 and Pm14.48, were

identified in the P. marneffei genome.

Neither C. neoformans nor P. brasiliensis are phylogenetically closely

related to P. marneffei. Comparison of patterns in gene expression with

the much more closely related Aspergillus species may be more meaning-

ful. Information about A. fumigatus gene expression in metabolic adap-

tation to higher temperatures became available recently [233]. Nierman

et. al., examined gene expression throughout a time course upon shift of

growth temperatures from 30 to 37 and 48 [233]. A total 1926 tem-

perature shift-responsive genes were identified. Comparative data also

indicate that high temperature responses in A. fumigatus differ from the

general stress response in yeast. We performed comparative analysis of

these genes against P. marneffei genome in order to identify their ho-

mologs. Among the 1,926 genes, 1,032 have homologs in P. marneffei,

i.e., a majority of A. fumigatus temperature shift-responsive genes are

present in P. marneffei. Here the set of homologs was defined by iden-

tifying unique pairwise reciprocal best hits, with at least 40% similarity

in protein sequence and less than 20% difference in length. This result

suggests that the genetic component of P. marneffei may not differ much

from those for general high temperature responses in A. fumigatus.

The experiments mentioned above identified the temperature shift-

responsive genes that may play a role in the structural or metabolic

changes that take place during morphogenesis or may be necessary for

colonisation and survival in the host. However, a direct interpretation

140

of the association between P. marneffei homologs of temperature shift-

responsive genes in other fungi may not be reliable. Moreover, very few

genetic determinants have been identified to be directly involved in either

phase transition and/or pathogenicity. Further studies of gene expression

in P. marneffei are necessary in order to solve these problems.

In addition to revealing the overall gene expression pattern, under-

standing the transcriptional mechanisms which control the dimorphic

program is also important. Some of transcription factors within known

pathways have been mentioned above. Here I mention more studies that

identified several other transcription factors which control conidiation

and dimorphic switching in P. marneffei. The P. marneffei abaA gene

(Pm109.16) encoding an ATTS/TEA DNA-binding domain transcrip-

tional regulator regulates cell cycle events and morphogenesis in both

filamentous and yeast growth [18]. The stuA gene (Pm107.14) encod-

ing a basic helix-loop-helix transcription factor may control processes

that require budding but not those that require fission as in dimorphic

growth in P. marneffei [20]. TATA-binding protein (TBP) is a general

transcription factor required for initiation of transcription in eukaryotes.

The TBP encoding gene, Tbp (Pm19.17), has been cloned and character-

ized in P. marneffei [254]. Tbp is essential for P. marneffei filamentous

growth, but plays a less significant role in growth and development dur-

ing the yeast phase. Furthermore, it has been shown that transcriptional

regulation in S. cerevisiae appears to be mechanistically bipolar, i.e.,

TATA box-containing genes are predominantly involved in responses to

stress, whereas TATA-less genes are mainly associated with constitutive

housekeeping functions [12]. Only 20% of yeast genes contain a TATA

box [12]. It therefore is interest to see if TATA-less promoters are also

present in P. marneffei, suggesting a need to balance inducible stress-

related responses with constitutive housekeeping functions or reflecting

the difference in the regulatory basis for growth and development of the

141

two morphological forms [254].

6.3.4 Structural reorganization towards the morphological change

It is reasonable to speculate that the mycelium-to-yeast transformation of

P. marneffei is an active process triggered by a shift in temperature. The

fungus undergoes a ‘drastic’ structural reorganisation associated with this

active process. We assume this process may be linked with a number of

phenotypic changes like those characteristic of apoptosis or programmed

cell death. Indeed, programmed cell death has been observed in both A.

fumigatus [225] and A. nidulans [313]. The metazoan upstream apop-

totic machinery is absent in fungi, whereas the downstream effectors and

regulators, both caspase-dependent and caspase-independent, seem to

present in A. fumigatus [225]. As in animal apoptotic cells, caspase activ-

ities are involved in fungal mycelium self-activated proteolysis. Searches

in P. marneffei genome revealed three genes (Pm105.4, Pm112.34 and

Pm205.1) encoding metacaspase proteins that could be responsible for the

caspase-like activities. Only two copies of these proteins were identified in

A. nidulans genome. The searches also found a single gene (Pm93.8) en-

coding a poly (ADP-ribose) polymerase (PARP) protein, a homologue of

the key participant of caspase-independent apoptosis in mammals. PARP

is one of the known target proteins inactivated by caspase degradation in

animal cells. PARP activity was demonstrated previously in A. nidulans

during sporulation-induced apoptosis. PARP is absent in S. cerevisiae

but present in Aspergillus. The presence of these proteins in P. marnef-

fei and Aspergillus is indicative of the PARP-dependent programmed cell

death pathway. In addition, homologs of mammalian apoptotic protein

AMID are found in P. marneffei and A. fumigatus, but not in unicellular

yeasts such as S. cerevisiae, further suggesting that mechanisms of cell

death appear to be more complex in filamentous fungi.

Analysis of the cell wall of P. marneffei is basic for understanding its

142

morphological transformation. In the mould form, the hyphal cell wall

is essential for P. marneffei to penetrate solid nutrient substrates. In

yeast form, a transformed cell wall is essential to resist host cell defence

reactions. The cell wall protects P. marneffei against the aggressive

human defence reactions, harbours most of the fungal antigens and it

represents a potential drug target. Therefore, comprehension of cell wall

biosynthesis pathways is important. We speculate that, like many other

filamentous fungi, the structural organization of the cell wall of P. marn-

effei is the polysaccharide constituents composed of alpha and beta(1,3)-

glucans, chitin, galactomannan, and beta(1,3),(1,4)-glucan. These struc-

tural genes and genes encoding a number of enzymes including synthases,

transglycosidases, and glycosyl hydrolases responsible for their biosynthe-

sis and remodelling were identified in the P. marneffei genome (provided

in PMGD website: www.pmarneffei.hku.hk). One of the known dif-

ferences between the yeast cell wall and the mycelium cell wall is that

β1,6-glucan and peptidomannan present in yeast cell walls are missing in

A. fumigatus [233]. The beta1,6-Glucan is a key component of the yeast

cell wall, interconnecting cell wall proteins, beta1,3-glucan, and chitin.

Yeast genes, KRE5, KRE6 and SKN1, are predicted to encode paralog

proteins that participate in assembly of the β1,6-glucan. Homologs of

these three genes, Pm76.37, Pm104.21 and Pm34.5 were identified in P.

marneffei genome, as well as in A. fumigatus genome. Seemingly, the

specificity of the cell wall biosynthetic gene inventory in the P. marneffei

genome determines the specificity of the polymer organization of the cell

wall. Yet we need further analysis for confirmation.

As a general feature of development in eukaryotes, only a small pro-

portion of the genome is associated with any particular morphogenetic

process. In yeast for example, only 21-75 of the estimated 6,000 genes

were assumed to be specific to meiosis and ascospore formation. This

is also the case in P. marneffei. Therefore, the study of morphogenesis

www.pmarneffei.hku.hk

143

should be directed to an emphasis on morphogenetic gene regulation of

differential expression of activity, rather than on large scale replacement

of one set of gene products by another. We still lack gene expression

studies in P. marneffei to date. Nevertheless, the findings in this chap-

ter offer new interpretive clues to the mechanisms of fungal virulence

and dimorphism. First, the signalling systems that control dimorphism

may be conserved between P. marneffei and related fungi. That is to

say, many fungal species contain orthologous genes specifying the same

pathways. Presumably, only subtle quantitative differences in the inputs

and outputs of each pathway generate the different morphologies and

behaviours characteristics. Second, dimorphism in P. marneffei may be

controlled by multiple signalling pathways. As in Saccharomyces, at least

three parallel pathways control the switch to filamentous growth. How

the fungus integrates the information from different pathways to effect a

change in cell type is not known.

In summary, morphogenesis is an essential developmental event, pro-

moting host invasion and evasion by dimorphic fungi. Prevention of this

event may hold the key to control of infections by these fungi. Under-

standing the molecular mechanisms for the morphologic switch could lead

to new drug or vaccine targets that block the earliest events in coloniza-

tion or infection.

144

Chapter 7

INTRAGENIC TANDEM REPEATS IN PENICILLIUM

MARNEFFEI AND OTHER ASCOMYCETES

Tandemly repeated DNA sequences occur frequently in the genomes of

organisms. Although their function and origin are not truly understood,

these highly dynamic genomic components may provide the most insights

into how a pathogenic fungus adapts to the host immune system.

7.1 Introduction

A tandem repeat (TR) is defined to be two or more adjacent copies of

the same sequence of nucleotides and may result from tandem duplica-

tion event(s). Over time, individual copies within a TR may undergo

additional, uncoordinated mutations so that typically, only approximate

tandem copies are present. The number of adjacent copies in a TR can

be variable. Lengths of TR range from few tens of base pairs (micro- and

mini-satellites) to megabases (larger satellite repeats).

Genomes, particularly of eukaryotes, contain a large number of TR.

For example, 10% or more human genome is composed of TRs. Simple

sequence repeats are fairly abundant in plant genomes, occurring once

in every approximately 6 Kb [258]. TRs are of biological importance

for many reasons. First, they cause human diseases, including fragile-X

mental retardation, Huntington’s disease, myotonic dystrophy, etc [288],

which are the result of a dramatic expansion in the number of copies of

a trinucleotide pattern. Second, they play a variety of regulatory and

evolutionary roles. The repeats may interact with transcription factors

or alter the structure of the chromatin or act as protein binding sites [121,

145

208]. Third, they are important laboratory and analytic tools. They have

been applied in linkage analysis and DNA fingerprinting [78,340] since the

number of copies of a specific TR is often polymorphic in the population.

Last but not least, TRs play an apparent role in the development of

immune system cells in human. Du et al. [75] showed that breakpoints

of immunoglobulin switch recombination, which occur between pairs of

switch regions located upstream of the constant heavy chain genes, cluster

to a defined subregion in three TRs.

The most interesting feature of TRs is that their association with the

functional variability of a gene product. Most TRs are in intergenic re-

gions, but some are in coding sequences or pseudogenes. Verstrepen et

al. [328] showed that in the genome of Saccharomyces cerevisiae, most

genes containing intragenic TRs (IntraTRs) encode cell-wall proteins.

The presence of IntraTRs facilitates recombination in the gene or between

the gene and a pseudogene. The result of this increased frequency of re-

combination events is an expansion or contraction of the gene size. More

importantly, this size variation creates quantitative alterations in pheno-

types (e.g., adhesion, flocculation or biofilm formation). The variation of

the fungal cell surface allows fungal microbes to ‘disguise’ themselves in

order to evade the host immune system’s defences.

Inspired by the finding of Verstrepen et al. [328], the aim of this

chapter is to reveal the composition of IntraTRs from the genomes of

Penicillium marneffei, as well as other related species. Using computer

programs, we searched for both long and short repeated sequences within

protein-coding regions in P. marneffei and related Ascomycetes. Com-

parison of observed frequencies with expected values reveals that repeats

are enriched in the P. marneffei genome.

146


7.2.1 Identification of coding tandem repeats

The previously described methodology [328] was applied to find Intra-

TRs in P. marneffei genome and other fungal genomes, using the EM-

BOSS ETANDEM software [263] to screen the sequences. The ETAN-

DEM threshold score was set to 20. All known and predicted genes were

scanned for long (> 40 nucleotide (nt)) or short (3-39 nt) repeats. Here

a sequence was considered to be an intragenic repeat if it meet two con-

ditions: (i) repeat conservation was at least 85%; and (ii) the number of

repeats was at least 20 for trinucleotide repeats, 16 for repeats between

4 and 10 nt, 10 for repeats between 11 and 39 nt and 3 for repeats of at

least 40 nt.

7.2.2 Sequence analysis

Position-specific iterated BLAST (PSI-BLAST) [6] was used to search

publicly available microbial genome sequences, GenBank, or EMBL. Gen-

Bank and EMBL were accessed through the National Center for Biotech-

nology Information http://www.ncbi.nlm.nih.gov/ and the Oxford Uni-

versity Bioinformatics Centre, respectively. Protein domain determina-

tions were addressed through the NCBI Conserved Domain Search. The

MBEToolbox package (Chapter 10) was used for nucleotide and amino

acid sequence analysis and alignments.


One of the ultimate goals of sequence analysis is to accurately iden-

tify candidate virulence genes that confer pathogenicity to P. marneffei.

General comparative analyses, such as ortholog prediction and species-

specific gene detection, are valuable, but not very specific. That is to say,

these methods give too many candidate genes. To narrow these candidate

http://www.ncbi.nlm.nih.gov/

147

Table 7.1: P. marneffei genes containing intragenic tandem repeats. Col-umn “size” is the length of repeat unit, “count” is the occurrence of re-peat unit. Total length of repeat units is therefore equals: size × count.Sequence identity (%) of repeat unit is greater than 80%. Consensus se-quences of repeat unit for each gene are available in PMGD. * indicatesthe gene contains more than one type of repeat. Genes are ordered bythe size of repeat unit. The last 12 genes contain short repeats, the restcontain long repeats.

Pm gene Size Count Putative FunctionPm6.47 228 3 Polyubiquitin, similar to S. cerevisiae

UBI4 (YLL039C)Pm27.95 171 5 Unknown functionPm78.37* 165 3 Unknown functionPm54.4 147 3 Streptococcal protective antigen

(Q8NZA4)Pm71.41 144 5 Unknown functionPm133.2 141 3 Unknown functionPm1.199 126 12 Homologous to AN7363.2, AN3547.2 and

AN8457.2Pm12.139 126 9 Putative ATP/GTP binding proteinPm14.111 126 4 O-acetylhomoserine (Thiol)-lyase

(CYSD EMENI)Pm30.75 126 7 Beta transducin-like protein HET-E2C*4

(Q8X1P4)Pm35.44 126 11 Beta transducin-like protein HET-E2C

(Q8X1P5)Pm94.31 126 8 Putative ATP/GTP binding protein

(Q6TMU6)Pm210.2 126 9 Beta transducin-like protein HET-D2Y

(Q8X1P2)Pm39.56 120 3 Unknown functionPm183.10 117 3 Casein kinase I homolog hhp1

(HHP1 SCHPO)Pm54.56* 108 6 Pedal peptide precursor protein (O01387)Pm12.114 102 3 Unknown functionPm161.1 102 3 Phosphorylase (Q8TK58)Pm77.10 99 5 KIAA1223 protein (Q8TB46)Pm226.4* 99 7 Ankyrin 2 (Q9NCP8)Pm209.2 96 3 Beta transducin-like protein HET-E4S

(Q8X1P6)to be continued...

148

Pm44.53 81 3 Related to transport protein USO1(Q873K7)

Pm163.5 78 9 Erythrocyte binding protein 3 [Plasmod-ium falciparum] (Q7K5Q6)

Pm42.29 72 3 Phenol 2-monooxygenase (Q8X0B1)Pm117.16* 72 5 Unknown functionPm31.1 66 5 Unknown functionPm34.34 66 5 Chitinase (Q873Y0)Pm54.65 66 4 Extensin class I (cell wall hydroxyproline-

rich glycoprotein) [Plasmodium falci-parum] (Q09082)

Pm78.42 66 3 Chitinase 4 (Q7ZA41)Pm118.4 63 5 Unknown functionPm40.30 60 3 PAAA motif protein, similar to microfila-

ment and actin filament cross-linker pro-tein [Pan troglodytes]

Pm64.14 60 8 Zonadhesin – [Mouse]; PT repeat pro-tein family (EAL93999) [Aspergillus fumi-gatus]

Pm95.32 60 3 Related to mannosyltransferase ALG2(Q8X0H8)

Pm194.2 60 3 Retrovirus-related Pol polyprotein fromtransposon TNT 1-94 (POLX TOBAC)

Pm41.72 54 5 Unknown functionPm166.6 54 3 Unknown functionPm48.11 48 5 Similar to S. cerevisiae YJR054W

(Q6CXI0)Pm78.3 48 4 Telomere-linked helicase 1 (Q8J216)Pm173.14 48 4 Telomere-linked helicase 1 (Q8J216)Pm194.1 48 4 Telomere-associated recQ-like helicase

(O13400)Pm194.5 48 5 Polymerase (Q9C435)Pm224.1 48 3 Telomere-linked helicase 1 (Q8J216)Pm224.2 48 5 Telomere-linked helicase 1 (Q8J216)Pm230.1 48 5 Telomere-linked helicase 1 (Q8J216)Pm234.1 48 5 Telomere-linked helicase 1 (Q8J216)Pm236.2 48 4 DWIQ motif containing hypothetical pro-

tein (NP 702011) PF14 0123 [Plasmodiumfalciparum]

Pm236.3 48 7 Q8J216 Telomere-linked helicase 1to be continued...

149

Pm247.2 48 5 Q8J216 Telomere-linked helicase 1Pm108.33 45 4 Unknown functionPm8.109 42 3 ATPase, AAA familyPm40.29 42 3 Unknown functionPm40.31 42 4 H7H motif in multiple proteins of Plas-

modiumPm52.29 42 3 Mitochondrial chaperone BCS1

(BCS1 XENLA)Pm210.1 42 4 Unknown functionPm173.16 24 10 Unknown functionPm36.21 12 11 Unknown functionPm1.35 6 25 Transcription initiation factor TFIID sub-

unit 12 (TAF12 YEAST)Pm1.28 3 24 Unknown functionPm3.168 3 28 Q7Z884 Putative cell wall protein FLO11pPm5.75 3 25 Dynamin binding protein, TUBA; DN-

MBP MOUSE (Q6TXD4)Pm14.75 3 29 Unknown functionPm22.8 3 22 Unknown functionPm67.24 3 22 Related to heat shock transcription factore

HSF21 (Q9P554)Pm76.36 3 21 Unknown functionPm85.21 3 30 Unknown functionPm138.7 3 24 Oxygenase-like protein (Q93M01)

genes down to a manageable amount, genes that contain IntraTRs were

carefully investigated. This is because IntraTRs have been suggested to

generate functional variability in S. cerevisiae, and variation in IntraTR

number provides the functional diversity of cell surface antigens that, in

fungi and other pathogens, allows rapid adaptation to the environment

and elusion of the host immune system [328]. In S. cerevisiae, there are

a total of 44 such genes with known functions that have been identified.

These genes show unexpected functional similarities: 62% with conserved

long repeats encode cell-wall proteins [328].

A total 66 P. marneffei genes that contain IntraTR(s) were identi-

fied (Table 7.1). Nearly one third of these genes are of unknown func-

tion, i.e., neither putative homologs have been detected by the extensive

150

PSI-BLAST search against GenPept databases, nor putative conserved

domains have been detected. These genes may be P. marneffei -specific.

The remaining two thirds of them, whose putative homologs can be found,

are genes with assigned functions. Nine of these genes, namely, Pm78.3,

Pm173.14, Pm224.1, Pm224.2, Pm230.1, Pm234.1, Pm236.3, Pm247.2,

and Pm194.1, are homologs of the Magnaporthe grisea telomere-linked

helicase 1 (TLH1) gene. Genetic mapping showed that most members

of the TLH gene family are tightly linked to the telomeres and located

within 10 kb from the telomeric repeat. Similar helicase gene families

are also present in the chromosome ends of Saccharomyces cerevisiae

and Ustilago maydis, which suggests the initial association of helicase

genes with fungal telomeres might date back to the very early stages of

the fungal evolution [103]. Four genes, Pm210.2, Pm30.75, Pm35.44, and

Pm209.2 are homologs of beta transducin-like protein genes, most closely

similar to Podospora anserina het-d2y, het-e2c, het-e2c*4 and het-e4s, re-

spectively. These genes are involved in vegetative incompatibility, which

prevents a viable heterokaryotic cell from being formed by the fusion of

filaments from two different wild-type strains. In P. anserina, such in-

compatibility is always the consequence of at least one genetic difference

in het genes, specifically het-e and het-d. These loci control heterokaryon

viability through genetic interactions with alleles of the unlinked het-c lo-

cus [82]. The other interesting homologs include streptococcal protective

antigen, chitinase, extensin, zonadhesin, and erythrocyte binding protein,

etc (Table 7.1).

For further experimental studies, such as, DNA typing, only those

that are most likely to be responsible for P. marneffei ’s pathogenic adap-

tation should be selected. The selective process involves a multi-step fil-

tering. The underlying rationale is that a candidate virulence gene has to

be (1) P. marneffei -specific (without orthologs or orthologs containing no

similar IntraTR), and (2) functionally known to be related to intracellular

151

adaptation or otherwise completely functionally unknown. Moreover, in

order to conduct a PCR-based IntraTR length polymorphism study, the

constraint of the length of target DNA in PCR reactions has to be taken

into account. After the multi-step filtering and investigating the lengths

of IntraTR and introns of these genes, two genes, Pm40.30 (745 bp) and

Pm40.31 (733 bp), were selected for further polymorphism study. The

lengths of IntraTRs plus introns of the two genes are 234 and 277 bp re-

spectively. What makes these two genes special are their BLAST analysis

results. Pm40.30’s top hit of PSI-BLAST against NCBI NRProt database

is a hypothetical Chimpanzee protein containing multiple PAAA motifs.

While Pm40.31’s top hit is a hypothetical histidine-rich motif containing

protein from Plasmodium falciparum. Although the function of this hy-

pothetical gene encoding this protein is unknown, it is still noteworthy

that another histidine-rich protein PfHRP2, encoded by P. falciparum

gene HRP-2, is indeed responsible for intracellular adaptation of this

parasite [11]. PfHRP2 binds heme, playing a role in hemoglobin prote-

olysis, which is the primary nutrient source of the erythrocytic growth

stage of P. falciparum [52].

The relative abundances of IntraTR within different fungi are com-

pared. Table 7.2 shows the genome size, G, bases in repeat regions, B,

and number of genes containing repeats, n, from several fungi. When

take all diploid and haploid species are taken together, the two diploid

fungi, S. cerevisiae and C. albicans show higher B/G ratio. It appears

that genomes of diploid species may accommodate more bases located in

IntraTR regions, as much as 3 times higher. Among haploid fungi, P.

marneffei shows the highest B/G ratio, i.e. its fraction of bases belong to

repeat regions is higher than any other haploid fungi. We argue that the

relatively more abundant IntraTRs in P. marneffei might be responsible

for its immuno-escaping mechanism, which enables the fungal pathogen

to survive within its host. Finally, note that B/N ratios remain largely

152

constant across different species, i.e., the average number of bases within

each gene is similar.

Table 7.2: Comparison of genome size and base in repeats. Abbrevi-ations: Pm, P. marneffei ; Af, Aspergillus fumigatus; An, Aspergillusnidulans; Sc, Saccharomyces cerevisiae; Ca, Candida albicans; Mg, Mag-naporthe grisea; Nc, Neurospora crassa.

Pm Af An Sc Ca Mg NcDiploid No No No Yes Yes No NoGenome size (Mb), G 30 28 30 12 16 39 40Bases in repeat re-gions (bp), B

23,814 12,687 16,820 29,664 34,662 16,933 22,101

No. of genes contain-ing repeats, N

66 33 31 69 82 62 121

B/G ratio 794 453 561 2,472 2,166 434 553B/N ratio 361 384 543 430 423 273 183

The amino acid composition of a protein is the mole percent of the

different amino acids its sequence. It is usually conserved among the

same proteins of different organism species. Here we performed a cross-

species comparsion of IntraTRs’ amino acid composition (Fig. 7.1). The

two yeasts show a different visual pattern compared to these of moulds.

S. cerevisiae and C. albicans use much more threonine and/or serine

residues than any other amino acid; while in moulds the patterns are

more contrast. Serine is used most in P. marneffei and A. fumigatus;

alanine in A. nidulans, glycine in N. crassa and isoleucine in M. grisae.

Phenylalanine, valine and tryptophan are ubiquitously less used in all

species. The overall patterns of P. marneffei, A. nidulans and A. fumi-

gatus are similar to each other. The result shows that the differences

among amino acid composition are associated with the phylogenetic dis-

tances among species. This suggests that the amino acid composition of

IntraTR is not subject to neutral mutation but under the constraint of

a certain level of selection.

The cell surfaces of microorganisms show distinctive properties which

153

0 500 1000 1500

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

0 500 1000 1500 2000 2500

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

0 100 200 300 400 500 600 700 800

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

0 500 1000 1500 2000 2500

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

0 100 200 300 400 500 600 700

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

0 100 200 300 400 500 600 700

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

Y

0 100 200 300 400 500 600

A

C

D

E

F

G

H

I

K

L

M

N

P

Q

R

S

T

V

W

YAf An Nc

Mg

Pm

Ca Sc

Figure 7.1: Amino acid composition in intragenic tandem repeats. Fungalspecies are: Af, A. fumigatus; Pm, P. marneffei ; An, A. nidulans; Sc,S. cerevisiae; Ca, C. albicans; Nc, N. crassa; Mg, M. grisae. For eachsubplot, x axis is occurrence/frequency of amino acid, y axis is aminoacid in the order of downwards: A - Alanine, C - Cysteine, D - AsparticAcid, E - Glutamic Acid, F - Phenylalanine, G - Glycine, H - Histidine,I - Isoleucine, K - Lysine, L - Leucine, M - Methionine, N - Asparagine,P - Proline, Q - Glutamine, R - Arginine, S - Serine, T - Threonine, V -Valine, W - Tryptophan, and Y - Tyrosine (Tyr).

154

can be recognised by the host immune system. Many microorganisms

have the ability to switch their cell-surface molecules, a tactic that per-

mits them to elude the immune system and adhere to diverse materials

and cells (for review, see [329]). The human immune system poses chal-

lenges to P. marneffei, which might have characteristic cell-surface mole-

cules that are recognized by dedicated phagocytic cells. Recent studies

linked the the diversity of cell surface molecules to the variation in In-

traTR number. The persistence of a large amount of IntraTRs in the

P. marneffei genome suggests that there is a compensating benefit. We

therefore propose that variation in IntraTR number provides the func-

tional diversity of cell surface antigens in P. marneffei, allowing rapid

adaptation to the environment and evasion of the host immune system.

155

Chapter 8

EXTENT AND EVOLUTIONARY PATTERN OF

DUPLICATE GENES IN PENICILLIUM MARNEFFEI

AND OTHER ASCOMYCETES

Gene duplication and subsequent divergence have long been believed

to be of importance for the functional novelty and complexity of organ-

isms. The extent and evolutionary patterns of duplicate genes (paralogs)

have long been studied in higher eukaryotes, but not in lower eukary-

otes such as fungi. In this chapter, gene-coding sequences in genomes

from Penicillium marneffei, together with those from other ascomycetes,

Saccharomyces cerevisiae, Schizosaccharomyces pombe, Candida albicans,

Aspergillus nidulans and Neurospora crassa, are used to identify multi-

gene families. The number of synonymous substitutions per synonymous

site, Ks, and the number of nonsynonymous substitutions per nonsyn-

onymous site, Ka, are calculated to measure the time (or relative fre-

quency) of duplication as well as the selective constraint on gene pairs.

The evolutionary rates of duplicate gene pairs are measured by applying

the codon substitution model, which is more sensitive than traditional

models [111]. A large variation in the extent of gene duplication in these

species was found (percentage of genes in multigene families ranged from

23.6% in S. cerevisiae to 8.0% in N. crassa). The age distribution of

the gene duplications tentatively suggests that the P. marneffei genome

may have experienced two rounds of large-scale duplication. It is also

detected that paralogs in filamentous ascomycetes (but not paralogs in

yeast ascomycetes) are under weaker functional constraint than those

of orthologs. Analysis of the divergence of evolutionary rates in S. cere-

156

visiae and C. albicans revealed that 17.8% of gene pairs show asymmetric

divergence pattern in amino-acid substitutions. However, there is no evi-

dence to show that this asymmetry is associated with positive selection. I

speculate that the different extent and evolutionary pattern of duplicate

genes in these ascomycetes might be associated with their genotypical

and phenotypical differences.

8.1 Introduction

In early 1970s Ohno proposed in his book that gene duplication is a ma-

jor evolutionary source of gene innovation [237]. By this he meant that:

the creation of a paralog of a gene through duplication (by many possible

means) results in one of the duplicates being functional redundant. This

redundant copy may mutate more freely without affecting the overall fit-

ness of the organism, and thus is more likely to become a gene with a

novel function. Now generally, biologists accept the vision that, by cre-

ating sets of gene paralogs, gene duplication plays an important role in

the adaptation of organisms to their environment and in the origin of yhe

phenotypic diversity of organismal evolution [210]. Nowadays, with the

completion of several eukaryotic genome projects, it is well known that

one of the characteristics of eukaryotic genomes is the presence of dupli-

cate genes, forming numerous gene families [287]. More than a third of a

typical eukaryotic genome consists of gene families [115,287,345]. Whole

genome duplication(s) during the earlier evolution of the vertebrate lin-

eage have been proposed to account for the presence of extensive gene

duplications in most of the vertebrate genomes [209,221,287].

The extent of gene families in one organism is firstly determined by

the frequency and magnitude of gene duplication events, and secondly de-

termined by the subsequent evolutionary fates of gene pairs following the

duplication events. This may be better understood through comparative

studies of sequence divergence in duplicate genes in different genomes.

157

However, until recently few studies have been conducted in the limited

number of representative organisms available [68,174,210], because such

kinds of inter-genomic comparisons rely on the availability of complete

genome sequence from multiple organisms.

In this study, I compare the extent and evolutionary pattern of dupli-

cate genes in the phylum ascomycota, using the complete sets of protein-

coding genes in the fungi, Saccharomyces cerevisiae [110], Schizosaccha-

romyces pombe [354], Candida albicans, Aspergillus nidulans, Penicillium

marneffei and Neurospora crassa [101]. These fungal species display dif-

ferent life styles and phenotypic characteristics. The brewer’s yeast, S.

cerevisiae, and fission yeast, S. pombe, have a life cycle characterized

by a unicellular thallus that reproduces by budding and fission respec-

tively. Filamentous ascomycetes, N. crassa and A. nidulans, grow hyphae

apically and branch laterally. P. marneffei shows dimorphic switching

between mould and yeast forms of growth under different temperatures.

It is of interest to know how gene duplication shaped their gene reposi-

tories leading to novel genes conferring novel adaptive functions in these

fungi.

In practice I used nucleotide alignments of duplicate genes to calculate

two key parameters of molecular evolution: the number of synonymous

(silent) substitutions per synonymous site, Ks, and the number of non-

synonymous (amino-acid replacement) substitutions per nonsynonymous

site, Ka. Ks provides a crude measure of the time since duplication for

each gene pair, if assume Ks increases approximately linearly with time.

The ratio Ka/Ks provides a measure of the selection pressure to which a

gene pair is being subjected. Generally speaking, if Ka/Ks ratio = 1, it

means that the duplicate genes are under few or no selective constraints

(i.e., amino acid replacement substitutions occur at the same rate as syn-

onymous substitutions). A Ka/Ks ratio > 1, which is a strong evidence

for positive selection, indicates that replacement substitutions occur at

158

a rate higher than that expected by chance, so advantageous mutations

have occurred during sequence divergence. In contrast, a Ka/Ks < 1

is consistent with ‘purifying selection’. That is to say, some amino-acid

replacement substitutions have been purged by natural selection because

of their deleterious effects [48]. Another evolutionary pattern that has at-

tracted great interest is the asymmetry of evolutionary rates between the

two copies of a duplicated gene pair, i.e., one copy evolves faster than the

other one. Intensive studies on this pattern in different organisms have

shown a wide range in estimation of the portion of duplicate gene pairs

show asymmetric evolution [59,68,137,174,265,321,370,371].

Since the completion of whole genome sequence of S. cerevisiae [110],

a number of studies have involved the identification of multigene families

in this model eukaryotic genome. The resulting numbers of multigene

families in S. cerevisiae reported by Rubin et al. [270] are higher than

those reported by Friedman and Hughes [95] (1858 compared to 1440).

This is because the former study used the simple criterion, BLAST E-

value of 10−6, while the latter used the much stricter search with E =

10−50. However, using a single statistical score (such as the E value given

by a BLAST search or a related score) without specifying the proportion

of alignable regions may put two non-homologous proteins into the same

family due to domain sharing [118]. Hence in this study, in order to

obtain a reasonable estimate, I adopted a relatively stringent definition

in which the lengths of gene-encoding proteins are taken into account,

instead of relying on E-values only.


The ability to adapt to changing environments and to exploit new niches

has a great influence on the success of an organism [210]. This ability is

associated with new genes or genes with new functions [219]. Gene dupli-

cations are traditionally considered to be a major evolutionary source of

159

new protein functions. After duplication, the fate of the resulting copy of

a gene is of great interest. At least three hypotheses have been proposed,

as follows:

Nonfunctionalisation The classical view pioneered by Susumu Ohno

[237] holds that a duplicate gene produces two functionally redundant,

paralogous genes and thereby frees one of them from selective constraints.

The duplicate gene may be degraded to a pseudogene by mutational

inactivation and finally could be removed from the genome by deletion

[237,238]. This is the most likely outcome of duplicate genes [237,68].

Neofunctionalisation The duplicate gene may avoid redundancy by

assuming a novel function, i.e., the redundant copy may be modified and

in time assume a new role [237, 166, 336, 334, 298, 212]. Since this un-

constrained paralog is free to accumulate neutral mutations, there is the

possibility of fixation of mutations that may lead to a new function. This

prediction was supported by studies on isozyme spectra of polyploidy in a

number of organisms (reviewed in [196]). Of course, mutational time is a

deciding factor, since copies need sufficient modifications to assume roles

different from their parents, assuming that they are initially of neutral

fitness. Thus, the deletion rate is of great importance to gene innovation

by being sufficiently slow to give copies time to diverge.

Both the hypotheses above assume one copy of a duplicate gene pair

is free to evolve, while the other remains under selective pressure. This

has been challenged in work by Kondrashov et al. [174] and Lynch and

Conery [212], who show that paralogs do not seem to have experienced

any extensive period of neutral evolution. Kondrashov et al. [174] pro-

posed that paralogs avoid neutrality through gene amplification, followed

by a period of either relaxed or positive selection. They also observed

that paralogs evolve faster than their corresponding orthologs. Again,

this could be due to relaxed or positive selection. Furthermore, a study

160

of 17 pairs of duplicate genes in the tetraploid frog Xenopus laevis has

shown that both copies were subject to purifying selection, contrary to

the notion of neutrality of one of the copies [137]. The failure of em-

pirical research to support Ohno’s model has led to the proposal of an

alternative hypothesis – subfunctionalisation.

Subfunctionalisation The third hypothesis, ‘subfunctionalisation’ or

the duplication-degeneration-complementation (DDC) model [90], pro-

poses that duplicate genes come under selective pressure and are re-

tained by losing separate subfunctions from a multifunctional ancestal

gene. Redundant material is discarded through degradation [90]. It also

states that duplicate genes are initially redundant in function and, ac-

cordingly, a duplication event is selectively neutral. But it differs from

the hypothesis that successfully retained subdomains can be reused for

subset of orignial functions or even other new or related purposes [90].

As a result, the two genes can be said to belong to a family, being related

by sequence similarity, if not by function. Naturally, this relationship

will decrease with time until no discernable similarity can be observed in

regions of low conservation. A large number of observations support this

model, although mostly in diploid or polyploid eukaryotes.


8.3.1 Sequences and gene families

For each organism, other than P. marneffei, the complete sets of available

putative amino-acid sequences and coding DNA sequences were down-

loaded from genomic databases as follows: for S. cerevisiae, http://

genome-www.stanford.edu/Saccharomyces; for S. pombe, http://www.

genedb.org/genedb/pombe (Schizosaccharomyces pombe GeneDB); for

C. albicans, http://genolist.pasteur.fr/CandidaDB/ (CandidaDB Data

Release R1 Dec 17, 2001), this genome database was created by the EU-

http://genome-www.stanford.edu/Saccharomyces

http://genome-www.stanford.edu/Saccharomyces

http://www.genedb.org/genedb/pombe

http://www.genedb.org/genedb/pombe

http://genolist.pasteur.fr/CandidaDB/

161

funded consortium Galar Fungail by performing independent annotation

of assembly 19 sequence data obtained from the Stanford Genome Tech-

nology Centre (http://www-sequence.stanford.edu/group/candida);

for A. nidulans, http://www.broad.mit.edu/annotation/fungi/aspergillus/

(Aspergillus nidulans Database), and for N. crassa, http://www-genome.

wi.mit.edu/annotation/fungi/neurospora (Neurospora crassa Data-

base release 3: 02.12.2002). All protein sequences that were annotated

as known or suspected pseudogenes and those proteins encoded by mi-

tochondrial genomes were removed. Gene families in each genome were

identified by using BLASTCLUST (30% of identical residues and aligned

over at least 80% of their lengths). BLASTCLUST applies the single-

linkage algorithm. For documentation on its use, see ftp://ftp.ncbi.

nlm.nih.gov/blast/documents/README.bcl. The clusters were used to

identify and count duplication events (although not all pairs of genes in

the cluster are homologous to each other). Throughout the analysis, the

same criteria were applied in searching for orthologs of genes from all

other species, that is to say, orthologs were predicted by BLASTP search

for interspecies genes with > 30% identical residues and alignable region

over at least 80%.

8.3.2 Estimation of substitution rate

Gene families with sequences similar to known transposable elements

were removed at this point and excluded from the rest of analysis. Paralo-

gous protein sequences were aligned using ClustalW version 1.82 with the

default parameters (PAM matrix; gap opening penalty = 10.0; gap exten-

sion penalty = 0.2). The corresponding nucleotide-sequence alignments

were derived by substituting the respective coding sequences from the

protein sequences by using MBEToolbox (Chapter 10 ). Ks and Ka were

calculated by the method of maximum-likelihood, which is implemented

in the CODEML program of the PAML package version 3.13d [359].

http://www-sequence.stanford.edu/group/candida

http://www.broad.mit.edu/annotation/fungi/aspergillus/

http://www-genome.wi.mit.edu/annotation/fungi/neurospora

http://www-genome.wi.mit.edu/annotation/fungi/neurospora

ftp://ftp.ncbi.nlm.nih.gov/blast/documents/README.bcl

ftp://ftp.ncbi.nlm.nih.gov/blast/documents/README.bcl

162

Following the procedure described in Zhang et al. [371], pairs of dupli-

cate genes with smallest value of Ks were picked within each family. This

process was repeated for the remaining genes within the family until there

was no gene pairs that could be picked. The process was implemented

by ad hoc scripts in Perl.

To plot Ka versus Ks, pairs with Ks > 5.0 or Ka > 5.0 were elim-

inated because such high sequence divergence is often associated with

problems like difficulty in alignment, different codon usage biases or

nucleotide compositions in the different sequences. Ks is known to be

strongly distorted by codon usage bias [283]. The codon adaptation index

(CAI) [282] was used as a measure of codon bias. I therefore calculated

average values of CAI for all gene pairs and excluded those with average

CAI > 0.5 from the analysis.

8.3.3 Relative rate test

The relative evolutionary rate test aims to compare the substitution rates

of two sequences or two groups of sequences. Here it was applied to

compare the evolutionary rate of two copies of a duplicate gene pair.

In the test I only used recently duplicated (i.e., duplicate genes with

Ks < 0.5). These ‘young’ duplicates have fewer multiple substitutions

and therefore can be estimated more accurately than those of older ones.

In addition, very young duplicates (Ks < 0.05) were excluded because

they have too few substitutions to make statistical test significance [199].

In order to apply the relative rate test, I obtained outgroup sequences

for these young gene pairs. Each relative rate test was based on one gene

pair and its outgroup, forming triplets. Selection of outgroup were done

by using the method described in Conant and Wagner [59]. When more

than one outgroup sequence was available, either from the same genome

or from other genomes, triplets of genes closest to each other in syn-

onymous divergence rate, Ks, were chosen. I used two likelihood ratio

163

Table 8.1: Distribution of multigene families in fungi. Abbreviations: SC- S. cerevisiae; SP - S. pombe; CA - C. albicans; AN - A. nidulans; PM- P. marneffei ; NC - N. crassa.

Family size SC SP CA AN PM NC1 4500 4104 5276 7887 8725 92742 390 229 188 320 291 1983 54 34 41 84 64 434 23 18 29 38 26 225 11 4 8 17 10 56-10 17 18 24 29 29 1511-20 7 4 2 9 3 3>20 2 0 2 5 5 1Number of multigene families(size >=2)

504 307 294 502 428 287

Total genes used in the analysis 5889 4939 6165 9541 10060 10082Number of genes in families 1389 835 1189 1654 1335 808Number of young duplicategene pairs (Ks < 0.5)

165 51 50 43 52 10

(LR) tests to test for asymmetric divergence in both amino-acid and

codon. Codon substitution rate was estimated using the codon substitu-

tion model described by Goldman and Yang [111]. To do the LR test,

two models were applied to the data: model 0 constrains the amino-acid

or codon substitution rates to be equal in the two sequences; and model

1 assumes the rates are free parameters (hence they could be unequal to

each other in two sequences). Maximum likelihood values ML1 and ML2

from the two models were collected and the likelihood ratios were calcu-

lated as LR = 2(ln(ML1) − ln(ML2)). LR was then compared against

the χ2 distribution with one degree of freedom, as detailed by Yang [358].

8.4 Results

8.4.1 Extent of gene duplication in ascomycetes

As shown in Table 8.1, 1,389 (23.6%) of 5,889 genes in S. cerevisiae belong

to multigene families (including at least two genes), 16.9% in S. pombe,

164

19.3% in C. albicans, 17.3% in A. nidulans, 13.3% in P. marneffei, and

only 8.0% in N. crassa.

When comparing number of young duplicates, I found 23.8% of gene

families are young (Ks < 0.5) in S. cerevisiae, 12.2% in S. pombe, 8.4%

in C. albicans, 5.2% in A. nidulans, 7.8% in P. marneffei, and only 2.5%

in N. crassa (Table 8.1).

Apparently S. cerevisiae contains more multigene families and more

recently duplicated genes than any other fungus in this analysis. This

is in concordance with an earlier study [345]. Whole-genome duplica-

tion approximately 108 years ago was proposed as an explanation for the

presence of many duplicate genes [279]. S. pombe, C. albicans, A. nidu-

lans and P. marneffei contain moderate numbers of duplicated genes to

roughly the same extent as each other. Very few duplicated genes are

present in the N. crassa genome. This low number of duplicate genes is

consistent with results reported previously [101,231].

Table 8.2 lists top multigene families that contain the most homol-

ogous genes in number. S. cerevisiae contains large amount of trans-

posable elements which play an important role in creating duplication

in yeast genome [366]. Top multigene families of S. cerevisiae include

a group of proteins, seripauperins, whose function(s) remain poorly un-

derstood [332]. Comparable number of predicted sugar transporters is

found in N. crassa and S. cerevisiae. Transporter and reductase gene

families are expanded in filamentous fungi. Interestingly, P. marneffei

has large gene family of 24 putative pepsin-like proteases, which is not

so substantial in other fungi studied here.

8.4.2 Age distribution of duplicate genes

In general, we assume Ks increases approximately linearly with time

because synonymous substitutions do not alter the amino-acid sequence

and therefore there will be lower constraint due to natural selection [212].

165

Table 8.2: Large multigene families in fungi.

Fungi Size of family Function/ProductS. cerevisiae

20 Hexose transporter20 Seripauperins17 Amino acid permease15 GTP-binding protein13 Helicase

S. pombe20 Multidrug resistance protein17 GTP-binding protein12 Amino acid permease11 Retrotransposable element10 Protein kinase

C. albicans23 Unknown proteins21 Amino acid permease13 GTP-binding protein11 Ferric reductase transmembrane component9 Unknown proteins

A. nidulans61 Hexose transporter42 Putative transporter36 Oxidoreductase28 Multidrug resistance protein21 Aldehyde dehydrogenase

P. marneffei34 MFS multidrug transporter31 Short chain dehydrogenase/reductase family27 Hexose transporter protein24 Pepsin-type protease23 Major facilitator superfamily

N. crassa21 Oxidoreductase17 Phosphoethanolamine N-methyltransferase16 Hexose transporter11 Aldehyde dehydrogenase10 Endoglucanase

166

C. a

lbic

an

s

30

20

100

Std

. Dev =

1.6

3

Mean =

2.0

9

N =

198.0

0

P. m

arn

effe

i

30

20

100

Std

. Dev =

1.7

2

Mean =

2.1

6

N =

174.0

0

A. n

idu

lan

s

40

30

20

100

Std

. Dev =

1.8

2

Mean =

2.4

7

N =

142.0

0

S. p

om

be

40

30

20

100

Std

. Dev =

1.3

0

Mean =

1.1

6

N =

123.0

0

S. c

ere

vis

iae

5.0

4.0

3.0

2.0

1.0

100

80

60

40

200

Std

. Dev =

1.6

5

Mean =

1.3

9

N =

313.0

0

N. c

rassa

1086420

Std

. Dev =

1.4

9

Mean =

2.4

7

N =

48.0

0

5.0

4.0

3.0

2.0

1.0

5.0

4.0

3.0

2.0

1.0

5.0

4.0

3.0

2.0

5.0

4.0

3.0

2.0

1.0

5.0

4.0

3.0

2.0

1.0

Figure

8.1:Frequency

distributionof

Ks .

Frequencydistribution

ofduplicategene

pairsas

afunction

ofthenum

berofsynonym

oussubstitution

persynonym

oussite

(Ks ).

Arrow

indicatesthe

secondpeak

inP.m

arneffei

167

S. cerevisiae

0.01 0.1 1

0.01

0.1

1

S. pombe

0.01 0.1 1

0.01

0.1

1

C. albicans

0.01 0.1 1

0.01

0.1

1

N. crassa

0.01 0.1 1

0.01

0.1

1

A. nidulans

0.01 0.1 1

0.01

0.1

1

P.. marneffei

0.01 0.1 1

.01

0.1

1

0

Ks

Ka

Ka

Figure 8.2: Log-log plots of Ka vs. Ks for duplicate gene pairs. Log-logplots of the number of nonsynonymous substitution per nonsynonymoussite (Ka) vs. the number of synonymous substitution per synonymoussite (Ks) for duplicate gene pairs. Each point represents a single pair ofgene duplications. Points below the diagonal (Ka < Ks) imply the geneshave been subjected to purifying selection against amino acid changes.Open points denote orthologous gene pairs.

168

If this assumption largely holds, the distribution of Ks can be used as an

indicator for the distribution of duplication events along a time scale. I

plotted the frequency distribution of pairs of duplicate genes as a function

of the number of Ks in Fig. 8.1. An obvious pattern found in all species

is that most of gene duplicates are young and the density of duplicates

drops off with increasing Ks. The distribution of C. albicans shows a flat

pattern, in which the gene pairs are evenly distributed over Ks, with a

peak around Ks = 0.2. This may indicate small-scale gene duplications

happened persistently during the course of evolution.

For P. marneffei, there are two peaks in the plot: the first one is a

high peak in the age distribution centered around Ks = 0.1, indicating

there are a large number of gene pairs of a similar recent age, the second

peak coresponds to a low region from Ks = 2.0 to 4.5. I speculate the

second peak is a trace of ancient gene duplication events on a relatively

large-scale. This proposed ancient duplication would have created many

duplicate gene pairs. After such a long evolutionary time, most of these

gene pairs would be expected to have mutated and become divergent.

Only some pairs retain some degree of similarity, which gives rise to

the second peak. This dual-peak pattern is not readily observed in other

fungal species, except for N. crassa with a second-peak which might result

from gene duplication prior to the development of the repeat-induced

point mutation (see below).

8.4.3 Selective constraint between paralogs

As metioned in the Introduction, Ka/Ks is used as a measure of selective

constraint between two copies of duplicate genes. The larger the Ka/Ks

value, the stronger the selective constraint between the two copies. Table

8.3 gives the estimated Ka/Ks values in different fungi.

Comparison of Ka/Ks values for different fungi revealed that the

strength of selection is generally similar among yeasts (i.e., S. cerevisiae,

169

S. pombe and C. albicans, and among moulds (i.e., A. nidulans, P. marn-

effei and N. crassa). There is substantial difference in Ka/Ks between

yeasts and moulds. The strongest purifying selection is among the S.

cerevisiae paralogs and the weakest purifying selection in A. nidulans.

Mould paralogs show significantly stronger functional constraints, indi-

cated by larger values of Ka/Ks, than those in yeasts (Student’s t-tests

for pairwise comparisons).

Table 8.3: Ratio of nonsynonymous to synonymous substitution rates(Ka/Ks) for recently diverged paralogs (0.05 < Ks < 0.5).

Fungi No. of gene pairs Ka/Ks (mean ± SD)S. cerevisiae 89 0.134 ± 0.166S. pombe 22 0.148 ± 0.234C. albicans 34 0.245 ± 0.224A. nidulans 12 0.491 ± 0.214P. marneffei 29 0.456 ± 0.231N. crassa 9 0.359 ± 0.276

8.4.4 Ka/Ks between paralogs and orthologs

Ka/Ks is also used to estimate the selective constraints acting on or-

thologs. I therefore also characterised rates of synonymous and nonsyn-

onymous substitution of orthologs for each genome. By plotting Ka as

a function of Ks and superimposing data from paralogs onto those from

orthologs, we can get an overall view of how natural selection acts on two

groups of comparisons (Fig. 8.2).

In all species, overall Ka values are much smaller than Ks values,

which implies that vast majority of duplicate gene are subject to purifying

selection. In C. albicans, A. nidulans and P. marneffei, gene pairs with

smaller Ks tend to gather round the diagonal line (Ka/Ks = 1) and gene

pairs with larger Ks tend to get away from the line. It seems that, in

C. albicans, A. nidulans and P. marneffei, recent duplicates appear to

170

tolerate more amino-acid replacement substitution than older duplicates.

In mould species, the strength of purifying selection acting on paralogs

is smaller than that acting on orthologs with the same level of sequence

divergence. As shown in Fig. 8.2, at the same level of Ks, most of

the open points are below clusters of closed points, that is to say, Ka

in paralogs is generally larger than that of orthologs in A. nidulans, P.

marneffei and N. crassa. On the other hand, there is no difference in

overall Ka/Ks between paralogs and orthologs in yeasts, S. cerevisiae, S.

pombe and C. albicans.

8.4.5 Relative evolutionary rate between paralogs

The two copies of a paralog pair may evolve at the different rate. If most

paralog pairs evolve in such an asymmetric way, it may indicate that

Ohno’s neofunctionalisation theory is plausible. Therefore, as mentioned,

many studies on the relative evolutionary rates between paralogs have

been conducted. However, these studies have led to different conclusions.

Two critical aspects responsible for the success of such analyses are the

sensitivity of methods and the appropriateness of the outgroup used.

Here I used a method that incorporates a codon-based model. Gen-

erally speaking, methods relying on codon-based models (for example,

[111, 226]) are more sensitive than nucleotide-based tests and amino-

acid based tests, because, in the latter two, one cannot distinguish be-

tween silent substitutions and amino-acid replacement substitutions [59].

Codon-based model however takes into account the ratio between the rate

of nonsynonymous and synonymous substitutions which gives a more di-

rect measure of the strength of selection or functional constraints on the

gene.

The major issue is choosing an outgroup is that the potential outgroup

cannot be too distant evolutionarily from the paralogs being studied, oth-

erwise, saturation in synonymous sites for many genes will interfere with

171

the power of the statistical test. To avoid this influence, Kondrashov et

al. [174] used a within-genome approach, since their study included four

highly diverged eukaryotic organisms, S. cerevisiae, A. thaliana, C. ele-

gans and D. melanogaster. By using the within-genome approach, they

identified outgroups of S. cerevisiae paralogs within the S. cerevisiae

genome itself. In addition, they required that the two paralogs be closer

in amino-acid sequence to each other than to the outgroup. This extra

condition, which probably has led to underestimate asymmetric diver-

gence, was criticised by Conant and Wagner [59], who adopted a similar

within-genome approach in multiple eukaryotes.

In the selection of gene duplicates and their outgroups, I adopted a

method similar to that of Conant and Wagner [59]. The only modification

made was the search of all fugal genomes for outgroups, instead of using

the within-genome approach.

I identified a total 163 triplets (composed of two paralogs and one

corresponding outgroup) which included 101 triplets based on paralogs

from S. cerevisiae, 6 from S. pombe, 50 from C. albicans, 2 from A.

nidulans, 3 from P. marneffei, and 1 from N. crassa.

Because the majority of triplets are from S. cerevisiae and C. albi-

cans, the following analysis has no power to distinguish differences among

species. Instead it can only be considered as a comprehensive analysis

dealing with the subject of ascomycetes as a whole.

I adopted the model of Goldman and Yang [111] (see Methods) in

the comparison of the relative rates in amino-acid substitution between

each of the paralogs. The result shows that, of a total of 163 analysed

gene pairs from the ascomycetes, 29 (17.8%) evolve at a significantly

(p < 0.05) different rate (Table 8.4). This figure includes 12 (11.9%) of

101 triplets in S. cerevisiae and 17 (32.7%) of 52 in C. albicans. In the

majority of cases, both paralogs evolved at approximately the same rate,

under a similar level of purifying selection.

172

In order to examine whether Ka/Ks ratio is the factor causing asym-

metry in evolutionary rates between paralogs, I estimated the asymmetry

of Ka/Ks ratios between two paralogs. A 2 × 2χ2 test failed to reject

the null hypothesis that the number of pairs with different Ka/Ks ratio

is independent of the number of pairs with different amino-acid substi-

tution rates (Table 8.4). That is to say, there is no correlation between

different Ka/Ks ratios and different amino-acid substitution rates.

Table 8.4: Amino-acid substitution rates versus Ka/Ks ratios in twocopies of duplicate genes. Columns show gene pairs with different orequal amino-acid substitution rates between two paralogs; rows showgene pairs with different or equal Ka/Ks ratios between two paralogs.

Different Ka Equal Ka TotalDifferent Ka/Ks ratio 3 10 13Equal Ka/Ks ratio 26 124 150Total 29 134 163

8.5 Discussion

This study took advantage of the avaiability of genome sequences of P.

marneffei and other 5 ascomycetes, S. cerevisiae, S. pombe, C. albicans,

A. nidulans and N. crassa. It also relied on the recent development

of methods to analyse selective constrains on duplicate genes in each

genome. Given the considerable phenotypic variation between the two

groups of distinct ascomycetes, yeasts and moulds, I speculated that gene

duplication may play an evolutionary role at different levels and selection

patterns of duplicate genes may be different. To my knowledge, no similar

analysis has been conducted in fungi, despite several genome-level studies

on gene duplications using S. cerevisiae as one of their model eukaryotic

organisms [95].

173

8.5.1 Gene duplication in ascomycetes is highly diverse

Most genomes show a certain degree of redundancy caused by single-

gene duplication, chromosomal segment duplication or complete genome

duplication (through polyploidisation). So do the ascomycetes I studied.

S. cerevisiae S. cerevisiae has the largest amount of gene redundancy

among all ascomycetes I analysed. Previously studies have revealed that

its genome contains approximately 55 large duplicated chromosomal re-

gions [345]. It has been widely accepted that the duplicated regions

found in the modern Saccharomyces species are probably the result of

a whole-genome duplication (tetraploidisation) approximately 108 years

ago [95, 250, 279, 280, 345]. This proposed genome duplication might co-

incide with the origin of the ability to grow under anaerobic conditions,

one of most striking physiological differences between S. cerevisiae and

other yeasts.

S. pombe S. pombe and S. cerevisiae have been separated for as long

as 420 million years [289]. Comparing the two yeasts, S. pombe has

fewer gene duplications than S. cerevisiae, which may account in part

for the smaller genome size. Transposable elements exist in the S. pombe

genome. However, their proportion is low compared to S. cerevisiae.

Using phylogenetic analysis, Hughes and Friedman [136] suggested that

parallel gene duplication appears to have played a role in the independent

origin of similar adaptations in the two unicellular fungi, S. pombe and S.

cerevisiae [136]. That is to say, gene duplications have occurred indepen-

dently in the same gene families in S. pombe and S. cerevisiae; S. pombe

has adapted to a similar unicellular lifestyle without polyploidisation.

C. albicans The age distribution of relative by young duplicate genes

(Ks < 5) in C. albicans (Fig. 8.1) suggests that duplication events are

likely to occur continuously during the course of evolution in this yeast.

174

In either S. pombe or C. albicans, no evidence suggesting polyploidisa-

tion, such as, duplicated genomic blocks, has so far been found. Hence,

genome duplication, as happened in S. cerevisiae, which may represent

an extreme adaptive strategy in providing genetic raw material for func-

tional divergence of novel genes, has not occurred in C. albicans.

A. nidulans A. nidulans contains a relatively large number of recently

duplicated gene pairs; totally 43 with Ks < 0.5. The age distribution of

duplicate genes (Ks < 5) in A. nidulans displays a high peak at Ks = 0.1

to 0.2 and shows a similar pattern with that in S. cerevisiae (Fig. 8.2).

However, S. cerevisiae has undergone genome duplication and there are

extensive duplicated blocks in its genome as the traces of the proposed

ancient tetraploidy that remain detectable after widespread deletion of

superfluous duplicate genes and sequence divergence. Most of gene pairs

in these duplicated regions are believed to have been produced simultane-

ously or within a narrow time frame [95]. Based on the similar patterns

of age distribution of gene pairs between A. nidulans and S. cerevisiae,

I might propose that duplicate genes in A. nidulans probably originated

through one or more episodic, large-scale gene duplications in a relatively

short period of time. What is uncertain is whether such a peak of gene

duplication over the course of evolution implies a polyploidisation event

in A. nidulans. As noted by Friedman and Hughes [95], a peak of gene

duplication need not imply polyploidisation event. Therefore, it would

be interesting to know how many duplicated blocks are present within

and between A. nidulans chromosomes when the genome sequencing of

A. nidulans is completely finished.

P. marneffei Slightly fewer genes in P. marneffei belong to multiple

gene families than A. nidulans. However, 52 pairs are young duplicate

genes compare to 43 in A. nidulans. There is no difference in the overall

extent of duplicate genes between these two close species. The pattern

175

of the Ks histogram is broadly similar to those of A. nidulans and S.

cerevisiea. A difference is the dual-peak pattern, seemingly implying

that besides the modern duplications, there was an ancient large-scale

duplication. The modern peak is at the similar location, Ks = 0.1, as

that of A. nidulans and other fungi, but on a smaller scale (less than 25%

genes belong to this peak) compared to that of A. nidulans. In contrast

the second peak at Ks = 2.0 to 4.0 is more apparent than in other fungi

except N. crassa. More evidence is needed before any solid conclusion

can be reached though.

N. crassa N. crassa exhibits much greater morphological and devel-

opmental complexity. Its genome is approximately three times the size

of the S. cerevisiae genome, and accordingly has a protein count much

larger than those in yeasts . However the paucity of duplicate genes in

N. crassa is obvious: (1) the number of multigene families in N. crassa is

much smaller than that in yeast, and (2) the number of gene pairs with

a small Ks (0.05 < Ks < 0.5) in N. crassa is much smaller that those

in unicellular yeasts (Table 8.1). An extraordinary feature of N. crassa,

repeat-induced point (RIP) mutation [219], has been suggested to play a

major role in preventing gene innovation through gene duplication and

response for this paucity. The RIP, acting as a defense against mobile

DNA [219], can detect and mutate both copies of a sequence duplica-

tion. In fact, the RIP is so efficient that all gene duplications remaining

in N. crassa genomes have been proposed to be raised and fixed before

the emergence of the RIP mechanism. Examples of the remaining multi-

gene families may have ‘survived’ RIP include hexose transporters and

cellulases (Table 8.2). N. crassa may have other mechanisms of gene

innovation, since gene duplication has rarely occurred in its genome.

Ascomycetes display a wide variation in the number of gene duplica-

tion events. This may have provided the foundation for specialisation of

a number of genes and their corresponding proteins, and formed the basis

176

for diversification. Amplification of their genetic material might increase

their fitness of adaptation to the environment. Examples include genes

for the yeast hexose transporters increasing fitness in low-glucose; genes

for N. crassa cellulases to allow growth on decaying plant material; genes

for cytochrome P450 and efflux systems involving in detoxification.

8.5.2 Different selective constraints in yeasts and filamentous ascomycetes

There are differnt models, such as, the classical model and duplication-

degeneration-complementation (DDC) model, to explain the creation of

novel genes by gene duplication. The classical model emphasises that

one copy is neutral and free to evolve while the other remains under

selective pressure. The DDC model [90] explains sub-functional diver-

gence when a gene has been duplicated. According to the DDC model,

the two gene copies then acquire complementary loss of function muta-

tions in independent sub-functions. Thus both genes required to produce

the full complement of functions of the single ancestral gene. Both the

classic model and DDC model predict a period immediately following

duplication when the genome should be able to tolerate a high degree of

nonsynonymous substitutions in one member of a duplicate pair because

the other member is still functioning at full strength.

Comparing Ka with Ks in each genome, I found a common pattern

in all fungi which is in partial agreement with these theoretical expec-

tations. First, in either filamentous fungi or yeasts, purifying selection

was dominant against amino acid changes in paralogous genes. This

confirms the earlier observation that paralogs evolve under purifying se-

lection [211], which challenges the classical model but supports the DDC

model. Second, recent duplicates with smaller Ks appear to tolerate

more replacement amino-acid substitutions than older duplicates, which

is compatible with both models.

I also found two exclusive patterns in filamentous fungi. The first

177

finding is that there are significantly (p < 0.01) higher values of the

Ka/Ks ratio in paralogs in moulds than those in yeasts with a similar

level of divergence (Table 8.3). Filamentous fungi show greater morpho-

logical and developmental complexity than do yeasts, and their genomes

are normally larger. As gene duplication is a source of novel protein

functions, the bigger genome size may partially result from frequently

occurring gene duplications provided a basis for divergence and resulting

in the increase of novel genes caused by the neofunctionlisation, or the

increase of gene number caused by the subfunctionalisation. Therefore,

the higher value of Ka/Ks ratio in paralogs in moulds may imply that, at

the similar stage after duplication, gene pairs in filamentous fungi have

faster evolutionary rates than those in yeasts. Either positive selection

or relaxed functional constraint can cause the higher value of the Ka/Ks

ratio. Few gene pairs in moulds are actually found under positive se-

lection, when use Ka/Ks > 1 as indicator of positive selection. Thus,

the slightly elevated Ka relative to Ks, accounts for the larger value of

Ka/Ks given by gene pairs in moulds.

Another interesting finding is that paralogs in A. nidulans, P. marn-

effei and N. crassa appear to be under weaker functional constraint than

orthologs at the same age. In other words, orthologs in moulds expe-

rience stronger functional constraints than paralogs. Natural selection

seems to allow paralogs in these three filamentous fungi to mutate with

less constraint, which may lead to more advantageous mutations. This

phenomenon was first observed in eukaryotes [174] but it has not been

reported in fungi. Note that this trend is not observed in the unicellular

yeasts, S. cerevisiae, S. pombe, and C. albicans. Therefore, it is suggested

that elevated functional constraint in orthologs or weaker functional con-

straint in paralogs is a more common feature in the evolutionary pattern

of multicellular eukaryotes.

178

8.5.3 Majority of paralogous genes evolve symmetrically

Estimation of asymmetric evolution rates were conducted mainly on par-

alogs from S. cerevisiae and C. albicans, so the result should not be

applied to other species. 29 (17.8%) of a total of 163 analysed gene pairs,

evolve at significantly (p < 0.05) different rates (Table 8.4). Therefore,

in the majority of cases at least in S. cerevisiae and C. albicans, both

paralogs evolved at approximately the same rate, under similar levels of

purifying selection.

Several similar studies have been done in S. cerevisiae and in several

other eukaryotes. Some concluded that both copies of duplicate gene typ-

ically evolved at the same rates [137,174,265], whereas others suggested

asymmetric divergence between two paralogs is not uncommon. Because

different organisms were used in those studies and different methods with

varying sensitivities were applied, it is hard to compare data in this study

with others directly. For instance, Kondrashov et al. [174] selected 15 S.

cerevisiae triplet genes and, by using a distance based method they found

no paralogs showing different rates. In another study, Conant and Wag-

ner [59] identified six of 22 (27%) gene triplets in S. cerevisiae, and three

(21%) of 14 in S. pombe, that showed asymmetry in Ka by using codon

based model following Muse and Gaut [226].

An asymmetric evolutionary rate is not always associated with an

asymmetric evolutionary constraint, as indicated by Ka/Ks. Moreover,

no simple dependence between evolutionary rate and gene function is

observed (data not shown). This finding is inconsistent with Zhang’s

finding in young paralogs of human genes [371], that genes with different

Ka/Ks ratios tend to evolve at different rates, suggesting that different

functional constraints might be largely responsible for the unequal evo-

lutionary rates. The incongruence may be again due to the difference in

species used in the studies.

In conclusion, this chapter reports the variation in the extent of gene

179

duplications in ascomycetes. The age distribution of gene duplications

tentatively suggests that the P. marneffei genome has experienced a

recent as well as an ancient large-scale duplication. Analysis of the di-

vergence of evolutionary rates in S. cerevisiae and C. albicans revealed

that less than 20% of gene pairs in these two yeasts show asymmetric

divergence patterns in amino-acid substitutions. I speculate that the dif-

ferent extent and evolutionary pattern of duplicate genes in ascomycetes

might be associated with their genotypical and phenotypical differences.

180

Chapter 9

ACCELERATED EVOLUTIONARY RATE MAY BE

RESPONSIBLE FOR THE EMERGENCE OF

LINEAGE-SPECIFIC GENES

Once the genome of Penicillium marneffei become available, genes

can be predicted and annotated. Hundreds of these predicted genes lack

homology to any known gene. They are species-specific genes or called

“orphan” genes. Where do these genes come from? This is still a mys-

tery. One suggestion has been that most orphan genes evolve rapidly

so that similarity to other genes cannot be traced after a certain evolu-

tionary distance. This can be tested by examining the divergence rates

of genes with different degrees of lineage specificity. Here the lineage

specificity (LS) of a gene describes the phylogenetic distribution of that

gene’s orthologs in related species. Highly lineage-specific genes will be

distributed in fewer species in a phylogeny.

In this chapter, I used the complete genomes of seven ascomycetes

and two animals to define several levels of LS, such as, Eukaryotes-core,

Ascomycota-core, Euascomycetes-specific, Hemiascomycetes-specific, As-

pergillus-specific and Saccharomyces-specific. The rates of gene evolution

in groups of higher LS to those in groups with lower LS are compared.

Molecular evolutionary analyses indicate a significant increase in nonsyn-

onymous nucleotide substitution rates in genes with higher LS. Multiple

regression analyses suggest that LS is significantly correlated with the

evolutionary rate of the gene. This correlation is stronger than those of a

number of other factors that have been proposed as predictors of a gene’s

evolutionary rate, including the expression level of genes, gene essential-

181

ity or dispensability and the number of protein-protein interactions. The

significantly accelerated evolutionary rates of genes with higher LS may

reflect the influence of selection and adaptive divergence during the emer-

gence of orphan genes. These analyses suggest that accelerated rates of

gene evolution may be responsible for the origin of apparently orphan

genes.

This chapter is very closely based on a paper I have published with

colleagues [in press]. The original draft of the manuscript has been re-

vised by Dr. David K. Smith, in Department of Biochemistry, HKU.

The preliminary version of this work has been presented at the SMBE

conference on 17th June 2004.

9.1 Introduction

During annotation of genome sequences a substantial fraction of the puta-

tive genes are found to lack sequence similarity to any of the genes in pub-

lic databases. These genes or protein-coding regions have been referred

to as “orphan” genes. Some may have crucial organism-specific func-

tions, however, the origin and evolution of orphan genes remain poorly

understood. A proposed explanation of this problem has been that some

genes evolve so rapidly that their homologs cannot be discovered over

larger evolutionary distances. Although this has been supported by re-

cent findings in Drosophila that orphan genes evolve, on average, more

than three times faster than non-orphan genes [73], the influence of other

factors on the evolutionary rate of genes should be taken into account.

These factors include the expression level of genes [127,241], a gene’s

dispensability (the organism’s fitness after deletion of the gene) [178],

gene essentiality [343], gene duplication [150, 357], and the number of

protein-protein interactions involving the gene’s product [93, 335]. Due

to the inherently stochastic property of evolutionary rates, the influence

of many of these factors has proved difficult to confirm and their relative

182

importance also needs further elaboration.

In order to systematically examine the relationship between a gene’s

evolutionary rate and the origin of orphan genes, as well as to assess the

influence of other factors, we have devised a study based on the following

rationale. Orthologs of a gene usually have a particular phyletic distri-

bution in several related species, thus giving each gene a certain lineage

specificity (LS). Orphan genes represent the extreme of LS because they

are only present in one node of a phylogeny. In contrast, highly con-

served genes have a low degree of LS and are widely distributed, while a

range of different degrees of LS can be defined for other gene groups. If

an elevated evolutionary rate is the major cause of the origin of orphan

genes, one should find a correlation between evolutionary rate and LS.

Slower evolving genes should tend to be less lineage specific.

Studying the relationship between the evolutionary rate of genes and

LS may reveal the dynamic processes that lead to the origin of species

specific, or orphan, genes. It can also be tested whether the evolutionary

rate leading to the emergence of orphan genes is relatively constant or

highly variable. If genes become lineage-specific gradually, one might

expect a simple relationship (e.g., a linear relationship, perhaps after data

transformation) between divergence time and genetic distance, otherwise,

a more complex relationship would be expected.

To investigate these matters, the complete sets of predicted protein-

coding genes from Aspergillus fumigatus (http://www.sanger.ac.uk/

Projects/A fumigatus/) and Saccharomyces cerevisiae [110] were ex-

tracted. Orthologs of these genes from five other ascomycotan fungi,

Aspergillus nidulans (http://www.broad.mit.edu/annotation/fungi/

aspergillus/), Schizosaccharomyces pombe [354], Candida albicans [65],

Neurospora crassa [101], and Saccharomyces mikatae , and two meta-

zoans Caenorhabditis elegans [79] and Drosophila melanogaster [2] were

also obtained.

http://www.sanger.ac.uk/Projects/A_fumigatus/

http://www.sanger.ac.uk/Projects/A_fumigatus/



183

The fungi studied here represent three major Ascomycetes classes,

Euascomycetes, Hemiascomycetes and Archaeascomycetes. The Euas-

comycetes, which contain well over 90% of Ascomycota, comprises As-

pergillus and Neurospora. The Hemiascomycetes comprises the Saccha-

romyces yeasts and Candida. The fission yeast, S. pombe belongs to the

class Archaeascomycetes which are distantly related to each other, pos-

sibly remnants of an early radiation of Ascomycota [289]. These fungi

also represent two major fungal morphological subdivisions, yeasts and

moulds. Yeasts, like S. cerevisiae, S. mikatae, C. albicans, as well as

S. pombe, have life cycles characterised by unicellular (occasionally di-

morphic) growth. In contrast, the filamentous ascomycota, A. nidulans,

A. fumigatus and N. crassa, predominantly grow as hyphal filaments.

Despite having such a morphological divergence, all of them share a rela-

tively recent common ancestor with respect to the rest of the eukaryotes.

The phylogeny of these ascomycota is clear and generally accepted, ex-

cept for the ancient Schizosaccharomyces, S. pombe [289].

Genes from S. cerevisiae and A. fumigatus were classified, according

to their phylogenetic profiles, into several LS groups as follows: Eukaryote-

core, Ascomycota-core, Euascomycetes-specific, Hemiascomycetes-specific,

Aspergillus-specific and Saccharomyces-specific. Average nonsynonymous

substitution rates, Ka, of genes among LS groups were compared and

correlations between LS and several other factors, for example, gene ex-

pression level, gene dispensability and gene redundancy, were explored.

The relative importance of LS and other factors, in terms of the pre-

diction of a protein’s evolutionary rate, were evaluated and whether the

divergence rate is relatively constant over genes with similar degrees of

LS was tested.

184


Holding the gene-centric rationale, our understanding of evolutionary

novelties is limited in the consequence of creation new gene. Recent at-

tention has been put to this phenomenon in genomes, yet the mechanism

remains mystery. Some insights have been obtained especially by study-

ing newly created genes (i.e., young genes) [210, 257, 204]. A number of

mechanisms that may be responsible for new gene origination have been

proposed. These include gene duplication, exon shuffling, retroposition,

lateral gene transfer, and transposable element assimilation (for review,

see [204]). Topic regarding to the gene duplication has been reviewed in

Chapter 8.

Here I only focus on the origination of exon – the basic units of gene.

Once exons exist, exon-shuffling, recombination or exclusion of exons, is

widely recognised as important in the generation of new genes [109,244,

155]. The creation of new exons has been proposed through three possi-

ble processes: (1) exaptation of transposable elements [27, 215, 230,293],

(2) exon duplication [172,194], and (3) exonisation of intronic sequences

[173].

Exaptation of transposable elements is a process in which a retroele-

ment has taken on new functions for a genome. It was firstly exampled by

the integration of an Alu element into the coding portion of the human

decay-accelerating factor (DAF) gene [215], and an L1 retrotransposon el-

ement insertion provides a premature stop codon and the polyadenylation

sites is responsible for the generation of the secreted form of the human

transmembrane protein attractin [305]. Recently as much as about 4% of

human genes were found containing transposable elements in their cod-

ing regions [230]. Exon duplication has been reported as about 10% of

all genes contain tandemly duplicated exons when searching the genomes

of human, fly and worm. They are likely to be involved in mutually

exclusive alternative splicing events, which might confer further evolu-

185

tionary potential [194]. Exonisation of intronic sequences is the most

easily conceived mechanism but few examples of such a process have

been reported [173]. Wang et al. [339] identified newly evolved exons by

EST comparison against outgroup to learn the ways new exons originate

and evolve, and how often new exons appear. They claim that the new

exon origination rate is about 2.71−3 per gene per million years and a

much higher proportion of new exons have Ka/Ks ratios > 1 than do the

old exons.

It is noteworthy that gene origination processes mentioned above does

not necessarily create new genes with novel functions, instead yield new

variants of genes [369]. Moreover, newly evolved genes often come up

with elevated evolutionary rate driven by positive selection [205,235,147,

338,369].


9.3.1 Sequences and data sets

Table 9.1: Genomic sequence sources.

Species Web Source for the sequence data.A. nidulans www-genome.wi.mit.edu/annotation/fungi/aspergillus/A. fumigatus www.sanger.ac.uk/Projects/A fumigatusN. crassa www-genome.wi.mit.edu/annotation/fungi/neurospora/S. cerevisiae genome-www.stanford.edu/SaccharomycesS. mikatae ftp://genome-ftp.stanford.edu/pub/yeast/data

download/sequence/fungal genomes/S mikataeC. albicans genolist.pasteur.fr/CandidaDBS. pombe www.genedb.org/genedb/pombe/index.jspC. elegans www.sanger.ac.uk/Projects/C elegans/wormpepD. melanogaster www.fruitfly.org

For each Ascomycotan, the complete set of available amino acid se-

quences and coding DNA sequences was downloaded from the repositories

www-genome.wi.mit.edu/annotation/fungi/aspergillus/

www.sanger.ac.uk/Projects/A_fumigatus

www-genome.wi.mit.edu/annotation/fungi/neurospora/

genome-www.stanford.edu/Saccharomyces

ftp://genome-ftp.stanford.edu/pub/yeast/data_download/sequence/fungal_genomes/S_mikatae

ftp://genome-ftp.stanford.edu/pub/yeast/data_download/sequence/fungal_genomes/S_mikatae

genolist.pasteur.fr/CandidaDB

www.genedb.org/genedb/pombe/index.jsp

www.sanger.ac.uk/Projects/C_elegans/wormpep

www.fruitfly.org

186

A. nidulans

N. crassa

C. albicans

S. pombe

ANIMALS

A. fumigatus

S. mikatae

S. cerevisiae

1,458 1,085 841 ~106701,144

Ascomycota-core

Aspergillus-specific

Eukaryotes-coreEuascom

ycetes-specificHem

iascomycetes-specific

Saccharomyces-specific

Figure 9.1: LS classification based on phylogenetic profiles of genes. Di-vergence times were adopted from Hedges and Kumar [131]. The diver-gence times between S. cerevisiae and S. mikatae and between A. fumi-gatus and A. nidulans are based on the estimates by Cliften et al. [56]and [87], respectively. A solid square (¥) means the gene is present incorresponding species; an open square point (¤) means it is absent.

187

given in Table 9.1. All known or suspected pseudogenes and genes in mi-

tochondrial genomes were removed. The S. mikatae dataset is derived

from the ORF predictions of Cliften et al. [56].

Yeast gene expression data came from Cho et al. [51] who charac-

terised all mRNA transcript levels during the cell cycle of S. cerevisiae.

mRNA levels were measured at 17 time points at 10 min intervals, cover-

ing nearly two full cell cycles. The mean of these 17 numbers was taken

to produce one general time-averaged expression level for each protein.

Protein dispensability was assessed by the fitness effect of a single-

gene deletion, as measured by the average growth rate of the knockout

strain in several types of media. The results of assays of a nearly complete

set of single gene deletions in S. cerevisiae [297] were obtained, and the

data were manipulated following the method by Gu et al. [119]. Briefly,

the fitness value fi is defined as ri/rpool, where ri is the growth rate of

the strain with gene i deleted and rpool is the pooled average growth rate

of different strains.

Essential genes were from the dataset of the Saccharomyces Genome

Deletion Project which contains 1,106 essential genes (http://www-sequence.

stanford.edu/group/yeast deletion project/). Although gene dis-

pensability and gene essentiality are highly associated, they were treated

as two separate variables in order to compare the results of each variable

to previous studies.

A list of protein-protein interactions among S. cerevisiae proteins

was obtained from two integrated interaction databases, YEAST GRID

[25] and the yeast subset of DIP [274], and a number of major high-

throughput studies published to date [106]. The final non-redundant set

contains 252,011 interactions involving 5,698 proteins.

http://www-sequence.stanford.edu/group/yeast_deletion_project/

http://www-sequence.stanford.edu/group/yeast_deletion_project/

188

9.3.2 Identification of orthologs

Orthologs of the genes from S. cerevisiae and A. fumigatus in each other

and in other genomes studied here were identified by the automatic clus-

tering method INPARANOID [261]. Orthologs between the genomes

of two species are derived in this method from mutual best pairwise

BLASTP hits. A further reciprocal test was applied by requiring the

longest region of local sequence similarity between putative orthologs to

cover ≥ 80% of each sequence and to have ≥ 30% sequence identity in

this region. 113 pairs that did not pass this test were excluded. A gene

was considered as being absent from another genome if no sequence sim-

ilarity could be detected between the gene and the genes in that genome.

To define the level at which sequence similarity was not detectable, a

TBLASTN expectation (E) value 1×10−2 with respect to a fixed effec-

tive search space (set to the size of the N. crassa genome) was used as a

cut-off.

Orthologs of fast-evolving genes may not be detected in their more dis-

tantly related genomes by the TBLASTN search used above. To address

this, ancestral sequence(s) were constructed (Collins et al. [58], based

on the detected orthologs, using the maximum likelihood method imple-

mented in the PAML phylogenetic analysis package version 3.13d [359].

Ancestral sequences are expected to be less divergent from their pos-

sible orthologs in the more distant genomes and their reconstructions

were used to search, as above, for orthologs in the more distantly related

genomes. If potential orthologs were identified, the gene was excluded

from further analysis to avoid ambiguity in the assignment of genes to

LS groups.

9.3.3 Classification of genes into LS groups

Phylogenetic profiles, a gene table giving 1 (or 0) if a gene is present in (or

absent from) a genome, for the genes from S. cerevisiae and A. fumigatus,

189

were constructed based on the detected orthologs in the genomes studied.

The genes were then classified into the different LS groups, Eukaryotes-

core (present in all genomes studied), Ascomycota-core (present in all fun-

gal genomes), Hemiascomycetes-specific, Euascomycetes-specific, Saccharomyces-

specific and Aspergillus-specific (Fig. 9.1). The phylogenetic tree relating

the species was derived from [131].

9.3.4 Divergence Times

Lineage divergence times are somewhat controversial [285]. In this work

divergence times were taken from [130] and [131]. These give the following

divergence times (Fig. 9.1): Animals vs Fungi, 1576 Mya; Fungi vs As-

comycetes, 1144 Mya; Saccharomyces and Candida vs Aspergillus, 1085

Mya; Candida vs Saccharomyces, 841 Mya; Neurospora vs Aspergillus,

670 Mya. Divergence times for S. cerevisiae vs S. mikatae and A. fumi-

gatus vs A. nidulans were taken as ∼10 Mya.

To convert LS into numeric form to calculate correlations with other

properties, the ratio of the time of the animal-fungi divergence to that

of the divergence of a lineage from its last common ancestor was used.

For example, the Eukaryotes-core value is 1 (1458/1458) while that of

Ascomycota-core is 1.27 (1458/1144). The final results were not sensitive

to changes in the divergence time estimates used for this category to

numeric conversion.

9.3.5 Estimation of substitution rates and statistical analyses

The number of synonymous substitutions per synonymous site, Ks, and

the number of nonsynonymous substitutions per nonsynonymous site,

Ka, were estimated between A. fumigatus-A. nidulans ortholog pairs and

S. cerevisiae-S. mikatae ortholog pairs in the Euascomycetes and Hemi-

ascomycetes lineages respectively. For each ortholog pair, the ortholo-

gous protein sequences were aligned using ClustalW version 1.82 with the

190

default parameters. The corresponding nucleotide-sequence alignments

were derived by substituting the respective coding sequences from the

protein sequences by using MBEToolbox (Chapter 10 [35] ). Ks and Ka

were then estimated by the maximum-likelihood method implemented in

the CODEML program of PAML [359].

High apparent sequence divergence, as shown by high Ks or Ka values,

is often associated with problems such as difficulty in alignment, or dif-

ferences in codon usage bias or nucleotide composition in the sequences.

Ortholog pairs with Ks < 0.05 may include too few substitutions to

provide a statistically significant measure of change [371]. To accurately

measure the intensity of selective forces acting on a protein, only ortholog

pairs with Ka ≤ 2 and 0.05 ≤ Ks ≤ 2 were used. Similar results were

obtained when more relaxed cutoffs for Ka and Ks (≤ 5) were used (data

not shown). All known ribosomal protein genes were excluded from the

data set as their high level of conservation gives them substantially lower

average values of Ka, Ks and Ka/Ks than those for the rest of the genes.

Statistical regression analyses were performed by referring to the pro-

cedure described by Rocha and Danchin. Since the linear regression

model works better with normal variables , the scatter plots of Ka by

other variables were examined to determine whether linear models are

reasonable for these variables. It was necessary to transform the values

of Ka, expression level and fitness of gene deletion into their logarithmic

forms to give a distribution closer to a normal distribution. For the same

reason, log(Ka) values were used in the correlation and partial correlation

analyses.

9.3.6 Detection of rate variability across species - Relative Divergence

Score (RDS)

To measure the degree of divergence of genes in a species away from or-

thologs in other species TBLASTN comparisons for all proteins in the A.

191

fumigatus or S. cerevisiae genomes were run against all DNA sequences

in the 9 genomes studied here. The relative divergence score (RDS) was

defined as: DA,B = −ln(SA,B/SA,A), where SA,Bis the TBLASTN bit

score for the query protein from genome A and subject genome B. Such

scores range from 0 (identical proteins found in the subject genome) to

infinity (no significant hit found). For genes belonging to each LS group,

and to the relevant species at each divergence time point, 10,000 boot-

strapped medians of random samples were taken from the RDS values

of the genes. The mean of the bootstrapped medians was used as the

estimated RDS of the LS group.

9.4 Results

9.4.1 Evolutionary rate differences among LS groups

The Ascomycotan fungi used in this study represent two distinct fun-

gal groups: Euascomycetes (A. nidulans, A. fumigatus and N. crassa)

and Hemiascomycetes (S. cerevisiae, S. mikatae and C. albicans) and

the more distantly related fission yeast, S. pombe. Data from the two

groups, Euascomycetes and Hemiascomycetes, were processed separately.

For the Euascomycetes sequences, we predicted 6,432 A. fumigatus-A.

nidulans orthologs and calculated the nonsynonymous substitution rate,

Ka, and the synonymous substitutions rate, Ks, for each gene pair. We

then classified the predicted orthologs into the following groups: (1)

Eukaryotes-core, (2) Ascomycota-core, (3) Euascomycetes-specific and

(4) Aspergillus-specific, according to the phylogenetic profiles of A. fu-

migatus genes. The Hemiascomycetes sequences gave 3,707 pairs of

S. cerevisiae-S. mikatae orthologs which were processed similarly and

classified into four groups: (1) Eukaryotes-core, (2) Ascomycota-core,

(3) Hemiascomycetes-specific and (4) Saccharomyces-specific. Thus, LS

groups from (1) to (4) represent increasingly more recent times of origin.

Filtering steps of (1) removing ortholog pairs with Ks,Ka > 2 or

192

212227113N =

Aspergillus-spec

Euascomycetes-spec

Ascomycota-core

Eukaryotes-core

Ka

.7

.6

.5

.4

.3

.2

.1

0.0

-.1

297222317N =

Saccharomyces-spec

Hemiascomycetes-spec

Ascomycetes-core

Eukaryotes-core

Ka

.5

.4

.3

.2

.1

0.0

-.1

(A)

(B)

Figure 9.2: Divergence of nonsynonymous substitution rate in LS groups.The edges of the boxes indicate the upper and lower quartiles. The line atthe centre of the box indicates the median and the edges of the whiskersrepresent the limits of 1.5 times the upper or lower inter-quartile ranges.The circle (©) indicates cases with values between 1.5 and 3 box lengthsfrom the upper or lower edge of the box. The number of the gene pairs(N) is given. (A) A. fumigatus-A. nidulans orthologs. (B) S. cerevisiae-S.mikatae orthologs.

193

Ks < 0.05, (2) excluding ribosomal proteins, and (3) eliminating genes

where possible similarity to a reconstructed ancestral sequence was found,

were applied to the data set. Step 3 removed only 3 gene pairs, 2 in the

Hemiascomycetes lineage and 1 in the Euascomycetes lineage, which may

be due to either the limits of the ancestral reconstruction method or the

relatively conservative criteria adopted in defining orthologs. Final sets

of 183 A. fumigatus-A. nidulans ortholog pairs and of 359 S. cerevisiae-

S. mikatae ortholog pairs were obtained. The mean Ka, Ks and Ka/Ks

of the ortholog pairs in each LS group are given in Table 9.2.

Genes that are distributed in the more specific lineages tend to have

higher Ka values than more widely distributed genes. Box plots of the

distribution of the Ka values for the Aspergillus and Saccharomyces genes

are shown in Fig. 9.2 (A and B, respectively). In both the Aspergillus

and Saccharomyces gene sets, average Ka increases with the degree of LS

with significant among-group variation as measured by a Kruskal-Wallis

test (Aspergillus, P < 0.001; Saccharomyces, P < 0.001). Moreover, as

expected, Ka is consistently smaller than Ks within all LS groups, which

suggests the operation of purifying (negative) selection or functional con-

straints.

The ratio Ka/Ks (i.e., the rate of nonsynonymous substitutions cor-

rected for neutral rates) showed a trend similar to Ka, namely, the values

of Ka/Ks for genes of high LS (e.g., Aspergillus-specific or Euascomycetes-

specific genes) are significantly higher than those for genes of low LS (e.g.,

Eukaryotes-core or Ascomycota-core genes). The differences among the

rates of sequence divergence for different LS groups are more pronounced

for Ka than for Ks, which suggests that the acceleration of a gene’s di-

vergence rate may be mainly caused by more relaxed purifying selection

against amino acid replacement. Functions of representative genes in dif-

ferent LS groups were also examined. Largely, the functions of highly

lineage-specific genes are poorly characterised or simply unknown.

194

Log(EXP)

43210-1

Lo

g(K

a)

0.0

-.5

-1.0

-1.5

-2.0

-2.5

-3.0

-3.5

Saccharomyces-

specif ic

Hemiascomycetes-

specif ic

Ascomycota-core

Eukaryotes-core

All genes

Log(EXP)

43210-1

Lo

g(K

s)

.4

.2

0.0

-.2

-.4

-.6

-.8

-1.0

-1.2

Saccharomyces-

specif ic

Hemiascomycetes-

specif ic

Ascomycota-core

Eukaryotes-core

All genes

(A)

(B)

Figure 9.3: Dependence of log gene expression level, Log(EXP), andsubstitution rate. (A) log non-synonymous substitution rate, log(Ka).(B) log synonymous substitution rate, log(Ks).

195

(A)

R2 = 0.9518

R2 = 0.9429

0.0

0.5

1.0

1.5

2.0

2.5

0 500 1000 1500 2000

Divergence time (Myr)

-ln

(D),

D=

rela

tive d

issim

ilari

ty s

co

re

Euascomycetes-specif ic

Ascomycota-core

Eukaryotes-core

(B)

R2 = 0.9544

R2 = 0.939

0.0

0.5

1.0

1.5

2.0

2.5

3.0

0 500 1000 1500 2000

Divergence time (Myr)

-ln

(D),

D=

rela

tive d

issim

ilari

ty s

co

re

Hemiascomycetes-specif ic

Ascomycota-core

Eukaryotes-core

Figure 9.4: Linear regression analysis of divergence time and RDS. (A)LS of A. fumigatus-A. nidulans genes. (B) LS of S. cerevisiae-S. mikataegenes.

196

9.4.2 Evolutionary rate-related factors of genes belonging to different

LS groups

The correlation between Ka and LS may be confounded by other factors.

For S. cerevisiae-S. mikatae orthologs, bivariate correlations were used

to compute the pairwise associations between Ka and LS and potentially

confounding factors. These factors include the expression level of genes,

the dispensability or essentiality of a gene, gene duplication and the num-

ber of protein-protein interactions of the gene product. The results are

summarised in the upper diagonal of Table 9.3. The coefficient for cor-

relation between log(Ka) and LS is 0.584 (Pearson’s R, P < 0.01, Table

9.4), which is higher than that between log(Ka) and any other factor or

that between any two other factors.

Log gene expression level correlates negatively with log Ka (R = -

0.382, P < 0.01, Table 9.3, Fig. 9.3). This is consistent with previ-

ous studies which showed a correlation between Ka and gene expression

level [127,241]. A correlation between Ka and gene essentiality has long

been proposed [343] but remains controversial [141,149]. The correlation

between log(Ka) and gene essentiality was found to be weak, albeit sig-

nificant (R = -0.163, P < 0.01), and essential genes have a lower mean

Ka (0.081, median 0.081) compared to that for non-essential genes (mean

0.136; median 0.110) (Mann-Whitney U test, P = 0.004).

Our data show a weak correlation between log(Ka) and gene dispens-

ability (R = 0.186, P < 0.001, Table 9.3), which is at a similar magnitude

to that of gene essentiality. This result is consistent with that recently

reported by Hirsh and Fraser. This correlation remains significant af-

ter controlling for gene expression levels (partial R = 0.240, P < 0.01),

suggesting the independent nature of gene dispensability as a factor.

Gene duplication has been shown to play a role in influencing gene

divergence rates [119,150,357]. Genes were classified as either singletons

or duplicate genes if they belonged to any multigene family. The mean

197

Table

9.2:A

verageK

a ,K

sand

Ka /K

sam

ongLS

classes.∗

AK

ruskal-Wallis

testreveals

significantrate

heterogeneityofaverage

Ka

oraverage

Ka /K

sofgenes

indifferent

LS

groupsin

bothE

uascomycetes

branchand

Hem

iascomycetes

branch,P

<0.001.

§A

Kruskal-W

allistest

revealsno

significantrate

heterogeneityofaverage

Ks

ofgenesin

differentLSG

groupsin

bothE

uascomycetes

branchand

Hem

iascomycetes

branch,P

>0.01.

LS

Class

Num

berof

genespairs

K∗a

mean

(SD)

K§s

mean

(SD)

Ka /K

∗sm

ean(SD

)A

.fum

igatus–

A.nidulans

(Euascom

ycetesbranch)

Eukaryotes-core

1130.051

(0.032)1.431

(0.441)0.039

(0.027)A

scomycota-core

270.126

(0.069)1.577

(0.329)0.080

(0.042)E

uascomycetes-specific

220.198

(0.118)1.436

(0.490)0.155

(0.091)A

spergillus-specific21

0.293(0.136)

1.263(0.567)

0.261(0.127)

S.cerevisiae

–S.

mikatae

(Hem

iascomycetes

branch)E

ukaryotes-core17

0.018(0.021)

0.586(0.213)

0.029(0.026)

Ascom

ycota-core23

0.031(0.030)

0.639(0.172)

0.047(0.040)

Hem

iascomycetes-specific

220.072

(0.037)0.839

(0.284)0.091

(0.045)Saccharom

yces-specific297

0.131(0.100)

0.830(0.329)

0.165(0.130)

198

Table 9.3: Correlation (Pearson’s R) (upper triangle) and partial corre-lation after controlling for log(Ks) (lower triangle). Abbreviations: Ka:nonsynonymous substitution rate; LS: lineage specificity; EXP: expres-sion level; FIT: fitness effect (gene dispensability); ESS: gene essentiality;DUP; duplicated (or not) gene; (INT) number of interactions. Amongthem, Ka, Ks, EXP and FIT are in their log forms.

Ka LS EXP FIT ESS DUP INT Ks

Ka – 0.584 -0.382 0.186 -0.163 0.257 -0.308 0.429LS 0.582 – -0.271 0.195 -0.263 0.324 -0.428 0.185EXP -0.294 -0.161 – -0.037 0.076 -0.113 0.197 -0.165FIT 0.240 0.192 -0.049 – 0.032 -0.116 -0.159 -0.048ESS -0.018 -0.146 -0.091 0.033 – 0.020 0.243 -0.087DUP 0.215 0.312 -0.065 -0.106 0.028 – -0.163 0.160INT -0.253 -0.379 0.123 -0.175 -0.007 -0.111 – -0.128

Ka of 0.097 (median 0.049) for duplicate genes was significantly smaller

than the mean of 0.138 (median = 0.114) for singleton genes (Mann-

Whitney U test, P < 0.001). The same pattern was observed between

different LS groups with the exception of the Ascomycota-core group.

Ka has been shown to be positively correlated with Ks in several

species [116, 214, 239, 344]. Such a correlation, which may confound cor-

relations between log(Ka) and LS or with other factors, was observed here

for log(Ka) and log(Ks) (R = 0.429, p < 0.01, Table 9.4). To examine

the influence of the correlation ofKa with Ks on other factors, partial cor-

relation coefficients between log(Ka) and other variables were calculated

while holding the value of log(Ks) constant. The results are given in the

lower diagonal portion of Table 9.4 and indicate that, after controlling

for log(Ks), log(Ka) remains significantly correlated with LS. There is

little change in the value of the coefficients with or without controlling for

log(Ks) (partial Rlog(Ka)−LS|log(Ks)=0.582 to Rlog(Ka)−LS=0.584). Thus,

Ka is correlated with LS independently of Ks.

A decrease in the absolute value of the correlation coefficient was ob-

served between log(Ka) and expression level when controlling for log(Ks)

199

Table

9.4:R

esultsofthe

regressionanalyses

on359

predictedS.cerevisiae-S.m

ikataeorthologs.

¶R2

isthe

proportionofvariation

inthe

dependentvariable

explainedby

theregression

model

constructedfrom

theindividual

variable.T

hevalues

indicatethe

independentcontribution

ofeach

variableto

explainthe

globalvariance

oflog(K

a ).∗

Order

ofvariables

enteredinto

model

ateach

step.∗∗

tstatistics

canindicate

therelative

importance

ofeach

variablein

them

odel.

Indep.contribution

(R2) ¶

Entry

order ∗U

nstd.coeffi

(B)±

1SEStd.

coeffi(β

)t ∗∗

P

Inclu

ded

Variab

les(C

onstant)–

–-1.149±

0.113–

-10.148<

0.0001LS

0.3411

0.048±0.004

0.56211.676

<0.0001

log(EX

P)

0.1642

-0.197±0.038

-0.247-5.124

<0.0001

Exclu

ded

Variab

leslog(F

IT)

0.0353

0.0871.836

>0.1

DU

P0.066

40.070

1.399>

0.1E

SS0.027

50.038

0.787>

0.1IN

T0.095

6-0.028

-0.546>

0.1

200

(|Rlog(Ka)−log(EXP )|Log(Ks)| = 0.294 and |Rlog(Ka)−log(EXP )| = 0.382).

This suggests Ks might be a confounding factor for gene expression level

in determining Ka. Figure 9.3 plots the relationship of log expression

level with log(Ka) (Fig. 9.3A) and with log(Ks) (Fig. 9.3B) showing the

values for the Saccharomyces gene lineage groups. The more consistent

relationship of log expression value with log(Ks) among the genes can be

seen.

Linear multiple regression was used to further examine the effect of

multiple factors on log(Ka). Gene essentiality and gene redundancy were

recoded to be quantitative variables by using two sets of binary variables

(essential = 1 and non-essential = 0; duplicated gene = 1 and singleton

gene = 0). A forward stepwise regression model was used to examine

the contribution of each independent variable to the regression. The

regression model defines log(Ka) as a function of LS (XLS), log expression

level (log(Xexp)), log fitness effect of gene deletion (log(Xfit)), essentiality

(Xess), gene duplication (Xdup), and the number of protein interactions

(Xint):

log(Ka) = β0+βlsgXlsg+βexplog(Xexp)+βfitlog(Xfit)+βessXess+βdupXdup+βintXint

Table 9.4 gives the results of the modelling procedure. The final model

gives a global R2 of 0.436 (P < 0.001). That is, nearly one half of the

variation in log(Ka) is explained by this model. During the construction

of the final model, the predictors most highly correlated with log(Ka),

LS and the expression level, were kept. The remaining variables, which

have minor roles in overall regression with log(Ka), were excluded from

the final model (Table 9.4). The standardised coefficients were examined

to determine the relative importance of the significant predictors. LS

contributes more to the model than does the expression level, as shown

by its larger absolute standardised coefficient of 0.562 and t statistic of

201

11.676, when compared with values of 0.247 and 5.124, respectively, for

expression level. This analysis suggests that LS is the most relevant

predictor of the rate of protein divergence.

9.4.3 Linear regression of divergence time and relative divergence score

(RDS)

To relate the group divergence times and RDS a linear regression for

each LS group was performed (Fig. 9.4). An increasing linear trend of

RDS with divergence time was observed in each LS group, suggesting

that genes diverge from other species at an approximately constant rate.

Groups with higher LS have greater slopes than those with lower LS, in-

dicating that genes with higher LS evolve faster than those with lower LS.

This trend would still be apparent if different divergence time estimates

were used.

9.5 Discussion

The phylogenetic distribution of a gene has been suggested to be of bi-

ological importance. For example, genes with the same phylogenetic

distribution may have linked functions [8, 218]. Lineage specificity (LS)

is a form of phylogenetic distribution whereby genes are found only in

a group of species that diverge from a certain point in a species tree.

Orphan genes, those identified from only one species, are the extremes of

lineage specificity. How these orphan and lineage specific genes arose is

still an open question.

Three possibilities are generally proposed [73]. One is that genes in a

lineage originate from a lineage ancestral gene formed by the recombina-

tion of exons from other genes or from random ORFs. These genes might

show similarity to the original exons and so not necessarily be considered

orphans or lineage specific. In the case of formation from random ORFs

it is unlikely that such a protein would be functional. A second option is

202

gene loss [8, 178]. However it is relatively unlikely that a gene would be

lost in all but one lineage [73] and this may not explain most orphan or

lineage specific genes. The third option, which is examined here, is that

some genes evolve at a rapid rate and so can no longer be recognised as

orthologs of the genes they diverged from after a certain time span.

If accelerated rates of evolution lead to the creation of orphan or

lineage specific genes, then it follows that genes with a high degree of LS

should show higher rates of evolution than genes with lower degrees of LS.

This hypothesis has been tested here with respect to the Ascomycotan

fungi. If LS arose through widespread gene loss or from creation of new

genes from recombination of exons or ORFs there is no reason to expect

accelerated evolutionary rates or a trend in evolutionary rate with respect

to the degree of LS.

The evolutionary rate of genes in Ascomycotan fungi that have dif-

ferent degrees of LS were compared and revealed a significant, strong

correlation between LS and the evolutionary rates of the genes. A trend

that genes with narrow phylogenetic distributions (high LS) tend to have

elevated evolutionary rates when compared with more ubiquitous genes

(low LS) was observed. This is consistent with the hypothesis that accel-

eration of the evolutionary rate is largely responsible for the formation

of lineage specific genes.

However, the rate of gene evolution is one of the most important pa-

rameters in molecular evolution. Correlations between the rate of gene

evolution and many properties of genes, including their phylogenetic dis-

tribution have been explored by several studies. As noted in the In-

troduction, the evolutionary rate has been associated with expression

level [127,241], gene dispensability [178], essentiality [343] or morbidity ,

gene duplication, gene loss [178] and protein-protein interactions [93,335].

Not all these studies have been in agreement e.g., [93,151]. These factors

may influence the apparent correlation of LS with evolutionary rate.

203

All pair-wise correlations of these factors with LS, Ka and Ks were

examined to investigate the influence of these factors on the relationship

between LS and Ka. The strongest correlation observed was that of LS

with log(Ka), however log(Ka) also correlated highly with log(Ks). Cor-

relations of log(Ka) with LS and the other factors were then calculated

after controlling for log(Ks). Again the correlation of LS with log(Ka)

was the strongest and similar to that without controlling for log(Ks).

With one exception, both LS and log(Ka) showed significant but low

correlations to all other factors. As log(Ka) showed the strongest corre-

lation with LS in both cases it seems clear that the evolutionary rate has

a considerable, though not unique, influence on the origin of LS.

Further examination of this was undertaken with a stepwise regres-

sion analysis of the factors likely to influenceKa. In the final regression

model, which explained close to half the variation in log(Ka), only the

parameters LS and log expression level were kept, with LS making the

larger individual contribution. The other parameters investigated did

not make significant contributions to the regression model. This again

indicated the role of evolutionary rate on LS.

Another approach used the relative divergence score (RDS) which

measures the divergence of a gene from its orthologs in other genomes

as a ratio of the TBLASTN score with its orthologs to the maximal (or

self-self) score. This provides another view of the degree of divergence

within a lineage and, when matched to divergence times, allows an ex-

amination of the evolutionary rate as the degree of LS increases. Within

each LS group a reasonably constant rate of evolution was seen since the

appearance of the LS group. Groups with low LS show lower RDS values

and evolutionary rates than groups with higher LS, consistent with the

evolutionary rate being a major determinant of LS. Allowing for errors

in the determination of divergence times this trend will still hold.

Genes with a certain degree of LS may have arisen from duplication

204

followed by acquisition of a lineage specific function [73] or simply have

diverged from a common ancestor to the extent that they cannot be

recognised as orthologs across lineages. Our findings support the idea

that genes destined to have high levels of LS will have higher evolution-

ary rates. It should be noted that Ka is a measurement of the average

nonsynonymous substitution rate along the whole length of a gene. Al-

though highly lineage-specific genes had higher average Ka, the extent

to which region- specific or site-specific contributions to Ka affect this

was not examined. Further research could be directed to evaluate such

region- or site-specific effects on the rate of protein divergence, especially,

for instance, for genes that have high LS but low evolutionary rates or

vice versa.

For ascomycotan fungi, our findings show that the degree of LS cor-

relates with the evolutionary rate and indicate that an elevated evolu-

tionary rate may be a major cause of the development of lineage specific

genes.

205

Chapter 10

MBETOOLBOX: A MATLAB TOOLBOX FOR

SEQUENCE DATA ANALYSIS IN MOLECULAR

BIOLOGY AND EVOLUTION

This chapter is very closely based on a paper I have published [35].

The original draft of the manuscript has been revised by Dr. David K.

Smith, in Department of Biochemistry, HKU.

10.1 Introduction

Matlab is a high-performance language for technical computing, integrat-

ing computation, visualization, and programming in an easy-to-use envi-

ronment. It has been widely used in many areas, such as, mathematics

and computation, algorithm development, data acquisition, modelling,

simulation, and scientific and engineering graphics. However, few func-

tions are freely available in Matlab to perform the sequence data analysis

for molecular biology and evolution specifically. I have developed a Mat-

lab toolbox, called MBEToolbox, aiming at filling this gap by offering

efficient implementations of the most needed functions in molecular bi-

ology and evolution. It can be used to manipulate aligned sequences,

calculate evolutionary distances, estimate synonymous and nonsynony-

mous substitution rates, and infer phylogenetic trees. Moreover, it pro-

vides an extensible functional framework for more specialized needs in

exploring and analysing aligned nucleotide or protein sequences from the

evolutionary perspective. The full functions in the toolbox are accessible

through command-line for those seasoned Matlab users, yet, it does pro-

vide a graphical user interface may be especially useful for non-specialist

206

end users. Through applicaiton of this software during the Penicillium

marneffei genome project, MBEToolbox is proved to be a useful tool

that can aid in the exploration, interpretation and visualization of data

in molecular biology and evolution. The software are publicly available

at http://web.hku.hk/∼jamescai/mbetoolbox/.


10.2.1 Probabilistic DNA substitution models

In this section I will discuss probability models, more specifically, Markov

models. (Of course, there also exist other types of models, e.g., determin-

istic models). Morkov models can be discrete or continuous in regard to

time. The discrete time models are called Markov chains, whereas con-

tinuous time models are usually called Markov processes. Mathematical

notations used in this section are given as: R - intrinsic rate matrix; Q

- (instantaneous) transition rate matrix; P - transition probability ma-

trix; X - divergence matrix; Π - matrix base frequencies; and t - time or

evolutionary distance.

Molecular evolution of sequences generally is constructed under a hy-

pothesis of phylogeny, i.e., modelling sequence evolution along a branch

of phylogenetic tree. This is using a continuous time Markov process,

more specifically finite, aperiodic, irreducible such processes (here refer

to these simply as Markov process). A Markov process has a defined

state space, e.g., A, C, G, T, and the (instantaneous) transition rate

between states is given by any n × n transition rate matrix, Q, where

Qij > 0 for all i 6= j and Qii = −∑i 6=jQij . Amino acid models have

http://web.hku.hk/~jamescai/mbetoolbox/

207

n = 20, while nucleotide models have n = 4, e.g.:

Q =

−1.218 0.504 0.336 0.378

0.126 −0.882 0.252 0.504

0.168 0.504 −1.050 0.378

0.126 0.672 0.252 −1.050

Qij indicates the rate for going from state i to state j. Since the total

instantaneous rate is zero each row should sum to zero. For a specified

time interval, t, we can calculate the transition probability matrix from

P(t) = eQt, e.g.:

P(t) =

0.6883 0.1308 0.0828 0.0981

0.0327 0.7783 0.0654 0.1236

0.0414 0.1308 0.7297 0.0981

0.0327 0.1647 0.0654 0.7372

Here t = 0.33, the exponential operation is matrix exponential. In

Matlab, this is computed using a scaling and squaring algorithm with a

Pade approximation. In P, the rows sum to one, since the total prob-

ability under the time interval is one. If the Markov process are run

sufficiently long time, the probabilities, P(t) will converge on a station-

ary distribution such that for all pairs (i, k) of states, Pi,j(t) = Pi,k(t).

That is the probability of the end state is independent of the starting

state. Here we will limit our discussion to cases where the overall rate

of changing from state i to state j is the same as the rate from i to j,

a constraint to models that are said to be time-reversible. The models

used in phylogenetic inference to date are almost exclusively subsets of

this class.

The transition rate matrix, Q, can be decomposed into an intrinsic

rate matrix, R, and Π, such that:

208

Q = RΠ

If R is symmetric, and Q is constructed as indicated above, and Π

is the equilibrium frequency vector. The rates at which each state is

replaced with each alternative state in R and methods for calculating or

estimating Π are set differently in different situation. Hence, different

DNA substitution model are existing. I will start to introduce the most

general models of nucleotide substitution is the general time reversible

model (REV), also called General Time Reversible model (GTR). The

instantaneous rate matrix for the REV model is:

R<REV> =

− µa µb µc

µa − µd µe

µb µd − µf

µc µe µf −

In this matrix, the rows (and columns) correspond to the bases A, C,

G, and T respectively. The factor µ represents the mean instantaneous

rate. This rate is modified with the relative rate parameters a, b, c, · · · , l,which correspond to each possible transformation between two bases. To

construct Q<REV>, all we need to do is: RΠ, where Π, (πA, πC , πG, πT ),

is frequency parameters that correspond to the frequencies of the four

bases. The diagonal elements of Q are always chosen so that the row

sums are zero (i.e., stationarity).

Many other models (still belong to GTR class) have been designated.

They are usually designated by the initial letters of the authors last names

and the year of the publication. Their relationship can be illustrated as

in Fig 10.1. The κ parameter represents the ratio of the instantaneous

rate of transition-type substitutions to transversion-type substitutions.

It assumes the value 1.0 for models in which all substitutions are taken

to occur at the same rate (i.e., the JC and F81 models). In the K2P and

209

JC

πA=πC=πG=πT

α=β

JC

πA=πC=πG=πT

α=β

HKY85

πA≠πC≠πG≠πT

α≠β

HKY85


α≠β

GTR/REV


a,b,c,d,e,f

GTR/REV


a,b,c,d,e,f

K2P

πA=πC=πG=πT

α≠β

K2P

πA=πC=πG=πT

α≠β

Allow transition/Allow transition/

transversion biastransversion bias

Allow transition/Allow transition/

transversion biastransversion bias

F81


α=β

F81


α=β

Allow baseAllow base

frequencies to varyfrequencies to vary

Allow baseAllow base

frequencies to varyfrequencies to vary

Figure 10.1: Relationship of GTR class DNA substitution models

HKY models, the rate of transversion is β, with the rate of transitions

being determined as α = κβ.

JC model The JC model was described by Jukes & Cantor in 1969

[153] and is the most restrictive model. It assumes that the base fre-

quencies are all equal and the instantaneous rate of substitution is the

same for all possible changes. When this model is selected, the base fre-

quencies (πA, πC , πG, πT ) are all set to 0.25 and a, b, c, · · · , l is set to 1.0.

The only free parameter that can be adjusted under this model is the µt

parameter.

F81 model The F81 model was described by Felsenstein (1981) [85].

It is like the JC model in assuming that all possible changes occur at

the same rate, but allows the base frequencies to be unequal. If the base

frequencies are all set to 0.25, this model is equivalent to the JC model.

When this model is selected, you will be free to vary the base frequency

parameters, but the κ parameter will not be changed as it is set to 1.0

under this model.

K2P model The K2P model was described by Kimura in 1980 [165].

It is like the JC model in assuming equal base frequencies, but allows the

210

rate of transition-type substitutions to differ from the rate of transversion-

type substitutions. As you know, the ratio of these two instantaneous

rates is κ. Two parameters, both κ and µt, will be free to vary when

using this model. In case of setting κ = 1.0, K2P model is identical with

the JC model. The base frequency parameters are forced to be equal.

HKY model The Hasegawa, Kishino and Yano (HKY) model [126]

allows for a different rate of transitions and transversions as well as un-

equal frequencies of base frequencies. The parameters requires by this

model are transition to transversion ratio κ and the base frequencies. If

base frequencies are uniform, the HKY model reduces to the K2P model.

10.2.2 Maximum likelihood estimation

Maximum likelihood estimation (MLE) is a popular statistical method

used to make inferences about parameters of the underlying probability

distribution of a given data set. Given a set of observations, the method

of maximum likelihood finds the parameters of a model that are most

consistent with these observations.

Here I use a simple and general example to explain the philosophy of

MLE. Example n data, X1, X2, . . . , Xn, are drawn from a given discrete

probability distribution D with known probability mass function fD and

distributional parameter θ. The probability associated with our observed

data may be computed:

P (x) = fD(x|θ)

where x ∈ x1, x2, . . . , xn. At this moment, although we know that

our data comes from the distribution D, we may don’t know the value of

the parameter θ. Such a situation is usually the case when we do exper-

iment to sample data points so that we can estimate some parameters,

such as, θ of a distribution. The question is how should we estimate θ?

211

MLE provides a general technique for seeking an estimate of the value

of θ from the sample. We maximise the likelihood of the observed data

set over all possible values of θ, i.e., seeking the most likely value of the

parameter θ.

We define likelihood mathematically:

lik(θ) =n∏

i=1

fD(x|θ)

MLE seeks the value θ which maximises this likelihood function over all

possible θ. MLE methods are versatile and apply to most models and to

different types of data.

The general principle of MLE has found its way of applying in many

aspects of phylogenetics, such as, phylogenetic parameter estimation, and

optimal tree searching [41, 85]. Generally, the likelihood of observing a

given set of data is maximised for each topology, and the topology that

gives the highest maximum likelihood is chosen as the final tree. In this

case, however, the parameters to be considered are not the topologies but

the branch lengths for each topology, and the likelihood is maximised to

estimate branch lengths rather than the topology. The problem with

phylogenetic inference based on the optimisation principle is that it is

very time-consuming, because the number of possible topologies is very

large for a sizable number of nucleotide sequences (> 15) and an enor-

mous amount of computational time is required to find the optimal tree.

Calculating MLE’s in phylogeny often requires specialised software for

solving complex non-linear equations. Numerical optimisation is often

required to solve these non-linear problems.

10.2.3 Elements of phylogenetic theory

The purpose of the reminder section is to explain how phylogenetic trees

may be constructed from analysis of nucleotide and protein sequences.

212

Such analyses enable the evolutionary relationships among species or

genes to be deduced. I will review basic concepts of phylogenetic the-

ory, such as, phylogenetic tree and likelihood calculation of a phylogeny,

given a substitution model. Then I will introduce some most commonly

used software packages in phylogenetic analyses, their advantages and

shortcomings.

Phylogenetic trees

We usually describe evolution, of either genes or species, by using a sketch

of a tree-like structure, which represents the hierarchical relationships

among species/genes arising through evolution. Such a tree-like struc-

ture is phylogenetic tree. In the case of rooted trees the root is the

common ancestor of all the nodes. In a evolutionary tree of species,

ancestors’ species are located at the root of the tree and contemporary

species are the leaves. In this sense, the tree is rooted. The topology of

the tree, branching pattern, defines the phylogenetic relationships among

the nodes. When the data for the ancestors are missing, the phylogenetic

trees produced are unrooted, which are only schematic trees comprising

a set of nodes linked together by branches. The location of the com-

mon ancestor of all the species/genes under study cannot be identified in

unrooted tree.

The string representation of a tree, following the newick standard,

is usually used. It uses the recursive definition of a tree to represent

phylogenies in a computer readable form with nested parentheses. For

example, a tree can be written:

(outgroup, neurospora, (penicillium, aspergillus));

However one must be aware that this representation is not unique,

the following one works as well:

(penicillium,(outgroup,neurospora),aspergillus));

213

Sometimes, when an outgroup was provided, the rooted representa-

tion is:

(outgroup,(neurospora,(penicillium,aspergillus)));

In addition to the branch topology, the branch lengths in phylogeny

are also important to specify a particular tree. The lengths of branches

represent the evolutionary distances between two consecutive nodes.

Phylogeny reconstruction

Data required for phylogeny reconstruction is not limited in nucleotide

and amino acid sequences; in fact, protein structures or exon-intron struc-

tures can also be used for this purpose. But I will limit the following dis-

cussion on nucleotide and amino acid sequences merely. It is important to

note that most phylogeny-building methods require multiple alignment of

sequences. Sequence alignment is one of the most important problems in

bioinformatics. Many efforts have been put in improvement of efficiency

and accuracy. The area is still actively developing.

Once obtaining the multiple alignments, we can usually use 3 different

methods to construct phylogeny: the distance matrix method, maximum

parsimony method and maximum likelihood method. A good review for

all these methods can be found in [199].

Maximum parsimony infers a phylogenetic tree by minimising the

total number of evolutionary steps required to explain a given set of data,

or in other words by minimising the total tree length. It is a character-

based method, the input data used is in the form of “characters” for a

range of taxa. Besides protein or nucleotide residue, a character could

be a binary value for the presence or absence of a feature (such as the

presence of a tail). Maximum parsimony is a very simple approach, and

is popular for this reason. However, it is not always very accurate.

Maximum likelihood evaluates a hypothesis about evolutionary his-

tory in terms of the probability that the proposed model and the hypoth-

214

esised history would give rise to the observed data set.

The central of likelihood based method is the likelihood function (for

general description, see Section 10.2.2).

Likelihood = f(Data|T, l, θ)

where T is topology, l is branch lengths of the given tree.

The topology with the highest maximum probability (likelihood) is

chosen. Advantages of maximum likelihood methods over other meth-

ods are: may have lower variance than other methods (least affected by

sampling error), tend to be robust to violations of the assumptions in

the evolutionary model, are statistically well founded, can statistically

evaluate different tree topologies and use all of the sequence information.

There are also some disadvantages: very computationally intensive (slow)

and the result depends on the model of evolution.

Computation of likelihood of phylogeny

Substitution models are a description of the way sequences evolve in

time by nucleotide replacements. Most commonly used Markov models

of DNA subsititution has been reviewed in Section 10.2.1.

10.2.4 Programs used for phylogenetic analyses

A few selective programs are introduced below, they are representatives

of the most commonly used ones in phylogenetic analyses.

PAUP* - http://paup.csit.fsu.edu/ is an integrated and user-

friendly package. Many distinct models of nucleotide substitution are

available (all possible submodels of the GTR + Γ + inv sites model). It

does not allow analyses of protein sequences using parametric approaches.

Tree-Puzzle - http://www.tree-puzzle.de/ reconstructs phyloge-

netic trees from molecular sequence data by maximum likelihood. It

implements a fast tree search algorithm, quartet puzzling, that allows

http://paup.csit.fsu.edu/

http://www.tree-puzzle.de/

215

analysis of large data sets and automatically assigns estimations of sup-

port to each internal branch. It also computes pairwise maximum likeli-

hood distances as well as branch lengths for user specified trees.

Mesquite - http://mesquiteproject.org/mesquite/mesquite.html

is an extensible and modular program for a variety of evolutionary analy-

ses. It is written in Java, therefore, is plantform-independent. At this

point Mesquite is of limited usefulness because it is a modular set of

programs to which specific applications must be added. But it does im-

plement one- and two-parameter models of evolution for ancestral state

reconstruction.

MrBayes - http://morphbank.ebc.uu.se/mrbayes/ is a program for

Markov chain Monte Carlo analysis of phylogeny. Implements a limited

set of submodels of the GTR + Γ + inv sites model. The current version

allows the use of mixed models (e.g., distinct GTR + Γ + inv sites sub-

models for 1st, 2nd, and 3rd codon positions or for different genes). A

number of protein models, using parameters estimated from large-scale

analyses of protein databases, are also available. It is only known package

implementing the covarion model.

PAML - http://abacus.gene.ucl.ac.uk/software/paml.html, is

a package of programs for phylogenetic analyses of DNA or protein se-

quences using maximum likelihood. It contains a modular set of programs

for various likelihood analyses flexibly (submodels of the GTR + Γ + inv

sites model, amino acid models, codon-based models). It is not designed

for tree-searches. But it is ideal for analyses of the evolutionary process,

estimation of evolutionary parameters, because of its flexibility. PAML

has a simulator module called “evolver” that is also quite flexible.

PHYLIP - http://evolution.genetics.washington.edu/phylip.

html, is a modular set of programs for various types of phylogenetic analy-

ses (including likelihood analyses of DNA and proteins). It implements

a heuristic tree space search algorithm, which is faster than PAML, but

http://mesquiteproject.org/mesquite/mesquite.html

http://morphbank.ebc.uu.se/mrbayes/

http://abacus.gene.ucl.ac.uk/software/paml.html

http://evolution.genetics.washington.edu/phylip.html

http://evolution.genetics.washington.edu/phylip.html

216

does not search as rapidly or as extensively as PAUP*.

10.3 Implementation

MBEToolbox is written in the Matlab language and has been tested on

the Windows platform with Matlab version 6.1.0. The main functions

implemented are: sequence manipulation, computation of evolutionary

distances derived from nucleotide-, amino acid- or codon-based substi-

tution models, phylogenetic tree construction, sequence statistics and

graphics functions to visualize the results of analyses. Although it imple-

ments only a small fraction of the multiplicity of existing methods used

in molecular evolutionary analyses, interested users can easily extend the

toolbox.

10.3.1 Input data and formats

MBEToolbox requires a single ASCII file containing the nucleotide or

amino acid sequence alignment in either Phylip [86], ClustalW [312]

or Fasta format. The toolbox does provide a built-in Clustalw [312]

interface if an unaligned sequence file is provided. Protein-coding DNA

sequences can be automatically aligned based on the corresponding pro-

tein alignment with the command alignseqfile.

After input, in common with the MathWorks bioinformatics tool-

box, MBEToolbox represents the alignment as a numeric matrix with

every element standing for a nucleic or amino acid character. Nucleotides

A, C, G and T are converted to integers 1 to 4, and the 20 amino acids are

converted to integers 1 to 20. A header, containing information about the

names and type of the sequences as well as the relevant genetic code for

protein-coding nucleotides, is attached to the alignment matrix to form a

Matlab structure. An example alignment structure, aln, in Matlab code

follows:

aln =

217

seqtype: 2

geneticcode: 1

seqnames: 1xn cell

seq: [nxm double]

where n is the number of sequences and m is the length of the aligned

sequences. The type of sequence is denoted by 1, 2 or 3 for sequences

of non-coding nucleotides, protein coding nucleotides and amino acids,

respectively.

10.3.2 Sequence Manipulation and Statistics

The alignment structure, aln, can be manipulated using the Matlab lan-

guage. For example, aln.seq(x,:) will extract the xth sequence from

the alignment, while aln.seq(:,[i:j]) will extract columns i to j from

the alignment. Users may easily extract more specific positions by us-

ing functions developed in the toolbox, such as extractpos(aln,3) or

extractdegeneratesites to obtain the third codon positions or fourfold

degenerate sites, respectively. For each sequence, some basic statistics

such as the nucleotide composition (ntcomposition) and GC content,

can be reported. Other functions include the calculation of the relative

synonymous codon usage (RSCU) and the codon adaptation index (CAI),

counts of segregating sites, taking the reverse complement or translating

a sequence, and determining the sequence complexity.

10.3.3 Evolutionary Distances

The evolutionary distance is one of the important measures in molecu-

lar evolutionary studies. It is required to measure the diversity among

sequences and to infer distance-based phylogenies. MBEToolbox con-

tains a number of functions to calculate evolutionary distances based

on the observed number of differences. The formulae used in these

functions are analytical solutions of a variety of Markov substitution

218

models, such as JC69 [153], K2P [165], F84 [86], HKY [126] (see [229]

for detail). Given the stationarity condition, the most general form of

Markov substitution models is the General Time Reversible (GTR or

REV) model [185, 309, 266, 358]. There is no analytical formula to cal-

culate the GTR distance directly. A general method, described by Ro-

driguez et al. [266], has been implemented here. In this method a matrix

F, where Fij denotes the proportion of sites for which sequence 1 (s1) has

an i and sequence 2 (s2) has a j, is formed. The GTR distance between

s1 and s2 is then given by

d = −tr(Π log(Π−1F))

where Π denotes the diagonal matrix with values of nucleotide equilib-

rium frequencies on the diagonal, and tr(A) denotes the trace of matrix

A. The above formula can be expressed in Matlab syntax directly as:

>> d=-trace(PI*logm(inv(PI)*F))

MBEToolbox also calculates the gamma distribution distance and the

LogDet distance [295] (i.e., Lake’s paralinear distance [184]).

For alignments of codons, the toolbox provides calculation or esti-

mation of the synonymous (Ks) and non-synonymous (Ka) substitution

rates by the counting method of Nei and Gojobori [228], the degenerate

methods of Li, Wu and Luo [198] and the method of Li or Pamilo and

Bianchi [197, 242], as well as the maximum likelihood method through

PAML [360]. All these methods for calculating Ks and Ka require that

the input sequences are aligned in the appropriate reading frame, which

can be performed by the function alignseqfile. Unresolved codon sites

will be removed automatically. In addition, several quantities, includ-

ing the number of substitutions per site at only synonymous sites, at

only non-synonymous sites, at only four-fold-degenerate sites, or at only

219

zero-fold-degenerate sites can be calculated. The output from these cal-

culations are distance matrices which can be exported into text or Excel

files, or used directly in further operations.

10.3.4 Phylogeny Inference

Two distance-based tree creation algorithms, Unweighted Pair Group

Method with Arithmetic mean (UPGMA) and neighbour-joining (NJ)

[273] are provided and trees from these methods can be displayed or ex-

ported. Maximum parsimony and maximum likelihood algorithms can

be applied to nucleotide or amino acid alignments through an interface

to the Phylip package [86]. As properly implemented maximum likeli-

hood methods are the best vehicles for statistical inference of evolution-

ary relationships among species from sequence data, several maximum

likelihood functions have been explicitly implemented in MBEToolbox.

These functions allow users to incorporate various evolutionary models,

estimate parameters and compare different evolutionary trees.

The simplest case of estimation of the evolutionary distance between

two sequences, s1 and s2, can be considered as the estimation of the

branch length (the number of substitutions along a branch) separating

ancestor and descendent nodes. Branch lengths, relative to a calibrated

molecular clock, can reveal the time interval for this separation. A con-

tinuous time Markov process is generally used to model evolution along

the branch from s1 to s2. A transition rate matrix, Q, is used to indicate

the rate of changing from one state to another. For a specified time in-

terval or distance, t, the transition probability matrix is calculated from

P(t) = eQt. If there are N sites, the full likelihood is

L =N∏

i=1

πs1iP (s1

i → s2i , t)

In this equation, s1i and s2

i are the ith bases of sequences 1 and 2 respec-

220

tively; πs1i

is the expected frequency of base s1i .

In MBEToolbox, to calculate the likelihood, L, at a given time interval

(or distance) t, we have to specify a substitution model by using an appro-

priate model defining function, such as modeljc, modelk2p or modelgtr

for non-coding nucleotides, modeljtt or modeldayhoff for amino acids,

or modelgy94 for codons. These functions return a model structure com-

posed of an instantaneous rate matrix, R, and an equilibrium frequency

vector, pi which give Q, (Q=R*diag(pi)). Once the model is specified,

the function likelidist(t,model,s1,s2) can calculate the log likeli-

hood of the alignment of the two sequences, s1 and s2, with respect to

the time or distance, t, under the substitution model, model.

In most cases we wish to estimate t instead of calculating L as a func-

tion of t, so the function optimlikelidist(model,s1,s2) will search for

the t that maximises the likelihood by using the Nelder-Mead simplex (di-

rect search) method, while holding the other parameters in the model at

fixed values. This constraint can be relaxed by allowing every parameter

in the model to be estimated by functions, such as optimlikelidistk2p,

that can estimate both t and the model’s parameters. Figure 10.2(a and

b) illustrates the estimation of the evolutionary distance between two

ribonuclease genes through the fixed- and free-parameter K2P models,

respectively. When the K2P model’s parameter, kappa, is fixed, the re-

sult and trace of the optimisation process is illustrated by the graph of

L and t (Fig. 10.2a). When kappa is a free parameter, a surface shows

the result and trace of the optimisation process (Fig. 10.2b).

When calculating the likelihood of a phylogenetic tree, where s1 and

s2 are two (descendant) nodes in a tree joined to an internal (ancestor)

node, sa, we must sum over all possible assignments of nucleotides to sa

to get the likelihood of the distance between s1 and s2. Consequently,

the number of possible combinations of nucleotides becomes too large to

be enumerated for even moderately sized trees. The pruning algorithm

221

(a)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

1240

1220

1200

1180

1160

1140

1120

1100

1080

1060

1040

Distance (substitutions/site)

ln(L

ike

liho

od

)

(b)(

00.1

0.20.3

0.40.5

01

2

34

5 1350

1300

1250

1200

1150

1100

1050

1000

950

Distance (substitutions/site)kappa

ln(L

ikelih

ood)

Figure 10.2: Log-likelihood of evolutionary distance. (a) Likelihood asfunction of K2P distance. Distance is estimated by maximising likelihoodof the alignment when the bias of transition and transversion, kappa, isfixed. (b) Likelihood as function of distance and kappa. Both distanceand kappa are numerically optimised simultaneously to give maximumlikelihood. The maximum likelihood peaks are marked with *. The twosequences used are coding regions of two mammalian ribonuclease genes,enc, of 474 bp.

222

introduced by Felsenstein [85] takes advantage of the tree topology to

evaluate the summation in a computationally efficient (but mathemati-

cally equivalent) manner. This and a simple and elegant mapping from a

‘parentheses’ encoding of a tree to the matrix equation for calculating the

likelihood of a tree, developed in the Matlab software, PhylLab [271],

have been adopted in likelitree.

10.3.5 Combination of functions

Basic operations can be combined to give more complicated functions.

A simple combination of the function to extract the fourfold degenerate

sites with the function to calculate GC content produces a new function

(countgc4) that determines the GC content at 4-fold degenerate sites

(GC4). A subfunction for calculating synonymous and nonsynonymous

differences between two codons, getsynnonsyndiff, can be converted

into a program for calculating codon volatility [251] with trivial effort.

Similarly, karlinsig which returns Karlin’s genomic signature (the din-

ucleotide relative abundance or bias) for a given sequence can be easily

re-formulated to estimate relative di-codon frequencies, which may be a

new index of biological signals in a coding sequence. In addition, the

menu-driven user interface, MBEGUI, is also a good example illustrating

the power of combination of basic MBEToolbox functions.

10.3.6 Graphics and GUI

Good visualisation is essential for successful numerical model building.

Leveraging the rich graphics functionality of Matlab, MBEToolbox pro-

vides a number of functions that can be used to create graphic output,

such as scatterplots of Ks vs Ka, plots of the number of transitions and

transversions against genetic distance, sliding window analyses on a nu-

cleotide sequence and the Z-curve (a 3-dimensional curve representation

of a DNA sequence [372]). A simple menu-driven graphical user inter-

223

face (GUI) has been developed by using GUIDE (Graphical User Inter-

face Development Environment) in Matlab. The top menu includes File,

Sequences, Distances, Phylogeny, Graph, Polymorphism and Help sub-

menus (Fig. 10.3). It aids the usage of the most frequently required

functions so that users do not have to run any scripts or functions from

the Matlab command line in most cases.


Only few Matlab toolboxes or functions are freely available for data analy-

sis, exploration, and visualisation of nucleotide and protein sequences.

The toolbox, MBEToolbox, presented here to fulfil most obvious needs in

sequence manipulation, genetic distance estimation and phylogeny infer-

ence under Matlab environment. Moreover, it is an extensible functional

framework to formulate and solve problems in evolutionary data analysis;

it facilitates the rapid construction of both general applications as well

as special-purpose tools for computational biologists in a fraction of the

time it would take to write a program in a scalar noninteractive language

such as C or FORTRAN.

10.4.1 Vectorisation simplifies programming

Matlab is a matrix language, which means it is designed for vector and

matrix operations. Programming can be simplified and made more effi-

cient by using algorithms that take advantage of vectorisation (converting

for and while loops to the equivalent vector or matrix operations). The

Matlab compiler in version 7.0 will automatically recognise and vectorise

loops without recursion. An example of vectorisation is the calculation

of Z-scores [246] for Smith-Waterman alignments [291] to give a mea-

sure of the significance of an alignment score against a background of

scores from randomly generated sequences with the same composition

and length. Hence, Z-scores are designed to overcome the bias due to the

224

Figure 10.3: MBEToolbox GUI. (a) Distances submenu; (b) Phylogenysubmenu; and (c) Graph submenu.

225

composition of the alignment and are usually calculated by comparing

an actual alignment score with the scores obtained on a set of random

sequences generated by a Monte-Carlo process. The Z-score is defined

as:

Z(A, B) = (S(A,B)−mean)/standard deviation

where S(A,B) is the Smith-Waterman (S-W) score between two se-

quences A and B. The mean and standard deviation are taken from

realignments of the permuted sequences. The algorithm is implemented

as follows in Matlab with as few as 15 lines of code:

function [z,z_raw]=zscores(s1,s2,nboot)

m1=length(s1);

m2=length(s2);

% Initialise two vectors holding Z-score of

% s1_rep and s2_rep, \textiti.e., replicate samples

% of sequences s1 and s2.

v_z1=zeros(1,nboot);

v_z2=zeros(1,nboot);

z_raw=smithwaterman(s1,s2);

for (k=1:nboot),

s1_rep=s1(:,randperm(m1));

v_z1(1,k)=smithwaterman(s1_rep, s2);

s2_rep=s2(:,randperm(m2));

v_z2(1,k)=smithwaterman(s1, s2_rep);

end

z1=(z_raw-mean(v_z1))./std(v_z1);

226

z2=(z_raw-mean(v_z2))./std(v_z2);

z=min(z1,z2);

where randperm(n) is a vector function returning a random permutation

of the integers from 1 to n and smithwaterman performs local alignment

by the standard dynamic programming technique.

10.4.2 Extensibility

An important distinction between compiled languages with subroutine

libraries and interactive environments like Matlab is the ease with which

problems can be specified and solved in the latter. Moreover, Matlab

toolboxes are traditionally organised in a less object-oriented mode and,

consequently, functions are more independent of each other and easier to

combine and extend. Several examples were given in the Implementation

section.

10.4.3 Comparison with other toolboxes

Some other toolboxes have been developed in Matlab for bioinformatics

related analyses. These include PhylLab [271] and MatArray [327]

as well as the bioinformatics toolbox developed by MathWorks. Other

examples can be found at the link and file exchange maintained at Mat-

lab Central [42]. PhylLab is a molecular phylogeny toolbox which

also provides some functions for sequence and tree input and manipula-

tion. Its main focus is on creating a maximum likelihood tree based on

Bayesian principles using a Markov chain Monte Carlo method to com-

pute posterior parameter distributions. MatArray is focussed on the

analysis of gene expression data from microarrays and provides normali-

sation and clustering functions but does not address molecular evolution.

The bioinformatics toolbox from MathWorks provides a range of bioin-

formatics functions, including some related to molecular evolution.

227

MBEToolbox provides a much broader range of molecular evolution

related functions and phylogenetic methods than either the more spe-

cialised Phyllab project or the more general bioinformatics toolbox from

MathWorks. These extra functions include IO in Phylip format, sta-

tistical and sequence manipulation functions relevant to molecular evo-

lution (e.g. count segregating sites), evolutionary distance calculation

for nucleic and amino acid sequences, phylogeny inference functions and

graphic plots relevant to molecular evolution (e.g. Ka vs Ks). As such

it makes an important contribution to the bioinformatics analyses that

can be performed in the Matlab environment.

10.4.4 A novel enhanced window analysis

To test for the selective pressures in the different lineages of a phyloge-

netic tree, the nonsynonymous to synonymous rate ratio (Ka/Ks) is nor-

mally estimated [281, 4, 61]. Values of Ka/Ks = 1, > 1, or < 1 indicate

neutrality, positive selection, or purifying selection, respectively. How-

ever, Ks and Ka are measurements of average synonymous and nonsyn-

onymous substitutions per site along the whole length of the sequences.

Average Ks and Ka values give neither the pattern of intragenic fluc-

tuation of selective constraints, nor region- or site-specific information.

A sliding window method is usually adopted to examine the intragenic

pattern of the substitution rates and to test for the occurrence of signifi-

cant clusters of variant regions [55, 145, 80, 53]. Significant heterogeneity

in Ks would indicate that the neutral substitution rate varies across the

gene, whereas heterogeneity in Ka may indicate that selective constraints

vary along the gene. The results and accuracy of sliding window meth-

ods, either overlapping or non-overlapping, depend on both the size of

the window and the moving distance adopted. Large window lengths

may obliterate the details of patterns in Ks or Ka, whereas small win-

dow lengths usually result in larger statistical fluctuations. Hence, the

228

500 1000 1500 2000 2500 30000

0.5

1

1.5

2

2.5

Substitu

tion n

um

ber

per

site

synnonsyn

500 1000 1500 2000 2500 3000-120

-100

-80

-60

-40

-20

0

20

40

Codon site

synnonsyn

a

b

c

d

f

e

C E1 E2 NS2 NS3 NS4 NS5A NS5B

Tra

nsfo

rmed s

ubstitu

tion n

um

ber

per

site

(a)

(b)

Figure 10.4: Comparison between sliding window and enhanced slidingwindow methods. Sliding window analysis of Ks and Ka for the con-catenated coding regions of two hepatitis C virus strains, HCV-JS andHCV-JT. The number of codons for the C, E1, E2, NS2, NS3, NS4,NS5A, and NS5B genes are 191, 192, 426, 217, 631, 315, 447, and 591,respectively. The different coding regions are separated by vertical lines.(a) illustrates the result of a normal sliding window analysis; (b) illus-trates the result of the enhanced sliding window analysis. Beginningsand ends of regions poor in synonymous substitutions (slope < 0) areindicated by the arrows a and b (genes C and E1) and e and f (geneNS5B). A region rich in synonymous substitutions (slope > 0) in geneNS3 is indicated by arrows c and d.

229

resolution of a sliding window is usually limited.

A mathematical formalism, similar to the Z’-curve [368], is introduced

here to solve this problem. Consider a subsequence based analysis of Ks

or Ka. In the n-th step, count the cumulative numbers of Ks or Ka

occurring from the first to the n-th nucleotide position in the gene se-

quences being inspected. Let K denote either Ks or Ka and K(n) denote

the cumulative K at the n-th sequence position. K(n) is usually an ap-

proximately mono-increasing linear function of n. The points (K(n), n),

n = 1, 2, · · · , N are fit by a least square method to a linear function,

f(K(n)) = βn, to give a straight line with β being its slope. We define

K′(n) = K(n) − βn

The two-dimensional curve of (K′(n) ∼ n) gives an alternative represen-

tation of the normal sliding window curve.

To compare these two curve representations, the example dataset of

Suzuki and Gojobori [303], which contains the coding regions of two

hepatitis C virus strains (HCV-JS - Genbank Acc.: D85516 and HCV-

JT - Genbank Acc.: D11168), was used. The entire coding sequence is

divided into eight regions (C, E1, E2, NS2, NS3, NS4, NS5A, NS5B).

Some of the coding regions have been combined as these short ORFs are

unlikely to yield meaningful Ks and Ka values. The reduction of Ks

in the C, E1 and NS5B regions, as well as its elevation in NS3, which

have been shown in previous studies [303], are not clear in a standard

sliding window representation (Fig. 10.4a). In contrast a sharp increase

in the (K′(n) ∼ n) curve (Fig. 10.4b), indicates an increase in K, while

a drop in the curve indicates a decrease in K. This new method has

been implemented in the function plotSlidingKaKs. Since it is derived

from the sliding window method, it is called the enhanced sliding window

method.

230

10.4.5 Limitations

The current version of this toolbox lacks novel algorithms yet it imple-

ments a variety of existing algorithms. There are some limitations in

the practical use of MBEToolbox. First, though the toolbox provides

many methods to infer and handle sequence and evolutionary analyses,

the full range of these features can only be accessed through the Matlab

command line interface, as in the majority of Matlab packages. Second,

some of the functions cannot handle ambiguous nucleotide or amino acid

codes in the sequences. The future development of MBEToolbox will

overcome these present limitations.

In summary, the MBEToolbox project is an ongoing effort in providing an

easy-to-use and yet powerful analysis environment for molecular biology

and evolution. Currently, it offers a solid set of frequently used functions

to manipulate sequences, calculate genetic distances, infer phylogenetic

trees and for related analyzes. MBEToolbox is a useful tool and inspires

evolutionary biologists to take advantage of Matlab. Moreover, it has

been widely applied in data analysis in the Penicillium marneffei genome

project as mentioned in pages 73, 113, 146, 161 and 190.

231

Chapter 11

CONCLUDING REMARKS

In this last chapter I provide a summary of the conclusions and rec-

ommendations for future research to the preceding chapters presented.

Chapter 1 has presented the draft genome of the important thermally

dimorphic fungus Penicillium marneffei. A number of features of the

pathogenic fungus have been uncovered.

Given the similarity of mitochondrial genome of P. marneffei and

other nonpathogenic Aspergillus (Chapter 3), it suggests that P. marnef-

fei is more close to mould than yeast, which is consistent with established

classification. No direct association between mitochondrion-encoding ge-

netic components and pathogenicity can be observed. Moreover, in silico

evidences for the capability of melanin biosynthesis P. marneffei (Chap-

ter 4) will inspire further research towards the experimental elucidation

of melanin’s role in fungal virulence. Based on the computational finding,

gene knockout and in vivo animal survival analysis are being undertaken

in our department. The possible presence of sexual cycle in P. marneffei

reported in Chapter 5 is highly significant as it affects genetic study of

the fungus, since the sexual cycle could be a useful genetic tool allowing

us to study the way in which the fungus causes disease. On the other

hand, if the fungus does reproduce sexually as part of its life cycle, it

might evolve more rapidly to become resistant to anti-fungal drugs be-

cause sex might create new strains with increased ability to cause disease

and infect humans. Chapter 6 explored our current knowledges about

the genetic components related to the fungal morphogenesis, trying to

emphasise molecular mechanism for dimorphic switching. Yet more re-

searches are required in the following directions, including (i) perception

232

of external stimuli by cellular sensors; (ii) transduction of biochemical

signal; (iii) alteration of the genomic expression, and (iv) structural re-

organization towards the morphological change, in order to solve this

far less archived task. The presence of over-abundant intragenic tan-

dem repeats (IntraTRs) in P. marneffei genome is a striking finding

(Chapter 7). The IntraTRs may create quantitative alterations in phe-

notypes (e.g., adhesion, flocculation or biofilm formation). The variation

resulted from the quantitative alterations of the fungal cell surface may

have allowed the fungus ‘disguise’ itself in order to slip past the host

immune system’s vigilant defences. Many P. marneffei proteins contain-

ing tandemly repeated domain/motif, with some degree of homology to

Plasmodium erythrocyte-binding protein domain.

The area of gene and genome duplication and its evolutionary sig-

nificance has attracted significant attention from researchers in recent

years. Chapter 8 represents a novel contribution to the field by present-

ing a description of gene duplication in five ascomycetes. We have cal-

culated the rates of synonymous and non-synonymous substitution using

the codon substitution model and reported large variation in the propor-

tion of genes in multigene families across these fungi. We also suggest

that paralogs of filamentous fungi are under less selective constraint than

orthologs (but that this does not hold for yeasts), also there is a lack

of evidence for an association between asymmetry in rates of evolution

and positive selection, and finally that different extents and consequences

of gene duplication may explain some of the phenotypic variation of the

ascomycetes. One of new conclusion, that P. marneffei may have under-

gone a whole-genome duplication, is not solidly supported by the evidence

presented so far; analysis of gene order information will be necessary to

support the claim, when the P. marneffei genome sequencing approaches

complete. Moreover, at the time when the analysis was performed, As-

pergillus genomes remain unpublished, the underlying data may change,

233

and results from a pre-mature analysis may be hard to reproduce or be-

come obsolete. Therefore, no Aspergillus genomes was included into the

comparison; further analysis of this sort should overcome this limitation.

In addition, in Chapter 9 we conducted the analysis on genes with

various degree of conservation among species as measured by lineage-

specificity of genes (LS). We examined the correlations between evolu-

tionary rate and LS, as well as several other related factors, such as

expression, essentiality, and protein-protein interactions. We found that

in seven ascomycets genomes, the more lineage specific a gene, the higher

its evolutionary rate. This is taken as evidence for the hypothesis that

orphan genes arise as a result of higher rate of evolution. The general

rule applies to the explaining of the origin of P. marneffei -specific genes.

Finally, the software products, P. marneffei genome database and

MBEToolbox for sequence data analysis, have been developed (Chapters

2 and 10). Two of them literally covers two major aspects of bioin-

formatics, i.e., biological database management system and algorithm

development. They have been successfully applied throughout the whole

genome project, and proved to be efficient and sufficient.

In conclusion, the boom in fungal genome sequence data over the past

few years came with high expectations for new insights into fungal bi-

ology, and pathogen control strategies. In the case of P. marneffei, it

became evident that computational approaches can be used in the deci-

phering of the genome so as to derive biological meaning or evolutionary

processes. This work paves the way for a systemic experimental study of

the pathogenic fungus.

234

BIBLIOGRAPHY

[1] N. Adames, K. Blundell, M. N. Ashby, and C. Boone. Role of yeast insulin-degrading enzyme homologs in propheromone processing and bud site selection.Science, 270(5235):464–7, 1995.

[2] M. D. Adams, S. E. Celniker, R. A. Holt, C. A. Evans, J. D. Gocayne, P. G. Ama-natides, S. E. Scherer, P. W. Li, R. A. Hoskins, R. F. Galle, R. A. George, S. E.Lewis, S. Richards, M. Ashburner, S. N. Henderson, G. G. Sutton, J. R. Wort-man, M. D. Yandell, Q. Zhang, L. X. Chen, R. C. Brandon, Y. H. Rogers, R. G.Blazej, M. Champe, B. D. Pfeiffer, K. H. Wan, C. Doyle, E. G. Baxter, G. Helt,C. R. Nelson, G. L. Gabor, J. F. Abril, A. Agbayani, H. J. An, C. Andrews-Pfannkoch, D. Baldwin, R. M. Ballew, A. Basu, J. Baxendale, L. Bayraktaroglu,E. M. Beasley, K. Y. Beeson, P. V. Benos, B. P. Berman, D. Bhandari, S. Bol-shakov, D. Borkova, M. R. Botchan, J. Bouck, P. Brokstein, P. Brottier, K. C.Burtis, D. A. Busam, H. Butler, E. Cadieu, A. Center, I. Chandra, J. M. Cherry,S. Cawley, C. Dahlke, L. B. Davenport, P. Davies, B. de Pablos, A. Delcher,Z. Deng, A. D. Mays, I. Dew, S. M. Dietz, K. Dodson, L. E. Doup, M. Downes,S. Dugan-Rocha, B. C. Dunkov, P. Dunn, K. J. Durbin, C. C. Evangelista,C. Ferraz, S. Ferriera, W. Fleischmann, C. Fosler, A. E. Gabrielian, N. S. Garg,W. M. Gelbart, K. Glasser, A. Glodek, F. Gong, J. H. Gorrell, Z. Gu, P. Guan,M. Harris, N. L. Harris, D. Harvey, T. J. Heiman, J. R. Hernandez, J. Houck,D. Hostin, K. A. Houston, T. J. Howland, M. H. Wei, C. Ibegwam, et al. Thegenome sequence of drosophila melanogaster. Science, 287(5461):2185–95, 2000.

[3] L. Ajello, A. A. Padhye, S. Sukroongreung, C. H. Nilakul, and S. Tantimavanic.Occurrence of penicillium marneffei infections among wild bamboo rats in thai-land. Mycopathologia, 131(1):1–8, 1995.

[4] H. Akashi. Within- and between-species dna sequence variation and the ‘foot-print’ of natural selection. Gene, 238:39–51, 1999.

[5] J. A. Alspaugh, L. M. Cavallo, J. R. Perfect, and J. Heitman. Ras1 regulates fila-mentation, mating and growth at high temperature of cryptococcus neoformans.Mol Microbiol, 36(2):352–65, 2000.

[6] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, andD. J. Lipman. Gapped blast and psi-blast: a new generation of protein databasesearch programs. Nucleic Acids Res, 25(17):3389–402, 1997.

[7] M. A. Andrade, N. P. Brown, C. Leroy, S. Hoersch, A. de Daruvar, C. Reich,A. Franchini, J. Tamames, A. Valencia, C. Ouzounis, and C. Sander. Automatedgenome sequence analysis and annotation. Bioinformatics, 15(5):391–412, 1999.

[8] L. Aravind, H. Watanabe, D. J. Lipman, and E. V. Koonin. Lineage-specificloss and divergence of functionally linked genes in eukaryotes. Proc Natl AcadSci U S A, 97(21):11319–24, 2000.

[9] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald,G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology.the gene ontology consortium. Nat Genet, 25(1):25–9, 2000.

[10] C. R. Astell, L. Ahlstrom-Jonasson, M. Smith, K. Tatchell, K. A. Nasmyth,and B. D. Hall. The sequence of the dnas coding for the mating-type loci ofsaccharomyces cerevisiae. Cell, 27(1 Pt 2):15–23, 1981.

235

[11] J. Baker, J. McCarthy, M. Gatton, D. E. Kyle, V. Belizario, J. Luchavez, D. Bell,and Q. Cheng. Genetic diversity of plasmodium falciparum histidine-rich protein2 (pfhrp2) and its effect on the performance of pfhrp2-based rapid diagnostictests. J Infect Dis, 192(5):870–7, 2005.

[12] A. D. Basehoar, S. J. Zanton, and B. F. Pugh. Identification and distinct regu-lation of yeast tata box-containing genes. Cell, 116(5):699–709, 2004.

[13] A. Bateman, L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones,A. Khanna, M. Marshall, S. Moxon, E. L. Sonnhammer, D. J. Studholme,C. Yeats, and S. R. Eddy. The pfam protein families database. Nucleic AcidsRes, 32(Database issue):D138–41, 2004.

[14] D. H. Beach and A. J. Klar. Rearrangements of the transposable mating-typecassettes of fission yeast. Embo J, 3(3):603–10, 1984.

[15] G. Bejerano and G. Yona. Variations on probabilistic suffix trees: statisticalmodeling and prediction of protein families. Bioinformatics, 17(1):23–43, 2001.

[16] R. J. Bennett and S. C. West. Ruvc protein resolves holliday junctions viacleavage of the continuous (noncrossover) strands. Proc Natl Acad Sci U S A,92(12):5635–9, 1995.

[17] P. Bork, T. Dandekar, Y. Diaz-Lazcoz, F. Eisenhaber, M. Huynen, and Y. Yuan.Predicting function: from genes to genomes and back. J Mol Biol, 283(4):707–25,1998.

[18] A. R. Borneman, M. J. Hynes, and A. Andrianopoulos. The abaa homologueof penicillium marneffei participates in two developmental programmes: conidi-ation and dimorphic growth. Mol Microbiol, 38(5):1034–47, 2000.

[19] A. R. Borneman, M. J. Hynes, and A. Andrianopoulos. An ste12 homologfrom the asexual, dimorphic fungus penicillium marneffei complements the de-fect in sexual development of an aspergillus nidulans stea mutant. Genetics,157(3):1003–14, 2001.

[20] A. R. Borneman, M. J. Hynes, and A. Andrianopoulos. A basic helix-loop-helixprotein with similarity to the fungal morphological regulators, phd1p, efg1p andstua, controls conidiation but not dimorphic growth in penicillium marneffei.Mol Microbiol, 44(3):621–31, 2002.

[21] V. L. Boyartchuk, M. N. Ashby, and J. Rine. Modulation of ras and a-factorfunction by carboxyl-terminal proteolysis. Science, 275(5307):1796–800, 1997.

[22] K. J. Boyce, M. J. Hynes, and A. Andrianopoulos. The cdc42 homolog of thedimorphic fungus penicillium marneffei is required for correct cell polarizationduring growth but not development. J Bacteriol, 183(11):3447–57, 2001.

[23] K. J. Boyce, M. J. Hynes, and A. Andrianopoulos. The ras and rho gtpasesgenetically interact to co-ordinately regulate cell polarity during development inpenicillium marneffei. Mol Microbiol, 55(5):1487–501, 2005.

[24] A. A. Brakhage, K. Langfelder, G. Wanner, A. Schmidt, and B. Jahn. Pigmentbiosynthesis and virulence. Contrib Microbiol, 2:205–15, 1999.

[25] B. J. Breitkreutz, C. Stark, and M. Tyers. The grid: the general repository forinteraction datasets. Genome Biol, 4(3):R23, 2003.

[26] C. Brenner and R. S. Fuller. Structural and enzymatic characterization of apurified prohormone-processing enzyme: secreted, soluble kex2 protease. ProcNatl Acad Sci U S A, 89(3):922–6, 1992.

[27] J. Brosius and S. J. Gould. On ”genomenclature”: a comprehensive (and re-spectful) taxonomy for pseudogenes and other ”junk dna”. Proc Natl Acad SciU S A, 89(22):10706–10, 1992.

236

[28] D. W. Brown, J. H. Yu, H. S. Kelkar, M. Fernandes, T. C. Nesbitt, N. P. Keller,T. H. Adams, and T. J. Leonard. Twenty-five coregulated transcripts define asterigmatocystin gene cluster in aspergillus nidulans. Proc Natl Acad Sci U SA, 93(4):1418–22, 1996.

[29] T. A. Brown, R. B. Waring, C. Scazzocchio, and R. W. Davies. The aspergillusnidulans mitochondrial genome. Curr Genet, 9(2):113–7, 1985.

[30] C. Burge and S. Karlin. Prediction of complete gene structures in human genomicdna. J Mol Biol, 268(1):78–94, 1997.

[31] M. Burset and R. Guigo. Evaluation of gene structure prediction programs.Genomics, 34(3):353–67, 1996.

[32] H. Bussey. Proteases and the processing of precursors to secreted proteins inyeast. Yeast, 4(1):17–26, 1988.

[33] H. J. Bussink and S. A. Osmani. A cyclin-dependent kinase family member(phoa) is required to link developmental fate to environmental conditions inaspergillus nidulans. Embo J, 17(14):3990–4003, 1998.

[34] E. T. Buurman, C. Westwater, B. Hube, A. J. Brown, F. C. Odds, and N. A.Gow. Molecular analysis of camnt1p, a mannosyl transferase important for adhe-sion and virulence of candida albicans. Proc Natl Acad Sci U S A, 95(13):7670–5,1998.

[35] J. J. Cai, D. K. Smith, X. Xia, and K. Y. Yuen. Mbetoolbox: a matlab toolboxfor sequence data analysis in molecular biology and evolution. BMC Bioinfor-matics, 6(1):64, 2005.

[36] R. Calderone. Molecular pathogenesis of fungal infections. Trends Microbiol,2(12):461–3, 1994.

[37] L. Cao, C. M. Chan, C. Lee, S. S. Wong, and K. Y. Yuen. Mp1 encodes anabundant and highly antigenic cell wall mannoprotein in the pathogenic funguspenicillium marneffei. Infect Immun, 66(3):966–73, 1998.

[38] L. Cao, K. M. Chan, D. Chen, N. Vanittanakom, C. Lee, C. M. Chan, T. Sirisan-thana, D. N. Tsang, and K. Y. Yuen. Detection of cell wall mannoprotein mp1pin culture supernatants of penicillium marneffei and in sera of penicilliosis pa-tients. J Clin Microbiol, 37(4):981–6, 1999.

[39] L. Cao, D. L. Chen, C. Lee, C. M. Chan, K. M. Chan, N. Vanittanakom, D. N.Tsang, and K. Y. Yuen. Detection of specific antibodies to an antigenic manno-protein for diagnosis of penicillium marneffei penicilliosis. J Clin Microbiol,36(10):3028–31, 1998.

[40] T. J. Carver, K. M. Rutherford, M. Berriman, M. A. Rajandream, B. G. Barrell,and J. Parkhill. Act: the artemis comparison tool. Bioinformatics, 21(16):3422–3, 2005.

[41] L. L. Cavalli-Sforza and A. W. Edwards. Phylogenetic analysis. models andestimation procedures. Am J Hum Genet, 19(3):Suppl 19:233+, 1967.

[42] MATLAB Central. Matlab central, 2005.

[43] C. M. Chan, P. C. Woo, A. S. Leung, S. K. Lau, X. Y. Che, L. Cao, and K. Y.Yuen. Detection of antibodies specific to an antigenic cell wall galactomanno-protein for serodiagnosis of aspergillus fumigatus aspergillosis. J Clin Microbiol,40(6):2041–5, 2002.

[44] Y. F. Chan and T. C. Chow. Ultrastructural observations on penicillium marn-effei in natural human infection. Ultrastruct Pathol, 14(5):439–52, 1990.

[45] S. Chariyalertsak, T. Sirisanthana, K. Supparatpinyo, and K. E. Nelson. Sea-sonal variation of disseminated penicillium marneffei infections in northern thai-land: a clue to the reservoir? J Infect Dis, 173(6):1490–3, 1996.

237

[46] S. Chariyalertsak, T. Sirisanthana, K. Supparatpinyo, J. Praparattanapan, andK. E. Nelson. Case-control study of risk factors for penicillium marneffei infectionin human immunodeficiency virus-infected patients in northern thailand. ClinInfect Dis, 24(6):1080–6, 1997.

[47] S. Chariyalertsak, P. Vanittanakom, K. E. Nelson, T. Sirisanthana, and N. Vanit-tanakom. Rhizomys sumatrensis and cannomys badius, new natural animal hostsof penicillium marneffei. J Med Vet Mycol, 34(2):105–10, 1996.

[48] D. Charlesworth, B. Charlesworth, and G. A. McVean. Genome sequences andevolutionary biology, a two-way interaction. Trends Ecol Evol, 16(5):235–242,2001.

[49] P. Chen, S. K. Sapperstein, J. D. Choi, and S. Michaelis. Biogenesis of thesaccharomyces cerevisiae mating pheromone a-factor. J Cell Biol, 136(2):251–69, 1997.

[50] C. S. Chim, C. Y. Fong, S. K. Ma, S. S. Wong, and K. Y. Yuen. Reactivehemophagocytic syndrome associated with penicillium marneffei infection. AmJ Med, 104(2):196–7, 1998.

[51] R. J. Cho, M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wod-icka, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. J. Lockhart, andR. W. Davis. A genome-wide transcriptional analysis of the mitotic cell cycle.Mol Cell, 2(1):65–73, 1998.

[52] C. Y. Choi, E. L. Schneider, J. M. Kim, I. Y. Gluzman, D. E. Goldberg, J. A.Ellman, and M. A. Marletta. Interference with heme binding to histidine-richprotein-2 as an antimalarial strategy. Chem Biol, 9(8):881–9, 2002.

[53] S. S. Choi and B. T. Lahn. Adaptive evolution of mrg, a neuron-specific genefamily implicated in nociception. Genome Res, 13:2252–2259, 2003.

[54] P. Chongtrakool, S. C. Chaiyaroj, V. Vithayasai, S. Trawatcharegon, R. Tean-paisan, S. Kalnawakul, and S. Sirisinha. Immunoreactivity of a 38-kilodaltonpenicillium marneffei antigen with human immunodeficiency virus-positive sera.J Clin Microbiol, 35(9):2220–3, 1997.

[55] A. G. Clark and T. Kao. Excess nonsynonymous substitution at shared poly-morphic sites among self-incompatibility alleles of solanaceae. Proc Natl AcadSci USA, 88:9823–9827, 1991.

[56] P. Cliften, P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton, J. Majors, R. Wa-terston, B. A. Cohen, and M. Johnston. Finding functional features in saccha-romyces genomes by phylogenetic footprinting. Science, 301(5629):71–6, 2003.

[57] L. Coin, A. Bateman, and R. Durbin. Enhanced protein domain discovery byusing language modeling techniques from speech recognition. Proc Natl AcadSci U S A, 100(8):4516–20, 2003.

[58] L. J. Collins, A. M. Poole, and D. Penny. Using ancestral sequences to uncoverpotential gene homologues. Appl Bioinformatics, 2(3 Suppl):S85–95, 2003.

[59] G. C. Conant and A. Wagner. Asymmetric sequence divergence of duplicategenes. Genome Res, 13(9):2052–8, 2003.

[60] A. Cooper and H. Bussey. Characterization of the yeast kex1 gene product: acarboxypeptidase involved in processing secreted precursor proteins. Mol CellBiol, 9(6):2706–14, 1989.

[61] KA Crandall, CR Kelsey, H Imamichi, HC Lane, and NP Salzman. Parallelevolution of drug resistance in hiv: failure of nonsynonymous/synonymous sub-stitution rate ratio to detect selection. Mol Biol Evol, 16:372–382, 1999.

[62] J. Davey, K. Davis, M. Hughes, G. Ladds, and D. Powner. The processing ofyeast pheromones. Semin Cell Dev Biol, 9(1):19–30, 1998.

238

[63] F. De Bernardis, S. Arancia, L. Morelli, B. Hube, D. Sanglard, W. Schafer, andA. Cassone. Evidence that members of the secretory aspartyl proteinase genefamily, in particular sap2, are virulence factors for candida vaginitis. J InfectDis, 179(1):201–8, 1999.

[64] R. A. Dean, N. J. Talbot, D. J. Ebbole, M. L. Farman, T. K. Mitchell, M. J.Orbach, M. Thon, R. Kulkarni, J. R. Xu, H. Pan, N. D. Read, Y. H. Lee, I. Car-bone, D. Brown, Y. Y. Oh, N. Donofrio, J. S. Jeong, D. M. Soanes, S. Djonovic,E. Kolomiets, C. Rehmeyer, W. Li, M. Harding, S. Kim, M. H. Lebrun, H. Bohn-ert, S. Coughlan, J. Butler, S. Calvo, L. J. Ma, R. Nicol, S. Purcell, C. Nusbaum,J. E. Galagan, and B. W. Birren. The genome sequence of the rice blast fungusmagnaporthe grisea. Nature, 434(7036):980–6, 2005.

[65] C. d’Enfert, S. Goyard, S. Rodriguez-Arnaveilhe, L. Frangeul, L. Jones,F. Tekaia, O. Bader, A. Albrecht, L. Castillo, A. Dominguez, J. F. Ernst,C. Fradin, C. Gaillardin, S. Garcia-Sanchez, P. de Groot, B. Hube, F. M. Klis,S. Krishnamurthy, D. Kunze, M. C. Lopez, A. Mavor, N. Martin, I. Moszer,D. Onesime, J. Perez Martin, R. Sentandreu, E. Valentin, and A. J. Brown.Candidadb: a genome database for candida albicans pathogenomics. NucleicAcids Res, 33(Database issue):D353–7, 2005.

[66] Z. L. Deng and D. H. Connor. Progressive disseminated penicilliosis caused bypenicillium marneffei. report of eight cases and differentiation of the causativeorganism from histoplasma capsulatum. Am J Clin Pathol, 84(3):323–7, 1985.

[67] Z. L. Deng, M. Yun, and L. Ajello. Human penicilliosis marneffei and its relationto the bamboo rat (rhizomys pruinosus). J Med Vet Mycol, 24(5):383–9, 1986.

[68] E. T. Dermitzakis and A. G. Clark. Differential selection after duplication inmammalian developmental genes. Mol Biol Evol, 18(4):557–62, 2001.

[69] V. Desakorn, M. D. Smith, A. L. Walsh, A. J. Simpson, D. Sahassananda, A. Ra-januwong, V. Wuthiekanun, P. Howe, B. J. Angus, P. Suntharasamai, and N. J.White. Diagnosis of penicillium marneffei infection by quantitation of urinaryantigen by using an enzyme immunoassay. J Clin Microbiol, 37(1):117–21, 1999.

[70] A. Dmochowska, D. Dignard, D. Henning, D. Y. Thomas, and H. Bussey. Yeastkex1 gene encodes a putative protease with a carboxypeptidase b-like functioninvolved in killer toxin and alpha-factor precursor processing. Cell, 50(4):573–84,1987.

[71] C. B. Do, M. S. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Prob-abilistic consistency-based multiple sequence alignment. Genome Res, 15(2):330–40, 2005.

[72] J. M. Dolence, L. E. Steward, E. K. Dolence, D. H. Wong, and C. D. Poulter.Studies with recombinant saccharomyces cerevisiae caax prenyl protease rce1p.Biochemistry, 39(14):4096–104, 2000.

[73] T. Domazet-Loso and D. Tautz. An evolutionary analysis of orphan genes indrosophila. Genome Res, 13(10):2213–9, 2003.

[74] R. F. Doolittle. The multiplicity of domains in proteins. Annu Rev Biochem,64:287–314, 1995.

[75] J. Du, Y. Zhu, A. Shanmugam, and A. L. Kenter. Analysis of immunoglobulinsgamma3 recombination breakpoints by pcr: implications for the mechanism ofisotype switching. Nucleic Acids Res, 25(15):3066–73, 1997.

[76] P. S. Dyer, M. Paoletti, and D. B. Archer. Genomics reveals sexual secrets ofaspergillus. Microbiology, 149(Pt 9):2301–3, 2003.

[77] S. E. Eckert, B. Hoffmann, C. Wanke, and G. H. Braus. Sexual develop-ment of aspergillus nidulans in tryptophan auxotrophic strains. Arch Microbiol,172(3):157–66, 1999.

239

[78] A. Edwards, H. A. Hammond, L. Jin, C. T. Caskey, and R. Chakraborty. Ge-netic variation at five trimeric and tetrameric tandem repeat loci in four humanpopulation groups. Genomics, 12(2):241–53, 1992.

[79] C. elegan Sequencing Consortium. Genome sequence of the nematode c. elegans:a platform for investigating biology. Science, 282(5396):2012–8, 1998.

[80] T Endo, K Ikeo, and T Gojobori. Large-scale search for genes on which positiveselection may operate. Mol Biol Evol, 13:685–690, 1996.

[81] E. Eskin, W. N. Grundy, and Y. Singer. Protein family classification using sparsemarkov transducers. Proc Int Conf Intell Syst Mol Biol, 8:134–45, 2000.

[82] E. Espagne, P. Balhadere, M. L. Penin, C. Barreau, and B. Turcq. Het-e andhet-d belong to a new subfamily of wd40 proteins involved in vegetative incom-patibility specificity in the fungus podospora anserina. Genetics, 161(1):71–81,2002.

[83] B. Ewing and P. Green. Base-calling of automated sequencer traces using phred.ii. error probabilities. Genome Res, 8(3):186–94, 1998.

[84] B. Ewing, L. Hillier, M. C. Wendl, and P. Green. Base-calling of automatedsequencer traces using phred. i. accuracy assessment. Genome Res, 8(3):175–85,1998.

[85] J. Felsenstein. Evolutionary trees from dna sequences: a maximum likelihoodapproach. J Mol Evol, 17:368–376, 1981.

[86] J. Felsenstein. Phylip – phylogeny inference package (version 3.2). Cladistics,5:164–166, 1989.

[87] Fungal Research Community FGI. Fungal genome initiative(http://www.broad.mit.edu/annotation/fungi/fgi/), 2002.

[88] M. C. Fisher, D. Aanensen, S. de Hoog, and N. Vanittanakom. Multilocusmicrosatellite typing system for penicillium marneffei reveals spatially structuredpopulations. J Clin Microbiol, 42(11):5065–9, 2004.

[89] M. C. Fisher, W. P. Hanage, S. de Hoog, E. Johnson, M. D. Smith, N. J.White, and N. Vanittanakom. Low effective dispersal of asexual genotypes inheterogeneous landscapes by the endemic pathogen penicillium marneffei. PLoSPathog, 1(2):e20, 2005.

[90] A. Force, M. Lynch, F. B. Pickett, A. Amores, Y. L. Yan, and J. Postlethwait.Preservation of duplicate genes by complementary, degenerative mutations. Ge-netics, 151(4):1531–45, 1999.

[91] F. Foury, T. Roganti, N. Lecrenier, and B. Purnelle. The complete sequence ofthe mitochondrial genome of saccharomyces cerevisiae. FEBS Lett, 440(3):325–31, 1998.

[92] C. M. Fraser and R. D. Fleischmann. Strategies for whole microbial genomesequencing and analysis. Electrophoresis, 18(8):1207–16, 1997.

[93] H. B. Fraser, D. P. Wall, and A. E. Hirsh. A simple dependence between proteinevolution rate and the number of protein-protein interactions. BMC Evol Biol,3(1):11, 2003.

[94] J. A. Fraser and J. Heitman. Evolution of fungal sex chromosomes. Mol Micro-biol, 51(2):299–306, 2004.

[95] R. Friedman and A. L. Hughes. Gene duplication and the structure of eukaryoticgenomes. Genome Res, 11(3):373–81, 2001.

[96] D. Frishman, M. Mokrejs, D. Kosykh, G. Kastenmuller, G. Kolesov, I. Zubrzycki,C. Gruber, B. Geier, A. Kaps, K. Albermann, A. Volz, C. Wagner, M. Fellenberg,K. Heumann, and H. W. Mewes. The pedant genome database. Nucleic AcidsRes, 31(1):207–11, 2003.

240

[97] M. C. Frith, J. L. Spouge, U. Hansen, and Z. Weng. Statistical significance ofclusters of motifs represented by position specific scoring matrices in nucleotidesequences. Nucleic Acids Res, 30(14):3214–24, 2002.

[98] Y. Fu, G. Rieg, W. A. Fonzi, P. H. Belanger, Jr. Edwards, J. E., and S. G. Filler.Expression of the candida albicans gene als1 in saccharomyces cerevisiae inducesadherence to endothelial and epithelial cells. Infect Immun, 66(4):1783–6, 1998.

[99] K. Fujimura-Kamada, F. J. Nouvet, and S. Michaelis. A novel membrane-associated metalloprotease, ste24p, is required for the first step of nh2-terminalprocessing of the yeast a-factor precursor. J Cell Biol, 136(2):271–85, 1997.

[100] R. S. Fuller, A. Brake, and J. Thorner. Yeast prohormone processing enzyme(kex2 gene product) is a ca2+-dependent serine protease. Proc Natl Acad Sci US A, 86(5):1434–8, 1989.

[101] J. E. Galagan, S. E. Calvo, K. A. Borkovich, E. U. Selker, N. D. Read, D. Jaffe,W. FitzHugh, L. J. Ma, S. Smirnov, S. Purcell, B. Rehman, T. Elkins, R. Engels,S. Wang, C. B. Nielsen, J. Butler, M. Endrizzi, D. Qui, P. Ianakiev, D. Bell-Pedersen, M. A. Nelson, M. Werner-Washburne, C. P. Selitrennikoff, J. A. Kin-sey, E. L. Braun, A. Zelter, U. Schulte, G. O. Kothe, G. Jedd, W. Mewes,C. Staben, E. Marcotte, D. Greenberg, A. Roy, K. Foley, J. Naylor, N. Stange-Thomann, R. Barrett, S. Gnerre, M. Kamal, M. Kamvysselis, E. Mauceli,C. Bielke, S. Rudd, D. Frishman, S. Krystofova, C. Rasmussen, R. L. Met-zenberg, D. D. Perkins, S. Kroken, C. Cogoni, G. Macino, D. Catcheside, W. Li,R. J. Pratt, S. A. Osmani, C. P. DeSouza, L. Glass, M. J. Orbach, J. A. Berglund,R. Voelker, O. Yarden, M. Plamann, S. Seiler, J. Dunlap, A. Radford, R. Ara-mayo, D. O. Natvig, L. A. Alex, G. Mannhaupt, D. J. Ebbole, M. Freitag,I. Paulsen, M. S. Sachs, E. S. Lander, C. Nusbaum, and B. Birren. The genomesequence of the filamentous fungus neurospora crassa. Nature, 422(6934):859–68,2003.

[102] C. A. Gale, C. M. Bendel, M. McClellan, M. Hauser, J. M. Becker, J. Berman,and M. K. Hostetter. Linkage of adhesion, filamentous growth, and virulence incandida albicans to a single gene, int1. Science, 279(5355):1355–8, 1998.

[103] W. Gao, C. H. Khang, S. Y. Park, Y. H. Lee, and S. Kang. Evolution andorganization of a highly dynamic, subtelomeric helicase gene family in the riceblast fungus magnaporthe grisea. Genetics, 162(1):103–12, 2002.

[104] R. G. Garrison and K. S. Boyd. Dimorphism of penicillium marneffei as observedby electron microscopy. Can J Microbiol, 19(10):1305–9, 1973.

[105] S. M. Gasser and M. M. Cockell. The molecular biology of the sir proteins.Gene, 279(1):1–16, 2001.

[106] A. C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer,J. Schultz, J. M. Rick, A. M. Michon, C. M. Cruciat, M. Remor, C. Hofert,M. Schelder, M. Brajenovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dick-son, T. Rudi, V. Gnau, A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M. A.Heurtier, R. R. Copley, A. Edelmann, E. Querfurth, V. Rybin, G. Drewes,M. Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neubauer, andG. Superti-Furga. Functional organization of the yeast proteome by systematicanalysis of protein complexes. Nature, 415(6868):141–7, 2002.

[107] R. F. Geever, L. Huiet, J. A. Baum, B. M. Tyler, V. B. Patel, B. J. Rutledge,M. E. Case, and N. H. Giles. Dna sequence, organization and regulation of theqa gene cluster of neurospora crassa. J Mol Biol, 207(1):15–34, 1989.

[108] M. S. Gelfand. Prediction of function in dna sequence analysis. J Comput Biol,2(1):87–115, 1995.

[109] W. Gilbert, S. J. de Souza, and M. Long. Origin of genes. Proc Natl Acad SciU S A, 94(15):7698–703, 1997.

241

[110] A. Goffeau, B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann,F. Galibert, J. D. Hoheisel, C. Jacq, M. Johnston, E. J. Louis, H. W. Mewes,Y. Murakami, P. Philippsen, H. Tettelin, and S. G. Oliver. Life with 6000 genes.Science, 274(5287):546, 563–7, 1996.

[111] N. Goldman and Z. Yang. A codon-based model of nucleotide substitution forprotein-coding dna sequences. Mol Biol Evol, 11(5):725–36, 1994.

[112] D. Gordon, C. Abajian, and P. Green. Consed: a graphical tool for sequencefinishing. Genome Res, 8(3):195–202, 1998.

[113] N. A. Gow. Candida albicans switches mates. Mol Cell, 10(2):217–8, 2002.

[114] N. A. Gow, A. J. Brown, and F. C. Odds. Fungal morphogenesis and hostinvasion. Curr Opin Microbiol, 5(4):366–71, 2002.

[115] D. Grant, P. Cregan, and R. C. Shoemaker. Genome organization in dicots:genome duplication in arabidopsis and synteny between soybean and arabidopsis.Proc Natl Acad Sci U S A, 97(8):4168–73, 2000.

[116] D. Graur. Amino acid composition and the evolutionary rates of protein-codinggenes. J Mol Evol, 22(1):53–62, 1985.

[117] S. I. Grewal and D. Moazed. Heterochromatin and epigenetic control of geneexpression. Science, 301(5634):798–802, 2003.

[118] Z. Gu, A. Cavalcanti, F. C. Chen, P. Bouman, and W. H. Li. Extent of geneduplication in the genomes of drosophila, nematode, and yeast. Mol Biol Evol,19(3):256–62, 2002.

[119] Z. Gu, L. M. Steinmetz, X. Gu, C. Scharfe, R. W. Davis, and W. H. Li.Role of duplicate genes in genetic robustness against null mutations. Nature,421(6918):63–6, 2003.

[120] J. E. Haber. Mating-type gene switching in saccharomyces cerevisiae. Annu RevGenet, 32:561–99, 1998.

[121] H. Hamada, M. Seidman, B. H. Howard, and C. M. Gorman. Enhanced geneexpression by the poly(dt-dg).poly(dc-da) sequence. Mol Cell Biol, 4(12):2622–30, 1984.

[122] A. J. Hamilton, L. Jeavons, S. Youngchim, and N. Vanittanakom. Recognition offibronectin by penicillium marneffei conidia via a sialic acid-dependent processand its relationship to the interaction between conidia and laminin. Infect Im-mun, 67(10):5200–5, 1999.

[123] A. J. Hamilton, L. Jeavons, S. Youngchim, N. Vanittanakom, and R. J. Hay.Sialic acid-dependent recognition of laminin by penicillium marneffei conidia.Infect Immun, 66(12):6024–6, 1998.

[124] K. H. Han, K. Y. Han, J. H. Yu, K. S. Chae, K. Y. Jahng, and D. M. Han. Thensdd gene encodes a putative gata-type transcription factor necessary for sexualdevelopment of aspergillus nidulans. Mol Microbiol, 41(2):299–309, 2001.

[125] K. H. Han, J. A. Seo, and J. H. Yu. A putative g protein-coupled receptornegatively controls sexual development in aspergillus nidulans. Mol Microbiol,51(5):1333–45, 2004.

[126] M Hasegawa, H Kishino, and T Yano. Dating of the human-ape splitting by amolecular clock of mitochondrial dna. J Mol Evol, 22:160–174, 1985.

[127] K. E. Hastings. Strong evolutionary conservation of broadly expressed proteinisoforms in the troponin i gene family and other vertebrate gene families. J MolEvol, 42(6):631–40, 1996.

[128] K. Haynes. Virulence in candida species. Trends Microbiol, 9(12):591–6, 2001.

242

[129] B. He, P. Chen, S. Y. Chen, K. L. Vancura, S. Michaelis, and S. Powers. Ram2,an essential gene of yeast, and ram1 encode the two polypeptide components ofthe farnesyltransferase that prenylates a-factor and ras proteins. Proc Natl AcadSci U S A, 88(24):11373–7, 1991.

[130] D. S. Heckman, D. M. Geiser, B. R. Eidell, R. L. Stauffer, N. L. Kardos, andS. B. Hedges. Molecular evidence for the early colonization of land by fungi andplants. Science, 293(5532):1129–33, 2001.

[131] S. B. Hedges and S. Kumar. Genomic clocks and evolutionary timescales. TrendsGenet, 19(4):200–6, 2003.

[132] I. Herskowitz. Fungal physiology. yeast branches out. Nature, 357(6375):190–1,1992.

[133] L. H. Hogan, S. Josvai, and B. S. Klein. Genomic cloning, characterization, andfunctional analysis of the major surface adhesin wi-1 on blastomyces dermatitidisyeasts. J Biol Chem, 270(51):30725–32, 1995.

[134] P. R. Hsueh, L. J. Teng, C. C. Hung, J. H. Hsu, P. C. Yang, S. W. Ho, andK. T. Luh. Molecular evidence for strain dissemination of penicillium marneffei:an emerging pathogen in taiwan. J Infect Dis, 181(5):1706–12, 2000.

[135] H. Huang, W. C. Barker, Y. Chen, and C. H. Wu. iproclass: an integrateddatabase of protein family, function and structure information. Nucleic AcidsRes, 31(1):390–2, 2003.

[136] A. L. Hughes and R. Friedman. Parallel evolution by gene duplication in thegenomes of two unicellular fungi. Genome Res, 13(6A):1259–64, 2003.

[137] M. K. Hughes and A. L. Hughes. Evolution of duplicate genes in a tetraploidanimal, xenopus laevis. Mol Biol Evol, 10(6):1360–9, 1993.

[138] C. M. Hull and A. D. Johnson. Identification of a mating type-like locus in theasexual pathogenic yeast candida albicans. Science, 285(5431):1271–5, 1999.

[139] C. M. Hull, R. M. Raisner, and A. D. Johnson. Evidence for mating of the”asexual” yeast candida albicans in a mammalian host. Science, 289(5477):307–10, 2000.

[140] C. C. Hung, M. Y. Chen, S. M. Hsieh, W. H. Sheng, C. F. Hsiao, and S. C.Chang. Discontinuation of secondary prophylaxis for penicilliosis marneffei inaids patients responding to highly active antiretroviral therapy. Aids, 16(4):672–3, 2002.

[141] L. D. Hurst and N. G. Smith. Do essential genes evolve slowly? Curr Biol,9(14):747–50, 1999.

[142] M. Huynen, B. Snel, 3rd Lathe, W., and P. Bork. Predicting protein functionby genomic context: quantitative evaluation and qualitative inferences. GenomeRes, 10(8):1204–10, 2000.

[143] I. Iliopoulos, S. Tsoka, M. A. Andrade, A. J. Enright, M. Carroll, P. Poul-let, V. Promponas, T. Liakopoulos, G. Palaios, C. Pasquier, S. Hamodrakas,J. Tamames, A. T. Yagnik, A. Tramontano, D. Devos, C. Blaschke, A. Valencia,D. Brett, D. Martin, C. Leroy, I. Rigoutsos, C. Sander, and C. A. Ouzounis.Evaluation of annotation strategies using an entire genome sequence. Bioinfor-matics, 19(6):717–26, 2003.

[144] P. Imwidthaya, A. S. Sekhon, T. D. Mastro, A. K. Garg, and E. Ambrosie. Use-fulness of a microimmunodiffusion test for the detection of penicillium marneffeiantigenemia, antibodies, and exoantigens. Mycopathologia, 138(2):51–5, 1997.

[145] Y. Ina. Oden: a program package for molecular evolutionary analysis and data-base search of dna and amino acid sequences. Comput Appl Biosci, 10:11–12,1994.

243

[146] L. Jeavons, A. J. Hamilton, N. Vanittanakom, R. Ungpakorn, E. G. Evans,T. Sirisanthana, and R. J. Hay. Identification and purification of specific peni-cillium marneffei antigens and their recognition by human immune sera. J ClinMicrobiol, 36(4):949–54, 1998.

[147] M. E. Johnson, L. Viggiano, J. A. Bailey, M. Abdul-Rauf, G. Goodwin, M. Roc-chi, and E. E. Eichler. Positive selection of a gene family during the emergenceof humans and african apes. Nature, 413(6855):514–9, 2001.

[148] T. Jones, N. A. Federspiel, H. Chibana, J. Dungan, S. Kalman, B. B. Magee,G. Newport, Y. R. Thorstenson, N. Agabian, P. T. Magee, R. W. Davis, andS. Scherer. The diploid genome sequence of candida albicans. Proc Natl AcadSci U S A, 101(19):7329–34, 2004.

[149] I. K. Jordan, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. Essential genes aremore evolutionarily conserved than are nonessential genes in bacteria. GenomeRes, 12(6):962–8, 2002.

[150] I. K. Jordan, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. Microevolutionarygenomics of bacteria. Theor Popul Biol, 61(4):435–47, 2002.

[151] I. K. Jordan, Y. I. Wolf, and E. V. Koonin. No simple dependence betweenprotein evolution rate and the number of protein-protein interactions: only themost prolific interactors tend to evolve slowly. BMC Evol Biol, 3(1):1, 2003.

[152] T. Joseph-Horne, D. W. Hollomon, and P. M. Wood. Fungal respiration: afusion of standard and alternative components. Biochim Biophys Acta, 1504(2-3):179–95, 2001.

[153] T. H. Jukes and C.R. Cantor. Evolution of protein molecules. In H. N. Munro,editor, Mammalian Protein Metabolism, pages 21–132. Academic Press, NewYork, 1969.

[154] D. Julius, L. Blair, A. Brake, G. Sprague, and J. Thorner. Yeast alpha factor isprocessed from a larger precursor polypeptide: the essential role of a membrane-bound dipeptidyl aminopeptidase. Cell, 32(3):839–52, 1983.

[155] H. Kaessmann, S. Zollner, A. Nekrutenko, and W. H. Li. Signatures of domainshuffling in the human genome. Genome Res, 12(11):1642–50, 2002.

[156] E. Kafer. Origins of translocations in aspergillus nidulans. Genetics, 52(1):217–32, 1965.

[157] T. Kanbe and J. E. Cutler. Minimum chemical requirements for adhesin activ-ity of the acid-stable part of candida albicans cell wall phosphomannoproteincomplex. Infect Immun, 66(12):5812–8, 1998.

[158] R. Kappe, C. Fauser, C. N. Okeke, and M. Maiwald. Universal fungus-specificprimer systems and group-specific hybridization oligonucleotides for 18s rdna.Mycoses, 39(1-2):25–30, 1996.

[159] N. Kato, W. Brooks, and A. M. Calvo. The expression of sterigmatocystin andpenicillin genes in aspergillus nidulans is controlled by vea, a gene required forsexual development. Eukaryot Cell, 2(6):1178–86, 2003.

[160] L. Kaufman, P. G. Standard, M. Jalbert, P. Kantipong, K. Limpakarnjanarat,and T. D. Mastro. Diagnostic antigenemia tests for penicilliosis marneffei. JClin Microbiol, 34(10):2503–5, 1996.

[161] N. P. Keller and T. M. Hohn. Metabolic pathway gene clusters in filamentousfungi. Fungal Genet Biol, 21(1):17–29, 1997.

[162] M. Kelly, J. Burke, M. Smith, A. Klar, and D. Beach. Four mating-type genescontrol sexual differentiation in the fission yeast. Embo J, 7(5):1537–47, 1988.

[163] Z. Kerenyi and L. Hornok. Structure and function of mating-type genes infusarium species. Acta Microbiol Immunol Hung, 49(2-3):313–4, 2002.

244

[164] H. Kim, K. Han, K. Kim, D. Han, K. Jahng, and K. Chae. The vea gene activatessexual development in aspergillus nidulans. Fungal Genet Biol, 37(1):72–80,2002.

[165] M. Kimura. A simple method for estimating evolutionary rates of base sub-stitutions through comparative studies of nucleotide sequences. J Mol Evol,16:111–120, 1980.

[166] M. Kimura and J. L. King. Fixation of a deleterious allele at one of two ”dupli-cate” loci by mutation pressure and random drift. Proc Natl Acad Sci U S A,76(6):2858–61, 1979.

[167] K. E. Kirk and N. R. Morris. The tubb alpha-tubulin gene is essential for sexualdevelopment in aspergillus nidulans. Genes Dev, 5(11):2014–23, 1991.

[168] K. E. Kirk and N. R. Morris. Either alpha-tubulin isogene product is sufficient formicrotubule function during all stages of growth and differentiation in aspergillusnidulans. Mol Cell Biol, 13(8):4465–76, 1993.

[169] B. S. Klein, L. H. Hogan, and J. M. Jones. Immunologic recognition of a 25-aminoacid repeat arrayed in tandem on a major antigen of blastomyces dermatitidis.J Clin Invest, 92(1):330–7, 1993.

[170] M. A. Klich, E. J. Mullaney, C. B. Daly, and J. W. Cary. Molecular and phys-iological aspects of aflatoxin and sterigmatocystin biosynthesis by aspergillustamarii and a. ochraceoroseus. Appl Microbiol Biotechnol, 53(5):605–9, 2000.

[171] Y. Koguchi, K. Kawakami, S. Kon, T. Segawa, M. Maeda, T. Uede, and A. Saito.Penicillium marneffei causes osteopontin-mediated production of interleukin-12by peripheral blood mononuclear cells. Infect Immun, 70(3):1042–8, 2002.

[172] F. A. Kondrashov and E. V. Koonin. Origin of alternative splicing by tandemexon duplication. Hum Mol Genet, 10(23):2661–9, 2001.

[173] F. A. Kondrashov and E. V. Koonin. Evolution of alternative splicing: deletions,insertions and origin of functional parts of proteins from intron sequences. TrendsGenet, 19(3):115–9, 2003.

[174] F. A. Kondrashov, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. Selection in theevolution of gene duplications. Genome Biol, 3(2):RESEARCH0008, 2002.

[175] R. Koszul, A. Malpertuy, L. Frangeul, C. Bouchier, P. Wincker, A. Thierry,S. Duthoy, S. Ferris, C. Hennequin, and B. Dujon. The complete mitochondrialgenome sequence of the pathogenic yeast candida (torulopsis) glabrata. FEBSLett, 534(1-3):39–48, 2003.

[176] L. Kraakman, K. Lemaire, P. Ma, A. W. Teunissen, M. C. Donaton, P. Van Dijck,J. Winderickx, J. H. de Winde, and J. M. Thevelein. A saccharomyces cerevisiaeg-protein coupled receptor, gpr1, is specifically required for glucose activation ofthe camp pathway during the transition to growth on glucose. Mol Microbiol,32(5):1002–12, 1999.

[177] A. Krause, J. Stoye, and M. Vingron. The systers protein sequence cluster set.Nucleic Acids Res, 28(1):270–2, 2000.

[178] D. M. Krylov, Y. I. Wolf, I. B. Rogozin, and E. V. Koonin. Gene loss, proteinsequence divergence, gene dispensability, expression level, and interactivity arecorrelated in eukaryotic evolution. Genome Res, 13(10):2229–35, 2003.

[179] N. Kudeken, K. Kawakami, and A. Saito. Cytokine-induced fungicidal activityof human polymorphonuclear leukocytes against penicillium marneffei. FEMSImmunol Med Microbiol, 26(2):115–24, 1999.

[180] N. Kudeken, K. Kawakami, and A. Saito. Role of superoxide anion in the fungici-dal activity of murine peritoneal exudate macrophages against penicillium marn-effei. Microbiol Immunol, 43(4):323–30, 1999.

245

[181] N. Kudeken, K. Kawakami, and A. Saito. Mechanisms of the in vitro fungi-cidal effects of human neutrophils against penicillium marneffei induced bygranulocyte-macrophage colony-stimulating factor (gm-csf). Clin Exp Immunol,119(3):472–8, 2000.

[182] E. Y. Kwan, Y. L. Lau, K. Y. Yuen, B. M. Jones, and L. C. Low. Penicil-lium marneffei infection in a non-hiv infected child. J Paediatr Child Health,33(3):267–71, 1997.

[183] K. J. Kwon-Chung and J. E. Bennett. Distribution of alpha and alpha matingtypes of cryptococcus neoformans among natural and clinical isolates. Am JEpidemiol, 108(4):337–40, 1978.

[184] J. A. Lake. Reconstructing evolutionary trees from dna and protein sequences:paralinear distances. Proc Natl Acad Sci USA, 91:1455–1459, 1994.

[185] C. Lanave, G. Preparata, C. Saccone, and G. Serio. A new method for calculatingevolutionary substitution rates. J Mol Evol, 20:86–93, 1984.

[186] E. S. Lander and M. S. Waterman. Genomic mapping by fingerprinting randomclones: a mathematical analysis. Genomics, 2(3):231–9, 1988.

[187] K. Langfelder, B. Jahn, H. Gehringer, A. Schmidt, G. Wanner, and A. A.Brakhage. Identification of a polyketide synthase gene (pksp) of aspergillus fu-migatus involved in conidial pigment biosynthesis and virulence. Med MicrobiolImmunol (Berl), 187(2):79–89, 1998.

[188] L. Latchinian-Sadek and D. Y. Thomas. Expression, purification, and charac-terization of the yeast kex1 gene product, a polypeptide precursor processingcarboxypeptidase. J Biol Chem, 268(1):534–40, 1993.

[189] J. P. Latge and R. Calderone. Host-microbe interactions: fungi invasive humanfungal opportunistic infections. Curr Opin Microbiol, 5(4):355–8, 2002.

[190] E. Leberer, D. Harcus, I. D. Broadbent, K. L. Clark, D. Dignard, K. Ziegelbauer,A. Schmidt, N. A. Gow, A. J. Brown, and D. Y. Thomas. Signal transductionthrough homologs of the ste20p and ste7p protein kinases can trigger hyphalformation in the pathogenic fungus candida albicans. Proc Natl Acad Sci U SA, 93(23):13217–22, 1996.

[191] D. W. Lee, S. Kim, S. J. Kim, D. M. Han, K. Y. Jahng, and K. S. Chae. Theisda gene is necessary for sexual development inhibition by a salt in aspergillusnidulans. Curr Genet, 39(4):237–43, 2001.

[192] K. B. Lengeler, R. C. Davidson, C. D’Souza, T. Harashima, W. C. Shen,P. Wang, X. Pan, M. Waugh, and J. Heitman. Signal transduction cascades reg-ulating fungal development and virulence. Microbiol Mol Biol Rev, 64(4):746–85,2000.

[193] K. B. Lengeler, P. Wang, G. M. Cox, J. R. Perfect, and J. Heitman. Iden-tification of the mata mating-type locus of cryptococcus neoformans reveals aserotype a mata strain thought to have been extinct. Proc Natl Acad Sci U SA, 97(26):14455–60, 2000.

[194] I. Letunic, R. R. Copley, and P. Bork. Common exon duplication in animals andits role in alternative splicing. Hum Mol Genet, 11(13):1561–7, 2002.

[195] J. C. Li, L. Q. Pan, and S. X. Wu. Mycologic investigation on rhizomys pruinoussenex in guangxi as natural carrier with penicillium marneffei. Chin Med J(Engl), 102(6):477–85, 1989.

[196] W. H. Li. Rate of gene silencing at duplicate loci: a theoretical study andinterpretation of data from tetraploid fishes. Genetics, 95(1):237–58, 1980.

[197] W. H. Li. Unbiased estimation of the rates of synonymous and nonsynonymoussubstitution. J Mol Evol, 36:96–99, 1993.

246

[198] W. H. Li, C. I. Wu, and C. C. Luo. A new method for estimating synonymousand nonsynonymous rates of nucleotide substitution considering the relative like-lihood of nucleotide and codon changes. Mol Biol Evol, 2:150–174, 1985.

[199] Wen-Hsiung Li. Molecular evolution. Sinauer Associates, Sunderland, Mass.,1997.

[200] F. Lisacek, Y. Diaz, and F. Michel. Automatic identification of group i introncores in genomic dna sequences. J Mol Biol, 235(4):1206–17, 1994.

[201] C. Y. Lo, D. T. Chan, K. Y. Yuen, F. K. Li, and K. P. Cheng. Penicilliummarneffei infection in a patient with sle. Lupus, 4(3):229–31, 1995.

[202] K. F. LoBuglio and J. W. Taylor. Phylogeny and pcr identification of the humanpathogenic fungus penicillium marneffei. J Clin Microbiol, 33(1):85–9, 1995.

[203] B. J. Loftus, E. Fung, P. Roncaglia, D. Rowley, P. Amedeo, D. Bruno, J. Va-mathevan, M. Miranda, I. J. Anderson, J. A. Fraser, J. E. Allen, I. E. Bosdet,M. R. Brent, R. Chiu, T. L. Doering, M. J. Donlin, C. A. D’Souza, D. S. Fox,V. Grinberg, J. Fu, M. Fukushima, B. J. Haas, J. C. Huang, G. Janbon, S. J.Jones, H. L. Koo, M. I. Krzywinski, J. K. Kwon-Chung, K. B. Lengeler, R. Maiti,M. A. Marra, R. E. Marra, C. A. Mathewson, T. G. Mitchell, M. Pertea, F. R.Riggs, S. L. Salzberg, J. E. Schein, A. Shvartsbeyn, H. Shin, M. Shumway, C. A.Specht, B. B. Suh, A. Tenney, T. R. Utterback, B. L. Wickes, J. R. Wort-man, N. H. Wye, J. W. Kronstad, J. K. Lodge, J. Heitman, R. W. Davis, C. M.Fraser, and R. W. Hyman. The genome of the basidiomycetous yeast and humanpathogen cryptococcus neoformans. Science, 307(5713):1321–4, 2005.

[204] M. Long, E. Betran, K. Thornton, and W. Wang. The origin of new genes:glimpses from the young and old. Nat Rev Genet, 4(11):865–75, 2003.

[205] M. Long and C. H. Langley. Natural selection and the origin of jingwei, achimeric processed functional gene in drosophila. Science, 260(5104):91–5, 1993.

[206] M. C. Lorenz. Genomic approaches to fungal pathogenicity. Curr Opin Micro-biol, 5(4):372–8, 2002.

[207] T. M. Lowe and S. R. Eddy. trnascan-se: a program for improved detection oftransfer rna genes in genomic sequence. Nucleic Acids Res, 25(5):955–64, 1997.

[208] Q. Lu, L. L. Wallrath, H. Granok, and S. C. Elgin. (ct)n (ga)n repeats and heatshock elements have distinct roles in chromatin structure and transcriptionalactivation of the drosophila hsp26 gene. Mol Cell Biol, 13(5):2802–14, 1993.

[209] L. G. Lundin. Evolution of the vertebrate genome as reflected in paralogouschromosomal regions in man and the house mouse. Genomics, 16(1):1–19, 1993.

[210] M. Lynch and J. S. Conery. The evolutionary fate and consequences of duplicategenes. Science, 290(5494):1151–5, 2000.

[211] M. Lynch and J. S. Conery. The evolutionary demography of duplicate genes. JStruct Funct Genomics, 3(1-4):35–44, 2003.

[212] M. Lynch and A. Force. The probability of duplicate gene preservation bysubfunctionalization. Genetics, 154(1):459–73, 2000.

[213] B. B. Magee and P. T. Magee. Induction of mating in candida albicans byconstruction of mtla and mtlalpha strains. Science, 289(5477):310–3, 2000.

[214] W. Makalowski and M. S. Boguski. Synonymous and nonsynonymous substitu-tion distances are correlated in mouse and rat genes. J Mol Evol, 47(2):119–21,1998.

[215] W. Makalowski, G. A. Mitchell, and D. Labuda. Alu sequences in the codingregions of mrna: a source of protein variability. Trends Genet, 10(6):188–93,1994.

247

[216] G. Mannhaupt, C. Montrone, D. Haase, H. W. Mewes, V. Aign, J. D. Hoheisel,B. Fartmann, G. Nyakatura, F. Kempken, J. Maier, and U. Schulte. What’sin the genome of a filamentous fungus? analysis of the neurospora genomesequence. Nucleic Acids Res, 31(7):1944–54, 2003.

[217] E. M. Marcotte, M. Pellegrini, H. L. Ng, D. W. Rice, T. O. Yeates, and D. Eisen-berg. Detecting protein function and protein-protein interactions from genomesequences. Science, 285(5428):751–3, 1999.

[218] E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg.A combined algorithm for genome-wide prediction of protein function. Nature,402(6757):83–6, 1999.

[219] A. McLysaght, K. Hokamp, and K. H. Wolfe. Extensive genomic duplicationduring early chordate evolution. Nat Genet, 31(2):200–4, 2002.

[220] H. W. Mewes, K. Albermann, M. Bahr, D. Frishman, A. Gleissner, J. Hani,K. Heumann, K. Kleine, A. Maierl, S. G. Oliver, F. Pfeiffer, and A. Zollner.Overview of the yeast genome. Nature, 387(6632 Suppl):7–65, 1997.

[221] A. Meyer and M. Schartl. Gene and genome duplications in vertebrates: theone-to-four (-to-eight in fish) rule and the evolution of novel gene functions.Curr Opin Cell Biol, 11(6):699–704, 1999.

[222] K. Y. Miller, T. M. Toennis, T. H. Adams, and B. L. Miller. Isolation and tran-scriptional characterization of a morphological modifier: the aspergillus nidulansstunted (stua) gene. Mol Gen Genet, 227(2):285–92, 1991.

[223] T. K. Mitchell and R. A. Dean. The camp-dependent protein kinase catalyticsubunit is required for appressorium formation and pathogenesis by the rice blastpathogen magnaporthe grisea. Plant Cell, 7(11):1869–78, 1995.

[224] N. P. Money. Plant pathology. reverend berkeley’s devil. Nature, 411(6838):644,2001.

[225] S. A. Mousavi and G. D. Robson. Oxidative and amphotericin b-mediated celldeath in the opportunistic pathogen aspergillus fumigatus is associated with anapoptotic-like phenotype. Microbiology, 150(Pt 6):1937–45, 2004.

[226] S. V. Muse and B. S. Gaut. A likelihood approach for comparing synonymous andnonsynonymous nucleotide substitution rates, with application to the chloroplastgenome. Mol Biol Evol, 11(5):715–24, 1994.

[227] K. A. Nasmyth and K. Tatchell. The structure of transposable yeast matingtype loci. Cell, 19(3):753–64, 1980.

[228] M. Nei and T. Gojobori. Simple methods for estimating the numbers of synony-mous and nonsynonymous nucleotide substitutions. Mol Biol Evol, 3:418–426,1986.

[229] Masatoshi Nei and S. Kumar. Molecular evolution and phylogenetics. OxfordUniversity Press, Oxford, UK, 2000.

[230] A. Nekrutenko and W. H. Li. Transposable elements are found in a large numberof human protein-coding genes. Trends Genet, 17(11):619–21, 2001.

[231] M. A. Nelson, S. Kang, E. L. Braun, M. E. Crawford, P. L. Dolan, P. M.Leonard, J. Mitchell, A. M. Armijo, L. Bean, E. Blueyes, T. Cushing, A. Er-rett, M. Fleharty, M. Gorman, K. Judson, R. Miller, J. Ortega, I. Pavlova,J. Perea, S. Todisco, R. Trujillo, J. Valentine, A. Wells, M. Werner-Washburne,D. O. Natvig, and et al. Expressed sequences from conidial, mycelial, and sexualstages of neurospora crassa. Fungal Genet Biol, 21(3):348–63, 1997.

[232] S. L. Newman, S. Chaturvedi, and B. S. Klein. The wi-1 antigen of blastomycesdermatitidis yeasts mediates binding to human macrophage cd11b/cd18 (cr3)and cd14. J Immunol, 154(2):753–61, 1995.

248

[233] W. C. Nierman, A. Pain, M. J. Anderson, J. R. Wortman, H. S. Kim, J. Ar-royo, M. Berriman, K. Abe, D. B. Archer, C. Bermejo, J. Bennett, P. Bowyer,D. Chen, M. Collins, R. Coulsen, R. Davies, P. S. Dyer, M. Farman, N. Fedorova,T. V. Feldblyum, R. Fischer, N. Fosker, A. Fraser, J. L. Garcia, M. J. Garcia,A. Goble, G. H. Goldman, K. Gomi, S. Griffith-Jones, R. Gwilliam, B. Haas,H. Haas, D. Harris, H. Horiuchi, J. Huang, S. Humphray, J. Jimenez, N. Keller,H. Khouri, K. Kitamoto, T. Kobayashi, S. Konzack, R. Kulkarni, T. Kuma-gai, A. Lafton, J. P. Latge, W. Li, A. Lord, C. Lu, W. H. Majoros, G. S.May, B. L. Miller, Y. Mohamoud, M. Molina, M. Monod, I. Mouyna, S. Mul-ligan, L. Murphy, S. O’Neil, I. Paulsen, M. A. Penalva, M. Pertea, C. Price,B. L. Pritchard, M. A. Quail, E. Rabbinowitsch, N. Rawlins, M. A. Rajan-dream, U. Reichard, H. Renauld, G. D. Robson, S. Rodriguez de Cordoba, J. M.Rodriguez-Pena, C. M. Ronning, S. Rutter, S. L. Salzberg, M. Sanchez, J. C.Sanchez-Ferrero, D. Saunders, K. Seeger, R. Squares, S. Squares, M. Takeuchi,F. Tekaia, G. Turner, C. R. Vazquez de Aldana, J. Weidman, O. White, J. Wood-ward, J. H. Yu, C. Fraser, J. E. Galagan, K. Asai, M. Machida, N. Hall, B. Bar-rell, and D. W. Denning. Genomic sequence of the pathogenic and allergenicfilamentous fungus aspergillus fumigatus. Nature, 438(7071):1151–6, 2005.

[234] L. R. Nunes, R. Costa de Oliveira, D. B. Leite, V. S. da Silva, E. dos Reis Mar-ques, M. E. da Silva Ferreira, D. C. Ribeiro, L. A. de Souza Bernardes, M. H.Goldman, R. Puccia, L. R. Travassos, W. L. Batista, M. P. Nobrega, F. G. No-brega, D. Y. Yang, C. A. de Braganca Pereira, and G. H. Goldman. Transcrip-tome analysis of paracoccidioides brasiliensis cells undergoing mycelium-to-yeasttransition. Eukaryot Cell, 4(12):2115–28, 2005.

[235] D. I. Nurminsky, M. V. Nurminskaya, D. De Aguiar, and D. L. Hartl. Se-lective sweep of a newly evolved sperm-specific gene in drosophila. Nature,396(6711):572–5, 1998.

[236] A. Odom, S. Muir, E. Lim, D. L. Toffaletti, J. Perfect, and J. Heitman.Calcineurin is required for virulence of cryptococcus neoformans. Embo J,16(10):2576–89, 1997.

[237] S Ohno. Evolution by Gene Duplication. Springer-Verlag Inc., New York, 1970.

[238] T. Ohta. How gene families evolve. Theor Popul Biol, 37(1):213–9, 1990.

[239] T. Ohta. Synonymous and nonsynonymous substitutions in mammalian genesand the nearly neutral theory. J Mol Evol, 40(1):56–63, 1995.

[240] H. D. Osiewacz and E. Kimpel. Mitochondrial-nuclear interactions and lifespancontrol in fungi. Exp Gerontol, 34(8):901–9, 1999.

[241] C. Pal, B. Papp, and L. D. Hurst. Highly expressed genes in yeast evolve slowly.Genetics, 158(2):927–31, 2001.

[242] P. Pamilo and N. O. Bianchi. Evolution of the zfx and zfy genes: rates andinterdependence between the genes. Mol Biol Evol, 10:271–281, 1993.

[243] B. Paquin and B. F. Lang. The mitochondrial dna of allomyces macrogynus: thecomplete genomic sequence from an ancestral fungus. J Mol Biol, 255(5):688–701, 1996.

[244] L. Patthy. Genome evolution and the evolution of exon-shuffling–a review. Gene,238(1):103–14, 1999.

[245] W. R. Pearson. Rapid and sensitive sequence comparison with fastp and fasta.Methods Enzymol, 183:63–98, 1990.

[246] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence com-parison. Proc Natl Acad Sci USA, 85:2444–2448, 1988.

[247] J. Pei and N. V. Grishin. Type ii caax prenyl endopeptidases belong to a novelsuperfamily of putative membrane-bound metalloproteases. Trends Biochem Sci,26(5):275–7, 2001.

249

[248] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates.Assigning protein functions by comparative genome analysis: protein phyloge-netic profiles. Proc Natl Acad Sci U S A, 96(8):4285–8, 1999.

[249] G. E. Pierard, J. Arrese Estrada, C. Pierard-Franchimont, A. Thiry, and D. Sty-nen. Immunohistochemical expression of galactomannan in the cytoplasm ofphagocytic cells during invasive aspergillosis. Am J Clin Pathol, 96(3):373–6,1991.

[250] J. Piskur. Origin of the duplicated regions in the yeast genomes. Trends Genet,17(6):302–3, 2001.

[251] J. B. Plotkin, J. Dushoff, and H. B. Fraser. Detecting selection using a singlegenome sequence of m. tuberculosis and p. falciparum. Nature, 428:942–945,2004.

[252] S. Poggeler. Mating-type genes for classical strain improvements of ascomycetes.Appl Microbiol Biotechnol, 56(5-6):589–601, 2001.

[253] S. Poggeler. Genomic evidence for mating abilities in the asexual pathogenaspergillus fumigatus. Curr Genet, 42(3):153–60, 2002.

[254] S. Pongsunk, A. Andrianopoulos, and S. C. Chaiyaroj. Conditional lethal dis-ruption of tata-binding protein gene in penicillium marneffei. Fungal Genet Biol,42(11):893–903, 2005.

[255] M. Pop, D. S. Kosack, and S. L. Salzberg. Hierarchical scaffolding with bambus.Genome Res, 14(1):149–59, 2004.

[256] R. O. Poyton and J. E. McEwen. Crosstalk between nuclear and mitochondrialgenomes. Annu Rev Biochem, 65:563–607, 1996.

[257] V. E. Prince and F. B. Pickett. Splitting pairs: the diverging fates of duplicatedgenes. Nat Rev Genet, 3(11):827–37, 2002.

[258] L. Ramsay, M. Macaulay, S. degli Ivanissevich, K. MacLean, L. Cardle, J. Fuller,K. J. Edwards, S. Tuvesson, M. Morgante, A. Massari, E. Maestri, N. Marmiroli,T. Sjakste, M. Ganal, W. Powell, and R. Waugh. A simple sequence repeat-basedlinkage map of barley. Genetics, 156(4):1997–2005, 2000.

[259] M. Raymond, D. Dignard, A. M. Alarco, N. Mainville, B. B. Magee, and D. Y.Thomas. A ste6p/p-glycoprotein homologue from the asexual yeast candidaalbicans transports the a-factor mating pheromone in saccharomyces cerevisiae.Mol Microbiol, 27(3):587–98, 1998.

[260] Y. Reiss, J. L. Goldstein, M. C. Seabra, P. J. Casey, and M. S. Brown. Inhibitionof purified p21ras farnesyl:protein transferase by cys-aax tetrapeptides. Cell,62(1):81–8, 1990.

[261] M. Remm, C. E. Storm, and E. L. Sonnhammer. Automatic clustering oforthologs and in-paralogs from pairwise species comparisons. J Mol Biol,314(5):1041–52, 2001.

[262] M. Ricchetti, C. Fairhead, and B. Dujon. Mitochondrial dna repairs double-strand breaks in yeast chromosomes. Nature, 402(6757):96–100, 1999.

[263] P. Rice, I. Longden, and A. Bleasby. Emboss: the european molecular biologyopen software suite. Trends Genet, 16(6):276–7, 2000.

[264] I. Rigoutsos, T. Huynh, A. Floratos, L. Parida, and D. Platt. Dictionary-drivenprotein annotation. Nucleic Acids Res, 30(17):3901–16, 2002.

[265] M. Robinson-Rechavi and V. Laudet. Evolutionary rates of duplicate genes infish and mammals. Mol Biol Evol, 18(4):681–3, 2001.

[266] F. Rodriguez, J. L. Oliver, A. Marin, and J. R. Medina. The general stochasticmodel of nucleotide substitution. J Theor Biol, 142:485–501, 1990.

250

[267] S. Rogic, A. K. Mackworth, and F. B. Ouellette. Evaluation of gene-findingprograms on mammalian sequences. Genome Res, 11(5):817–32, 2001.

[268] S. Rogic, B. F. Ouellette, and A. K. Mackworth. Improving gene recognitionaccuracy by combining predictions from two gene-finding programs. Bioinfor-matics, 18(8):1034–45, 2002.

[269] Y. Rongrungruang and S. M. Levitz. Interactions of penicillium marneffei withhuman leukocytes in vitro. Infect Immun, 67(9):4732–6, 1999.

[270] G. M. Rubin, M. D. Yandell, J. R. Wortman, G. L. Gabor Miklos, C. R. Nelson,I. K. Hariharan, M. E. Fortini, P. W. Li, R. Apweiler, W. Fleischmann, J. M.Cherry, S. Henikoff, M. P. Skupski, S. Misra, M. Ashburner, E. Birney, M. S.Boguski, T. Brody, P. Brokstein, S. E. Celniker, S. A. Chervitz, D. Coates,A. Cravchik, A. Gabrielian, R. F. Galle, W. M. Gelbart, R. A. George, L. S.Goldstein, F. Gong, P. Guan, N. L. Harris, B. A. Hay, R. A. Hoskins, J. Li,Z. Li, R. O. Hynes, S. J. Jones, P. M. Kuehl, B. Lemaitre, J. T. Littleton, D. K.Morrison, C. Mungall, P. H. O’Farrell, O. K. Pickeral, C. Shue, L. B. Vosshall,J. Zhang, Q. Zhao, X. H. Zheng, and S. Lewis. Comparative genomics of theeukaryotes. Science, 287(5461):2204–15, 2000.

[271] A. Rzhetsky and P. Morozov. Markov chain monte carlo computation of confi-dence intervals for substitution-rate variation in proteins. Pac Symp Biocomput,6:203–214, 2001.

[272] C. Sadhu, D. Hoekstra, M. J. McEachern, S. I. Reed, and J. B. Hicks. A g-protein alpha subunit from asexual candida albicans functions in the matingsignal transduction pathway of saccharomyces cerevisiae and is regulated by thea1-alpha 2 repressor. Mol Cell Biol, 12(5):1977–85, 1992.

[273] N. Saitou and M. Nei. The neighbor-joining method: a new method for recon-structing phylogenetic trees. Mol Biol Evol, 4(4):406–25, 1987.

[274] L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie, and D. Eisen-berg. The database of interacting proteins: 2004 update. Nucleic Acids Res,32(Database issue):D449–51, 2004.

[275] G. San-Blas. [dimorphic fungi: biochemical approach to their dimorphism]. ActaCient Venez, 46(4):221–4, 1995.

[276] G. A. Sarosi and D. S. Serstock. Isolation of blastomyces dermatitidis frompigeon manure. Am Rev Respir Dis, 114(6):1179–83, 1976.

[277] A. S. Sekhon, J. S. Li, and A. K. Garg. Penicillosis marneffei: serological andexoantigen studies. Mycopathologia, 77(1):51–7, 1982.

[278] P. Sengupta and B. H. Cochran. Mat alpha 1 can mediate gene activation bya-mating factor. Genes Dev, 5(10):1924–34, 1991.

[279] C. Seoighe and K. H. Wolfe. Extent of genomic rearrangement after genomeduplication in yeast. Proc Natl Acad Sci U S A, 95(8):4447–52, 1998.

[280] C. Seoighe and K. H. Wolfe. Updated map of duplicated regions in the yeastgenome. Gene, 238(1):253–61, 1999.

[281] P. M. Sharp. In search of molecular darwinism. Nature, 385:111–112., 1997.

[282] P. M. Sharp and W. H. Li. The codon adaptation index–a measure of directionalsynonymous codon usage bias, and its potential applications. Nucleic Acids Res,15(3):1281–95, 1987.

[283] P. M. Sharp and W. H. Li. The rate of synonymous substitution in enterobac-terial genes is inversely related to codon usage bias. Mol Biol Evol, 4(3):222–30,1987.

[284] J. C. Shepherd, W. McGinnis, A. E. Carrasco, E. M. De Robertis, and W. J.Gehring. Fly and frog homoeo domains show homologies with yeast mating typeregulatory proteins. Nature, 310(5972):70–1, 1984.

251

[285] R. Shields. Pushing the envelope on molecular dating. Trends Genet, 20(5):221–2, 2004.

[286] R. A. Sia, K. B. Lengeler, and J. Heitman. Diploid strains of the pathogenicbasidiomycete cryptococcus neoformans are thermally dimorphic. Fungal GenetBiol, 29(3):153–63, 2000.

[287] A. Sidow. Gen(om)e duplications in the evolution of early vertebrates. CurrOpin Genet Dev, 6(6):715–22, 1996.

[288] R. R. Sinden. Biological implications of the dna structures associated withdisease-causing triplet repeats. Am J Hum Genet, 64(2):346–53, 1999.

[289] M. Sipiczki. Where does fission yeast sit on the tree of life? Genome Biol,1(2):REVIEWS1011, 2000.

[290] T. Sirisanthana, K. Supparatpinyo, J. Perriens, and K. E. Nelson. Amphotericinb and itraconazole for treatment of disseminated penicillium marneffei infectionin human immunodeficiency virus-infected patients. Clin Infect Dis, 26(5):1107–10, 1998.

[291] T. F. Smith and M. S. Waterman. Identification of common molecular subse-quences. J Mol Biol, 147:195–197, 1981.

[292] T. F. Smith, M. S. Waterman, and C. Burks. The statistical distribution ofnucleic acid similarities. Nucleic Acids Res, 13(2):645–56, 1985.

[293] R. Sorek, G. Ast, and D. Graur. Alu-containing exons are alternatively spliced.Genome Res, 12(7):1060–7, 2002.

[294] P. Staib, M. Kretschmar, T. Nichterlein, H. Hof, and J. Morschhauser. Differen-tial activation of a candida albicans virulence gene family during infection. ProcNatl Acad Sci U S A, 97(11):6102–7, 2000.

[295] M. A. Steel. Recovering a tree from the leaf colourations it generates under amarkov model. Appl Math Lett, 7:19–32, 1994.

[296] B. R. Steen, T. Lian, S. Zuyderduyn, W. K. MacDonald, M. Marra, S. J. Jones,and J. W. Kronstad. Temperature-regulated transcription in the pathogenicfungus cryptococcus neoformans. Genome Res, 12(9):1386–400, 2002.

[297] L. M. Steinmetz, C. Scharfe, A. M. Deutschbauer, D. Mokranjac, Z. S. Herman,T. Jones, A. M. Chu, G. Giaever, H. Prokisch, P. J. Oefner, and R. W. Davis.Systematic screen for human disease genes in yeast. Nat Genet, 31(4):400–4,2002.

[298] A. Stoltzfus. On the possibility of constructive neutral evolution. J Mol Evol,49(2):169–81, 1999.

[299] J. N. Strathern, E. Spatola, C. McGill, and J. B. Hicks. Structure and organi-zation of transposable of transposable mating type cassettes in saccharomycesyeasts. Proc Natl Acad Sci U S A, 77(5):2839–43, 1980.

[300] K. Supparatpinyo, C. Khamwan, V. Baosoung, K. E. Nelson, and T. Sirisan-thana. Disseminated penicillium marneffei infection in southeast asia. Lancet,344(8915):110–3, 1994.

[301] K. Supparatpinyo, K. E. Nelson, W. G. Merz, B. J. Breslin, Jr. Cooper, C. R.,C. Kamwan, and T. Sirisanthana. Response to antifungal therapy by humanimmunodeficiency virus-infected patients with disseminated penicillium marn-effei infections and in vitro susceptibilities of isolates from clinical specimens.Antimicrob Agents Chemother, 37(11):2407–11, 1993.

[302] K. Supparatpinyo, J. Perriens, K. E. Nelson, and T. Sirisanthana. A con-trolled trial of itraconazole to prevent relapse of penicillium marneffei infectionin patients infected with the human immunodeficiency virus. N Engl J Med,339(24):1739–43, 1998.

252

[303] Y. Suzuki and T Gojobori. Analysis of coding sequences. In M. Salemi and A.M.Vandamme, editors, The phylogenetic handbook: a practical approach to DNAand protein phylogeny, pages 283–311. Cambridge University Press, Cambridge,UK, 2003.

[304] A. Tam, W. K. Schmidt, and S. Michaelis. The multispanning membrane proteinste24p catalyzes caax proteolysis and nh2-terminal processing of the yeast a-factor precursor. J Biol Chem, 276(50):46798–806, 2001.

[305] W. Tang, T. M. Gunn, D. F. McLaughlin, G. S. Barsh, S. F. Schlossman, andJ. S. Duke-Cohan. Secreted and membrane attractin result from alternativesplicing of the human atrn gene. Proc Natl Acad Sci U S A, 97(11):6025–30,2000.

[306] D. Taramelli, S. Brambilla, G. Sala, A. Bruccoleri, C. Tognazioli, L. Riviera-Uzielli, and J. R. Boelaert. Effects of iron on extracellular and intracellulargrowth of penicillium marneffei. Infect Immun, 68(3):1724–6, 2000.

[307] D. Taramelli, C. Tognazioli, F. Ravagnani, O. Leopardi, G. Giannulis, and J. R.Boelaert. Inhibition of intramacrophage growth of penicillium marneffei by 4-aminoquinolines. Antimicrob Agents Chemother, 45(5):1450–5, 2001.

[308] R. L. Tatusov, D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T.Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, andE. V. Koonin. The cog database: new developments in phylogenetic classifica-tion of proteins from complete genomes. Nucleic Acids Res, 29(1):22–8, 2001.

[309] S. Tavare. Some probabilistic and statistical problems in the analysis of dnasequences. Lectures on Mathematics in the Life Sciences, 17:57–86, 1986.

[310] R. D. Teasdale and M. R. Jackson. Signal-mediated sorting of membrane proteinsbetween the endoplasmic reticulum and the golgi apparatus. Annu Rev Cell DevBiol, 12:27–54, 1996.

[311] J. D. Thompson, D. G. Higgins, and T. J. Gibson. Clustal w: improving thesensitivity of progressive multiple sequence alignment through sequence weight-ing, position-specific gap penalties and weight matrix choice. Nucleic Acids Res,22(22):4673–80, 1994.

[312] JD Thompson, DG Higgins, and TJ Gibson. Clustal w: improving the sensitivityof progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl Acids Res, 22:4673–4680,1994.

[313] C. Thrane, U. Kaufmann, B. M. Stummann, and S. Olsson. Activation ofcaspase-like activity and poly (adp-ribose) polymerase degradation during sporu-lation in aspergillus nidulans. Fungal Genet Biol, 41(3):361–8, 2004.

[314] W. E. Timberlake. Molecular genetics of aspergillus development. Annu RevGenet, 24:5–36, 1990.

[315] R. B. Todd, J. R. Greenhalgh, M. J. Hynes, and A. Andrianopoulos. Tupa, thepenicillium marneffei tup1p homologue, represses both yeast and spore develop-ment. Mol Microbiol, 48(1):85–94, 2003.

[316] S. Trewatcharegon, S. Sirisinha, A. Romsai, B. Eampokalap, R. Teanpaisan, andS. C. Chaiyaroj. Molecular typing of penicillium marneffei isolates from thailandby noti macrorestriction and pulsed-field gel electrophoresis. J Clin Microbiol,39(12):4544–8, 2001.

[317] H. F. Tsai, Y. C. Chang, R. G. Washburn, M. H. Wheeler, and K. J. Kwon-Chung. The developmentally regulated alb1 gene of aspergillus fumigatus:its role in modulation of conidial morphology and virulence. J Bacteriol,180(12):3031–8, 1998.

253

[318] H. F. Tsai, M. H. Wheeler, Y. C. Chang, and K. J. Kwon-Chung. A devel-opmentally regulated gene cluster involved in conidial pigment biosynthesis inaspergillus fumigatus. J Bacteriol, 181(20):6469–77, 1999.

[319] N. Tsuchimori, L. L. Sharkey, W. A. Fonzi, S. W. French, Jr. Edwards, J. E., andS. G. Filler. Reduced virulence of hwp1-deficient mutants of candida albicansand their interactions with host cells. Infect Immun, 68(4):1997–2002, 2000.

[320] B. G. Turgeon and O. C. Yoder. Proposed nomenclature for mating type genesof filamentous ascomycetes. Fungal Genet Biol, 31(1):1–5, 2000.

[321] Y. Van de Peer, J. S. Taylor, I. Braasch, and A. Meyer. The ghost of selectionpast: rates of evolution and functional divergence of anciently duplicated genes.J Mol Evol, 53(4-5):436–46, 2001.

[322] K. Vandepoele, Y. Saeys, C. Simillion, J. Raes, and Y. Van De Peer. The auto-matic detection of homologous regions (adhore) and its application to microco-linearity between arabidopsis and rice. Genome Res, 12(11):1792–801, 2002.

[323] N. Vanittanakom, Jr. Cooper, C. R., S. Chariyalertsak, S. Youngchim, K. E.Nelson, and T. Sirisanthana. Restriction endonuclease analysis of penicilliummarneffei. J Clin Microbiol, 34(7):1834–6, 1996.

[324] N. Vanittanakom, W. G. Merz, N. Sittisombut, C. Khamwan, K. E. Nelson, andT. Sirisanthana. Specific identification of penicillium marneffei by a polymerasechain reaction/hybridization technique. Med Mycol, 36(3):169–75, 1998.

[325] N. Vanittanakom, P. Vanittanakom, and R. J. Hay. Rapid identification ofpenicillium marneffei by pcr-based detection of specific sequences on the rrnagene. J Clin Microbiol, 40(5):1739–42, 2002.

[326] J. Varga and B. Toth. Genetic variability and reproductive mode of aspergillusfumigatus. Infect Genet Evol, 3(1):3–17, 2003.

[327] D. Venet. Matarray: a matlab toolbox for microarray data. Bioinformatics,19:659–660, 2003.

[328] K. J. Verstrepen, A. Jansen, F. Lewitter, and G. R. Fink. Intragenic tandemrepeats generate functional variability. Nat Genet, 37(9):986–90, 2005.

[329] K. J. Verstrepen, T. B. Reynolds, and G. R. Fink. Origins of variation in thefungal cell surface. Nat Rev Microbiol, 2(7):533–40, 2004.

[330] P. E. Verweij, J. F. Meis, P. van den Hurk, J. Zoll, R. A. Samson, and W. J.Melchers. Phylogenetic relationships of five species of aspergillus and relatedtaxa as deduced by comparison of sequences of small subunit ribosomal rna. JMed Vet Mycol, 33(3):185–90, 1995.

[331] K. Vienken, M. Scherer, and R. Fischer. The zn(ii)2cys6 putative aspergillusnidulans transcription factor repressor of sexual development inhibits sexual de-velopment under low-carbon conditions and in submersed culture. Genetics,169(2):619–30, 2005.

[332] M. Viswanathan, G. Muthukumar, Y. S. Cong, and J. Lenard. Seripauperins ofsaccharomyces cerevisiae: a new multigene family encoding serine-poor relativesof serine-rich proteins. Gene, 148(1):149–53, 1994.

[333] M. A. Viviani, A. M. Tortorano, G. Rizzardini, T. Quirino, L. Kaufman, A. A.Padhye, and L. Ajello. Treatment and serological studies of an italian case ofpenicilliosis marneffei contracted in thailand by a drug addict infected with thehuman immunodeficiency virus. Eur J Epidemiol, 9(1):79–85, 1993.

[334] A. Wagner. The fate of duplicated genes: loss or new function? Bioessays,20(10):785–8, 1998.

[335] A. Wagner. The yeast protein interaction network evolves rapidly and containsfew redundant duplicate genes. Mol Biol Evol, 18(7):1283–92, 2001.

254

[336] J. B. Walsh. How often do duplicated genes evolve new functions? Genetics,139(1):421–8, 1995.

[337] J. D. Walton. Horizontal gene transfer and the evolution of secondary metabolitegene clusters in fungi: an hypothesis. Fungal Genet Biol, 30(3):167–71, 2000.

[338] W. Wang, F. G. Brunet, E. Nevo, and M. Long. Origin of sphinx, a youngchimeric rna gene in drosophila melanogaster. Proc Natl Acad Sci U S A,99(7):4448–53, 2002.

[339] W. Wang, H. Zheng, S. Yang, H. Yu, J. Li, H. Jiang, J. Su, L. Yang, J. Zhang,J. McDermott, R. Samudrala, J. Wang, H. Yang, J. Yu, K. Kristiansen, andG. K. Wong. Origin and evolution of new exons in rodents. Genome Res,15(9):1258–64, 2005.

[340] J. L. Weber and P. E. May. Abundant class of human dna polymorphismswhich can be typed using the polymerase chain reaction. Am J Hum Genet,44(3):388–96, 1989.

[341] M. H. Wheeler and A. A. Bell. Melanins and their importance in pathogenicfungi. Curr Top Med Mycol, 2:338–87, 1988.

[342] S. Whelan and N. Goldman. A general empirical model of protein evolutionderived from multiple protein families using a maximum-likelihood approach.Mol Biol Evol, 18(5):691–9, 2001.

[343] A. C. Wilson, S. S. Carlson, and T. J. White. Biochemical evolution. Annu RevBiochem, 46:573–639, 1977.

[344] K. H. Wolfe and P. M. Sharp. Mammalian gene evolution: nucleotide sequencedivergence between mouse and rat. J Mol Evol, 37(4):441–56, 1993.

[345] K. H. Wolfe and D. C. Shields. Molecular evidence for an ancient duplication ofthe entire yeast genome. Nature, 387(6634):708–13, 1997.

[346] K. H. Wong and S. S. Lee. Comparing the first and second hundred aids casesin hong kong. Singapore Med J, 39(6):236–40, 1998.

[347] L. P. Wong, P. C. Woo, A. Y. Wu, and K. Y. Yuen. Dna immunization usinga secreted cell wall antigen mp1p is protective against penicillium marneffeiinfection. Vaccine, 20(23-24):2878–86, 2002.

[348] S. S. Wong, H. Siau, and K. Y. Yuen. Penicilliosis marneffei–west meets east. JMed Microbiol, 48(11):973–5, 1999.

[349] S. S. Wong, K. H. Wong, W. T. Hui, S. S. Lee, J. Y. Lo, L. Cao, and K. Y. Yuen.Differences in clinical and laboratory diagnostic characteristics of penicilliosismarneffei in human immunodeficiency virus (hiv)- and non-hiv-infected patients.J Clin Microbiol, 39(12):4535–40, 2001.

[350] S. S. Wong, P. C. Woo, and K. Y. Yuen. Candida tropicalis and penicilliummarneffei mixed fungaemia in a patient with waldenstrom’s macroglobulinaemia.Eur J Clin Microbiol Infect Dis, 20(2):132–5, 2001.

[351] P. C. Woo, C. M. Chan, A. S. Leung, S. K. Lau, X. Y. Che, S. S. Wong, L. Cao,and K. Y. Yuen. Detection of cell wall galactomannoprotein afmp1p in culturesupernatants of aspergillus fumigatus and in sera of aspergillosis patients. J ClinMicrobiol, 40(11):4382–7, 2002.

[352] P. C. Woo, K. T. Chong, A. S. Leung, S. S. Wong, S. K. Lau, and K. Y.Yuen. Aflmp1 encodes an antigenic cel wall protein in aspergillus flavus. J ClinMicrobiol, 41(2):845–50, 2003.

[353] P. C. Woo, H. Zhen, J. J. Cai, J. Yu, S. K. Lau, J. Wang, J. L. Teng, S. S. Wong,R. H. Tse, R. Chen, H. Yang, B. Liu, and K. Y. Yuen. The mitochondrial genomeof the thermal dimorphic fungus penicillium marneffei is more closely related tothose of molds than yeasts. FEBS Lett, 555(3):469–77, 2003.

255

[354] V. Wood, R. Gwilliam, M. A. Rajandream, M. Lyne, R. Lyne, A. Stewart,J. Sgouros, N. Peat, J. Hayles, S. Baker, D. Basham, S. Bowman, K. Brooks,D. Brown, S. Brown, T. Chillingworth, C. Churcher, M. Collins, R. Connor,A. Cronin, P. Davis, T. Feltwell, A. Fraser, S. Gentles, A. Goble, N. Hamlin,D. Harris, J. Hidalgo, G. Hodgson, S. Holroyd, T. Hornsby, S. Howarth, E. J.Huckle, S. Hunt, K. Jagels, K. James, L. Jones, M. Jones, S. Leather, S. Mc-Donald, J. McLean, P. Mooney, S. Moule, K. Mungall, L. Murphy, D. Niblett,C. Odell, K. Oliver, S. O’Neil, D. Pearson, M. A. Quail, E. Rabbinowitsch,K. Rutherford, S. Rutter, D. Saunders, K. Seeger, S. Sharp, J. Skelton, M. Sim-monds, R. Squares, S. Squares, K. Stevens, K. Taylor, R. G. Taylor, A. Tivey,S. Walsh, T. Warren, S. Whitehead, J. Woodward, G. Volckaert, R. Aert,J. Robben, B. Grymonprez, I. Weltjens, E. Vanstreels, M. Rieger, M. Schafer,S. Muller-Auer, C. Gabel, M. Fuchs, A. Dusterhoft, C. Fritzc, E. Holzer,D. Moestl, H. Hilbert, K. Borzym, I. Langer, A. Beck, H. Lehrach, R. Reinhardt,T. M. Pohl, P. Eger, W. Zimmermann, H. Wedler, R. Wambutt, B. Purnelle,A. Goffeau, E. Cadieu, S. Dreano, S. Gloux, et al. The genome sequence ofschizosaccharomyces pombe. Nature, 415(6874):871–80, 2002.

[355] J. Wu and B. L. Miller. Aspergillus asexual reproduction and sexual reproduc-tion are differentially affected by transcriptional and translational mechanismsregulating stunted gene expression. Mol Cell Biol, 17(10):6191–201, 1997.

[356] Z. Yan, X. Li, and J. Xu. Geographic distribution of mating type alleles ofcryptococcus neoformans in four areas of the united states. J Clin Microbiol,40(3):965–72, 2002.

[357] J. Yang, Z. Gu, and W. H. Li. Rate of protein evolution versus fitness effect ofgene deletion. Mol Biol Evol, 20(5):772–4, 2003.

[358] Z. Yang. Estimating the pattern of nucleotide substitution. J Mol Evol, 39:105–111, 1994.

[359] Z. Yang. Paml: a program package for phylogenetic analysis by maximum like-lihood. Comput Appl Biosci, 13(5):555–6, 1997.

[360] Z Yang. Phylogenetic Analysis by Maximum Likelihood (PAML). Version 3.0.London: University College, 2000.

[361] R. F. Yeh, L. P. Lim, and C. B. Burge. Computational inference of homologousgene structures in the human genome. Genome Res, 11(5):803–16, 2001.

[362] G. Yona, N. Linial, and M. Linial. Protomap: automatic classification of proteinsequences and hierarchy of protein families. Nucleic Acids Res, 28(1):49–55,2000.

[363] K. Y. Yuen, C. M. Chan, K. M. Chan, P. C. Woo, X. Y. Che, A. S. Leung,and L. Cao. Characterization of afmp1: a novel target for serodiagnosis ofaspergillosis. J Clin Microbiol, 39(11):3830–7, 2001.

[364] K. Y. Yuen, G. Pascal, S. S. Wong, P. Glaser, P. C. Woo, F. Kunst, J. J. Cai,E. Y. Cheung, C. Medigue, and A. Danchin. Exploring the penicillium marneffeigenome. Arch Microbiol, 179(5):339–53, 2003.

[365] K. Y. Yuen, S. S. Wong, D. N. Tsang, and P. Y. Chau. Serodiagnosis of peni-cillium marneffei infection. Lancet, 344(8920):444–5, 1994.

[366] M. Zagulski, B. Babinska, R. Gromadka, A. Migdalski, J. Rytka, J. Sulicka,and C. J. Herbert. The sequence of 24.3 kb from chromosome x reveals fivecomplete open reading frames, all of which correspond to new genes, and atandem insertion of a ty1 transposon. Yeast, 11(12):1179–86, 1995.

[367] E. M. Zdobnov and R. Apweiler. Interproscan–an integration platform for thesignature-recognition methods in interpro. Bioinformatics, 17(9):847–8, 2001.

[368] C. T. Zhang, J. Wang, and R. Zhang. A novel method to calculate the g+ccontent of genomic dna sequences. J Biomol Struct Dyn, 19:333–341, 2001.

256

[369] J. Zhang, Y. P. Zhang, and H. F. Rosenberg. Adaptive evolution of a duplicatedpancreatic ribonuclease gene in a leaf-eating monkey. Nat Genet, 30(4):411–5,2002.

[370] L. Zhang, T. J. Vision, and B. S. Gaut. Patterns of nucleotide substitutionamong simultaneously duplicated gene pairs in arabidopsis thaliana. Mol BiolEvol, 19(9):1464–73, 2002.

[371] P. Zhang, Z. Gu, and W. H. Li. Different evolutionary patterns between youngduplicate genes in the human genome. Genome Biol, 4(9):R56, 2003.

[372] R. Zhang and C. T. Zhang. Z curves, an intutive tool for visualizing and ana-lyzing the dna sequences. J Biomol Struct Dyn, 11:767–782, 1994.

understanding the pathogenic fungus penicillium …people.tamu.edu/~jcai/pdf/thesis-phd-full.pdf ·...

Documents