understanding the pathogenic fungus penicillium …people.tamu.edu/~jcai/pdf/thesis-phd-full.pdf ·...
TRANSCRIPT
Abstract of thesis entitled
Understanding the Pathogenic Fungus Penicillium marneffei : A
Computational Genomics Perspective
by James J. Cai
for the degree of Doctor of Philosophy
at The University of Hong Kong
in May 2006
Penicillium marneffei, a thermally dimorphic fungus that alternates be-
tween a filamentous and a yeast growth form in response to changes in
its environmental temperature, has become an emerging fungal pathogen
endemic in Southeast Asia. Defining the genomics of P. marneffei will
provide a better understanding of the fungus.
This thesis reports the draft sequence of the P. marneffei genome as-
sembled from 6.6 coverage of the genome through whole genome shotgun
sequencing. The 31 Mb genome obtained from the assembly contains
10,060 protein-coding genes. The complete mitochondrial genome is 35
kb long and its gene content and gene order are very similar to that of
Aspergillus. An annotation system and P. marneffei genome database
(PMGD) were developed to allow a preliminary annotation of the se-
quences and provide an intuitive graphic interface to give curators and
users ready access to the annotation and the underlying evidence, and
a Matlab-based software package, MBEToolbox, was developed for data
analysis in phylogenetics and comparative genomics. A well-designed and
structured annotation system and powerful sequence analysis software
are essential requirements for the success of large-scale genome analysis
projects.
Analysis of the gene set of P. marneffei provided insights into the
adaptations required by a fungus to cause disease. The genome encodes
a diverse set of putative virulence genes such as proteinase, phospholi-
pase, metacaspase and agglutinin, which may enable the fungus to adhere
to, colonise and invade the host, adapt to the tissue environment, and
avoid the host’s humoral and cellular defences of the innate and adaptive
immune responses. A gene cluster involved in biosynthesis of melanin, a
known virulence factor in some other pathogenic fungi, was also identi-
fied in the genome, indicating that P. marneffei may produce melanin
or melanin-like immunosuppressive compounds that protect the fungus
against immune effector cells. More interestingly, P. marneffei genome
contains more intragenic tandem repeats (IntraTRs) than other fungi.
These IntraTRs encoding repeat domains/motifs may create quantita-
tive variation in surface proteins, allowing the fungus to ‘disguise’ itself
to slip past the vigilant defences of the host immune system. The genome
sequence of P. marneffei also revealed a number of genes associated with
mating processes and sexual development, suggesting an unidentified sex-
ual cycle in the fungus.
The extent and evolutionary patterns of duplicate genes in P. marn-
effei and other ascomycetes were compared. All ascomycetes show a
certain degree of redundancy (though its extent can vary considerably),
which may provide the foundation for the specialisation of fungal genes
and form the basis for fungal diversification. An inverse relationship be-
tween the lineage specificity of a gene and gene’s evolutionary rate was
also discovered, implying that an accelerated evolutionary rate may be
responsible for the emergence of lineage specific genes.
The genome sequence of P. marneffei has provided our first glimpse
into the genomic basis of the physiology of the dimorphic filamentous
fungus.
Understanding the Pathogenic FungusPenicillium marneffei : A Computational
Genomics Perspective
BY
James J. Cai
M.D., Henan Medical University, 1996
M.S., University of New South Wales, 2001
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy
at The University of Hong Kong
May 2006
To Yan
“Any living cell carries with it the experiences of a billion
years of experimentation by its ancestors.”
Max Delbruck (1949)
DECLARATION
I declare that this thesis represents my own work, except where due
acknowledgement is made, and that it has not been previously included
in a thesis, dissertation or report submitted to this University or to any
other institution for a degree, diploma or other qualifications.
Signature:
Date:
i
ACKNOWLEDGEMENTS
First of all, a special thanks goes to my principle supervisor, Pro-
fessor Kwok-yung Yuen, for his enthusiasm and support during
the course of my study. My heartfelt thanks to Dr. David K.
Smith and Dr. Xuhua Xia who introduced me to the fascinating
world of bioinformatics and molecular evolution.
Thanks to my friends and colleagues for their moral support
and technical assistance over the past four years especially Dr.
Patrick Woo, Dr. Sussana Lau, and Jade, Huang Yi, Ken, Haw,
Candy, Rachel ... I am also grateful to my external mentor Dr.
Gavin Huttley and fellow colleagues Peter, Ray, Helen and Brett
in the Australian National University.
Finally, I am very grateful to my wife and my parents. Without
their support, this work would not have been possible.
ii
TABLE OF CONTENTS
Declaration i
Acknowledgements ii
List of Figures x
List of Tables xii
Abbreviations xiv
Glossary xviii
Introduction 1
Chapter 1: The draft genome sequence of Penicillium
marneffei 4
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 General fungal biology . . . . . . . . . . . . . . . . 5
1.2.2 P. marneffei, as an important fungal pathogen . . 7
1.2.3 Penicilliosis marneffei . . . . . . . . . . . . . . . . 13
1.2.4 Fungal genome projects . . . . . . . . . . . . . . . 20
1.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . 23
1.3.1 Strain and DNA preparation . . . . . . . . . . . . 23
1.3.2 Library construction, shotgun sequencing . . . . . 24
1.3.3 Sequence assembly . . . . . . . . . . . . . . . . . . 24
1.3.4 Data release . . . . . . . . . . . . . . . . . . . . . . 24
iii
1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.4.1 Assembly and general characteristic . . . . . . . . 25
1.4.2 Genome architecture and co-linearity . . . . . . . . 29
1.4.3 Gene duplications (multigene families) and com-
parisons . . . . . . . . . . . . . . . . . . . . . . . . 30
1.4.4 Interspecies proteome comparison . . . . . . . . . . 31
1.4.5 Lineage-specific genes . . . . . . . . . . . . . . . . 33
1.4.6 Cell signalling and morphogenesis . . . . . . . . . 35
1.4.7 Potential mating ability . . . . . . . . . . . . . . . 35
1.4.8 Putative virulence genes . . . . . . . . . . . . . . . 35
1.4.9 Cell wall antigens and biosynthetic genes . . . . . 35
1.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Chapter 2: Penicillium marneffei genome database and
annotation pipeline 40
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . 42
2.2.1 Methods for predicting protein function . . . . . . 42
2.2.2 Software/database systems for protein function pre-
diction . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.2.3 The art of gene finding . . . . . . . . . . . . . . . . 47
2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 50
2.3.1 Annotation pipeline . . . . . . . . . . . . . . . . . 50
2.3.2 Assembly process . . . . . . . . . . . . . . . . . . . 53
2.3.3 Gene finding . . . . . . . . . . . . . . . . . . . . . 55
2.3.4 Database and databank to store results . . . . . . 57
2.3.5 Perl source code collection . . . . . . . . . . . . . . 58
2.3.6 Genome browser configuration . . . . . . . . . . . 58
2.3.7 Synteny identification . . . . . . . . . . . . . . . . 59
iv
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.4.1 Statistics of assembly . . . . . . . . . . . . . . . . 60
2.4.2 Genome size estimation . . . . . . . . . . . . . . . 61
2.4.3 Accuracy of gene finding . . . . . . . . . . . . . . . 63
2.4.4 Combination of gene finding . . . . . . . . . . . . . 63
2.4.5 Database and databank to store results . . . . . . 65
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Chapter 3: Mitochondrial genome of Penicillium marn-
effei 69
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . 72
3.2.1 Library construction and sequence assembly . . . . 72
3.2.2 Mitochondrial DNA sequence annotation . . . . . 72
3.2.3 Phylogenetic analysis . . . . . . . . . . . . . . . . . 73
3.2.4 Mitochondrial DNA sequences in nuclear genome . 73
3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . 74
3.3.1 Gene content and genome organisation . . . . . . . 74
3.3.2 Protein coding genes . . . . . . . . . . . . . . . . . 74
3.3.3 Genetic code and codon usage . . . . . . . . . . . 81
3.3.4 tRNA genes . . . . . . . . . . . . . . . . . . . . . . 81
3.3.5 Other RNA genes . . . . . . . . . . . . . . . . . . 81
3.3.6 Group I introns . . . . . . . . . . . . . . . . . . . . 84
3.3.7 Mitochondrial DNA sequences in nuclear genome . 85
Chapter 4: Genomic evidence for the presence of melanin
biosynthesis gene cluster in Penicillium marn-
effei 88
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . 89
v
4.2.1 Potential virulence factors . . . . . . . . . . . . . . 90
4.2.2 Genomic approaches in identification of virulence
factors . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . 96
4.3.1 Identification of melanin biosynthesis genes in P.
marneffei . . . . . . . . . . . . . . . . . . . . . . . 96
4.3.2 Multiple alignments and phylogenetic analyses . . 97
4.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . 97
4.4.1 Melanin gene cluster present in P. marneffei . . . 97
4.4.2 Disrupted aflatoxin biosynthesis gene cluster in P.
marneffei . . . . . . . . . . . . . . . . . . . . . . . 101
4.4.3 Absence of penicillin biosynthesis genes in P. marn-
effei . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Chapter 5: Mating abilities in Penicillium marneffei 105
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . 107
5.2.1 Mating in hemiascomycete yeasts . . . . . . . . . . 108
5.2.2 Mating in filamentous ascomycetes . . . . . . . . . 109
5.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . 112
5.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . 113
5.4.1 Homologs of known sexual genes . . . . . . . . . . 114
5.4.2 Mating type genes . . . . . . . . . . . . . . . . . . 116
5.4.3 Mating pheromone precursor genes . . . . . . . . . 120
5.4.4 Mating pheromone processing genes . . . . . . . . 123
5.4.5 Mating pheromone receptor and other GPCRs . . 126
Chapter 6: Exploring the genetic components associated
with the dimorphism of Penicillium marnef-
fei 128
vi
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . 130
6.2.1 Sequence similarity . . . . . . . . . . . . . . . . . 130
6.2.2 Phylogenetic Analysis . . . . . . . . . . . . . . . . 131
6.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . 131
6.3.1 Perception of external stimuli by cellular sensors . 132
6.3.2 Transduction of biochemical signal . . . . . . . . . 134
6.3.3 Alteration of the genomic expression . . . . . . . . 136
6.3.4 Structural reorganization towards the morphologi-
cal change . . . . . . . . . . . . . . . . . . . . . . 141
Chapter 7: Intragenic tandem repeats in Penicillium marn-
effei and other ascomycetes 144
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . 146
7.2.1 Identification of coding tandem repeats . . . . . . 146
7.2.2 Sequence analysis . . . . . . . . . . . . . . . . . . . 146
7.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . 146
Chapter 8: Extent and evolutionary pattern of duplicate
genes in Penicillium marneffei and other as-
comycetes 155
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . 158
8.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . 160
8.3.1 Sequences and gene families . . . . . . . . . . . . . 160
8.3.2 Estimation of substitution rate . . . . . . . . . . . 161
8.3.3 Relative rate test . . . . . . . . . . . . . . . . . . . 162
8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.4.1 Extent of gene duplication in ascomycetes . . . . . 163
vii
8.4.2 Age distribution of duplicate genes . . . . . . . . . 164
8.4.3 Selective constraint between paralogs . . . . . . . . 168
8.4.4 Ka/Ks between paralogs and orthologs . . . . . . 169
8.4.5 Relative evolutionary rate between paralogs . . . . 170
8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.5.1 Gene duplication in ascomycetes is highly diverse . 173
8.5.2 Different selective constraints in yeasts and fila-
mentous ascomycetes . . . . . . . . . . . . . . . . . 176
8.5.3 Majority of paralogous genes evolve symmetrically 178
Chapter 9: Accelerated evolutionary rate may be respon-
sible for the emergence of lineage-specific genes180
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . 184
9.3 Materials and Methods . . . . . . . . . . . . . . . . . . . . 185
9.3.1 Sequences and data sets . . . . . . . . . . . . . . . 185
9.3.2 Identification of orthologs . . . . . . . . . . . . . . 188
9.3.3 Classification of genes into LS groups . . . . . . . 188
9.3.4 Divergence Times . . . . . . . . . . . . . . . . . . . 189
9.3.5 Estimation of substitution rates and statistical analy-
ses . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.3.6 Detection of rate variability across species - Rela-
tive Divergence Score (RDS) . . . . . . . . . . . . 190
9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.4.1 Evolutionary rate differences among LS groups . . 191
9.4.2 Evolutionary rate-related factors of genes belong-
ing to different LS groups . . . . . . . . . . . . . . 196
9.4.3 Linear regression of divergence time and relative
divergence score (RDS) . . . . . . . . . . . . . . . 201
viii
9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Chapter 10: MBEToolbox: a Matlab toolbox for sequence
data analysis in molecular biology and evo-
lution 205
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . 206
10.2.1 Probabilistic DNA substitution models . . . . . . . 206
10.2.2 Maximum likelihood estimation . . . . . . . . . . . 210
10.2.3 Elements of phylogenetic theory . . . . . . . . . . 211
10.2.4 Programs used for phylogenetic analyses . . . . . . 214
10.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 216
10.3.1 Input data and formats . . . . . . . . . . . . . . . 216
10.3.2 Sequence Manipulation and Statistics . . . . . . . 217
10.3.3 Evolutionary Distances . . . . . . . . . . . . . . . 217
10.3.4 Phylogeny Inference . . . . . . . . . . . . . . . . . 219
10.3.5 Combination of functions . . . . . . . . . . . . . . 222
10.3.6 Graphics and GUI . . . . . . . . . . . . . . . . . . 222
10.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . 223
10.4.1 Vectorisation simplifies programming . . . . . . . . 223
10.4.2 Extensibility . . . . . . . . . . . . . . . . . . . . . 226
10.4.3 Comparison with other toolboxes . . . . . . . . . . 226
10.4.4 A novel enhanced window analysis . . . . . . . . . 227
10.4.5 Limitations . . . . . . . . . . . . . . . . . . . . . . 230
Chapter 11: Concluding remarks 231
Bibliography 234
ix
LIST OF FIGURES
Figure Number Page
1.1 P. marneffei mould and yeast culture . . . . . . . . . . . 7
1.2 Dimorphic switching of P. marneffei . . . . . . . . . . . . 8
1.3 Phylogenetic tree showing the relationships of P. marneffei
to other fungi . . . . . . . . . . . . . . . . . . . . . . . . . 28
1.4 Microsyntenies containing pheromone precursor loci from
four fungi . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1.5 Triple proteome comparison between P. marneffei, S. cere-
visiae and A. fumigatus . . . . . . . . . . . . . . . . . . . 32
1.6 Putative MAPK signalling pathway in P. marneffei . . . 34
2.1 Flowchart of annotation pipeline for P. marneffei genome 51
2.2 PMGD genome browser . . . . . . . . . . . . . . . . . . . 60
2.3 Database schema of PMGD . . . . . . . . . . . . . . . . . 66
3.1 Fungal respiratory pathways . . . . . . . . . . . . . . . . . 71
3.2 Physical map of P. marneffei mitochondrial DNA . . . . 75
3.3 Comparison of gene order between mitochondrial DNAs . 78
3.4 Phylogenetic distribution of group I and group II introns . 80
3.5 28 tRNAs encoded in the mitochondrial genome of P.
marneffei . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.6 Secondary structures of two representative group I introns 84
4.1 P. marneffei abr1 gene Cu-oxidase domain homologues . 100
4.2 Melanin gene cluster in P. marneffei and A. fumigatus . . 102
x
5.1 Comparison of the mating-type loci in P. marneffei and
other fungi . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2 Comparison of the alpha1 domian of MAT proteins of fil-
amentous ascomycetes . . . . . . . . . . . . . . . . . . . . 116
5.3 Gene organisation around the MAT locus . . . . . . . . . 117
5.4 P. marneffei biogenesis of the a-factor pheromones . . . . 121
6.1 Phylogenetic tree of fungal GPCR family genes . . . . . . 133
6.2 P. marneffei genes in cAMP pathway . . . . . . . . . . . 135
7.1 Amino acid composition in intragenic tandem repeats . . 153
8.1 Frequency distribution of Ks . . . . . . . . . . . . . . . . 166
8.2 Log-log plots of Ka vs. Ks for duplicate gene pairs . . . . 167
9.1 LS classification based on phylogenetic profiles of genes . 186
9.2 Divergence of nonsynonymous substitution rate in LS groups192
9.3 Dependence of log gene expression level and substitution
rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.4 Linear regression analysis of divergence time and RDS . . 195
10.1 Relationship of GTR class DNA substitution models . . . 209
10.2 Log-likelihood of evolutionary distance . . . . . . . . . . . 221
10.3 MBEToolbox GUI . . . . . . . . . . . . . . . . . . . . . . 224
10.4 Comparison between sliding window and enhanced sliding
window methods . . . . . . . . . . . . . . . . . . . . . . . 228
xi
LIST OF TABLES
Table Number Page
1.1 General features of the P. marneffei genome . . . . . . . 25
1.2 Comparison of genome statistics of several fungi . . . . . 27
1.3 Putative virulence genes . . . . . . . . . . . . . . . . . . . 36
1.4 Cell wall antigens and biosynthetic genes predicted in P.
marneffei . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.1 Commonly used domain databases . . . . . . . . . . . . . 48
2.2 Summary of assembly statistics . . . . . . . . . . . . . . . 61
3.1 Gene content of P. marneffei mitochondrial genome . . . 76
3.2 Codon usage in protein-coding genes of P. marneffei mi-
tochondrial genome . . . . . . . . . . . . . . . . . . . . . . 82
3.3 Presence of mitochondrial DNA fragments in nuclear genomes 85
3.4 P. marneffei mitochondrial DNA sequences present in nu-
clear genome . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1 Major dimorphic fungal pathogens . . . . . . . . . . . . . 95
4.2 Putative gene products related to melanin biosynthesis in
P. marneffei . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 Mating strategies adopted by ascomycetous fungi . . . . . 110
5.2 Pheromone-processing enzymes encoded by the putative
P. marneffei genes . . . . . . . . . . . . . . . . . . . . . . 122
6.1 GPCR family in P. marneffei and A. nidulans . . . . . . 132
xii
6.2 Homologous genes related to signal transduction in fila-
mentous growth . . . . . . . . . . . . . . . . . . . . . . . . 137
7.1 P. marneffei genes containing intragenic tandem repeats . 147
7.2 Comparison of genome size and base in repeats . . . . . . 152
8.1 Distribution of multigene families in fungi . . . . . . . . . 163
8.2 Large multigene families in fungi . . . . . . . . . . . . . . 165
8.3 Ka/Ks ratio for recently diverged paralogs . . . . . . . . . 169
8.4 Amino-acid substitution rates versus Ka/Ks ratios in two
copies of duplicate genes . . . . . . . . . . . . . . . . . . . 172
9.1 Genomic sequence sources . . . . . . . . . . . . . . . . . . 185
9.2 Average Ka, Ks and Ka/Ks among LS classes . . . . . . . 197
9.3 Correlation and partial correlation between LS and other
factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.4 Regression analyseson predicted S. cerevisiae-S. mikatae
orthologs . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
xiii
ABBREVIATIONS AND SYMBOLS
aa Amino acid
AIDS Acquired Immunodeficiency Syndrome
ADHoRe Automatic Detection of Homologous Regions
BLAST Basic Local Alignment Search Tool
BLOSUM BLOcks SUbstitution Matrix
bp Base pairs
CDS Nucleotide coding sequence
DBMS Database management system
DDC Duplication-degeneration-complementation (model)
EST Expressed Sequence Tag
FASTA Fast-All (pronounced fast-aye) a program for pairwise sequence
alignment
FGI Fungal Genome Initiative
GFF ‘Gene-Finding Format’ or ‘General Feature Format’
GO Gene Ontology
xiv
GOLD Genomes OnLine Database
GPCR G Protein-Coupled Receptor
GTR General Time Reversible model
GUI Graphical User Interface
HAART Highly Active Anti-Retroviral Therapy
HMM Hidden Markov Model
HKU CC Computer Centre, University of Hong Kong
ITR Intragenic Tandem Repeat
Ka Nonsynonymous substitution rate
Ks Synonymous substitution rate
LS Lineage specificity
MAPK Mitogen-activated protein kinase
Mb Megabases
MBEToolbox Molecular biology and evolution toolbox
MCMC Markov-chain Monte Carlo
MDD Maximal dependence decomposition
MFS Major facilitator superfamily
MIPS Munich Information Center for Protein Sequences
xv
TF Transcription Factor
TNF Tumor Necrosis factor
MIT Massachusetts Institute of Technology
MLMT Multilocus microsatellite typing system
NCBI National Centre for Biotechnology Information
RDS Relative Divergence Score
ORF Open Reading Frame
PAUP* Phylogenetic Analysis Using Parsimony, *and other methods
(pronounced pop star)
PFGE Pulsed-field gel electrophoresis
PHYLIP PHYLogenetic Inference Package
PMGD P. marneffei genome database
REV General reversible process model
RIP Repeat-induced point
SAGE Serial Analysis of Gene Expression
SGD Saccharomyces Genome Database in Stanford Genomic Resources
xvi
Swiss-Prot a curated protein sequence database which strives to pro-
vide a high level of annotation (such as the description of the func-
tion of a protein, its domains structure, post-translational modi-
fications, variants, etc.), a minimal level of redundancy and high
level of integration with other databases.
TIGR The Institute for Genomic Research
TrEMBL a computer-annotated supplement of Swiss-Prot that contains
all the translations of EMBL nucleotide sequence entries not yet
integrated in Swiss-Prot.
UML Unified Modelling Language
UCSC University of California, Santa Cruz
URF unidentified reading frame
UTR Untranslated transcriptional region
WGS Whole-genome shotgun
HMG high mobility group motif
xvii
GLOSSARY
ADDITIVE TREE: A phylogenetic tree in which the distance between
any two terminal nodes is equal to the sum of the branch lengths
connecting them.
BOOTSTRAP: A statistical technique using resampling with replace-
ment.
BRANCH: The graphical representation of an evolutionary relation-
ship in a phylogenetic tree.
CODON: A triplet of adjacent nucleotides in mRNA that either codes
for an amino acid carried by a specific tRNA or specifies the ter-
mination of the translation process.
CODON USAGE: The frequency with which members of a codon family
are used in protein-coding genes.
COMPLEMENTARY DNA (CDNA): DNA synthesised from an RNA tem-
plate by the enzyme reverse transcriptase.
CONCERTED EVOLUTION: Maintenance of homogeneity of nucleotide
sequences among members of a gene family in a species, although
the nucleotide sequences change over time.
CONSENSUS SEQUENCE: A sequence that represents the most preva-
lent nucleotide or amino acid at each site in a number of homologous
sequences.
xviii
CONSERVATIVE SUBSTITUTION: The substitution of an amino acid by
another with similar chemical properties.
CONSTANT SITE OR CONSTANT REGION: A site or region within the
DNA that is occupied by the same nucleotide in all homologous
sequences under comparison.
CONVERGENCE: The independent evolution of similar genetic or phe-
notypic traits.
CONVERGENT SUBSTITUTION: The substitution of two different nu-
cleotides by the same nucleotide at the same nucleotide site in two
homologous sequences.
DETERMINISTIC PROCESS: A process, the outcome of which can be
predicted exactly from knowledge of initial conditions.
DIRECTIONAL SELECTION: A selective regime that changes the fre-
quency of an allele in a specific direction, either toward fixation or
toward elimination.
DIVERGENCE: The differences between two homologous sequences due
to the independent accumulation of genetic changes in each lineage.
DOMAIN: A well-defined region within a protein that can perform a
specific function. May not consist of a continuous stretch of amino
acids, although it almost always consists of amino acids that are
adjacent to each other as far as the tertiary structure of the protein
is concerned.
DUPLICATION: The presence or the creation of two copies of a DNA
segment in the genome.
xix
EUKARYOTE: An organism having a true nucleus and membraneous
organelles. One of the three primary lines of descent in the living
world.
EXON: A DNA segment of a gene, the transcript of which appears in
the mature RNA molecule.
FIXATION PROBABILITY: The probability that a particular allele will
become fixed in a population.
FIXATION TIME: The time it takes for a mutant allele to become fixed
in a population.
FLANKING SEQUENCE: Untranscribed sequences at the 5’ or 3’ termi-
nal of transcribed genes.
FOURFOLD DEGENERATE SITE: A nucleotide site within a codon at
which all possible substitutions are synonymous. For example, in
the codon CCT, the third site is fourfold degenerate because CCT,
CCC, CCA and CCG are all codons for proline.
FUNCTIONAL CONSTRAINT (SELECTIVE CONSTRAINT): The degree of
intolerance characteristic of a site or a locus toward nucleotide sub-
stitutions.
GENE CONVERSION: A nonreciprocal recombination process resulting
in a sequence becoming identical with another.
GENE DIVERSITY: A measure of genetic variability in a population.
The mean expected heterozygosity per locus in a population.
xx
GENE DUPLICATION: Generally, the production of two copies of a
DNA sequence. Specifically, the duplication of an entire gene se-
quence.
GENETIC DISTANCE: Broadly, any of several measures of the degree
of genetic difference between individuals, populations, or species.
In reference to molecular evolution, a measure of the number of
nucleotide substitutions per nucleotide site between two homolo-
gous DNA sequences that have accumulated since the divergence
between the sequences.
INFERRED TREE: A phylogenetic tree based on empirical data per-
taining to extant taxa.
INFORMATIVE SITE (DIAGNOSTIC POSITION): A site that is used to
choose the most-parsimonious tree from among all the possible phy-
logenetic trees. In molecular evolution, a site where there are at
least two different kinds of nucleotides or amino acids, and each of
them is represented in at least two sequences.
LIKELIHOOD RATIO TEST: A statistical test of the goodness-of-fit be-
tween two models. A relatively more complex model is compared
to a simpler model to see if it fits a particular dataset significantly
better.
LINEAGE: A linear evolutionary sequence from an ancestral species
through all intermediate species to a particular descendant species.
MAXIMUM LIKELIHOOD: A statistical procedure of finding the value
of one or more parameters for a given statistic which makes the
known likelihood distribution a maximum.
xxi
ORTHOLOGOUS LOCUS: A gene that has evolved directly from an an-
cestral locus. homologous genes: genes that share a common evo-
lutionary ancestor.
PARALOGOUS LOCUS: A gene that originated by duplication and then
diverged from the parent copy by mutation and selection or drift.
PATTERN OF SUBSTITUTION (SUBSTITUTION SCHEME): The relative fre-
quency with which a nucleotide or an amino acid changes into an-
other during evolution.
POSITIVE SELECTION: Selection for an advantageous mutant allele.
POSTERIOR PROBABILITY: The probability of a parameter value in-
ferred from an analysis.
RELATIVE-RATE TEST: A calibration-free test for checking the con-
stancy of the rate of nucleotide substitutions in different lineages
during their evolution, thus determining whether or not the mole-
cular clock operates at the same rate among different lineages.
ROOTED TREE: A phylogenetic tree that specifies ancestral and de-
scendant species, thus indicating the direction of the evolutionary
path.
SENSE CODON: A codon specifying an amino acid.
SEQUENCE DIVERGENCE (DIVERGENCE): The differences between two
homologous sequences due to the independent accumulation of ge-
netic changes in each lineage.
xxii
STOCHASTIC PROCESS: A process, the outcome of which cannot be
predicted exactly from knowledge of initial conditions. However,
given the initial conditions, each of the possible outcomes of the
process can be assigned a certain probability.
SYNTENY: A pair of genomes in which at least some of the genes are
located at similar map positions.
TANDEM DUPLICATION: A duplication, the products of which reside
in close proximity to each other on the chromosome.
TRANSITION: The substitution of a purine for a purine or a pyrimidine
for a pyrimidine.
TRANSVERSION: The substitution of a purine for a pyrimidine or vice
versa.
xxiii
1
INTRODUCTION
Penicillium marneffei is a dimorphic fungus that intracellularly in-
fects the reticuloendothelial system of humans and bamboo rats. En-
demic in Southeast Asia, it infects 10% of AIDS patients in this re-
gion [365, 201, 182, 50, 348, 350]. The complete genomic sequencing for
various organisms has accelerated rapidly, which has offered another path
to gene discovery in recent years. This thesis presents the sequence of
P. marneffei genome, as well as related studies from the perspectives of
comparative and evolutionary genomics. These studies will throw light
on the molecular mechanism of virulence of this important pathogenic
fungus.
Chapter 1 gives an overview of P. marneffei genome, including se-
quence statistics, gene content and prediction of gene function. Chapter
2 describes the organisation and implementation of genome database of
P. marneffei genome project. The complete mitochondrial genome of P.
marneffei is reported in Chapter 3. The gene content and gene order
P. marneffei of mitochondrial genome are highly similar to that of As-
pergillus, further confirming their close phylogenetic relationship. This
provides the basis for comparative genomics study between P. marneffei
and Aspergillus species.
This is followed by Chapter 4 that reports the presence of impor-
tant virulence gene cluster, the melanin biosynthesis gene cluster, in P.
marneffei genome. Since melanin is a highly toxic natural product pro-
duced by some species of Aspergillus which are phylogenetically close to
P. marneffei, this finding is also valuable in revealing the evolutionary
origin of this gene cluster.
2
Mating of P. marneffei has not yet been observed in nature or under
laboratory defined conditions. The lack of a sexual stage impairs the
utility of experimental fungal genetics. By using genome sequence infor-
mation, however, we found evidence of the potential mating ability of P.
marneffei (Chapter 5). It suggests that P. marneffei, like other patho-
genic fungi, may limit access to the sexual cycle to generate a population
structure that is in part clonal but which retains the ability to undergo
sexual cycle in response to challenging conditions in the environment or
in the host. Chapter 6 contributes to the thesis by offering a systemic
exploration of genetic components that may be responsible for the mor-
phogenetic processes in the genome of P. marneffei, mainly through the
sequence analysis in a context of comparative genomics. Chapter 7 re-
ports an interesting phenomenon: Tandemly repeated DNA sequences
occuring frequently in the genomes of P. marneffei, not only in noncod-
ing regions, but also in protein-coding regions, i.e. intragenic regions.
These highly dynamic genomic components provide the clue on how the
pathogenic fungus adapts to the host immune system.
Chapter 8 introduces a systematic test about the extent of duplicate
genes in major ascomycetes. We observed significant variation within
ascomycetes in the extent of gene duplications. Age distribution of gene
duplications tentatively suggests that P. marneffei genome have experi-
enced duplication in large scale twice. We argue that different extents
and evolutionary patterns of duplicate genes in ascomycetes might be
associated with the great genotypical and phenotypical differences in as-
comycetes. Chapter 9 tackled the question of the origin of species-specific
genes. The statistically significant correlation between accelerated evo-
lutionary rate and the degree of lineage specificity is confirmed. This
correlation is independent of many confounding factors, like gene essen-
tiality and expression level. This finding helps to explain the origin of
P. marneffei -specific genes, which is about one third of all P. marneffei
3
genes.
Finally, Chapter 10 introduces the software package, developed in a
high-performance scientific computer language, for sequence data manip-
ulation and analysis, which performed very successfully throughout the
whole genome project.
Publications arising from this thesis are:
1. Cai JJ, Liu B, Woo PC, Lau SKP, Wong SS, Zhen H, Yuen KY (In
preparation) Genomic evidence for the presence of melanin biosyn-
thesis gene cluster in the thermal dimorphic fungus Penicillium
marneffei
2. Cai JJ, Woo PCY, Lau SKP, Smith DK and Yuen KY (2006) Ac-
celerated evolutionary rate may be responsible for the emergence
of lineage-specific genes in Ascomycota Journal of Molecular Evo-
lution, in press
3. Cai JJ, Smith DK, Xia X and Yuen KY (2005) MBEToolbox: a
MATLABTM toolbox for sequence data analysis in molecular biol-
ogy and evolution. BMC Bioinformatics, 6:64
4. Woo PC, Zhen H, Cai JJ, Yu J, Lau SKP, Wang J, Teng JLL,
Wong SS, Tse RH, Chen R, Yang H, Liu B and Yuen KY (2003) The
mitochondrial genome of the thermal dimorphic fungus Penicillium
marneffei is more closely related to those of molds than yeasts.
FEBS Letters, 555 (3): 469-77
5. Yuen KY, Pascal G, Wong SS, Glaser P, Woo PC, Kunst F, Cai JJ,
Cheung EY, Medigue C, Danchin A (2003) Exploring the Penicil-
lium marneffei genome. Archives of Microbiology, 179 (5): 339-53
I have tried to explicitly acknowledge where the other authors’ ideas
have contributed significantly to the present work.
4
Chapter 1
THE DRAFT GENOME SEQUENCE OF
PENICILLIUM MARNEFFEI
This chapter describes basic features of genome of Penicillium marn-
effei, such as, genome assembly, gene content and some comparative re-
sults, attempting to give an overall impression of the genome. More detail
and complete analyses of some sections may be found in corresponding
chapters.
1.1 Introduction
Although fungi pose little threat to people with healthy immune systems,
they can cause fatal infections in the immunocompromised individuals.
Penicillium marneffei is the most important thermal dimorphic fungus
causing respiratory, skin and systemic mycosis in Southeast Asia [365,
201, 182, 50, 348, 350]. Discovered in 1956 in hepatic abscesses of the
Chinese bamboo rat Rhizomys sinensis, only 18 cases of human diseases
were reported (in HIV-negative patients) until 1985 [66]. The appearance
of the HIV pandemic, especially in South-east Asian countries, saw the
emergence of the infection as an important opportunistic mycosis in this
group of immunocompromised patients. About 10% of AIDS patients in
Hong Kong are infected with P. marneffei [346]. In northern Thailand,
penicilliosis is the third most common indicator disease of AIDS following
tuberculosis and cryptococcosis [300].
Genome sequencing of P. marneffei will increase the understanding
molecular biology and biochemical mechanisms for the pathogenicity of
this fungus. Despite its medical importance and its unusual thermal di-
5
morphism, our understanding of gene organisation in P. marneffei was
limited. To my knowledge, only one cell wall mannoprotein gene has
been characterised and successfully used in serodiagnosis and prevention
of this infection [38,37,347]. As a ‘pilot study’ of this genome project, the
random analysis of 2303 random sequence tags has been performed [364],
which laid down the foundation for the complete genomic sequencing
project of this fungus. In 2002, the complete genome sequencing project
of P. marneffei was initiated, and we have now approximately 6.6 cov-
erage of the genome, which includes a contig that contains the complete
sequence of the mitochondrial genome. The sequencing of its genome
paves the way for the development of novel methods for detecting, pre-
venting and treating this infection.
1.2 Literature Review
In this section I will first recap some basic concepts and terminologies
in fungal biology, and then review some clinical aspects, including the
diagnosis and management of P. marneffei infection. Finally, I will give
a survey of the recent advances in fungal genome projects.
1.2.1 General fungal biology
Fungi are a large and diverse group of eukaryotes characterised by their
absorptive mode of nutrition, i.e., digesting food outside of their bodies.
Modern taxonomists place fungi in their own kingdom, on equal footing
with plants and animals, sometimes called “The Fifth Kingdom”. They
include moulds, yeasts, and mushrooms. Most fungi are multicellular,
but some, the yeasts, are simple unicellular organisms. Fungi are plastic,
having a diversity of forms which influence the manner of function, and
a range of dispersal mechanisms enabling various approaches to survival
over time. Nevertheless, some basic structures of diverse fungi are in
common.
6
A fungal organism consists of a mass of threadlike filaments called
hyphae, which combine to make up the fungal mycelium. Each hypha is
composed of a chain of fungal cells, a continuous cytoplasm with many
nuclei. The hypha is surrounded by a plasma membrane and a polysac-
charide chitin cell wall. The hyphae in a fungus branch off from one
another to form the mycelium, and are all ultimately connected to the
original hypha. Septa are barriers across the filament. In all fungi, septa
form, either adventitiously in all filamentous fungi, or at regular intervals
along the hypha in most members of the Ascomycota and Basidiomycota.
Different methods of reproduction have been adopted by different types
of fungus. For example, yeasts reproduce mitotically, while moulds have
much more complex life cycles involving distinct phases, including diploid
and haploid phases.
Fungi are often directly involved in our lives. Some fungi are in-
deed parasitic, and cause devastating plant infections. Serious agricul-
tural pests, parasitic fungi such as the rusts and the smuts can ruin
entire crops, especially affecting cereals such as wheat and corn. Only
about 50 species are known to harm animals. Many medical applications
of fungi have been discovered, of which antibiotic production by fungi
is the most important. The first among these antibiotics is penicillin,
possibly the most important non-genetic medical breakthrough of last
century. Approximately 75% of all described fungi belongs to the As-
comycota. Among them are some famous ones, such as, Saccharomyces
cerevisiae, the baker’s yeast, Penicillium chrysogenum, producer of peni-
cillin, and Neurospora crassa, the “one-gene-one-enzyme” organism, As-
pergillus flavus, the producer of aflatoxin, Candida albicans, the cause of
thrush.
7
(A) (B)
Figure 1.1: P. marneffei mould (A) and yeast (B) culture. Courtesy ofProf. KY Yuen, Micriobiolgy, HKU
1.2.2 P. marneffei, as an important fungal pathogen
Mycology
The fungus grows well on the Sabouraud dextrose agar. When grown
at 25, the fresh culture appears similar to other Penicillium species,
with rapidly growing greenish-silver mycelial colonies. The reverse side
is usually of a beige colour. One of the most characteristic features is the
production of a soluble red pigment that diffuses into the medium. Of all
the Penicillium species, only P. marneffei, P. citrinum, P. janthinellum,
P. purpurogenum, and P. rubrum produce diffusible red pigments. The
other Penicillium species are generally not associated with human infec-
tions nor do they display dimorphism. In contrast to a room temperature
culture, the fungus assumes a yeast form at 37, whether in cultures or
in vivo. Colonies at 37 are glabrous and beige-coloured and do not
produce any red pigment (Fig. 1.1). The dimorphic growing feature that
as a yeast-like fungus at 37 and as a mould in culture at temperatures
below 30 is illustrated in Fig 1.2.
Microscopically, the mycelial form resembles other Penicillium species
with conidiophore-bearing biverticillate penicilli, and each penicillus be-
ing composed of four to five metulae with smooth-walled conidia. The
8
Figure 1.2: Dimorphic switching of P. marneffei.The diagram is obtainedfrom the website of Department of Genetics, University of Melbourne.
yeast forms are ovoid or elongated measuring 2–3 µm × 2–6.5 µm. Sim-
ilar forms are also observed in tissue samples obtained from patients,
which may be seen within macrophages or extracellularly. In contrast to
other yeasts, the yeast cells of P. marneffei divide not by budding, but
by fission, with the result that a transverse septum is often seen in the di-
viding cell. This helps to differentiate P. marneffei from other dimorphic
fungi in histological sections, especially Histoplasma capsulatum.
Ecology and epidemiology
P. marneffei is geographically restricted to the Southeast Asia. Cases
have been reported mostly from northern Thailand, southwestern China
(e.g., around the Guangxi Province), Hong Kong, Taiwan, Singapore,
Malaysia, and the Philippines.
The ecology and possible environmental reservoirs of P. marneffei was
first investigated in 1986 by Deng et al. [67]. In the Guangxi Province
of region of the People’s Republic of China, it was found that P. marn-
effei can be isolated in the internal organs of 18 out of 19 bamboo rats
belonging to the species Rhizomys pruinosus. The findings of Deng et al.
9
were confirmed by a subsequent study by Li et al. [195]. Rhizomys pru-
inous senex bamboo rats in the Guangxi Province were studied. 93.1%
of the wild bamboo rats carried P. marneffei in the internal organs. The
fungus was most commonly isolated from the lungs (87.5%), followed by
the liver (56.3%), spleen (56.3%) and mesentery lymph node (50%).
The association between P. marneffei and bamboo rats had also been
noted in Thailand, another country endemic for the infection. In two
studies by Ajello et al. [3] and Chariyalertsak et al. [47], P. marneffei
was recovered from various species of bamboo rats, including Cannomys
badius, Rhizomys pruinosus, and R. sumatrensis. The distribution of the
fungus in the internal organs was similar to previous studies, with the
highest prevalence in the lungs followed by the liver.
The consistency of these findings suggests that inhalation of the (pre-
sumably) infective conidia could be an important mode of transmission.
The occurrence of the fungus in the liver could be a result of the propen-
sity of the fungus to invade the reticuloendothelial system. It has been
suggested that bamboo rats, like human victims, probably acquired the
infection from a common environmental source. The possible link to en-
vironmental factors is demonstrated by two studies from northern Thai-
land which showed a significant clustering of cases of penicilliosis marn-
effei during the rainy season [45,46]. A recent history of occupational or
other forms of exposure to soil is also a significant risk factor. Impor-
tantly, exposure to or consumption of bamboo rats, was not a risk factor
for infection. The exact mode of transmission of the fungus its natural
habitat is still unsettled at the moment.
Although P. marneffei is a naturally occurring sylvatic infection in
a high proportion of bamboo rat species [67], it is not known whether
bamboo rats are (1) an obligate stage in P. marneffei ’s life cycle or (2) a
zoonotic focus for human infection. Furthermore, it is not known whether
all lineages of P. marneffei are equally infectious to bamboo rats and hu-
10
mans or rather represent a subset of a wider, more genetically diverse
population. In order to address these questions, four groups of investiga-
tors reported the use of various molecular typing techniques in the differ-
entiation of P. marneffei strains. Vanittanakom et al. [323] first reported
in 1996 the use of restriction endonuclease analysis for epidemiological
typing of strains isolated in Thailand. Hsueh et al. noted an increase
in the incidence of P. marneffei infection in Taiwan in the 1990’s [134].
Antifungal susceptibility, chromosomal DNA restriction fragment-length
polymorphism types, and randomly amplified polymorphic DNA patterns
recognised 8 strain types out of 20 isolates. Trewatcharegon et al., on
the other hand, used pulsed-field gel electrophoresis (PFGE) with NotI
digestion for strain differentiation [316]. Fisher et al. [88] used multilo-
cus microsatellite typing (MLMT) system, an accurate and reproducible
method of characterizing genetic diversity of eukaryotic pathogens that
have low levels of genetic variation. They observed the high genetic di-
versity and extensive spatial structure among clinical isolates, revealing
spatially structured P. marneffei populations [88]. In further study, again
based on MLMT typing results, Fisher et al. [89] showed that different
clones of the fungus are found in different environments, all the samples
from any given location were genetically very similar. This led them to
the conclusion that the fungus becomes highly adapted to its local en-
vironment, making it highly successful there, but stopping it spreading
to other areas. This is why P. marneffei is only endemic to a relatively
small area of south-east Asia.
Immunobiology
Like most other pathogens, the availability of iron is crucial to the survival
of P. marneffei in the human host. Studies by Taramelli et al. shown
that the antifungal activity of macrophages is markedly suppressed in the
presence of iron overload and that iron chelators inhibit the extracellular
11
growth of P. marneffei [306].
The route of transmission and infection of P. marneffei is unknown at
the moment. However, it is generally believed that inhalation of the coni-
dia is a likely route, in line with the mode of infection for other moulds.
The attachment of P. marneffei conidia to host cells and tissues is the
first step in the establishment of an infection. The conidia-host interac-
tion may occur via adhesion to the extracellular matrix protein laminin
and fibronectin via a sialic acid-dependent process. Using immunofluores-
cence microscopy, Hamilton et al. demonstrated that fibronectin binds to
the conidia surface and to phialides, but not to hyphae [122]. The inves-
tigators suggested that there could be a common receptor for the binding
of fibronectin and laminin on the surface of P. marneffei [123,122].
The interaction between human leukocytes and heat-killed yeast-phase
P. marneffei has been studied by Rongrungruang et al. [269]. Their data
suggested that monocyte-derived macrophages phagocytose P. marneffei
even in the absence of opsonisation and the major receptor(s) recognising
P. marneffei could be a glycoprotein with N-acetyl-beta-D-glucosaminyl
groups. P. marneffei stimulates the respiratory burst of macrophages
regardless of whether opsonins are present, but tumour necrosis factor-α
secretion is stimulated only in the presence of opsonins. The authors thus
speculated that the ability of unopsonised fungal cells to infect mononu-
clear phagocytes in the absence of TNF-α production is a possible viru-
lence mechanism.
Although P. marneffei is capable of infecting and replicating inside
mononuclear macrophages, it is also evident that macrophages do possess
antifungal activities. The fungicidal activities of macrophages is likely to
involve the generation of reactive nitrogen intermediates, as described
by Kudeken et al. [180]. In addition to macrophages, the neutrophils
also exhibit antifungal properties. The fungicidal activity of neutrophils
is significantly increased in the presence of proinflammatory cytokines,
12
especially GM-CSF, G-CSF and IFN-γ. In addition to GM-CSF, G-CSF
and IFN-γ, other cytokines such as TNF-α and IL-8 are capable of en-
hancing the neutrophil’s inhibitory effects on germination of P. marneffei
conidia. The strongest effect was observed with GM-CSF [179]. Coni-
dia are, however, generally not susceptible to killing by phagocytes. The
fungicidal activity exhibited by neutrophils is believed to be independent
of superoxide anion, but through exocytosis of granular enzymes [181].
Recently, Koguchi et al. demonstrated that osteopontin (secreted by
monocytes) could be involved in IL-12 production by peripheral blood
mononuclear cells during infection by P. marneffei, and the production
of osteopontin is also regulated by GM-CSF [171]. It is also likely that
the mannose receptor is involved as a signal-transducing receptor for trig-
gering the secretion of osteopontin by P. marneffei-stimulated peripheral
blood mononuclear cells.
Molecular biology
The mechanism of thermal dimorphism and morphogenesis in P. marnef-
fei is not fully understood. However, studies by Borneman et al. start to
provide important information in this area [18,19]. It was shown that the
homologue of the Aspergillus nidulans abaA gene is involved in the reg-
ulation of cell cycle and morphogenesis in P. marneffei [18]. An STE12
homologue of P. marneffei (stlA gene) was subsequently shown to be able
to complement the sexual defect of an A. nidulans steA mutant [19]. A
hitherto unknown sexual stage of P. marneffei is therefore postulated to
be present.
Other genes which are involved in the growth and development of
P. marneffei have been described recently. A CDC42 homologue (cflA
gene) was shown to be required for polarisation and determination of cor-
rect cell shape during yeast-like growth, and for the separation of yeast
cells [22]. Deletion of the homologue of Aspergillus nidulans stuA gene in
13
P. marneffei showed that the gene is required for metula and phialide for-
mation during conidiation but is not required for dimorphic growth [20].
No vaccine is currently available for P. marneffei. Some recent studies
showed that vaccine development is potentially feasible. The P. marnef-
fei mannoprotein Mp1p (encoded by the MP1 gene) has been tested in a
mouse model as a potential vaccine candidate [347]. The relative efficacy
of intramuscular MP1 DNA vaccine, oral mucosal MP1 DNA vaccine us-
ing live-attenuated Salmonella typhimurium carrier, and intraperitoneal
recombinant Mp1p protein vaccine were compared. Intramuscular MP1
DNA vaccine appears to give the best protection against P. marneffei.
1.2.3 Penicilliosis marneffei
Clinical features
Penicilliosis marneffei manifests clinically as a progressive systemic febrile
illness as a result of infiltration and inflammation of the reticuloendothe-
lial system by the yeast stage of P. marneffei. Common clinical fea-
tures include systemic symptoms of fever, weight loss, anaemia, and those
due to local organ involvement such as pulmonary syndrome, chest radi-
ographic infiltrate, lymphadenopathy, hepatosplenomegaly, molluscum-
contagiosum-like skin lesions, osteolytic bone lesions, arthritis, subcuta-
neous abscesses and even endophthalmitis. Almost all organs could be
involved in severe disseminated disease.
In immunocompetent hosts, the tissue damage is mainly associated
with granulomatous inflammation with multinucleated giant cells, lym-
phocytes, and neutrophils. A suppurative inflammation dominated by
neutrophils resulting in abscess formation can be present. In immuno-
suppressed hosts, an anergic and necrotising reaction is found with diffuse
infiltration of macrophages engorged with yeast cells.
Underlying immunosuppression could be found in 80% of penicilliosis
patients. The commonest underlying disease is AIDS. P. marneffei is
14
second only to Cryptococcus neoformans as the commonest opportunis-
tic fungal pathogen in AIDS patients in Southeast Asian countries like
Thailand.
Infections in non-HIV-infected patients have also been described, pri-
marily among immunocompromised patients and less frequently in pa-
tients without any known underlying diseases. Reported cases of non-
HIV-associated penicilliosis marneffei had occurred in patients with al-
coholism, tuberculosis, systemic lupus erythematosus, patients receiving
corticosteroid or other forms of immunosuppressive therapy, and even
patients without any apparent underlying disease. Manifestations of the
infection included lymphadenopathy, osteomyelitis and septic arthritis,
pulmonary infection, and disseminated infection with multi-organ in-
volvement.
Comparison of the clinical manifestations of penicilliosis in HIV-positive
and HIV-negative patients has been published recently [349]. Of the 15
patients who had culture-documented P. marneffei infection, 8 (53.3%)
were HIV positive and 7 (46.7%) were HIV negative. The HIV-infected
patients were more likely to have a higher incidence of fungaemia than
the non-HIV-infected patients (50% vs. 28.6%) while the latter group fre-
quently required tissue biopsies for confirmation of the infection. There
was a significant delay in establishing the diagnosis in non-HIV-infected
patients when compared with HIV-infected patients (median delay of 5.5
weeks vs. 1 week, P < 0.01). Most of the non-HIV patients (85.7%)
have underlying immunocompromising conditions including haematolog-
ical malignancies and autoimmune diseases requiring the use of corticos-
teroids or cytotoxic chemotherapy, as well as diabetes mellitius. In both
categories, pulmonary involvement was the commonest manifestation on
initial presentation, followed by pyrexia of unknown origin and cutaneous
manifestation.
15
Diagnosis
Fungal culture The infection itself is relatively amenable to antifun-
gal therapy and a cure is potentially possible. Early recognition of the
infection is therefore essential for timely initiation of effective therapy.
Conventional fungal culture remains the diagnostic test of choice in
most settings. The fungus may be cultivated from appropriate clinical
specimens in most cases, such as blood cultures, skin lesions, and respira-
tory tract specimens. In the AIDS patients with high levels of fungaemia,
it has been occasionally reported that a direct smear of the peripheral
blood may reveal the fungus. In HIV-positive patients, fungaemia could
be detected in at least 55% of the patients in previous reports.
Unfortunately, fungal culture suffers from the drawback of a long
turnaround time and that sometimes invasive tissue biopsies are necessary
for obtaining a satisfactory specimen. In a series of HIV-infected patients
from Hong Kong, 50% of them had documented fungaemia [349].
The yeast form of P. marneffei may be stained by the methenamine
silver or periodic acid-Schiff stains in tissue sections. When the cen-
tral septation of the yeast cell is seen in the histopathological section,
this offers clues to the diagnosis of penicilliosis. Pierard et al. reported
that the monocloncal antibody EB-A1 against the galactomannan of As-
pergillus species may also be used to detect P. marneffei in formalin-
fixed, paraffin-embedded tissues [249].
Serology A number of studies aimed at detecting fungal antibodies
and/or antigens in the serum and body fluids of infected patients. In
earlier studies, culture filtrates or whole cell extracts were being used as
antigens. P. marneffei was cultured in liquid media, and the culture fil-
trate was concentrated to immunise rabbits. The culture filtrate and the
anti-P. marneffei rabbit sera were incorporated in an immunodiffusion
test to detect antibody or antigens respectively [277,333,144].
16
In 1994, an indirect immunofluorescent antibody test for serodiagnosis
of P. marneffei infection was reported, using the yeast-hyphae (represent-
ing tissue multiplication phase) or the germinating conidia (representing
initial tissue invasion phase) as antigens [365]. None of the eight sera
from culture-documented patients tested at 1 : 10 dilution gave a posi-
tive result for IgM. High IgG titres (of the respective phases, geometric
mean 1 : 905 and 1 : 1280) were found in all eight penicilliosis marneffei
patients, in contrast to that obtained from 78 healthy controls (with a
respective geometric mean of 1 : 1.34 and 1 : 2.14). Sera from patients
with cryptococcosis (n = 2) or candidaemia (n = 2) did not show cross-
reactivity (IgG titre < 1 : 40, which is similar to that of the healthy con-
trols). Overall, the IgG titre was higher than IgA for the cases but there
was little difference in using the germinating conidia or the yeast-hyphae
form as the testing antigen. Moreover, IgA could not be detected in two
out of eight positive cases. Three HIV patients with culture-documented
penicilliosis marneffei were tested positive (IgG titres 1 : 80 − 1 : 160).
An IgG titre > 1 : 80 is suggestive of penicilliosis marneffei.
In 1996 Kaufman et al. developed a latex agglutination test to detect
antigenaemia, where polystyrene beads were coated with rabbit anti-P.
marneffei globulin, obtained from rabbits immunised with yeast culture
filtrate [160]. 77% of the 17 P. marneffei culture-positive HIV patients
were tested positive.
Desakorn et al. later used purified hyperimmune IgG, from rabbits
immunised with yeast cells, in an enzyme-linked immunosorbent assay
(ELISA) to quantitate P. marneffei yeast antigens in urine samples [69].
All urine samples from 33 P. marneffei culture-positive HIV patients
were tested positive, with a median titre of 1 : 20.
Jeavons et al. characterised and purified three cytoplasmic yeast anti-
gens of 50-, 54- and 61-kDa, which were found respectively in 48, 71
and 85% of serum samples from 21 P. marneffei culture-positive pa-
17
tients [146]. Chongtrakool et al. isolated a 38-kDa antigen partially-
purified from yeast culture filtrate, where 45% of P. marneffei culture-
positive HIV patients (n = 51), 17% of HIV positive asymptomatic pa-
tients (n = 262) and 25% of other fungal culture-positive HIV patients
(n = 67) have developed antibodies against this antigen [54].
PCR The detection of the P. marneffei genomic DNA in clinical spec-
imens have also been reported. LoBuglio and Taylor used primers PM2
and PM4 to amplify a 347 bp fragment of the internal transcribed spacer
region between 18S rDNA and 5.8S rDNA [202]. On the other hand
Vanittanakom et al. used a PCR-Southern hybridisation format, where
primers RRF1 and RRH1 were used to amplify a 631 bp fragment of
the 18S rDNA, followed by hybridisation with a P. marneffei -specific 15-
oligonucleotide probe [324]. Recently Vanittanakom et al. described a
nested PCR assay which might prove useful in the detection of P. marn-
effei and identification of young fungal cultures [325].
Mp1p The first gene cloned from P. marneffei was the MP1 gene [37].
Serum from guinea pigs immunised with P. marneffei yeast cells was used
to screen the cDNA library of P. marneffei. The MP1 gene was subse-
quently cloned which encodes an abundant antigenic cell wall manno-
protein in P. marneffei. MP1 is a unique gene without homologues in
sequence databases. It codes for a protein, Mp1p, of 462 amino acid
residues, with a few sequence features that are present in several cell wall
proteins of Saccharomyces cerevisiae and Candida albicans. It contains
two putative N-glycosylation sites, a serine- and threonine-rich region for
O-glycosylation, a signal peptide, and a putative glycosylphosphatidyli-
nositol attachment signal sequence. Specific anti-Mp1p antibody was
generated with recombinant Mp1p protein purified from Escherichia coli
to allow further characterisation of Mp1p. Western blot analysis with
anti-Mp1p antibody revealed that Mp1p produces dominant bands with
18
molecular masses of 58 and 90 kDa and that it belongs to a group of cell
wall proteins that can be readily removed from yeast cell surfaces by glu-
canase digestion. In addition, Mp1p is an abundant yeast glycoprotein
and has high affinity for concanavalin A, a characteristic indicative of a
mannoprotein. Furthermore, ultrastructural analysis with immunogold
staining indicated that Mp1p is present in the cell walls of the yeast, hy-
phae, and conidia of P. marneffei. Finally, it was observed that infected
patients develop a specific antibody response against Mp1p, suggesting
that this protein represents a good cell surface target for host humoral
immunity.
The antibody response of penicilliosis patients to Mp1p was studied
in two subsequent studies [38, 39]. An ELISA-based antibody test with
purified Mp1p was produced. Evaluation of the test with guinea pig sera
against P. marneffei and other pathogenic fungi indicated that this assay
was specific for P. marneffei. Clinical evaluation revealed that high levels
of specific antibody were detected in two immunocompetent penicilliosis
patients. Furthermore, approximately 80% (14 of 17) of the documented
penicilliosis patients with human immunodeficiency virus tested positive
for the specific antibody. No false-positive results were found for serum
samples from 90 healthy blood donors, 20 patients with typhoid fever,
and 55 patients with tuberculosis, indicating a high specificity of the test.
Thus, this ELISA-based test for the detection of anti-Mp1p antibody can
be of significant value as a diagnostic for penicilliosis.
In vitro, Mp1p is found to be secreted into the cell culture super-
natant at a level that can be detected by Western blotting. A sensitive
ELISA developed with antibodies against Mp1p was capable of detect-
ing this protein from the cell culture supernatant of P. marneffei at 104
cells/mL. The anti-Mp1p antibody is specific since it fails to react with
any protein-form lysates of Candida albicans, Histoplasma capsulatum, or
Cryptococcus neoformans by Western blotting. In addition, this Mp1p
19
antigen-based ELISA is also specific for P. marneffei since the cell cul-
ture supernatants of the other three fungi gave negative results. Finally,
a clinical evaluation of sera from penicilliosis patients indicates that 17
of 26 (65%) patients are Mp1p antigen test positive. Furthermore, an
Mp1p antibody test was performed with these serum specimens. The
combined antibody and antigen tests for P. marneffei carry a sensitivity
of 88% (23 of 26), with a positive predictive value of 100% and a negative
predictive value of 96%. The specificities of the tests are high since none
of the 85 control sera was positive by either test.
The value of antigen (Mp1p) and antibody (anti-Mp1p) detection in
the diagnosis of penicilliosis marneffei is best evaluated by comparing the
results in patients with or without underlying HIV infection. In a study
involving eight HIV positive and seven non-HIV penicilliosis marneffei
patients, the HIV positive patients tended to have a higher antigen titre
and a lower antibody titre, while the converse is true in the HIV negative
patients. This presumably is due to impaired antibody production as a
result of the underlying immune defects associated with HIV infection
and a higher fungal load in this group of patients. Concomitant testing
of the serum antigen and antibody levels could therefore improve the
diagnostic yield of serology in immunocompromised patients.
When serial serum samples were available for the HIV-positive pa-
tients, it was found that the serum antigen and antibody titres against
P. marneffei were elevated as early as 30 days before the day of posi-
tive cultures. The titres of both serum antigen and antibody dropped
with the initiation of amphotericin B therapy and itraconazole prophy-
laxis. Upon subsequent follow up, there was no clinical and mycological
evidence of relapse and this was associated with a persistently negative
serum antigen and antibody ELISA.
20
Treatment
In vitro, P. marnefffei is susceptible to itraconazole and amphotericin
B, while the susceptibility to fluconazole and 5-fluorocytosine is less uni-
form [301]. The recommended antifungal regimen to date consists of two
weeks of intravenous amphotericin B (0.6 mg/kg/d) followed by ten weeks
of oral itraconzaole (400 mg/d), which resulted in clinical and microbio-
logical cure in 97.3% of the patients. Long term secondary prophylaxis
has also been suggested to reduce the relapse rate [290,302]. With wider
use of HAART for HIV infection, it has been suggested that long term
antifungal prophylaxis may not be necessary. The highly active anti-
retroviral therapy (HAART) has been shown to reduce the incidence of
many opportunistic infections in AIDS patients, including invasive fun-
gal infection. There is, however, currently no specific cut-off value of
CD4 cell count can be used to guide the use of secondary antifungal
prophylaxis [140]. One recent interesting observation is that several 4-
aminoquinoline agents including chloroquine were found to be able to
inhibit the growth of P. marneffei inside macrophages. The activity of
chlorquine on P. marneffei is postulated to be due to an increase in the
intravacuolar pH and a disruption of pH-dependent metabolic processes.
This finding could be of value in the chemotherapy or chemoprophylaxis
of penicilliosis marneffei [307].
1.2.4 Fungal genome projects
Genomics has only just started to impact on biological/medical research,
although modern molecular genetics has been at the center of the bio-
medical revolution in research since 1980s. The potential of studying
whole genome sequences is a new tool in biomedical research.
At the time when this thesis is written, there are about 317 completed
and published genome sequence projects and 549 eukaryotic and 802
prokaryotic ongoing projects (data from the Genomes OnLine Database
21
(GOLD) at http://www.genomesonline.org/). Current estimates sug-
gest at least 2 million fungal species, of which only some 50,000 to 70,000
have been documented and merely a couple of them whose genomes them
have been completed.
S. cerevisiae was the first eukaryote to have its genome fully se-
quenced. In 1996 the work was completed by many different laboratories
and organisations. Its genome contains ≈6,000 genes on 16 chromosomes.
At the time that genome sequence was published, only 43.3% of the
yeast genes were classified as ‘functionally characterised’, i.e., having ex-
perimentally well-investigated properties, being members of well-defined
protein families, or displaying strong homology to proteins with known
biochemical functions. Despite this limitation, it is the most well studied
fungus, which serves as the most important model organim for fungal
genetics. The all-against-all matching of the yeast genome had been
accomplished and duplication patterns within the genome have been in-
vestigated in a systematic way. Such a view of the genome’s architecture,
based on an exhaustive intra-genomic sequence comparison, revealed that
whole genome duplication seems to have had an important influence of
the evolutionary development of S. cerevisiae [220].
The S. pombe genome [354] contains the smallest number of protein-
coding genes yet recorded for a eukaryote: 4,824. Centromere structure
has been well studied in S. pombe: the centromeres are between 35 and
110 kb and contain related repeats including a highly conserved 1.8-kb
element. More introns (of which there are 4,730) are found than in S.
cerevisiae. Some 43% of the genes contain introns. Some homologs of
human disease genes, such as cancer related genes, have been identified.
Comparative study identified highly conserved genes important for eu-
karyotic cell organisation including those required for the cytoskeleton,
compartmentation, cell-cycle control, proteolysis, protein phosphoryla-
tion and RNA splicing, which may have originated with the appearance
22
of eukaryotic life. In constrast, few similarly conserved genes that are
important for multicellular organisation were identified. The lesson from
studying S. pombe genome is that the transition from prokaryotes to eu-
karyotes required more new genes than did the transition from unicellular
to multicellular organisation.
The N. crassa genome has been reported recently [101]. The genome
is assembled from genomic data of more than 20-fold sequence coverage
of the genome. It has the highest genome size (39.9 Mb) and gene num-
ber (10,082 protein-coding genes) among all published fungal genomes so
far. On average, the gene density is one gene per 3.7 kilobases (kb) and
an average of 1.7 short introns (134 bp on average) per gene. Neurospora
genome comprises a small number of repetitive elements, a low degree of
segmental duplications and very few paralogous genes. Neurospora genes
are highly divergent – of the predicted proteins 41% have no significant
matches to known proteins. Many of genes with predicted products likely
to be involved in determining hyphal growth and multicellular develop-
mental structures in Neurospora, as well as involved in catabolism, chem-
ical detoxification and stress-defense mechanisms. It has also been noted
that for some Neurospora genes the only known homologs are found in
prokaryotes [216], indicating that occupation of similar ecological niches
has resulted in conservation of genes for substrate degradation and sec-
ondary metabolism.
Magnaporthe grisea, one of the most devastating agricultural pathogens
in the world, has been sequenced [64]. The fungus causes blast disease in
rice, a scourge that destroys enough rice crops to feed 60 million people
annually. The pathogen’s remarkable ability to overcome plant defences
has stymied efforts to fight the disease. Analysis of its predicted gene set
provides an insight into the adaptations required by a fungus to cause
disease. The M. grisea genome encodes a large and diverse set of se-
creted proteins, including those defined by unusual carbohydrate-binding
23
domains. This fungus also possesses an expanded family of G-protein-
coupled receptors, several new virulence-associated genes and large suites
of enzymes involved in secondary metabolism. Together with the draft
rice genome sequences published earlier this year, the new information
will help researchers develop better and cheaper methods of protecting
plants than the currently available fungicides.
Recently, the C. albicans and C. neoformans genomes were reported
[148, 203], enabling a comparison between these divergent fungi. More-
over, high-quality draft sequences of A. nidulans and A. fumigatus are
already in the public domain, and others, such as Ustilago maydis, are
likely to be available soon. Other genome sequencing projects of patho-
genic fungi are also under way or will soon be started (for instance,
Pneumocystis carinii).
1.3 Materials and Methods
Strain and DNA preparation of P. marneffei genome were done by col-
leagues in the department of Microbiology, University of Hong Kong.
Library construction and shotgun sequencing were carried out by Beijing
Genomics Institute (BGI).
1.3.1 Strain and DNA preparation
P. marneffei strain PM1 was isolated from an HIV-negative patient suf-
fering from culture-documented penicilliosis in Hong Kong. The arthro-
conidia (“yeast form”) of PM1 was used throughout the DNA sequencing
experiments. Genomic DNA, including mitochondrial DNA, was pre-
pared from the arthroconidia purified at 37 . A single colony of the
fungus grown on Sabouraud dextrose agar at 37 was inoculated into
yeast peptone broth and incubated in a shaker at 30 for 3 days. Cells
were cooled in ice for 10 min, harvested by centrifugation at 2000g for
10 min, washed twice and re-suspended in ice cold 50 mmol EDTA/l
24
buffer (pH 7.5). 20 mg novazym/ml was added and incubated at 37for one hour followed by digestion in a mixture of 1 mg proteinase K/ml,
1% N-lauroylsarcosine, and 0.5 mol EDTA/l pH 9.5 at 50 for 2 hours.
Genomic DNA was then extracted by phenol, phenol-chloroform, and fi-
nally precipitated and washed in ethanol. After digestion with RNase A,
a second ethanol precipitation was followed by washing with 70% ethanol,
air-dried and dissolved in 500 µl of TE (pH 8.0).
1.3.2 Library construction, shotgun sequencing
Two genomic DNA libraries were made in pUC18 carrying insert sizes
from 2.0 – 3.0 kb and 7.5 – 8.0 kb, respectively. DNA inserts were pre-
pared by physical shearing using the sonication method. The genome
sequence was assembled from deep whole-genome shotgun (WGS) cov-
erage obtained by paired-end sequencing from a variety of clone types,
i.e., all inserts were sequenced from both ends to generate paired reads.
A total of about 190.3 Mb of sequence data, which is equivalent to ap-
proximately 6.6 coverage of the genome, has been generated by random
shotgun sequencing.
1.3.3 Sequence assembly
Phred/Phrap/Consed package was used for base calling, contig assembly
and quality assessment [83, 84, 112]. Contigs were ordered into scaffolds
by the scaffold building program, Bambus [255]. Refer to Chapter 2 for
more detailed descriptions of annotation procedure and genome database
construction.
1.3.4 Data release
Sequence data generated by the project were released continuously and
were available for searching using the on-site BLAST server and down-
loading by FTP with access restriction. The annotated sequences are
25
Table 1.1: General features of the P. marneffei genome.
Feature ValueAssembly size (excluding gaps) 28.98 MbEstimated genome size ∼ 31 MbGC content overall 47%GC content (coding) 50%Protein coding genes 10,060tRNAs 110% coding 62%Average gene size 1,753 bpAverage intergenic distance 1,051 bpAverage intron size 111 bpAverage exon size 380 bp
available for browsing and downloading from web interface of P. marn-
effei Genome Database (PMGD), http://www.pmarneffei.hku.hk. At
present, PMGD contains 10,060 protein-coding genes.
1.4 Results
1.4.1 Assembly and general characteristic
Using a pure whole genome shotgun approach, we sequenced the P. marn-
effei genome to 6.6× coverage. The net length of assembled contigs
totalled 28.98 Mbp. Genome statistics are presented in Table 1.1.
Genome sequence
The P. marneffei genome size was estimated ∼ 31 Mb (see Section 2.4.2),
which is similar to that of Magnaporthe (∼ 30 Mbp), larger than that
of S. cerevisiae and S. pombe (both about 12 Mbp), but smaller than
Neurospora (greater than 40 Mbp). The resulting assembly consists of
2,911 sequence contigs with a total length of 28,977,603 bp. Contigs
were ordered into 273 supercontig (i.e., scaffolds) with a total length
of 28.42 Mbp (excluding gaps between contigs). Most of the assembly
26
(98.35%) is contained in the contigs. Given the high sequencing cover-
age, the assembly represents the vast majority (> 95%) of the genome,
as theoretically assessed by the Lander-Waterman model [186]. The mi-
tochondrial genome (35 kb, circular) has been completely sequenced and
assembled (See Chapter 3 for detail).
Genes
A total of 10,060 protein-coding genes (9,257 (92%) longer than 100
amino acids) were predicted. This, again is similar to that of Magna-
porthe and less than that of Neurospora, and constitutes nearly twice as
many genes as in S. cerevisiae(about 6,300) and S. pombe (about 4,800),
and nearly as many as in D. melanogaster (about 14,300). The average
gene density is one gene per 2.8 kb. The average gene length of 1.75 kb
is slightly longer than the 1.67 kb average gene length for Magnaporthe
and the 1.40 kb for both S. cerevisiae and S. pombe. The protein-coding
sequence is predicted to occupy 62.1% (51.2% excluding introns) of the se-
quenced portion of the P. marneffei, compared with 71% in S. cerevisiae
(70.5% excluding introns) and 60.2% in S. pombe (57% excluding introns)
(Table 1.2). An estimated total of 28,180 introns are distributed among
91% of P. marneffei genes, with 34 being the largest number of introns
found within a single gene. Introns varied from 15 to 1,617 nucleotides
long, with a mean length of 111 nucleotides. The telomere tandem re-
peat identified is TTAGGG. Several predicted genes that encode conserved
telomere and centromere proteins, such as, telomere-associated helicases,
were identified, but telomere and centromere sequences have remained
elusive. Note, although the complete genomes of A. fumigatus and A.
nidulans are not published, the high-quality drafts of their genomes can
be obtained. Preliminary analyses reveal that most of above statistics
about gene number and gene density of P. marneffei are similar to those
of Aspergillus. This result is consistent with our understanding of phylo-
27
Table 1.2: Comparison of P. marneffei genome statistics to those of otherfungi. PM - P. marneffei, AN - A. nidulans, MG - M. grisea, NC - N.crassa, SC - S. cerevisiae, and SP - S. pombe.
PM AN MG NC SC SPGenome size (Mb) 31 31 30 43 12 12Gene number 10,060 9,457 11,108 10,620 6,300 4,800Gene coverage 62.1% 59.2% 48.2% 44.5% 71.0% 60.2%Gene coverage (ex-cluding introns)
51.2% 50.6% 40.5% 37.6% 70.5% 57.0%
genetic relationship between them, as obtained by small ribosomal RNA
sequences (Section 1.4.1) and mitochondrial comparison (Chapter 3).
Ribosomal RNA and tRNA
Copies of the large rRNA tandem repeat containing the 18S, 5.8S and
25S rRNA genes are present in P. marneffei genome. Ribosomal RNAs
from P. marneffei and other fungi were used to construct phylogeny
to study phylogenetic relationships. 18S rRNA from 43 species of As-
comycetes were obtained from Ribosomal Database Project II Release
8.1 (http://rdp.cme.msu.edu/html/). The phylogenetic relationship
is presented in Fig. 1.3. The neighbour-joining method of tree recon-
struction, implemented in MBEToolbox (Chapter 10), was used. Align-
ment replicates for bootstrapping were generated by using Phylip [86].
Result suggests that P. marneffei is likely to be an anamorph of a Ta-
laromyces species. This substantiates the observation that the spacer
regions of the rRNA loci are highly similar to that found in Talaromyces
species [158,330]. Indeed the sequence is almost identical with that of T.
flavus and T. bacillisporus (Fig. 1.3). It is also very similar to that of
Chromocleista cinnabarina, a soil fungus that produces a red pigment, as
does P. marneffei. A total of 110 tRNA genes were identified, including
69 (63%) with introns.
28
Clavispora lusitaniae [M55526]
Pichia anomala [D86914]
Candida tropicalis [M55527]
Zygosaccharomyces rouxii [X58057]
Saccharomyces cerevisiae [Z75578]
Torulaspora delbrueckii [X53496] 100 100
64
94
Schizosaccharomyces pombe [X58056]
Saitoella complicata [D12530]
Protomyces inouyei [D11377]
Taphrina populina [D14165]
Taphrina deformans [U00971]
Taphrina wiesneri [D12531] 77
97
100
65
100
97
Chaetomium elatum [M83257]
Neurospora crassa [X04971]
Podospora anserina [X54864]
77
Microascus cirrosus [M89994]
Pseudallescheria boydii [U43913]
100
100
Ophiostoma ulmi [M83261]
Leucostoma persoonii [M83259]
54
50
Aureobasidium pullulans [M55639]
Pleospora rudis [U00975]
100
76 Thermoascus crustaceus [M83263]
Penicillium verruculosum [AF510496]
Penicillium marneffei
Talaromyces flavus [M83262]
63
Talaromyces bacillisporus [D14409]
97
Chromocleista cinnabarina [AB003952]
62
Byssochlamys nivea [M83256]
Eurotium rubrum [U00970]
Aspergillus fumigatus [M55626]
Aspergillus flavus [D63696]
50
Monascus purpureus [M83260]
Eupenicillium javanicum [U21298]
Penicillium notatum [M55628]
Penicillium chrysogenum [AF548086]
Penicillium commune [AF236103]
Penicillium expansum [AF218786]
Penicillium allii [AF218787] 90 72
80
100
99
82
73
75
100
55
53
Histoplasma capsulatum [Z75306]
Blastomyces dermatitidis [M55624]
Paracoccidioides brasiliensis [AF227151]
59
Coccidioides immitis [M55627]
Eremascus albus [M83258]
Ascosphaera apis [M83264] 98
98
62
100
68
81
100
100
0.01
Figure 1.3: Phylogenetic tree showing the relationships of P. marneffei toother Penicillium and Talaromyces species. The tree was inferred from18S rRNA data by the neighbour-joining method and bootstrap valuescalculated from 1000 trees. The scale bar indicates the estimated numberof substitutions per 100 bases using the Jukes-Cantor correction. Namesand accession numbers are given as cited in the GenBank database.
29
1.4.2 Genome architecture and co-linearity
Identification of syntenies conserved between species is valuable for trac-
ing the evolutionary events that affect genomes, however, little informa-
tion about synteny among chromosome segments (or contig) is known
for filamentous ascomycetes. Analysis of orthologous genes among P.
marneffei, A. nidulans and A. fumigatus, revealed extensive regions of
conserved synteny, as well as a considerable extent of genome reorganisa-
tion that has occurred in this phylum. There are 1,340 regions containing
four or more genes that were found to be co-linear between P. marneffei
and A. nidulans. A total 3,188 P. marneffei genes are in these regions.
There are 1,273 regions between P. marneffei and A. fumigatus, contain-
ing 3,716 P. marneffei genes. The largest syntenic cluster contains 27
gene pairs, appearing in P. marneffei and A. nidulans.
Melanin-biosynthesis gene cluster
One of the interesting examples of the syntenic segments conserved be-
tween P. marneffei and Aspergillus spp. is the melanin biosynthesis gene
cluster. This six-gene cluster, spanning ∼ 19 kb, which participates in
DHN-melanin biosynthesis [24, 187, 317, 318], is found in P. marneffei,
and is syntenic in A. fumigatus (Chapter 4).
Pheromone precursor gene loss
Syntenic regions reveal evolutionary events, like gene loss, which are dif-
ficult to identify by other methods. One of the examples is the loss
of known mating pheromone precursor genes. Figure 1.4 shows the mi-
crosyntenies among pheromone precursor loci from P. marneffei, A. nidu-
lans, A. fumigatus and N. crassa. The pheromone precursor gene has
been identified in all these species (highlighted in green) except for P.
marneffei. The hypothetical locations of P. marneffei pheromone pre-
cursor genes before loss are indicated by triangles in the figure.
30
Figure 1.4: Microsyntenies containing pheromone precursor loci from P.marneffei, A. nidulans, A. fumigatus and N. crossa. The pheromone pre-cursor genes have been highlighted in green. The hypothetical locationsof P. marneffei pheromone precursor genes before gene loss are indicatedby triangles.
1.4.3 Gene duplications (multigene families) and comparisons
Among all predicted P. marneffei genes (total 10,060 with 9,541 longer
than 100 bp), 1,335 of them belong to 428 multigene families which con-
tain more than one homologous member. The largest gene family consists
of 34 genes. The most expanded gene families include MFS multidrug
transporter, dehydrogenase/reductase and hexose transporter, as well as
pepsin-type protease (see Table 8.2 on page 165). Comparisons of con-
31
tig/supercontig sequences and searches for tracts of conserved gene order
reveal little evidence for large-scale duplications in P. marneffei. The
incomplete genome sequences and unordered contigs obviously impair
the detection. Notably, the result is inconsistent with that based on the
other line of evidence, as presented in Chapter 8, in which histogram of
synonymous substitution rate of P. marneffei duplicate gene pairs sug-
gesting two large-scale gene duplications probably happened. Compared
to S. cerevisiae which undergone genome duplication (i.e., the largest
gene duplication), P. marneffei has relatively smaller number of recently
duplicate gene pairs. But, the age distribution of duplicate genes in P.
marneffei at the first peak (see Chapter 8 for detail) shows a similar
pattern with that in S. cerevisiae, which might suggest that duplicate
genes in P. marneffei probably originated through one or two episodic,
large-scale gene duplication.
1.4.4 Interspecies proteome comparison
The comparison of genomic sequences of two or more species may provide
highlighted information on how evolution shapes genome structure and
content, and to reveal specific sequences that have been conserved, as well
as those that have been invented throughout evolution. I conducted such
a comparative analysis of proteome sequences between P. marneffei and
A. fumigatus and S. cerevisiae. The analysis started by defining ortholog
or paralog pairs among proteomes. Two genes are said to be paralogous
if they are derived from a duplication event, but orthologous if they are
derived from a speciation event. Determining ortholog is important step
in assessing the relationship between genomes. This was performed us-
ing the BLAST comparison tool. BLASTP was used to compare the
sequences of proteins encoded by genes of one genome against those from
the other genomes. Protein sequences, instead of nucleotide sequences,
were compared because protein sequences remain conserved much longer,
32
on an evolutionary time scale and therefore can detect much older rela-
tionships among alignments. The lower the E-value, the greater chance
that two proteins are orthologous, that is, derived from a common ances-
tral protein and therefore having the same function. E-values have been
shown to be an accurate indication for the ratio of false positives to true
positives of homologous relationships. Genes g and h were considered or-
thologues if h is the best BLASTP hit for g and vice versa, with E-value
less than or equal to 1e-10.
The translated ORFs sequences of S. cerevisiae were obtained from
the Saccharomyces Genome Database (SGD) at http://www.yeastgenome.
org/. The predicted peptides of A. fumigatus were downloaded from the
FTP service at the A. fumigatus genome project in the Sanger Institute
(http://www.sanger.ac.uk/). The result of the proteome comparison
is given in Fig. 1.5.
Figure 1.5: Graphical representation of a triple proteome comparisonbetween P. marneffei, S. cerevisiae and A. fumigatus.
33
1.4.5 Lineage-specific genes
We identified many genes only present in P. marneffei or its closely re-
lated fungal species, namely lineage-specific genes. At the most extreme,
some genes are present in P. marneffei exclusively. These genes are of
particular interest because they may be determinators of characteristic
features of the fungus. A total of 1,447 genes whose proteins lack signifi-
cant matches to known proteins from public databases (TBLASTN cutoff
10−10) were found. This reflects that the Penicillium and its closely re-
lated fungal genome projects are still in the early stage, the diversity of
fungal genes remaining to be explored. Furthermore, 2,506 proteins do
not have significant matches to genes in either of the sequenced yeast
and A. nidulans. A novel theory about the emergence of lineage- or
species-specific genes is given in Chapter 9. Briefly speaking, the accel-
erated evolutionary rate, one of the most characterised properties of a
lineage-specific gene, may be responsible for the gene’s emergence.
In addition to the lineage-specific genes, many fungal specific domains
have been identified. These include cell wall antigen MP1 domain that is
first described in cell wall antigen Mp1p encoded in P. marneffei [347].
The Mp1p contains two self conserved regions, namely CR1 and CR2,
which form a new conserved domain family that has not been described
in conserved domain databases, such as Pfam and ProDom. The genome
sequence reveals more than 12 P. marneffei genes containing at least one
MP1 domain. That is to say, the genes encoding MP1 containing proteins
have been expanded in P. marneffei genome. Such an expansion is not
so extensive in A. fumigatus and A. nidulans, despite at least two MP1
containing proteins, afmp1 and afmp2 (GenBank Acc.: AAG09624 and
AAR22399), were discovered in A. fumigatus genome.
34
Figure 1.6: Putative MAPK signalling pathway in P. marneffei.Overview of major intracellular signalling pathways in P. marneffei.Common genes between S. cerevisiae and P. marneffei are marked withasterisks. Names of S. cerevisiae genes are presented. The P. marnef-fei genes are in parentheses. Created by using GenMAPP v2.0, a freeprogram for visualising genes on biological pathways.
35
1.4.6 Cell signalling and morphogenesis
The sequences encoding proteins that act on well-studied signalling path-
ways, including mitogen-activated protein kinases (MAPK) and cyclic
AMP-dependent protein kinase, as well as small GTPases of the Ras
family, are readily recognised in the P. marneffei genome. Figure 1.6 is
the comparison of MAPK signalling pathways between S. cerevisiae and
P. marneffei.
1.4.7 Potential mating ability
Traditionally, P. marneffei is considered as an asexual (anamorph) as-
comycete that lacks an apparent sexual (teleomorph) stage in its life cycle
and seems to reproduce only mitotically [44, 104]. Recent genetic stud-
ies, however, suggest it may have an unidentified sexual cycle. Except
for the pheromone precursor gene, the whole set of sex-related genes in
P. marneffei genome was identified, which demonstrates the potential
matting ability of this important thermally dimorphic fungus (Chapter
5).
1.4.8 Putative virulence genes
What makes a fungus a pathogen is an old question. The P. marneffei
genome sequence has revealed many proteins and systems with functions
that have previously been found to be important in pathogenic fungi. For
example, proteins such as phospholipases and proteinases are involved in
direct host cell damage and lysis. A review about fungal virulence factor
is in Section 4.2. A few identified putative virulence factors are presented
in Table 1.3.
1.4.9 Cell wall antigens and biosynthetic genes
The cell wall of a fungus maintains the structural integrity of the cell,
protects the fungus against the defence mechanism of the host and har-
36
Table 1.3: Putative virulence genes
Gene Acc. No. BLAST hit E valueProteinase
Pm47.49 P87184 Intracellular vacuolar serine pro-teinase precursor
0
Pm61.35 Q96WN2 Lon proteinase 0Pm109.24 P25375 Saccharolysin (EC 3.4.24.37) (Pro-
tease D) (Proteinase yscD)1e-159
Pm61.50 Q6FX66 YCL057w PRD1 proteinase yscD 1e-158Pm88.30 Q64HW0 Aspartyl proteinase 1e-122Pm66.31 P32379 Proteasome component PUP2 (EC
3.4.25.1)3e-98
Pm13.58 Q871P4 Related to ubiquitin-specific pro-teinase UBP1
6e-97
PhospholipasePm1.261 Q769K2 N-acyl-phosphatidylethanolamine-
hydrolysing phospholipase D6e-61
Pm103.31 Q874F2 Phospholipase D 1e-156Pm16.57 Q6U820 Lysophospholipase (EC 3.1.1.5) 0Pm167.18 Q877A5 Phospholipase (Fragment) 2e-51Pm182.7 Q76H92 Phospholipase A2 3e-27Pm22.27 Q9P866 Candida albicans Phosphatidylinosi-
tol phospholipase C4e-44
MetacaspasePm112.34 Q8J140 Metacaspase 1e-91Pm205.1 Q8J140 Metacaspase 3e-58
AgglutininPm113.29 Q9P5P9 related to A-agglutinin core protein
AGA11e-24
Pm10.4 P11219 Lectin precursor (Agglutinin) 5e-09Pm2.195 Q8CMU7 Streptococcal hemagglutinin protein 3e-07Pm28.53 Q7N911 Similar to hemagglutinin/hemolysin-
related protein0.00005
ToxinPm21.30 A45086 HC-toxin synthetase - fungus
(Cochliobolus carbonum)0
Pm21.31 Q9UVN5 AM-toxin synthetase 0Pm71.10 Q9UVN5 AM-toxin synthetase 0Pm71.39 Q9UVN5 AM-toxin synthetase 0Pm137.4 Q9UVN5 AM-toxin synthetase 0Pm151.1 A45086 HC-toxin synthetase - fungus
(Cochliobolus carbonum)0
Pm112.24 Q96WL1 Aflatoxin efflux pump Aflt 1e-141
37
bours most of the fungal antigens. It consists of a polymer of α and
β(1,3)-glucans, chitin, galactomannan and β(1,3)(1,4)-glucan embedding
protein antigens including the adhesins. The cell wall is synthesised and
continuously remodelled by enzymes including synthases, transglycosi-
dases and glycosyl hyrolases. All these are absent in human cell and thus
ideal targets for anti-fungal agents and immunisation. Previous studies
have shown that the specific monoclonal antibody against the galactofu-
rane side chain of galactomannan antigen of A. fumigatus can react with
the cell wall of P. marneffei and can be used to detect the presence of
antigenaemia or antigenuria in patients suffering from penicilliosis marn-
effei [363]. Ortholog of one of the known P. marneffei cell wall antigen
genes, MP1, is present in A. fumigatus. Within P. marneffei, homologs
of a number of Aspergillus genes encoding similar biosynthetic enzymes
and cell wall antigens have been identified (Table 1.4).
1.5 Discussion
This is the initial analysis of the genome of a thermal dimorphic fun-
gus. Although P. marneffei has not been studied intensively, the analy-
sis of the genome sequence has provided many new insights into a va-
riety of gene functions and cellular processes, including cell wall com-
ponents, signalling pathway, secondary metabolism and mating ability.
Comparisons of the genome of P. marneffei with those of other patho-
genic/nonpathogenic fungi have also uncovered surprising similarities and
differences, providing a new perspective on the molecular underpinnings
of these lifestyles. The analysis of P. marneffei -specific genes might allow
researchers to begin to make insights into the transition from mould to
yeast growth. Furthermore, the genome sequence has revealed the differ-
ent pattern of gene duplication in P. marneffei and other ascomycetes,
which might be linked with their divergent biological characteristics. The
apparent lack of a pheromone precursor loci in P. marneffei may provide
38
Table 1.4: Cell wall antigens and biosynthetic genes predicted in P. marn-effei.
Aspergillus gene Acc. No. Pm gene E valueCHSs
Class I CHSA AAB33397 Pm14.101 e-107Class II CHSB AAB33398 Pm132.15 5e-097Class III CHSG AAB07678 Pm110.5 0Class IV CHS F AAB33402 Pm87.22 6e-064Class V CHSE CAA70736 Pm38.37 0Class VI CHSD AAB33400 Pm223.4 e-051
β(1,3)-glucan synthaseFKS1 AAB58492 Pm120.1 0RHO1 AAG12155 Pm203.6 5e-099
α(1,3)-glucan synthaseAGS1 AAL28129 Pm162.3 0AGS2 AAL18964 Pm66.50 0
β(1,3)-glucanosyl transferasesGEL 1 AAC35942 Pm221.6 e-154GEL 2 AAF40139 Pm94.24 e-123GEL 3 AAF40140 Pm119.10 e-124
Mannosyl transferasesMNN9 Afu2g01450 Pm207.2 5e-097PIG-M Afu7g01300 Pm90.41 2e-063
Chitinases Endo-β(1,3)-glucanasesEngl1 AAF13033 Pm5.32 0
39
an explanation of its asexual life style. However, the fungus may indeed
undergo a yet undetected sexual cycle, which is supported by the findings
of homologs of many mating genes. Finally, one of the most interesting
findings is the abundant intragenic tandem repeats in the coding regions
of the genome. This finding provides a possible mechanism to explain
how the fungus can change its surface coat and thereby evade detection
by the host’s natural defences (see Chapter 7).
The draft genome sequence of P. marneffei presented in this chapter
provides the first attempt to understand the genetic basis of the physi-
ology of the special Penicillium species. Nonetheless, This first glimpse
may be expanded as many other fungal genomes generated from fungal
genome sequence projects ongoing or planned. This new era in fungal
biology promises to yield insights into this important group of organisms,
as well as to provide a deeper understanding of the fundamental cellular
processes common to all eukaryotes.
40
Chapter 2
PENICILLIUM MARNEFFEI GENOME DATABASE
AND ANNOTATION PIPELINE
The draft genome of Penicillium marneffei has been obtained (Chap-
ter 1). The huge amount of sequence data needs efficient analysis in order
to extract valuable information. A computer-based analysis system tai-
lored for the genome is required. Such a sequence data management
system with a number of peripheral applications has been developed to
solve this problem.
2.1 Introduction
The ever accelerating amount of genome information of P. marneffei
needs to be adequately processed, annotated and interpreted. Computa-
tional annotation systems providing tools and algorithms can facilitate
this process and advance our understanding of the genome sequences.
For the systems to be developed and refined, data must be easily acces-
sible and amenable to analysis. The analysed data must be fed back into
the loop to allow the data to be re-analysed, refined, verified, and new
hypotheses to be built. This is the issue of data management. Good data
management practices are essential to users of genomic data.
This chapter is concerned with two aspects: (1) construction of the
PMGD (P. marneffei genome database) system, and (2) the issues rele-
vant to the development of annotation pipeline. Many steps are involved
in these two aspects. Among these steps, prediction of protein function
is one of the most critical one in genome information processing. The
process of function prediction therefore stands the central part of an-
41
notation pipeline. Since P. marneffei genetics has not been well estab-
lished, most of proteins derived from its genome will be totally unknown
to biologists. More than ten thousand unknown proteins will undergo
function prediction. Different methods of protein function prediction
have been developed (see Literature Review). Briefly, these methods
can be categorised into two major groups: homology based methods and
non-homology based method. The former methods depend on the de-
tectable homolog between unknown protein and the characterised pro-
teins in database. The latter methods are based on various contexts in
functional information of a protein, which are collected and integrated
around the protein in order to assign a putative function for the protein
in an indirect way [218]. However, none of these methods can guarantee
a ‘one-stop’ solution that are particularly successful in P. marneffei gene
function prediction. Hence, the newly developed annotation pipeline in-
tegrates several currently used methods, but it is by no means a collection
of methodologies. Different methods have been tailored before it can be
integrated in order to give its maximum predicting power in respect to
the features of fungal proteins.
In next section, I will first review underlying principle behind the
methodologies used for predicting function of unknown proteins. I will
then examine a few protein function prediction systems implemented by
several research groups, before pointing out some additional considera-
tions in regard to the further development of similar systems. Note that
the topic of protein function prediction is a broad one. It could be broken
down into different subtopics in many different ways. I have tried to or-
ganise them in a flow from theory to application as smoothly as possible.
But still, the content of sections might jumpover slightly; some of key
concepts, such as, algorithm of sequence alignment, might be mentioned
more than once in different sections.
42
2.2 Literature Review
In this literature review I will first examine the most widely used methods
in protein function prediction. Then give a survey of software/database
systems currently available, highlighting their strengths and shortcom-
ings. Further possible research directions will be addressed before final-
ising the whole literature review section.
2.2.1 Methods for predicting protein function
Based on the underlying principle, the methods of protein function pre-
diction can be categorised into two major groups: homology-based meth-
ods and nonhomology-based methods [17,217,142].
Homology-based methods
Homology-based annotation relies on sequence similarity between query
protein and a well characterised protein. If two proteins are highly similar
in sequence, they possibly share the same function. The rationale behind
this function extrapolation is that similarity in sequence is determinate
enough to functional similarity. This is reasonable but counter-examples
are not rare. For instance, in the presence of domains that are shared by
numerous proteins [74], choosing the first or the best hit may not be op-
timal. The multi-domain organisation of proteins can lead to incorrectly
annotated database entries. Despite such criticisms, homology-based
methods are definitely the most widely used method. To calculate simi-
larities/distances with sequences of known proteins, pairwise similarities
are computed using the rigorous dynamic programming algorithm [292],
or heuristic algorithms such as FASTA [245] and BLAST [6].
Besides the whole protein similarity comparison, detecting motif or
domain sharing among proteins gives additional information about func-
tion. Motif is a simple combination of a few consecutive secondary struc-
ture elements with a specific geometric arrangement (e.g., helix-loop-
43
helix). Not all, but some motifs are associated with a specific biologi-
cal function. Domain is the fundamental unit of structure folding and
evolution. It may combine several secondary elements and motifs, not
necessarily contiguous. A domain can fold independently into a stable
3D structure, and it has a specific function. A variety of mathemati-
cal representations of protein motif/domain were developed and utilised
in detecting and storing these motifs/domains, such as, regular expres-
sion, position specific scoring matrices [97], hidden Markov models [57],
probabilistic suffix trees [15], and sparse Markov transducers [81].
Nonhomology-based methods
Although homology-based annotation has been widely successful in ex-
tending knowledge from the small set of experimentally characterised
proteins to the tens of thousand proteins found in genome sequencing
projects, a fatal problem for this method is that a well characterised
reference protein must be found base on sequence similarity; otherwise,
one cannot assign putative function to the unknown protein. Accord-
ing to the data that we currently have, 30-40% of proteins cannot find
a clear sequence homology in today’s most updated protein databases.
Another fungal genome sequencing project finished recently has the same
problem [101].
Nonhomology-based methods, also called context-based function pre-
diction is complementary to homology-based function prediction. Phy-
logenetic profiles, domain fusion and gene neighbouring are examples of
these methods. Pellegrini et al. [248] presented the phylogenetic profiles
method based on the assumption that proteins that function together in
a pathway or structural complex are likely to evolve in a correlated fash-
ion. If protein A and B tend to be either preserved or eliminated together
in a new species, we can expect that they are functional linked. In this
case, if we know the function of protein A, we can manage to predict the
44
function of protein B with respect to this functional linkage. The method
of phylogenetic profiling could be useful in predicting the function of un-
characterised proteins in P. marneffei, especially, when more and more
fungal species are sequenced. But for the time being this method has
to be performed manually because there is no free software available in
assisting automation of the analysis.
2.2.2 Software/database systems for protein function prediction
Over decades, with the close cooperation of biological scientists and soft-
ware engineers, a wide range of software and/or database systems have
been developed. As we can see in the next section of this review, some
of them utilise mainly one of methods mentioned above as its predictive
tool, while some of them try to integrate more than one method in order
to give more comprehensive annotation for unknown proteins.
Systems for automatic function assignment
A group of software systems, such as, PEDANT, Genequiz, Bio-Dictionary,
is attempting to accelerate the task of human experts by providing de-
tailed and exhaustive information for function assignment.
PEDANT (http://pedant.gsf.de) is a software system for com-
pletely automatic and exhaustive analysis of protein sequence sets - from
individual sequences to complete genomes [96]. It was launched in 1996
and is one of the earliest such systems. It was extensively utilised in
MIPS, a Europe based bioinformatics institute. It claims that it is fully
integrated with sequence database system and provides access to a broad
range of biological information through a hierarchically organised, con-
trolled vocabulary. The whole system became commercialised like some
other similar systems these days, which limits its popularity.
The GeneQuiz analysis server is open to public usage and accepts
anonymous protein sequences with GQserve [7]. It is composed of several
45
major modules: GQupdate keepings target databases current; GQsearch
performs database searching of queries, applies a variety of sequence
analysis tools to the query sequence, parsing, and storing the results
in a common format; GQbrowse allows browsing and querying of results;
GQupdate maintains integrated, up-to-date, non-redundant protein and
nucleotide sequence databases, as well as databases of protein structures
and motifs. These modules are general engineering achievement with no
principle different from other database systems. It is GQreason module
that is the most critical know-how for the whole system. The module
analyses results and makes intelligent guesses, assigns a specific function
to the query, a general functional class, and a reliability estimate.
Bio-Dictionary [264] employs a weighted, position-specific scoring scheme
and uses the complete collection of amino acid patterns (referred to as
seqlets) and can determine, in a single pass, the following: all local and
global similarities between the query and any protein already present in a
public database. The most unique feature of Bio-Dictionary is the usage
of seqlets that completely cover the natural sequence space of proteins in
the currently available public databases. As its developers claimed the
seqlets contain in this collection can capture both functional and struc-
tural signals that have been reused during evolution both within as well
as across families of related proteins. With this capacity, seqlets are ideal
elements for use in the context of protein annotation.
Classification system
It is not always the case that an unknown protein can be readily as-
signed a definite functional description. In such a case, protein classifi-
cation can help to elucidate the function of the new protein. Comparing
a protein sequence with a database of protein families is more effective
than a standard database search. Generally, conserved proteins are clas-
sified according to their homologous relationships. Each protein group
46
composes of a set of “seed” proteins which is represented as multiple
alignments, regular expression profiles or HMM. Protein classification is
useful in structure and function prediction, and especially important in
large-scale annotation efforts.
As it claims as of 2001, Clusters of Orthologous Groups of proteins
(COGs) were delineated by comparing protein sequences encoded in 43
complete genomes, representing 30 major phylogenetic lineages [308].
Now it is more updated by including more complete genomes represent-
ing broader lineages. Each COG consists of individual proteins or groups
of paralogs from at least 3 lineages and thus corresponds to an ancient
conserved domain. The problem with COGs system is that the system is
not fully open to public. Batch-application of COGnitor, the key compo-
nent of the system used to fit new proteins into the COGs, can only be
accessed inside the NCBI. Another issue has to be taken into account is
that COGs does not discriminate paralog (genes from the same genome
which are related by duplication) from ortholog (genes in different species
that evolved from the same ancestral protein). Orthologs typically have
the same function, allowing transfer of functional information from one
member to an entire COG. In contrast, paralogs are functionally diverse
proteins whose genes duplicated after speciation, although high sequence
similarity is normally preserved in paralogs. A system like COGs can
only be used as classifying system for automatically yielding a number of
functional predictions for poorly characterised genomes. COGs system
is of limited usefulness in P. marneffei genome project because its cur-
rent version contains few fungal genomes. The other database systems,
such as, Systers [177], iProClass [135], ProtoMap [362], have the same
shortcoming as COGs. They are better to be treated as protein infor-
mation storage/retrieval systems than active protein function prediction
systems.
47
Protein domain databases
A list of commonly used protein domain databases are given in Table 2.1.
Two of them have been used in PMGD. They are Pfam and InterPro.
Pfam (http://www.sanger.ac.uk/Software/Pfam) is a large collec-
tion of multiple sequence alignments and hidden Markov models covering
many common protein domains and families [13]. For each protein fam-
ily, Pfam allows looking at multiple alignments, viewing protein domain
architectures, examining species distribution, and so on. Pfam is built
from fixed releases of Swiss-Prot and TrEMBL. At current version 18.0
(2005), 75% of protein sequences in Swiss-Prot and TrEMBL have at
least one match to Pfam.
InterPro (http://www.ebi.ac.uk/interpro) is a database of pro-
tein families, domains and functional sites in which identifiable features
found in known proteins can be applied to unknown protein sequences.
It provides an integrated view of the commonly used signature databases
like PROSITE, PRINTS, SMART, Pfam, ProDom, etc., and has an in-
tuitive interface for text- and sequence-based searches. The latest release
11.0 contains 12,294 entries and covers 77.5% of UniProt proteins. Inter-
ProScan is a tool that combines different protein signature recognition
methods native to the InterPro member databases into one resource with
look up of corresponding InterPro and GO annotation.
2.2.3 The art of gene finding
The last 20 years has witnessed the significant development of compu-
tational methodology for finding genes and other functional sites in ge-
nomic DNA. Two major classes of computational approaches are com-
monly used to detect genes in genomic sequences. They are homology-
based approaches, and ab initio gene-finding algorithms. The former
approaches are relatively straightforward, focusing on search of homol-
ogous relationship with the content and structure of known genes. If a
48
Table 2.1: Commonly used domain databases.
Database Method Data type URLProsite Semi-Maual Motif www.expasy.ch/prosite/Pfam Semi-Auto Domain www.sanger.ac.uk/Software/Pfam/Blocks Full-Auto Motif www.blocks.fhcrc.org/ProDom Full-Auto Domain prodes.toulouse.inra.fr/prodomPrints N/A Motif www.bioinf.man.ac.uk/PRINTS/Domo Full-Auto Domain www.infobiogen.fr/services/domo/InterPro N/A Motif www.ebi.ac.uk/interpro/Smart Semi-Auto Domain smart.embl-heidelberg.de/eMotif Full-Auto Motif dna.stanford.edu/identify
region of sequence is similar to the sequence of an identified gene it is
highly suggestive, though not necessarily conclusive, of a gene. The most
common program for such comparison may be BLAST.
Next I will review some issues related to ab initio gene finding al-
gorithms. Generalised hidden Markov models (GHMMs) appear to be
approaching acceptance as a de facto standard for state-of-the-art ab
initio gene finding, as evidenced by the recent proliferation of GHMM
implementations, including GenScan [30] and FGENESH (Softberry). At
the time of this thesis’ written, neither GenScan nor FGENESH is open-
sourced, and no detailed information about underlying algorithm and
implementation is available. According to general algorithm description,
GenScan uses a training set in order to estimate the HMM parameters,
then the algorithm returns the exon structure using maximum likelihood
approach standard to many HMM algorithms (Viterbi algorithm). The
generalised HMM that GenScan uses consists of a number of states mod-
elling the various parts of a gene. These states include 5’ splice site, 3’
splice site, internal coding exon, start exon, and terminal exon. The final
gene structure predicted by GenScan is the maximum probability path
through the HMM. FGENESH is also HMM-based with the algorithm
similar to GenScan [30], differing in the model of gene structure a signal
49
term (such as splice site or start site score) has some advantage over a
content term (such as coding potentials), reflecting the biological signifi-
cance of the signals. No matter what algorithm a gene finding program
implements, several basic types of signal are indispensable to be detected.
These signals (or functional sites in genomic DNA) that researchers have
ever sought to recognise are splice sites, start and stop codons, branch
points, promoters and terminators of transcription, polyadenylation sites,
ribosomal binding sites, topoisomerase II binding sites, topoisomerase I
cleavage sites, and various transcription factor binding sites [108]. From
the point of view of information sciences, two basic types of information
are used here (1) “signals” in the sequence, such as splice sites; and (2)
“content” statistics, such as codon bias. Among signal measures, the
splice junctions-the donor and acceptor sites is the most important fea-
tures to be identified. The most common method for this has been the
“weight matrix” based methods. Other methods like consensus, Maximal
dependence decomposition (MDD) and Neural network based methods
are also used. Other signals, such as, start and stop codons, TATA boxes,
transcription factor (TF) binding sites, and CpG islands, are also use-
ful in predicting protein-coding regions. Content measures, like such as
codon bias, periodicities and asymmetries of coding regions, help to dis-
tinguish coding from noncoding regions. Fairly long exons are easy to
identify whereas short ones remain difficult. Neural networks have also
been used to distinguish coding from noncoding sequences.
Recently homolog-based approaches have been incorporated into the
ab initio gene-finding algorithms. GenomeScan, for example, is a com-
bination of two sources of information: probabilistic models of exons-
introns and sequence similarity information [361]. It is an extension of
the GenScan program, predicting gene structures that have at least one
exon with supporting evidence from an existing protein sequence. The
major disadvantage to this method is the requirement of a close homolog.
50
It is often the case that homologs are unknown or are remote, in which
case this system would be inappropriate.
Although the programs for gene structure prediction have greatly im-
proved in the last decade, even the best cannot autonomously detect all
genes and genomic elements and have to be supported by experimental
analysis. The programs still have considerable proportion of incorrect
and missed exons, and they concentrate only on the detection of coding
exons, while 5’ and 3’ UTRs, promoter elements, and polyA sites often
remain undetected. The elucidation of complex genome organisation,
such as nested and overlapping genes or alternative splicing, has not yet
been considered by any of the programs [267].
2.3 Implementation
The overall objective of PMGD is to design and implement a distributed
information framework that will provide services, tools and infrastruc-
ture for high-quality analysis and annotation of large amounts of diverse
genomic data. The whole system starts from assembly of sequences, and
ends with the web interface for output of all processed information. The
requirements of the update are dependant on the genomic data sources
to be updated, so the PMGD was designed to be modules and config-
urable so that adding new sequence data should be as straightforward as
possible.
2.3.1 Annotation pipeline
The general strategy applied to the analysis of all contigs is diagrammed
in Fig. 2.1. It uses standard published procedures of sequence compar-
isons as well as sh/bash shell scripts and Perl specifically developed for
this work (see Section 2.3.5). The procedure involves the following major
steps:
51
Predicted Genes
(10,060)
Contigs (2911)
Scaffolds (273)
FGENESHGenScan HmmGene
Consed/BAMBUS
Domain Identification
Best Gene Prediction
Relational Database Storing
Annotation
Sequence Data Files
PMGD Website Interface
Gene Structure
& Functional Annotation
BLASTP SearchOther Protein Analyses ...
Tandem Repeat Finder
BLASTX Search
Other Nucleotide Analyses ...
Figure 2.1: Flowchart of annotation pipeline for P. marneffei genome.
Step 1: contig assembly
Contigs were assembled from the sequence electropherograms using the
Phred/Phrap with their default options except as otherwise indicated
(for detail, see Section 2.3.2).
Step 2: comparisons of contigs to sequence databases
Comparisons of all contigs with fungal DNA sequences were performed
using BLASTN (default parameters) to search for rDNA, plasmid or mi-
tochondrial DNA sequences. The contigs were also compared to all known
proteins in GenBank (release 131) using ungapped BLASTX, with sig-
nificant hits indicating potential exons. The searches were made using
the seg filter and the PAM250 substitution matrix. The searches against
mitochondrial sequences were made using the filamentous fungal mito-
chondrial genetic code. In order to facilitate the visual inspection of the
52
alignments, I have developed blast2html script that converts regular
BLAST output to the HTML format. A graph was inserted above the
descriptive lines showing alignments coloured according to their similarity
score with the contig or protein query. Note BLASTX hits can often in-
dicate the approximate location of many coding exons but not every exon
and do not accurately delineate exon boundaries, so BLASTX search in
this step only provide preliminary coding information.
Step 3: identification of genetic elements
This step identifies protein coding genes and other genetic elements. Dif-
ferent gene finding programs were evaluated and then the best one was
used as the primary gene finding program (for detail, see Section 2.3.3).
In addition to the protein-coding genes, tRNAs were identified using the
tRNAScan-SE program [207](http://www.genetics.wustl.edu/eddy/
tRNAscan-SE/).
Step 4: BLAST comparisons to protein sequences
After obtaining predicted proteins, comparisons of proteins with the non-
redundant NCBI protein database were performed using BLASTP ver-
sion 2.0.10 with the seg filter and the PAM250 substitution matrix. All
predicted genes were searched against the Pfam set of hidden Markov
models using the HMMER program and InterPro using modified Inter-
ProScan running locally on Bioinfo server.
Step 5: Data storing and PMGD web interface
Before dumping the annotation data into database system, information
from vairous software programs were integrae d and the results were
converted into either GenBank or GFF format (see below). A manual
validation step was introduced at this stage. Data storing procedure will
be described in Section 2.3.4.
53
2.3.2 Assembly process
Phred/Phrap/Cosed package (version 0.99.03.19) is one of the most fre-
quently used software sets for trace file base calling, contig assembly and
contig editing [83,84,112].
Base calling
The purpose of base calling is to determine the nucleotide sequence on
the basis of multi-colour peaks in the sequence trace. Because traces
(and regions within a trace) are of variable quality, the fidelity of “called”
nucleotides is also variable. This accuracy for each called base is measured
by what are called base quality values. Phred takes trace file as input.
The Phred base calling program provides these base quality values to
help realistically evaluate sequence accuracy. It computes a probability p
of an error in the base call at each position, and converts this to a quality
value q using the transformation q = −10 × log10(p). Thus a quality of
30 corresponds to an error probability of 1/1000, a quality of of 40 to an
error probability of 1/10000, etc.
Vector clipping
Use the cross match alignment program to compare each read in fasta-
format file generated by base calling to a fasta database of cloning and
sequencing vectors vector.fasta. The sequence of the cloning vector used
(pUC18 plasmid sequence in our case) was added to the vector sequence
database. On the bioinfo server, the the vector sequence database is lo-
cated at /db/univec/UNIVEC/UniVec or /pgm1/phrap/vector.seq. The
example command line for clipping CLONE.fasta is:
% cross match -minmatch 12 -penalty -2 -minscore 20 -screen
CLONE.fasta
54
/db/univec/UNIVEC/UniVec
The -screen option tells cross match to produce another fasta file, CLONE.fasta.screen,
nearly identical to CLONE.fasta, except that recognised vector sequences
are replaced by X (or x, according to the original capitalisation).
Sequence assembly
Assemble the vector-clipped reads to reconstruct the clone sequence, us-
ing the Phrap sequence assembler. The program takes as input a fasta
format file of sequence fragments and a companion base quality file, con-
structs contig sequence as a mosaic of the highest quality parts of reads.
Run the assembly program using command line:
% phrap -new ace CLONE.fasta.screen > phrap.out
As a result, Phrap creates a number of files. The most important ones:
CLONE.fasta.screen.contigs (assembly consensus sequence in Fasta
format),
CLONE.fasta.screen.contigs.qual (assembly consensus base quality
values assigned by Phrap), and CLONE.fasta.screen.ace (a complicated-
looking file that enables one to view the result of the assembly in the
Consed assembly viewer/editor program).
In file CLONE.fasta.screen.contigs.qual, Phrap provides quality
information about assembly (i.e., quality values for contig sequence) by
generating its own quality measures (based on read-read confirmation).
This process seems rule-based (few references about it). For example, if
all input quality values (given by Phred) are relatively small (less than
15), Phrap assumes that they do not correspond to error probabilities
and attempts to rescale them so that the largest quality value is approx-
imately 30; in contrast, if input quality values are relatively high (≥ 40),
55
Phrap may give the base in contig (consensus of more than one bases of
reads) a higher quality value like 90. After contig assembly, for a contig
of length n, the average quality value is given by:
∑(Quality value of base in contigs)
Number of base in contigs
2.3.3 Gene finding
One of the main aims of annotation pipeline is to aid in identification
of protein-coding genes. This can be done by using a gene-finding pro-
gram to predict gene models (ab initio gene finding), or by predicting
possible genes based on the similarity of the sequence to other sequences,
particularly other identified sequences. I used both of these approaches
as follows. Ab initio gene predictions were performed using FGENESH
(SoftBerry). The automated gene prediction pipeline was hosted on the
bioinfo server at the Computer Center, HKU. The original prediction
was manually refined with assistance from GenomeScan, another gene
prediction program that combines sequence similarity and exon-intron
composition (i.e., two distinct types of evidence used by these classes of
methods), into one integrated algorithm.
Evaluation of gene recognition accuracy
The predictive accuracy of a gene-finding program is evaluated by com-
paring the exons predicted by the program with the actual coding exons
at nucleotide level and exon level [31]. For nucleotide level accuracy,
define the values TP (true positives), TN (true negatives), FP (false
positives), and FN (false negatives) as follows: TP = the number of
coding nucleotides predicted as coding; TN = the number of noncoding
nucleotides predicted as noncoding; FP = the number of noncoding nu-
cleotides predicted as coding; FN = the number of coding nucleotides
predicted as noncoding, then sensitivity as the proportion of coding nu-
56
cleotides that are correctly predicted as coding:
Sn =TP
TP + FN,
and specificity as the proportion of nucleotides predicted as coding that
are actually coding:
Sp =TP
TP + FP.
For exon level accuracy, the formulas for exon level sensitivity (ESn) and
specificity (ESp) are:
ESn =TE
AE, ESp =
TE
PE.
where TE (true exons) is the number of exactly predicted exons and AE
and PE are the numbers of annotated and predicted exons, respectively.
Combining predictions from two gene-finding programs
Gene-finding programs are still unable to provide automatic gene dis-
covery with desired correctness. The benefits of combining predictions
from more than one already existing gene prediction program have been
explored [268]. Therefore, methods for combining predictions from pro-
grams, GenScan and HMMgene, was used in predication of P. marneffei
genes, in attempt to improving exon level accuracy of gene-finding by
identifying more probable exon boundaries and by eliminating false pos-
itive exon predictions. The scripts implementing these methods are ob-
tained from http://www.cs.ubc.ca/labs/beta/genefinding/. Note
that at the time this combining prediction study was conducted, the gene-
finding program FGENESH was still not available. A late retrospective
test was conducted after combining FGENESH with either GenScan or
HMMgene though.
57
2.3.4 Database and databank to store results
The first step in database design is to decide what the database will be
used for and how users will interact with it. Once these are defined, the
data to be stored and how these data are associated with one another
is defined. This is done using a conceptual data model. The model is
independent of how the information will be stored in the final, physical
implementation on the computer. Entities, like gene, contig and gene
product, are defined that informally represent concepts from the real
world. The relationship between these concepts were also defined, for
example, a contig contains more than one genes; generally one gene pro-
duces one gene product. A formal language such as Unified Modelling
Language (UML) was used for specifying both use cases and conceptual
data models.
The next step is physical implementation of the data model. Now a
database management system (DBMS) has to be selected. Here I used
Microsoft Access, relational database manager running on a Windows
operating system. It is available in our departmental facilities and is
quite powerful and efficient for medium-size database management. It
has straightforward Web-publication capabilities and intuitive graphic
user interface-building capabilities. Administrators of the database work
through the application interface, while users interact with database
through a web interface. Physical implementation of the conceptual data
model was mediated with the database schema (Fig. 2.3).
Large-scale data that are to be made accessible to the community
should be well curated, annotated and documented and appropriately
formatted for publication. At present, no universally accepted standards
for data format exist for genomics data. Here, I adopted GFF (http:
//www.sanger.ac.uk/Software/formats/GFF) and GenBank format to
transfer information to and from public databases and applications. The
database was populated using Perl scripts written using ActiveState Perl
58
Version 5.6 for Windows (downloaded from http://www.activestate.
com) and the Perl modules Bioperl (obtained from http://www.bioperl.
org).
2.3.5 Perl source code collection
In the annotation pipeline, a sequence of analysis steps each using differ-
ent tools must be carried out one after the other. The challenge was that
in the absence of defined standards for the input and output of different
tools. Because there is no explicit ‘contract’ between the various tools
as to what input and output formats each will support, at any time one
of the tools in the pipeline may change the format of its input or output
(breaking the system). To connect together multiple tools ‘smoothly’
and ‘robustly’, special ‘glue codes’ have been written, mostly in Perl.
The collection of Perl scripts organised into several modules are available
at the PMGD website.
2.3.6 Genome browser configuration
Visualisation of genomic information is not just for the beauty or aes-
thetic purposes. It is of practical use that it gives more meaning to people
than reading those ‘cipher texts’. For example, three of the most promi-
nent genome browsers are the Ensembl Genome Browser (http://www.
ensembl.org/) by the European Bioinformatics Institute and the Sanger
Institute, the Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/)
by National Centre for Biotechnology Information and the UCSC Genome
Browser (http://genome.ucsc.edu/) by the University of California
Santa Cruz Genome Bioinformatics Group. They are highly specified
to their particular data type and information. Most of genome browsers
can work either online or offline. They are usually developed in Perl,
Java or other high-level languages.
PMGD incorporates two free but powerful genome browsers, Argo
59
(Java applet Fig. 2.2) and GBrowse (Generic Genome Browser), in
order to organise and annotate genomic data. The GBrowse (http:
//www.gmod.org/) combines database and interactive web page for ma-
nipulating and displaying annotations on genomes. It requires 3 steps,
installation, configuration and customisation. Installation is a easy walk
through following the instruction. Configuration is done by a configu-
ration file. Customisation was achieved by the configuration file. The
machine is equipped with a Pentium III Processor at the clock speed of
800 MHz and 128 MB main memory. ActivePerl, BioPerl and Apache web
server are necessarily installed. There is an advanced option for choos-
ing between the ‘in-memory’ database or the relational database MySQL
for storing the sequence and annotation information. For genome size
of P. marneffei, the ‘in-memory’ architecture is already good enough to
handle. Sequence files (in FASTA format) and annotation files (in GFF
format) are to be stored under ‘$HTDOCS/gbrowse/databases’ of the di-
rectory of Apache web server. The configuration file (.conf) defining the
settings is stored in ‘$CONF/gbrowse.conf’. GBrowse is highly customis-
able. For example, administrators can use different colours or shapes to
represent exon, intron, and other genetic elements. More sophisticated
functions, such as the display of different reading frames, transcription
profile, ESTs and alignments, are also provided. Administrators are al-
lowed to freely customise it by switching ON/OFF these functions and
altering the default settings so that Genome Browser can better fit the
purposes of a particular database.
2.3.7 Synteny identification
To perform synteny analyses, amino acid identity between P. marnef-
fei and A. nidulans (or other fungi) was first determined by comparing
the predicted proteins from each fungus using BLASTP. The putative
ortholog pairs is predicted by using INPARANOID program [261]. Puta-
60
Figure 2.2: PMGD genome browser.
tive ortholog pairs were aligned using ClustalW and the amino acid per
cent identity for each pair was calculated. If alignments spanned 60%
of both genes and the alignment score was within 80% of the top score
for either of the pair of genes, then the pair was accepted. Using these
putative ortholog pairs, supercontigs were compared with the ADHoRe
program [322] (r2 cutoff = 0.8, maximum gap size = 35 genes, minimum
number of pairs = 3). Results were filtered such that the maximum
probability for a segment to be generated by chance was < 0.01.
2.4 Results
2.4.1 Statistics of assembly
As mentioned in Section 1.3.2, all inserts were sequenced from both ends
to generate paired reads. These paired sequence fragments were assem-
bled using the Phrap package of assembly tools [84], yielding a draft
assembly. 98.35% of the assembled sequence was reconstructed in 273
supercontigs (2911 contigs); The longest contig is 178,730 bp and the
longest supercontig is 729,276 bp; The fidelity of the assembly is sup-
61
ported by the high degree (80.50%) of plasmid-end pairs preserved in
contigs and scaffolds. The net length of assembled contigs totaled 28.98
Mbp, including the mitochondrial genome of ∼ 35 kbp (Table 2.2).
Table 2.2: Summary of assembly statistics.
Features ValueRead
Total Number of Reads Sequenced 315,580Number of Bases in Total Reads 173,664,505 bpAverage Read Length 550.20Number of Confirmed Reads (by Phrap) 310,365Fraction of Reads Assembled 98.35%Fraction of Reads Paired in Assembly 80.50%Number of Bases Used in Assembly 170,951,774 bpAverage Shotgun Coverage 6.6 fold (Phrap report)
ContigTotal Number of Contigs 2,911Number of Bases in Contigs 28,977,603 bpLongest Contigs 178,730 bpAverage Length of Contigs 9,955 bp
Supercontig (scaffold)Total Number of Supercontigs 273Number of Bases in Supercontigs 28,421,390 bpLongest Supercontigs 729,276 bpAverage Length of Supercontigs 104,110 bp
2.4.2 Genome size estimation
The genome size was approximated from the draft assembly by estimat-
ing the size of gaps between contigs and scaffolds. As shown in Table
2.2, total base summarised is 28.42 Mb in supercontigs, 28.98 Mb in con-
tigs. These estimates do not include gaps. Within a supercontig, gaps,
so called within-supercontig gaps, are between contigs that belong to
the supercontig. The size of these gaps can be derived from the size of
clones spanning the gap. As mentioned in Section 1.3.2, two sequencing
clone libraries were constructed, carrying insert sizes from 2.0 – 3.0 kb
62
and 7.5 – 8.0 kb, respectively. Paired-reads belonging to contigs adjacent
gaps was recognised to be from which library. The size of gaps between
adjacent contigs in a supercontig can therefore be derived from the size
of clones spanning the gap. When estimated gap sizes are included, the
total physical length of all scaffolds is estimated to be 29.8 – 30.5 Mb.
Between supercontigs there are so called between-supercontig gaps. The
size of these gaps is hard to estimate since no spanning clones are avail-
able. In addition, these gaps include difficult-to-sequence regions of the
genome including the ribosomal DNA (rDNA) repeats, centromeres, and
telomeres. If we take these considerations, the genome size is estimated
to be ∼ 31 Mb.
When the sequencing is at the stage of relatively low coverage. There
is ‘dynamic’ way to estimate genome size by applying Lander-Smith
mathematical model. Assuming there is no cloning bias, the DNA frag-
ments generated in the shotgun sequence process are located around the
chromosome according to a Poisson distribution [92]. The unsequenced
fraction of a genome (double-strand) is:
p = e−nw/L
where n is the number of reads, w is the average length of reads and L
is the length of genome. For a 20 Mb genome, it would require about
120,000 reads of 500 bp to produce theoretically about 95% (P = 0.05)
coverage.
The number of unsequenced regions on both strands generates the
same number of contigs, N , which can be calculated as:
N = ne−nw/L
For the total sequence data (about 60 Mb reads) we have got, there are
total 119,744 reads with a mean length of 511 bp. After assembly with
63
Phrap, it generated 13,861 contigs. Therefore, n = 119744, w = 511, N =
13861. The genome size can be calculated as the following:
L = − nw
ln(N/n)= 28, 377, 000
In practice, the number of contigs is higher than theoretical expectation,
since when assembling fragments Phrap needs overlap of nucleotides to
link two reads together. These overlap regions do not contribute to the
actual coverage but was taken into calculation as it does. Another factor
is the bias due to cloning difficulties [186].
2.4.3 Accuracy of gene finding
The purpose for evaluation of gene recognition accuracy is to select the
best gene finding program. The testing data set, composing of 103 Peni-
cillium protein-coding genes that contain multiple exons was built. Our
results shows that FGENESH gives the most accurate predication over-
all. With it, we can identify ∼ 90% of coding nucleotides with 12% false
positives. It provides sensitivity (Sn) = 96% and specificity (Sp) = 89%
at the base level, Sn = 92% and Sp = 84% at the exon level and Sn =
85% and Sp = 67% at the gene level.
2.4.4 Combination of gene finding
Gene recognition accuracy may be improved by combining predictions
from two gene-finding programs. Rogic et al. [268] implemented a series
of algorithms combining gene prediction from two existing gene finding
systems, GenScan and HMMgene. The combined algorithms were tested
on the HMR195 sequence dataset and generated improved accuracy at
both the nucleotide and exon levels, where the average improvement was
7.9% compared to the best result obtained by GenScan or HMMgene
alone.
In order to identify the most accurate gene prediction system for P.
64
marneffei, I conducted an evaluation study to compare GenScan, HM-
Mgene and the combined gene prediction system based on them. The
improved accuracy of result obtained by using the combined algorithm
as in Rogic’s study was not observed in our study, where we used a dataset
of 103 sequences with known genes from Penicillium species. Our result
shows that GenScan tends to give a significantly better prediction than
either of the other systems. At the nucleotide level, the sensitivity de-
creased from 95% for GenScan to 89% for HMMgene, to 92% for the
combined algorithm.
Two considerations came up in regard to the discouraging result ob-
tained when the combined algorithm was applied to the dataset from
Penicillium species. Firstly, the different performance of combined algo-
rithm in ours and Rogic’s study is most likely caused by the difference
of organisms. The dataset HMR195 used in Rogic’s study is composed
of 195 human, mouse and rat sequences. Secondly, if two systems gen-
erate consistent (no matter good or bad) predictions, then combining
them would not give better results. For the human and rodents’ dataset,
GenScan and HMMgene performed differently, but neither of them was
always superior to the other. But when GenScan and HMMgene were
used in our dataset composed of sequences from Penicillium species, we
found GenScan always generated significantly better results than HMM-
gene. Obviously, it does not help to combine gene finding systems if one
system is always superior.
As mentioned, FGENESH was not available during the time when the
gene combination test was conducted. A late retrospective test indicated
that no improvement can be obtained when combining FGENESH with
either GenScan or HMMgene (data not shown). Consequently we decided
to use FGENESH alone to perform the gene prediction for this project.
65
2.4.5 Database and databank to store results
Physical deployment of P. marneffei genome database is different from
that of annotation pipeline hosted in SUN Solaris server at the Computer
Center, HKU. PMGD is located in the Windows 2000 based system at the
Department of Microbiology, HKU, which is accessible as a workstation
for administrators, and as a web service system for general users.
2.5 Discussion
Nowadays high through-put DNA sequencing offers a rapid and cost ef-
fective approach to obtain the most important and relevant of all ge-
netic information – the complete DNA sequence of an organism. As
the quantity of data increases for a genome project like P. marneffei
genome, researchers have to become more sophisticated about data man-
agement issues. The study developed the system for P. marneffei genome
project. This system performs semi-automatic tasks of assembly analy-
sis, gene prediction/analysis, and extragenic region analyses. In order to
be compatible with the computer systems available at the Department of
Microbiology, HKU, the system was designed to span multiple working
environments and integrate several public domains and newly developed
software programs capable of dealing with several types of databases.
Our PMGD solution approves a feasible way to handle the information
and to manage large quantities of data internally or for public use. The
genome sequence was searched against the public protein databases using
BLAST. Genes were predicted using FGENESH and adjusted manually
by referring GenomeScan. The FGENESH was selected as the best pre-
dictor from a number of gene calling programs validated against a test
set of 103 previously characterised Penicillium protein-coding genes.
Ab initio gene finding is challenging in P. marneffei. This is because
1) lack of training dataset. Normally training gene-finding program re-
quires more than 300 genes, in order to reach statistical power. However,
66
SG
D_
ES
SE
NT
IAL_
OR
F
FK
1,I
1S
YS
_N
AM
E
Fie
ld3
INT
ER
PR
O
PK
INT
ER
PR
O_N
O
DO
MA
IN_N
AM
E
ALIA
S
AL
IAS
_N
O
AL
IAS
_N
AM
E
FE
AT
UR
E_N
O
OR
TH
OLO
G
PK
Ort
oID
Score
I1G
EN
E_
NA
ME
SG
D_
SY
S_
NA
ME
SG
D_
GE
NE
NA
ME
PK
DB
_O
bje
ct_
ID
ST
AN
DA
RD
_N
AM
E
ALIA
S
DE
SC
RIP
TIO
N
GE
NE
_P
RO
DU
CT
PH
EN
OT
YP
E
FK
1,I1
SY
S_N
AM
E
IS_E
SS
EN
TIA
L
BLA
ST
_P
RO
GR
AM
PK
,I1
BLA
ST
_P
RO
GR
AM
_N
O
BLA
ST
_P
RO
GR
AM
BLA
ST
_V
ER
SIO
N
BLA
ST
_D
B
BLA
ST
_D
B_LE
N
BLA
ST
_D
B_LE
T
DA
TE
_M
OD
IFIE
D
DA
TE
_C
RE
AT
ED
CR
EA
TE
D_
BY
GE
NE
_P
RO
DU
CT
PK
GE
NE
_P
RO
DU
CT_N
O
FK
1,I1
GE
NE
_N
O
GE
NE
_P
RO
DU
CT
DE
SC
RIP
TIO
N
FU
NC
TIO
N_
EV
IDE
NC
E
PK
FU
NC
TIO
N_
EV
IDE
NC
E_N
O
FU
NC
TIO
N_
EV
IDE
NC
E_N
AM
E
DE
SC
RIP
TIO
N
CO
NT
IG
PK
CO
NT
IG_N
O
CO
NT
IG_N
AM
E
OR
GA
NIS
M
SO
UR
CE
LE
NG
TH
PO
ST
_G
AP
PR
E_
GA
P
CO
NT
IG_O
RD
ER
FK
1,I
1S
CA
FF
OLD
_N
O
CO
MM
EN
TS
CR
EA
TE
D_
BY
DA
TE
_C
RE
AT
ED
SG
D_
GO
DB
FK
1,I2
DB
_O
bje
ct_
ID
ST
AN
DA
RD
_N
AM
E
NO
T
I1G
Oid
DB
_R
efe
rence
Evid
ence
With
Aspect
DB
_O
bje
ct_
Nam
e
DB
_O
bje
ct_
Synonym
DB
_O
bje
ct_
Type
taxon
Date
Assig
ned
_by
PA
TH
WA
Y
PK
,I1
PA
TH
WA
Y_
ID
PA
TH
WA
Y
GO
_E
VID
EN
CE
PK
GO
_E
VID
EN
CE_N
O
EV
IDE
NC
E_C
OD
E
DE
SC
RIP
TIO
N
GE
NE
PK
GE
NE
_N
O
I2G
EN
E_
NA
ME
FK
1,I1
SC
AF
FO
LD
_N
O
EX
ON
_N
UM
BE
R
C_S
TA
RT
C_E
ND
CD
S_LE
NG
TH
FR
AM
E
CH
RO
MO
SO
ME
GE
NE
TIC
_P
OS
ITIO
N
GE
NE
_D
ES
CR
IPT
ION
CO
MM
EN
T
BLA
ST
P
PK
,I2
BLA
ST
P_N
O
I3H
IT_
ID
HIT
_G
I
HIT
_LE
N
HIT
_A
CC
ES
SIO
N
HIT
_D
EF
HIT
_S
IGN
IF
HIT
_S
CO
RE
BLA
ST
_Q
UE
RY
_D
EF
BLA
ST
_Q
UE
RY
_LE
N
BLA
ST
_Q
UE
RY
_A
CC
BLA
ST
_Q
UE
RY
DE
SC
FK
1,I1
BLA
ST
_P
RO
GR
AM
_N
O
PR
OT
EIN
I3P
RO
TE
IN_N
O
FK
1,I1
GE
NE
_N
O
I2P
RO
TE
IN_N
AM
E
PR
OT
EIN
_S
EQ
PR
OT
EIN
_LE
N
DE
SC
RIP
TIO
N
EC
_N
UM
BE
R
GO
_G
EN
E_
GO
EV
FK
1G
EN
E_N
O
GO
id
FK
2,I
2G
O_E
VID
EN
CE_
NO
IS_N
OT
PR
OT
EIN
_IN
FO
FK
1P
RO
TE
IN_N
O
FE
AT
UR
E_
NO
MO
LE
CU
LA
R_
WE
IGH
T
PI_
VA
LU
E
CA
I
PR
OT
EIN
_LE
NG
TH
N_T
ER
M_
SE
Q
C_T
ER
M_
SE
Q
CO
DO
N_B
IAS
TO
P_S
CO
RE
GR
AV
Y_
SC
OR
E
AR
OM
AT
ICIT
Y_S
CO
RE
HO
MO
LO
G
PK
,I2
ID HO
MO
LO
G_
NO
I1G
EN
E_
NO
FK
1,I3
GE
NE
_N
AM
E
HM
LG
_S
PE
CIE
S
HM
LG
_G
EN
E_
NA
ME
HM
LG
_S
YS
_N
AM
E
HM
LG
_F
UN
CT
ION
SC
OR
E
PR
OT
EIN
_IN
TE
RP
RO
FK
2,I
2P
RO
TE
IN_
NA
ME
FK
1,I
1IN
TE
RP
RO
_N
O
GE
NE
_A
LIA
S
FK
1A
LIA
S_N
O
FK
2,I2
GE
NE
_N
O
SC
AF
FO
LD
PK
SC
AF
FO
LD
_N
O
LE
NG
TH
I1O
LD
_ID
GE
NE
_F
UN
CT
ION
PK
GE
NE
_F
UN
CT
ION
_N
O
FK
2,I2
GE
NE
_N
O
GE
NE
_P
RO
DU
CT
DE
SC
RIP
TIO
N
FK
1,I1
FU
NC
TIO
N_E
VID
EN
CE_
NO
BLA
ST
X
PK
,I2
BLA
ST
X_
NO
I3H
IT_ID
HIT
_G
I
HIT
_LE
N
HIT
_A
CC
ES
SIO
N
HIT
_D
EF
HIT
_S
IGN
IF
HIT
_S
CO
RE
BLA
ST
_Q
UE
RY
_D
EF
BLA
ST
_Q
UE
RY
_LE
N
BLA
ST
_Q
UE
RY
_A
CC
BLA
ST
_Q
UE
RY
DE
SC
FK
1,I1
BLA
ST
_P
RO
GR
AM
_N
O
RE
FE
RE
NC
E
PK
,I3
RE
FE
RE
NC
E_
NO
RE
F_S
OU
RC
E
ST
AT
US
CIT
AT
ION
YE
AR
_V
AL
UE
U1
PU
BM
ED
DA
TE
_P
UB
LIS
HE
D
DA
TE
_R
EV
ISE
D
ISS
UE
PA
GE
VO
LU
ME
TIT
LE
FK
1,I
2JO
UR
NA
L_N
O
I1B
OO
K_N
O
DA
TE
_C
RE
AT
ED
CR
EA
TE
D_B
Y
AB
ST
RA
CT
FK
1,U
2R
EF
ER
EN
CE_
NO
AB
ST
RA
CT
AU
TH
OR
PK
,I1
AU
TH
OR
_N
O
AU
TH
OR
_N
AM
E
AU
TH
OR
_F
ULLN
AM
E
DA
TE
_C
RE
AT
ED
CR
EA
TE
D_B
Y
AU
TH
OR
_E
DIT
OR
FK
1,I4
RE
FE
RE
NC
E_N
O
FK
2,I3,I
2A
UT
HO
R_N
O
AU
TH
OR
_T
YP
E
AU
TH
OR
_O
RD
ER
CA
TE
GO
RY
PK
,I1
CA
TE
GO
RY
_N
O
CA
TE
GO
RY
DA
TE
_C
RE
AT
ED
CR
EA
TE
D_
BY
RE
MA
RK
FK
1P
UB
ME
D
RE
MA
RK
RE
FE
RE
NC
E_W
EIG
HT
DA
TE
_C
RE
AT
ED
CR
EA
TE
D_
BY
CA
TE
GO
RY
_R
EF
FK
1,I3
CA
TE
GO
RY
_N
O
FK
2,I4
,I2
RE
FE
RE
NC
E_N
O
JO
UR
NA
L
PK
JO
UR
NA
L_N
O
FU
LL_
NA
ME
AB
BR
EV
IAT
ION
ISS
N
PU
BLIS
HE
R
UR
L_N
O
PU
BLIC
AT
ION
_T
YP
E
FK
1,U
2R
EF
ER
EN
CE_N
O
PU
B_T
YP
E
Figure 2.3: Database schema of PMGD.
67
for P. marneffei we don’t have enough characterised genes; 2) lack of
cDNA which is very useful for confirming initial gene prediction. To
identify the genes that lack available cDNA sequence will require other
methods, such as, interspecies homolog search. We do have small amount
of RST sequences available [364], but, due to the poor sequence quality,
they are not even helpful. Our solution for this problem is to apply a
pre-existing gene finding program, namely FGENESH. Generally speak-
ing, if one uses a pre-existing gene finding program in a newly sequenced
organism, one expects inaccurate predictions. However, our evaluation
shows that FGENESH trained with A. nidulans dataset produced satis-
factory results when applied onto P. marneffei. This is due to the close
phylogenetic relationship between two species. We also tried to combine
predictions made by more than one gene prediction system, which has
been proposed that would significantly improvement gene prediction ac-
curacy. But unfortunately, because FGENESH is dominately better than
any other gene finding programs available, we did not observe such an
improvement after combination.
The further direction can be envisaged basing on current stage of
the system. Firstly, one of striking characteristics of the genomes of eu-
karyotic organisms is the existence of muiltigene family. This confounds
the identification of orthologous relationship among genes in interspecies
comparison. In order to solve the problem of discrimination between or-
tholog and paralog, more sophisticated algorithms are required. These al-
gorithms should take phylogenetic information into account and integrate
this into the protein prediction system. Secondly, when assigning a func-
tion to protein, controlled vocabulary should be used to all organisms.
Recent development of Gene Ontology [9] project produced a dynamic
controlled vocabulary environment that can cope with ever accumulating
and changing knowledge of gene and protein functions. Thirdly, it is ob-
vious that the more function prediction system develops, the more impor-
68
tant will be its evaluation of accuracy. Iliopoulos (2002) has established a
scoring scheme to measure performance of prediction systems [143]. De-
spite of this, considerable concerns are still raised regarding the accuracy
of assignment and the reproducibility of methodologies. The evaluation
of the performance of these systems is missing at this stage.
In summary, modern biology has created an information explosion.
The areas of whole-genome sequencing and functional genomics have pro-
duced a prodigious amount of data. This is the case in P. marneffei
genome project. This study provided a solution by offering the anno-
tation pipeline linking variant biological softwares in a systemic way, as
well as the state-of-art database management system for storing and re-
trieval biological sequence data. It has been successfully applied on the
daily-based work of annotation for the most important thermal dimorphic
fungus.
69
Chapter 3
MITOCHONDRIAL GENOME OF PENICILLIUM
MARNEFFEI
This work described in this chapter is very closely based on a paper
I have published with colleagues [353].
3.1 Introduction
Mitochondria are the power centres of the cell. They are generally the
major sites of aerobic respiration and the energy production centre in
fungi, providing the energy a cell needs to move, divide, produce se-
cretory products and contract. They are small oval-shaped, membrane-
bound organelles, about the size of a bacterium, surrounded by highly
specialised double membranes. The outer membrane is fairly smooth.
But the inner membrane, where oxidative phosphorylation takes place, is
highly convoluted, forming two compartments, the intermembrane space
and matrix. The reaction of the citric acid cycle and fatty acid oxidation
occur in the matrix.
Mitochondria maintain their own genomes. Nowadays a number of
mitochondrial genome sequences have become available. At present, the
NCBI organelle genome resource maintains a collection of 350 completed
mitochondrial genomes from different organisms, including 256 meta-
zoans, 15 fungi, 9 plants and 22 others. The number is subject to change
with the advance of sequencing endeavours. The gene content of mito-
chondrial genomes is generally well conserved. In metazoans, for exam-
ple, the mitochondrial genomes are generally circular, about 16 kb long,
and encode three primary transcript types (13 proteins used for energy
70
production, two rRNAs and 22 tRNAs). The homologous genes exist-
ing in the mitochondria of plants, protists, fungi, and animals, and in
the genomes of prokaryotes, make it possible to undertake inter-species
gene comparisons. Next I will review major components in respiratory
pathway of fungal mitochondria.
The common and invariant feature of respiratory pathways of mi-
tochondria is production of ATP coupled to electron transport. The
respiratory chain begins with electrons being transferred from NADH to
complex I (NADH:ubiquinone oxidoreductase) or from the tricarboxylic
acid cycle intermediate succinate to complex II (succinate:ubiquinone
oxidoreductase). Electrons are transferred via ubiquinones, complex III
(ubiquinol:cytochrome c oxidoreductase), cytochrome c, complex IV (cy-
tochrome c oxidase) and finally to molecular oxygen to give water (Fig.
3.1).
Complex I is comprised of peptides encoded by both nuclear- and
mithochondrial-genes (more than 25 nuclear-genes and seven mitochondrial-
encoded genes, nad 1, 2, 3, 4, 4L, 5, 6 ), forming a large multisubunit
complex and spanning the inner mitochondrial membrane. Note that a
few fungi like Saccharomyces cerevisiae and Schizosaccharomyces pombe
lack complex I, and many fungi have additional components, such as al-
ternative NADH dehydrogenases and/or an alternative terminal oxidase
(see review [152]). Complex III contains nine subunits, of which only
the gene for apocytochrome b is encoded in the mitochondrion. Between
complexes III and IV there is Cytochrome c existing in the intermembrane
space and passes electrons. Cytochrome c is encoded by the nuclear cyc-1
gene. Complex IV contains 7-8 polypeptides of which three are encoded
in mitochondrion, cox1,2,3. It is the terminal oxidase of the standard
respiratory pathway. Complex V is the mitochondrial ATP synthase,
encoded by two of the ATP synthase subunit genes, atp6 and atp8.
Since the formation of several mitochondrial complexes have subunits
71
Figure 3.1: Fungal respiratory pathways. The diagram is downloadedfrom http://pages.slu.edu/faculty/kennellj
encoded in both mitochondrion- and nuclear- genomes, the coordinated
expression of genes encoded in the nucleus and mitochondrion is critical
for the mitochondrial function. These mitochondrial complexes include
not only the large respiratory complexes as mentioned above, but also the
translational machinery that involves nuclear-encoded polypeptides and
mitochondrially-encoded rRNAs and tRNAs, and so on [240]. Therefore,
the communication between the nuclear and mitochondrial genomes con-
tributes essential subunit polypeptides to important mitochondrial pro-
teins and they collaborate in the synthesis and assembly of these proteins
(for review, see [256]).
In this chapter I report the complete sequence of the mitochondr-
ial genome of Penicillium marneffei, the first complete mitochondrial
DNA sequence of thermally dimorphic fungi. This 35 kb mitochondrial
genome contains the genes encoding ATP synthase subunits 6, 8, and 9
(atp6, atp8, and atp9 ), cytochrome oxidase subunits I, II, and III (cox1,
cox2, and cox3 ), apocytochrome b (cob), reduced nicotinamide adenine
dinucleotide ubiquinone oxireductase subunits (nad1, nad2, nad3, nad4,
nad4L, nad5, and nad6 ), ribosomal protein of the small ribosomal sub-
72
unit (rps), 28 tRNAs, and small and large ribosomal RNAs. Analysis
of gene contents, gene orders, and gene sequences revealed that the mi-
tochondrial genome of P. marneffei is more closely related to those of
moulds than yeasts.
3.2 Materials and Methods
3.2.1 Library construction and sequence assembly
The P. marneffei mitochondrial genome was sequenced as part of the
P. marneffei whole genome sequencing project as described in Chapter
1 and 2. A genomic DNA (including mitochondrial DNA) library was
made in pUC18 carrying insert sizes from 2.0 to 8.0 kb. DNA inserts
were prepared by physical shearing using the sonication method. These
work above were done by my colleagues in the Department of Micriol-
ogy, HKU and Beijing Genome Institute. I used Phred/Phrap/Consed
software package for base calling, contigs assembly and assembly qual-
ity assessment [83, 84, 112]. The complete mitochondrial DNA genome
was generated from assembly of 467 successful sequence reads (100 bp at
Phred value Q20 [112,243]), which corresponded to an overall mitochon-
drial genome coverage of about 7×.
3.2.2 Mitochondrial DNA sequence annotation
The putative ORFs in P. marneffei mitochondrial DNA were denoted
by using Artemis, a free sequence viewer and annotation tool, with the
genetic code of mould. Genes, in which the putative ORFs were lo-
cated, were functionally assigned through BLASTP searces against fun-
gal mitochondrion encoding proteins available in the GenBank database.
Introns and rRNAs were mainly identified by BLASTN pairwise compar-
ison of P. marneffei mitochondrial DNA with mitochondrial DNAs of
Aspergillus nidulans, Neurospora crassa, Saccharomyces cerevisiae (Acc.
NC 001224), Schizosaccharomyces pombe (Acc. NC 001326), Podospora
73
anserina (Acc. NC 001329), Allomyces macrogynus (Acc. NC 001715),
Pichia canadensis (Acc. NC 001762), Candida albicans (Acc. NC 002653),
Yarrowia lipolytica (Acc. NC 002659), and Candida glabrata (Acc. NC 004691)
[29, 91, 101, 354, 175, 262]. The BLASTN results were viewed through
ACT, a DNA sequence comparison viewer based on Artemis [40], and
exon and intron boundaries were adjusted manually. The tRNAs were
predicted by tRNAscan-SE 1.21 [207]. The core structures of the group
I introns were inferred by the program CITRON [200].
3.2.3 Phylogenetic analysis
Phylogenetic analysis was performed by using MBEToolbox as described
in Chapter 10. The 11 genes that encode subunits of respiratory chain
complexes (cox1, cox2, cox3, cob, nad1, nad2, nad3, nad4, nad4L, nad5,
and nad6 ) and the three that encode ATPase subunits (atp6, atp8, and
atp9 ) in the P. marneffei mitochondrial genome and the corresponding
genes in 24 other fungi with completed mitochondrial genomes were used
to determine the phylogenetic relationships of P. marneffei to the other
fungi. Phylogenetic trees were constructed using unambiguously aligned
portions of concatenated amino acid sequences of these 14 protein cod-
ing genes by the maximum likelihood method in the Phylip package [86].
The corresponding nad genes are not present in Schizosaccharomyces
japonicus, Schizosaccharomyces octosporus, S. pombe, C. glabrata, Sac-
charomyces castellii, Saccharomyces servazzii, and S. cerevisiae, and the
maximum likelihood method is not as sensitive to a lack of sequence in-
formation as the distance methods. A total of 3,462 amino acid positions
were included in the analysis.
3.2.4 Mitochondrial DNA sequences in nuclear genome
Fragments of mitochondrial DNA sequences were searched for in the cor-
responding nuclear genomes in P. marneffei, A. nidulans, N. crassa, S.
74
cerevisiae, and S. pombe. For each fungus, the corresponding mitochon-
drial DNA sequence was used as the query sequence to search against
its own nuclear genome, using a published method for S. cerevisiae
[262]. The mitochondrial and genomic DNA sequences of A. nidulans
and N. crassa were downloaded from the A. nidulans Database (http:
//www-genome.wi.mit.edu/annotation/fungi/aspergillus/) and N.
crassa Database (http://www-genome.wi.mit.edu/annotation/fungi/
neurospora/) respectively, and those of S. cerevisiae and S. pombe were
obtained from GenBank. For P. marneffei, the 6.6× coverage of ge-
nomic DNA sequences was generated by our own whole genome sequenc-
ing project.
3.3 Results and Discussion
3.3.1 Gene content and genome organisation
The mitochondrial DNA of P. marneffei is a circular DNA molecule of
35,438 bp (Fig. 3.2). The overall G+C content is 25%, and 24% in
protein-coding genes. The genome encodes 28 tRNAs, the small and
the large subunit rRNAs, the ribosomal protein of the small ribosomal
subunit, 11 genes encoding subunits of respiratory chain complexes, and
the three ATPase subunits (Table 3.1). All genes are encoded by the
same DNA strand. 63.6% of the genome is occupied by structural genes
(40.5% corresponds to protein coding exons, 5.9% to the 28 tRNA genes,
and 17.3% to the rRNA subunits), 8.8% by intergenic spacers that are
14-372 bp in size, and 32.4% by the 11 introns.
3.3.2 Protein coding genes
The P. marneffei mitochondrial genome contains 15 protein coding genes.
These include genes encoding ATP synthase subunits 6, 8, and 9 (atp6,
atp8, and atp9 ), the cytochrome oxidase subunits I, II, and III (cox1,
75
P. marneffei mtDNA35,438 bp
nad5
cob
rnl
cox1
nad9
nad4
nad2
nad4L
atp9
atp8
nad6
cox3
urf1
urf2nad3 cox2
rps
atp6
rns
introns
exons
intronic ORFs
tRNAs
0/35.4
10
20
30
L2FA
L1M2
M1,V,E,T
M3H
Q
P1,S2,I,W,S1,D,G2,G1,K,R2
Y
N2
R1
C
N1
P2
Figure 3.2: Physical map of P. marneffei mitochondrial DNA. The mapis based on an annotation of the reverse complement of Assembly 3 ofthe P. marneffei mitochondrial sequence determined by the P. marneffeiSequencing Project at the University of Hong Kong in collaboration withBeijing Genomics Institute of Chinese Academy of Sciences. Numbers inthe inner circle are in kb. The sequence is numbered from the uniquerestriction enzyme ClaI site (AT|CGAT) (0/35.4), which is located justupstream to the nad4L gene and downstream to the cox2 gene. Exonsare shown in black, introns in white, and intronic ORFs in gray.
76
Table 3.1: Gene content of P. marneffei mitochondrial genome. * Exactstart codon could not be determined merely through sequence compari-son.
Genetic element Localisation (nt)Size Codons
bp aa Start Stopnad4L 26-295 270 89 ATG TAAnad5 295-2271 1977 658 ATG TAAnad2 2289-4028 1740 579 TTA TAAatp9 4216-4440 225 74 ATG TAAcob Join: (4706-5098, 6270-7037) 2332 386 ATG TAAcob-i1-ORF 5099-5965 867 288 TTG* TAAnad1 Join: (7532-8179, 8650-9081) 1550 359 ATA TAAnad4 9253-10716 1464 487 ATG TAAatp8 10945-11091 147 48 ATG TAGatp6 11158-11928 771 256 ATG TAArns 12341-13721 1381nad6 14053-14637 585 194 ATG TAAURF1 14722-15177 456 151 ATG TAAcox3 15352-16161 810 269 ATG TAArnl Join: (17165-19688, 21361-
21902)4738
rps 19987-21252 1266 421 ATG TAAcox1 join: (23339-23718, 24994-
25099, 26298-26641, 27740-27875, 29012-29201, 30504-30553, 31652-31806, 32835-33159)
9821 561 ATT TAA
cox1-i1-ORF 23720-24622 903 300 AAA* TAAcox1-i2-ORF 25100-26200 1101 366 AAA* TAAcox1-i3-ORF 26643-27647 1005 334 AAA* TAAcox1-i4-ORF 27876-28928 1053 350 TGA* TAAcox1-i5-ORF 29204-30043 840 279 TTA* TAAcox1-i6-ORF 30554-31384 831 276 ACA* TAAcox1-i7-ORF 31808-32629 821 273 AGA* TAGURF2 33223-33660 438 145 ATT TAAnad3 33955-34362 408 135 ATG TAAcox2 34591-35346 756 251 ATG TAA
77
cox2, and cox3 ), apocytochrome b (cob), the reduced nicotinamide ade-
nine dinucleotide ubiquinone oxireductase subunits (nad1, nad2, nad3,
nad4, nad4L, nad5, and nad6 ), and the ribosomal protein of the small
ribosomal subunit (rps). This set of protein coding genes is exactly the
same as that in the A. nidulans mitochondrial genome. Furthermore, the
gene order of the protein genes is the same as that in the A. nidulans mito-
chondrial genome, except for the atp9 gene, which is located between the
cox1 and nad3 genes in the A. nidulans mitochondrial genome, but be-
tween the nad2 and cob genes in the P. marneffei mitochondrial genome
(Fig. 3.3).
Concatenated amino acid sequences of the 14 protein coding genes in
the mitochondrial genomes of P. marneffei and 24 other fungi were used
for phylogenetic tree construction. The closest relatives of P. marnef-
fei were A. nidulans and other moulds, such as P. anserina, N. crassa,
Hypocrea jecorina, and Verticillium lecanii (Fig. 3.4). On the other hand,
the yeasts, such as the Saccharomyces species, Schizosaccharomyces species,
Candida species, and P. canadensis were more distantly related to P.
marneffei. This implied that phylogenetically the mitochondrial genome
of P. marneffei is more related to those of moulds than yeasts. This is in
line with our previous observation and also results published by others,
that when the chromosomal 18S rRNA genes or the internal transcribed
spacers and 5.8S rRNA genes (ITS1-5.8S-ITS2) and mitochondrial small
subunit rRNA genes were used for phylogenetic trees construction, the
closest neighbours of P. marneffei, besides the other Penicillium species,
were the Aspergillus species as well as other moulds [202, 364]. Fur-
thermore, the same gene content and almost the same gene order in the
mitochondrial genomes of P. marneffei and A. nidulans also implies that
the mitochondrial genome is probably not related to the unique charac-
teristic of thermal dimorphism of P. marneffei. Interestingly, MP1, the
gene that encodes an abundant and highly immunogenic protein in P.
78
Protein & rRNA genes
tRNA genes
G1
nad4L
nad5
nad2
atp9
N1
cob
G2
cox3
R2
K
D
S1
W
I
S2
P1
rnl
rpsT
E
V
M1
M2
L1
A
F
L2
Q
M3
H
cox1
P2
nad3
cox2
nad4L
nad5
atp9
N1
cob
cox3
P1
rnl
rps
T
E
V
M1
M2
L1
A
L2
Q
M3
H
cox1
nad3
cox2
C1
C
R1
nad1
nad4
atp8
atp6
rns
Y
nad6
G1
C2
R
nad1
nad4
atp8
atp6
G2
rns
Y
nad6
K
D
S
W
I
N2
N2
nad2
P. marneffei A. nidulans
F
Figure 3.3: Gene content and order comparison between P. marneffei mi-tochondrial DNA and A. nidulans mitochondrial DNA. The only exonicgene that has undergone gene rearrangement is atp9, which is highlightedin black background.
79
Hya
lora
ph
idiu
m c
urv
atu
m
Mo
no
ble
ph
are
lla s
p. J
EL
15
Ha
rpo
ch
ytriu
m s
p. J
EL
10
5
Ha
rpo
ch
ytriu
m s
p. J
EL
94
Sp
ize
llom
yce
s p
un
cta
tus
Rh
izo
ph
yd
ium
sp
.
Allo
myce
s m
acro
gyn
us
Ve
rticilliu
m le
ca
nii
Hyp
ocre
a je
co
rina
Ne
uro
sp
ora
cra
ssa
Po
do
sp
ora
an
se
rina
Asp
erg
illus n
idu
lan
s
Pe
nic
illium
ma
rne
ffei
Pic
hia
ca
na
de
nsis
Sa
cch
aro
myce
s c
ere
vis
iae
Sa
cch
aro
myce
s s
erv
azzii
Sa
cch
aro
myce
s c
aste
llii
Ca
nd
ida
gla
bra
ta
Ca
nd
ida
alb
ica
ns
Ya
rrow
ia lip
oly
tica
Sch
izo
sa
cch
aro
myce
s p
om
be
Sch
izo
sa
cch
aro
myce
s o
cto
sp
oru
s
Sch
izo
sa
cch
aro
myce
s ja
po
nic
us
Cry
pto
co
ccu
s n
eo
form
an
s v
ar. g
rub
ii
Sch
izo
ph
yllu
m c
om
mu
ne
0.1
£ G
roup I in
tron w
ith in
tron
ic O
RF
¢ G
roup I in
tron w
ithout in
tronic
OR
F¿
Gro
up II in
tron
Ge
ne
s no
t pre
sen
t we
re cro
ssed
ou
t
rnl
atp
6
atp
8
atp
9
co
b
co
x1
co
x2
co
x3
nad
1
nad
2
nad
3
nad
4
nad
4L
n
ad
5
nad
6
£
££
££¿£
¿
¿
££
££££
£££££££¢£
£
¢
££
¢
¢¢¢
£££££££¢£
¢
¢£
¢
£
¢££
£
¿££££
¿¿££££¿
£
£
£
£
£££££££
¢
£
£
£££
££
£
££
¿££££££££££££££ ££
££££
£
£
£
£££¿
£
£
££
£
¢
£
£
££
£
££
£££££
¢
£
¿¢¢¢
¢££¢£¢
¢¢¿¢¢¢££¢¢¢¢
¢
¢£
££¢
¢¢¢¿¢¢¢¢£
££££¢¢ ¢¢£££¢£££¿¢¢£¢
¢
¢
¢¢¢
££
£
¢
¢
¢
¢£
££
££££
££
£
80
Figure 3.4: Phylogenetic relationships of P. marneffei to other fungiand distribution of group I and group II introns in the correspondingfungi. Maximum likelihood tree showing phylogenetic relationships ofP. marneffei to other fungi and distribution of group I and group II in-trons in the corresponding fungi. The tree was constructed using unam-biguously aligned portions of concatenated amino acid sequences of the14 protein-coding genes (atp6, atp8, atp9, cob, cox1, cox2, cox3, nad1,nad2, nad3, nad4, nad4L, nad5 and nad6 ). A total of 3462 amino acidpositions were used for the inference with ProML [86]. Sequences were ob-tained from GenBank: Allomyces macrogynus (NC 001715), Aspergillusnidulans (CAA32799, CAA33481, AAA99207, AAA31737, CAA25707,AAA31736, CAA23994, P15956, CAA23995, CAA33116), Candida albi-cans (NC 002653), Candida glabrata (NC 004691), Cryptococcus neofor-mans var. grubii (NC 004336), Harpochytrium sp. JEL105 (NC 004623),Harpochytrium sp. JEL94 (NC 004760), Hyaloraphidium curva-tum (NC 003048), Hypocrea jecorina (NC 003388), Monoblepharellasp. JEL15 (NC 004624), Neurospora crassa (CAA24041, CAA32799,AAA31961, CAA27029, CAA27418, AAA66053, AAA31959), P. marn-effei (Present study), Pichia canadensis (NC 001762), Podospora anse-rina (NC 001329), Rhizophydium sp. 136 (NC 003053), Saccharomycescastellii (NC 003920), Saccharomyces cerevisiae (NC 001224), Saccha-romyces servazzii (NC 004918), Schizophyllum commune (NC 003049),Schizosaccharomyces japonicus (NC 004332), Schizosaccharomyces oc-tosporus (NC 004312), Schizosaccharomyces pombe (NC 001326), Spizel-lomyces punctatus (NC 003052, NC 003061 and NC 003060), Verticil-lium lecanii (NC 004514), Yarrowia lipolytica (NC 002659). Some se-quences of A. nidulans were downloaded from Fungal MitochondrialGenome Project (http://megasun.bch.umontreal.ca/People/lang/FMGP/FMGP.html), and some sequences of N. crassa were downloadedfrom http://pages.slu.edu/faculty/kennellj/genbank.html. Thescale bar indicates the branch lengths that were scaled in terms of ex-pected numbers of amino acid substitutions.
81
marneffei, only has known homologues in A. nidulans, A. fumigatus, and
A. flavus, but not in other fungi [37,39,38,363,43,351,352].
3.3.3 Genetic code and codon usage
Since the mitochondrial genome P. marneffei is phylogenetically closely
related those of moulds and its gene content is the same as that of A.
nidulans, the genetic code of the mitochondrial genome of P. marneffei
is assumed to be the same as that of A. nidulans .
There is a strong codon usage bias in exonic ORFs in the mitochondr-
ial genome of P. marneffei towards codons ending in A or T. In fact, eight
codons (CTC, CTG, ACG, TGC, TGG, CGC, CGG, and GGC) were not
used at all, five codons (GTC, TCC, TCG, ACC, and AGG) were used
only once, and nine codons (ATC, CCG, GCC, GCG, CAC, CAG, AGG,
GAC, GGG) were used 2 to 10 times, in exonic ORFs. Moreover, this
codon usage bias is also evident in the use of stop codon, where TAA is
used as the stop codon in 14 genes, but TAG is only used in one gene.
3.3.4 tRNA genes
Twenty-eight tRNA genes were identified in the P. marneffei mitochon-
drial genome (Fig. 3.5). These are all located on the same DNA strand
as the other genes. The set of mitochondrial tRNAs in P. marneffei is
similar in type to that in A. nidulans. Furthermore, the sequences of
the mitochondrial tRNA genes of P. marneffei are fairly conserved with
those of A. nidulans, especially between the two tRNA gene clusters of
two species (Fig. 3.3).
3.3.5 Other RNA genes
The genes that encode the 23S and 16S ribosomal RNAs of the large and
small subunits of the ribosome (rnl and rns) were identified. Further-
more, a gene (rps), located within the intron of rnl (Table 3.1 and Fig.
82
Table 3.2: Codon usage in protein-coding genes of P. marneffei mi-tochondrial genome. Numbers indicate the total numbers of codonsin either identified protein coding genes or ORFs (including both free-standing URFs, intronic ORFs and RPS).
Codon AA Genes ORFs Codon AA Genes ORFsTTT F 307 143 TCT S 160 93TTC F 66 13 TCC S 1 5TTA L 572 250 TCA S 105 45TTG L 26 33 TCG S 1 13
CTT L 49 42 CCT P 119 35CTC L 0 6 CCC P 4 2CTA L 20 24 CCA P 25 20CTG L 0 4 CCG P 4 3
ATT I 182 134 ACT T 121 78ATC I 10 12 ACC T 1 7ATA I 326 162 ACA T 105 45ATG M 112 38 ACG T 0 4
GTT V 132 74 GCT A 144 49GTC V 1 3 GCC A 4 7GTA V 131 70 GCA A 81 35GTG V 18 5 GCG A 7 3
TAT Y 191 180 TGT C 24 21TAC Y 32 27 TGC C 0 4TAA * 14 9 TGA W 56 37TAG * 1 1 TGG W 0 5
CAT H 76 47 CGT R 10 24CAC H 8 7 CGC R 0 1CAA Q 83 75 CGA R 0 1CAG Q 5 7 CGG R 0 2
AAT N 196 277 AGT S 123 90AAC N 11 30 AGC S 15 8AAA K 101 347 AGA R 78 94AAG K 6 18 AGG R 1 9
GAT D 97 112 GGT G 188 94GAC D 3 11 GGC G 0 1GAA E 89 133 GGA G 92 32GAG E 21 21 GGG G 6 13
83
1 U 2 U 3 U 4 U 5
A A G C U A G C AA U G C U A C G A UG C A U C G U A G CG C G C U A C G G CA U C G U A U A A U
A A A A AA U U A U A A U U A
UGUC A CCCC A G A A UCU A A G CU AA A A A A A A A A A A UU A
UUU A UGG A A AG A U GGG A CUC CU A CUC AG A U U A C A UG AU A U A U U U
A A A A CU A G AG A G AG A A UG UUU A A U G A U A A U U A A G A A A
A U A A A A A U AA A A
A A A A A U AA A A A U
A A A AA A A A A A
A A A A AU GC A UC G G A
A S N C Y S A RG A S N T Y R
6 U 7 A 8 A 9 A 10 GA U G U A U A U G CA U A U U A C G G UC G G C G C G C G CG C A U A U A U U AU A C G C G U A U AG C C G U A C G A UU A U A U A U A A U U A A U U A G C GG
U U A UUC A U U A CCC A U CG A CU A U CG A CU A U C AGCC AA A A G A A A A A A A G G A A G U A A A A G
U CUU A A U A AG C C UUUG GUGGG C U UUU A GCUG A C C UUG A GCUG A C A UUC GUCGG CG U UU G U UU G U UU G C UU G U A UG G A A U A G A A A C UC G G A A U A G G A CU G G A AGG A
U A A CU U A A U U A C A U A A U U A G U AGU A U U A G U A C GG U AU A G U U A U A A UA U U A C G C G U AU A G C G C G C C GG C A U A U U A U A
C A C A C C U A U AU G U A U A U A U G
A CG UUU UCC A CC G U C
A RG L Y S G L Y G L Y A S P
11 G 12 G 13 A 14 G 15 AG C A U G C A U C GG C A U A U G C A UA U G C U A A U G CA U A U U A G C G CA U G C C G A U U AG C U A U A G C U AG C U A A U U A U A CU U A U A A U U A
U UUCUC A U G AGUC A U GUC A C A UG A U U A U A C A G A A U A C AU A G G A A A G C A A A G G G G G GG A A G
U CCG A AG AG C U UUUG CUU AG C U UUC A C AGUG C G UCG A U A UG C U CCG UU A UG CG U UU G U UU G U UU U C UU U G UUG GGC C G A A A C U G A AGU U U GGC U G GGC U
U AG G A U U U A C U U A G UG U A A G A A U G A G A A UGG U A A G CC A U A A UU A U UU U G
U A U U C G A U C G G C C GC G G A A U U A U A G A A U AG C U U A U U G A U U A C GU A A A U C G G C A A A U
U A C A U A C U U UU A U A U A U G U G
G C U UC A G A U U G A U G G
S E R UR P I L E S E R P R O
16 U 17 A 18 19 A 20 AG C G U A U A A UC G A U A U G C A UC G U A A U C G G CU G C G G C U C G UG U C G A U A U C GG C A U A U A U U AU A A U A U U A A U U G U A A U UG
U CGUUC A U CUCUC A U A U A U A A UCC A U UUU A C AA A A G UG A A G G G A A U UCC A C A A A A G U A A A G
A U A CG GU A AG C C CUG G AG AG C C CUCG A U UGU A UU AGG C A UUCG A A A UG CA A UG G C UU G AGGUG C G A UU G U UUG A UGC A G G A C A G G AGC C UU G GC A U A G A AGC A
U A A G UU A A A G U A U UG U A A A U A G U AC U A U UG G G U A A U GC G A U U A A A U A A UC G A U C G U A C GG C C G G C G C U A
U U A U C G G C G UU A U C U C C A C AU A U A U A U G U G
U GU UUC U A C C A U C A U
T HR G L U V A L M E T M E T
21 A 22 A 23 A 24 A 25 AA U G C G C A U U AU A G C C G C G A UC G G U U A A U U AC G G C C G G C C GA U U A G C A U U AA U U A A U U A C GG C CU A U CC G U UG A U U A G C U A
A A U CUUGC A U AGUCC A A UGUUC A A A U U A UUC A U CUCGC AC A G G U A A A G U A A A G U C G G U A A U G
U GUC G A A CG C U UUUG UC AGG C U CUCG A C A AG C U GUC A U A AG C A UC AG G AGUG CG U UU G C UU G U UU G U UU G U UUG C AG G G A A A C U G G AGC U G C AGG G G AGUC A
U A U U C U A U U A UU G AGG U A U A U A C UA U G C G C G U G U AG C GU A A U A UU
C G G G A C G A U U A A UA U C U G C A U C G A UA U GC A U A U A U A UA U U A A U C A U A
C A U U C A U G U UU G U A U G U AG U G
U A A UGC G A A U U G
L E U A L A P H E L E U G L N
26 U 27 G 28 AG C G C A UC G U U A UC G G C G CA U G C G CA U G U A UG C U A U AG U U A A U U A U A U A
U CUCUU A U GG A A C A U UGUCC AA A A G A A A G A A A A G
C UUUG G AG A A C U CUUG CCUUG C U UUUG A U AGG AG U UU G U UU U U UUG G A A C A A G G A A C A G A A A C A
U A A A U A A A C UU A A UU A AG G U U A A AA U C G A UA U U A U AU A G U A U
U A U A UA
U G A C AU
C A U U G U IntronGUG A
UGG
M E T H I S P RO
U U U U UU G U G G U C U U G C U
U C G U U U C U C U U CG G G G U U G
U G G G C U G G C U G G C G C G CU U U G U G U U G U U G U UG C G C C G C G C U G U
U U U G U G UU U U G U G C U CA U U U U U G C U G GU C G U U G G CA U U G U U U U
U U U U U G CC C C C
U U U U UA U U U U U
Figure 3.5: 28 tRNAs encoded in the mitochondrial genome of P. marn-effei. Predicted clover-leaf structures of the 28 tRNAs encoded in themitochondrial genome of P. marneffei. Anticodons are underlined andthe corresponding amino acids are indicated. tRNAs are listed accordingto the order of their positions in the map in Fig. 3.2.
84
3.6), that encodes the ribosomal protein of the small ribosomal subunit,
which is also present in the A. nidulans mitochondrial genome, was also
identified.
A
T
A
T
A
P5
T
A
C
G
CA
A
T
G
A
A
A
A
A
T
A
T
A
T
P4
C
G
19720
T
A
T
A
A
..
G
A
T
T
T
A
38 bp
A
A
T
AA
A
T
G
C
G
47 bp
T
G
C
98 bp
A
P3
T
T
G
G
G
T
A
A
C
C
C
G
C
T
A
A
T
A
T
T
T
C
T
A
C
C
P6
24 bp
A
T
A
T
A
A
A
G
A
T
G
C
A
A
A
A
T
C
A
G
A
45 bp
P8
A
G
A
A
T
A
G
T
C
T
G
A
A
T
T
G
A
A
C
..
P7
21360
RPS5
1256 bp
Pm Lsu.1
44 bp
75 bp
G
A
C
G
T
A
T
A
P5
C
G
G
T
CT
A
A
A
G
A
A
T
G
C
T
A
C
G
P4
26642
G
C
P3
T
A
T
A
A
T
A
A
TA
T
A
G
G
T
T
A
C
G
T
T
A
C
30 bp
C
C
A
A
T
A
G
C
A
A
T
GC
T
A
A
T
G
A
T
G
T
T
A
A
G
A
C
P6
783 bp
T
A
T
A
C
T
A
C
A
A
T
T
T
T
T
C
A
GA
G
A
A
P8
A
A
A
G
T
C
C
G
A
T
A
T
A
A
A
G
G
A
T
A
..
P7
27647
Pm Cox1.3
..
T
G
G
T
14 bp
76 bp
Figure 3.6: Predicted secondary structures of two representative groupI introns. Group I introns, PmRnl.1 and PmCox1.3, of rnl and cox1genes respectively, in P. marneffei. The exon/intron boundaries are rep-resented by dotted lines. Base pairs are depicted by bars. The corre-sponding sizes of nucleotides not shown are indicated in bp. RPS5 geneis depicted by square box. The numbers correspond to the coordinatesin the mitochondrial genome.
3.3.6 Group I introns
In P. marneffei, the cox1 gene contains seven introns (PmCox1.1, Pm-
Cox1.2, PmCox1.3, PmCox1.4, PmCox1.5, PmCox1.6, and PmCox1.7),
while the cob gene, nad1 gene, and rnl gene contain one intron each
(PmCob1.1, PmNad1.1, and PmRnl1.1 respectively). Each intron in the
cox1, nad1, and rnl genes contains an ORF. The ORF in the rnl gene
85
Table 3.3: Presence of mitochondrial DNA fragments in nuclear genomes.‘Nuc no.’, number of mtDNA fragments in nuclear genomes; ‘Mt size’,size of mitochondrial genomes (kb); ‘Nuc size’, Size of nuclear genome(Mb); ‘Ratio’, ratio of sizes of mitochondrial to nuclear genome (kb/Mb).
Fungus Nuc no. Mt size Nuc size RatioP. marneffei 10 35.4 ∼ 29.5 ∼ 1.20A. nidulans 17 ∼ 33.2 ∼ 31.0 ∼ 1.07N. crassa 21 ∼ 64.8 ∼ 43.0 ∼ 1.51S. cerevisiae 34 85.7 12.1 7.08S. pombe 21 19.4 13.8 1.41
encodes the rps gene. The predicted secondary structures of two repre-
sentative group I introns are depicted in Fig. 3.6. In both introns, the
upstream exons end with a T and the introns end with a G, typical for
most group I introns.
A comparison of the distribution of group I and group II introns in the
14 protein coding genes and rnl gene in the P. marneffei mitochondrial
genome and that in the corresponding genes in the other 24 fungi is
shown in Fig. 3.4. As a whole, the distribution of these introns in the
genes encoded in the mitochondrial genome of P. marneffei concurs with
those of the other fungi. The cox1 gene, the gene that contains the
largest number of self-splicing introns in other mitochondrial genomes,
is also the gene that contains the largest number of self-splicing introns
in the P. marneffei genome. The cob and nad1 genes, the genes that
also contain significant numbers of self-splicing introns, also possess one
group I intron each in the P. marneffei mitochondrial genome.
3.3.7 Mitochondrial DNA sequences in nuclear genome
Presence of mitochondrial DNA sequence fragments in the correspond-
ing nuclear genomes of P. marneffei, A. nidulans, N. crassa, S. cere-
visiae, and S. pombe were compared (Table 3.3). By using the same
method of sequence similarity comparison used for S. cerevisiae [262],
86
Table 3.4: P. marneffei mitochondrial DNA sequences present in nucleargenome.
No. Coordinates Size (bp) Location E-value1 9031..9069 39 nad1 9e-082 10182..10201 20 nad4 1e-033 11622..11697 76 atp6 2e-154 13445..13465 21 rrs 2e-045 15158..15177 20 nad6 – cox3 1e-036 18757..18776 20 rnl 1e-037 25168..25187 20 cox1 1e-038 31197..31216 20 cox1 1e-039 32560..32580 21 cox1 2e-0410 34510..34529 20 nad3 – cox2 1e-03
only 10 mitochondrial DNA sequence fragments were detected in the 4×coverage, representing 95%, nuclear genome sequences for P. marneffei
(Table 3.4). This number of mitochondrial DNA sequence fragments in
the corresponding nuclear genomes, as well as the ratio of mitochondrial
to nuclear genome size, was comparable to those found in A. nidulans,
N. crassa, and S. pombe (Table 3.3). On the other hand, the number
of mitochondrial DNA sequence fragments in the nuclear genome of S.
cerevisiae was 34, which was about two times more than the other fungi.
Although the relatively high ratio of mitochondrial to nuclear genome
size of S. cerevisiae may partly explain this phenomenon, further studies
would be necessary to elucidate the difference in the significance of these
mitochondrial DNA fragments in the nuclear genomes for the different
fungi.
In conclusion, among the known mitochondrial genomes of fungi, the P.
marneffei mitochondrial genome has an intermediate size. The replica-
tion origin of the P. marneffei mitochondrial genome is unknown. De-
87
spite the distinct biological property of thermal dimorphism in P. marn-
effei, its mitochondrial genome is much more closely related to those of
moulds, especially to that of A. nidulans, than to yeasts. The set of
protein coding genes in the P. marneffei mitochondrial genome is ex-
actly the same as that in the A. nidulans mitochondrial genome. Except
for the atp9 gene, the gene order of the protein genes is also the same
as that in the A. nidulans mitochondrial genome. Furthermore, when
concatenated amino acid sequences of 14 protein coding genes in the mi-
tochondrial genomes of P. marneffei and 24 other fungi were used for
phylogenetic tree construction, the closest relatives of P. marneffei were
A. nidulans and other moulds, whereas the yeasts were more distantly
related.
88
Chapter 4
GENOMIC EVIDENCE FOR THE PRESENCE OF
MELANIN BIOSYNTHESIS GENE CLUSTER IN
PENICILLIUM MARNEFFEI
In this Chapter, I will firstly review fungal virulence factors and their
identification by genomic approaches, then I give genomic evidence for
the presence of melanin biosynthesis genes in Penicillium marneffei.
4.1 Introduction
In Chapter 3, when I compared the mitochondrial genome of P. marneffei
to those of other fungi, it was observed that the mitochondrial genome
of P. marneffei is much more closely related to those of moulds, espe-
cially to that of Aspergillus nidulans, than to yeasts. The set of protein
coding genes in the P. marneffei mitochondrial genome is exactly the
same as that in the A. nidulans mitochondrial genome. Except for the
atp9 gene, the gene order of the protein genes is also the same as that
in the A. nidulans mitochondrial genome. Furthermore, the amino acid
sequence identity between the mitochondrial genes of P. marneffei and
those of A. nidulans is significantly higher than those between the mi-
tochondrial genes of P. marneffei and those of Neurospora crassa, Can-
dida albicans, Saccharomyces cerevisiae, and Schizosaccharomyces pombe.
This evidence of close relationships between P. marneffei and Aspergillus
species has prompted a further search for previously undiscovered charac-
teristics in P. marneffei based on our knowledge of the various Aspergillus
species.
Melanins are negatively charged pigments of high molecular weight
89
with hydrophobic surfaces. They are formed by the oxidative polymeri-
sation of phenolic and/or indolic compounds [341]. They are carcinogens
that are widespread in agricultural products and food. They are mainly
produced by various Aspergillus species, like A. parasiticus and A. flavus,
and less frequently, also by A. nomius, A. pseudotamarri, and A. bom-
bycis [170]. Since melanin is made by these important pathogenic fungi
and has been implicated in the pathogenesis of a number of fungal infec-
tions, it would be of interest to investigate whether P. marneffei could
synthesise melanin or melanin-like compounds.
Here, after the literature review, I report the progress in identifying a
gene cluster in P. marneffei, spanning 19 kb, which contains six homologs
of genes. All these six genes in the cluster in A. fumigatus have been
shown to be involved in DHN-melanin biosynthesis [24, 187, 317, 318].
These genes are alb1, arp1, arp2, abr1 and abr2 encoding polyketide
synthases, scytalone dehydratases, and hydroxynaphthalene reductases,
a putative protein possessing two signatures of multicopper oxidases and
laccase respectively, as well as, ayg1 of unknown function. The order of
genes in the clusters of two fungi differs slightly from each other. These
findings indicate that P. marneffei can potentially produce melanin or
melanin-like compounds. Since melanin is an important virulence factor
in other pathogenic fungi, this pigment may have a similar role to play
in the pathogenesis of penicilliosis.
4.2 Literature Review
Most fungi cannot survive in the environment provided by human tissue
and therefore are not pathogenic. Amongst more than 100,000 fungal
species which have been described, only a handful of them are pathogens.
The pathogenic fungi are divided into two classes, primary pathogens and
opportunistic pathogens. Primary pathogenic fungi, e.g., Coccidioides
immitis and Histoplasma capsulatum, are “professional” pathogens which
90
adapt to live inside healthy mammalian and human tissue, causing dis-
ease not only in immuno-compromised patients but also in healthy peo-
ple. Opportunistic fungi may have an environmental reservoir or exist as
commensals in a healthy host. Some examples include Candida species,
C. neoformans and A. fumigatus. These fungi are able to grow and in-
vade host tissue only when they take advantage of immuno-compromised
host. However, the incidence of life-threatening mycoses caused by op-
portunistic fungal pathogens has increased dramatically in recent years.
They are eventually the major cause of fungal infections. The infections
cause by pathogenic fungi can be superficial, subcutaneous or systemic.
Superficial infection localises to the skin, the hair, and the nails; subcu-
taneous infection confines to the dermis, subcutaneous tissue or adjacent
structures; systemic infection refers to deep infections of the internal or-
gans.
4.2.1 Potential virulence factors
Virulence factor in a fungus literally refers to any factor that a fungus
possesses that increases its virulence in the host. For instance, if a gene
or a protein is essential for growth in vivo whose deletion does not af-
fect mycelial growth in vitro, it is considered as a virulence factor [189].
The concept of virulence factor is different in primary pathogens and
opportunistic pathogens and it is relatively difficult to define literally
when dealing with the latter, as pointed out by [128]. For most of fungal
pathogens, few virulence factors which contribute to their pathogenicity
have been reported.
Although the mechanisms of fungal pathogenicity remain less-well
understood, the development of a fungal infection must satisfy several
considerations. The fungus must first be able to adhere to the host
tissues. The fungus must colonise the host and invade the host tissue.
Once the fungus has invaded the host tissue, it must be able to adapt to
91
the tissue environment. Probably most importantly, the fungus must be
able to avoid the host’s cellular defences.
Adherence to host tissues
Adherence factor is essential for fungal pathogens to attach themselves
onto host tissue, and to resist physical clearing of the infectious agent.
For example, C. immitis, Aspergillus species, H. capaulatum and Cryp-
tococcus neoformans all infect via the bronchial route and must have
specific adaptations in order to avoid effective clearance from a host’s
lungs. Adherence is dependent on a variety of factors, including surface
glycoprotions, fungal cell surface hydrophobicity, pH, temperature, and
of course, phenotype of the organism. Adhesins are biomolecules that
promote the adherence of fungi to host cells or host-cell ligands that
bind to several extracellular matrix proteins of mammalian cells, such as
fibronectin, laminin, fibrinogen and collagen Type I and IV.
Amongst many studies that have shown the association of adherence
and fungal pathogenesis, the studies on adhesion in C. albicans are most
extensive. Candida species express several cell surface proteins termed
adhesions which actively promote binding to host cells. These include a
lectin-like protein that recognises sugar residues of epithelial cell surface
glycoproteins, and a complement receptor-like protein, CR3, which may
play in a role in adherence to endothelial cells. Several adherence promot-
ing molecules or adhesions of C. albicans regulate attachment, invasion,
and dissemination of the fungus [36,157].
Als1p (agglutinin-like sequence) of C. albicans is a member of a fam-
ily of seven lycosylated proteins with similarity to the S. cerevisiae -
agglutinin protein that is required for cell-cell recognition during mating.
Als1p is essential for virulence in a hematogenously disseminated murine
model [98].
HWP1 is a hyphal- and germ-tube-specific outer surface mannopro-
92
tein that binds C. albicans hyphae to human buccal epithelial cells [319].
The null mutant was less virulent than parental or single-gene-deleted
strains in a hematogenously disseminated murine model. The yeast ger-
minated less readily in the kidneys of infected mice and caused less en-
dothelial cell damage [319]. C. albicans binds to several ECM ligands,
including FN, laminin and collagens I and IV. C. albicans expresses an
integrin-like protein INT1 which is 25% identical to a non-repeat region
of the fibrinogen-binding protein, ClfA, of Staphylococcus aureus. Strains
of C. albicans deleted in INT1 were less virulent and adhered less readily
to an epithelial cell line [102]. Strains of C. albicans deleted in the 1,2-
mannosyltransferase gene (MNT1) are less able to adhere in vitro and are
avirulent. Mnt1p is a type II membrane protein that is required for both
O- and N-mannosylation in fungi and found to be required for adherence
to an epithelial cell line [34].
Adhesins of other medically important fungi, such as Blastomyces der-
matitidis (a dimorphic fungal pathogen that infects the host through in-
halation of conidia [276], have also been characterised. This is a 120-kDa
surface protein adhesin, namely WI-1, on B. dermatitidis, binding CD18
and CD14 receptors on human macrophages [232]. Hogan et al. [133]
cloned the adhesion WI-1 gene and found a total of 30 highly conserved
repeats of a 24-amino acid sequence. The repeat sequence is similar to
invasion, an adhesion-promoting protein on Yersiniae [169].
Invasion
Invasion is required for the development of deep mycoses in the internal
tissues of the body. The process is probably aided by hydrolytic enzymes,
such as proteinases and lipases, and in the case of dermatophytes, kerati-
nases. Secretion of extracllular enzymes, such as phospholipase, has been
proposed as one of the virulence mechanisms used by bacteria, parasites,
and pathogenic fungi in overcoming host defence mechanism. The role of
93
extracellular phospholipase as a potential virulence factor in pathogenic
fungi, including C. albicans, C. neoformans, and A. fumigatus has been
reported. Of the 4 Candidal phospholipases (PLA, PLB, PLC and PLD),
only C. albicans null mutants that failed to secrete phospholipase B, en-
coded by PLB1, constructed by targeted gene disruption, when tested in
two clinically relevant murine models of candidiasis, was shown to have
attenuation of its virulence. Initial data suggest that direct host cell
damage and lysis are the main virulence mechanisms.
The secretion of lytic and degradative enzymes is also of obvious im-
portance to the invasion of host tissues. Those necrotic enzymes secreted
by fungi can break down structural barriers and play an important role
in mediating host tissue invasion. The most extensively studied example
is SAP gene family in C. albicans [294]. At least nine proteins comprise
the family of secreted aspartyl proteinases. In guinea pig and murine
models of invasive disease, deletions in sap1-6 attenuated virulence. The
SAP genes have been shown to be differentially expressed, according to
the growth phase and phenotype of the organism; SAP2 mRNA was the
dominant transcript in the yeast phase organism; SAP4, SAP5 and SAP6
transcripts were observed only at neutral pH during serum-induced yeast
to hyphal transition. The order of expression was SAP1, -2, followed
sequentially by SAP8, -6 and -3 was correlated with tissue invasion i.e.,
early invasion (SAP1, 2), extensive penetration (SAP8) and extensive
hyphal growth (SAP6). This data indicates that members of the SAP
gene family may have distinct roles in the colonisation and invasion of
the host [63].
Growth at elevated temperature/Thermotolerance
Thermotolerance is one of the most obvious factors leading to pathogene-
sis. The ability of grow at body temperature 37 and within fever range
38 – 42 is important to systemic infection. The majority of fungi has an
94
optimum growth temperature of 25 to 30, and may grow only weakly
or not at all at 37. The first genome-wide analysis of the temperature-
regulated transcriptome of C. neoformans has been done by Steen et
al. [296]. They identified sets of genes with higher transcript levels at
25 or 37 respectively.
Morphology/Morphogenesis
There is a growing body of evidence linking morphogenesis and virulence.
Changes in morphologies are advantageous for fungal pathogens. It has
been demonstrated that fungal hyphae can exert significant tip pressure
for penetration [224]. Many fungi adapt this morphological change and
develop virulence. Filamentous fungi (such as Aspergillus species) tend
to form branched hyphae in lung. C. neoformans, being an unique en-
capsulated yeast, is coated with a polysaccharide capsule. The capsule
is a potent inhibitor of macrophage phagocytosis, which is an important
factor in the resistance to C. neoformans infection.
The most remarkable ability shared among the dimorphic fungi, such
as, B. dermatitidis, C. immitis, H. capsulatum, Paracoccidioides brasilien-
sis, Sporothrix schenckii, is to switch between two distinct forms: yeast
and mould. The dimorphic fungi exist normally as non-pathogenic forms
(normally filamentous mycelia) in the environment and converse into
pathogenic forms (yeast) in the tissues of a host. This process is re-
versible; the switching trigger of conversion is unknown and differs amongst
fungi though. The importance of the yeast cell, as an invasive morphol-
ogy, for dimorphic fungi has been reviewed by Gow et al. [113, 114]. As
shown in Table 4.1, most dimorphic mycelial pathogens invade tissues of
a host as yeast cells. Yeast cells are regarded as a better adapted for
dissemination within host circulatory system and avoidance of immune
capture. Note that although the opportunistic pathogens C. albicans
and Candida tropicalis shows dimorphic growth, these Candida species
95
Table 4.1: Major dimorphic fungal pathogens and their characteristicmorphologies in infectious disease. Taken from [114]
.
Fungal species Form in diseased tissueBlastomyces dermatitidis Budding yeastsCandida albicans (Pesudo)hyphae, budding yeastsCandida tropicalis Yeast and pesudohyphaeCoccidioides immitis Endosporulating spherulesCryptococcus neoformans Budding capsulate yeastsHistoplasma capsulatum Budding yeastsParacoccidioides brasiliensis Budding yeastsPenicillium marneffei Yeasts undergoing binary fissionSporothrix schenckii Budding yeastsWangiella dermatitidis Budding yeasts
mainly form pseudohyphae, therefore they are not regarded as true di-
morphic fungi. Nevertheless conversion to pseudohyphae has been long
regarded as essential for tissue invasion for Candida species.
4.2.2 Genomic approaches in identification of virulence factors
In practice, the combinatorial approaches by combining a few of the
following techniques have great potential to make elucidation of detailed
biological systems.
Mining whole genome sequences and fishing for virulence factors
The sequence of the genome of budding yeast, S. cerevisiae, is a landmark
of genomics. Since then, progress has been made in sequencing whole fun-
gal genomes. The second complete sequence of a fungal genome, that of
S. pombe, was published in 2002 [354]. The filamentous fungi A. nidulans,
A. fumigatus, N. crassa and Ashbya gossypii are nearing completion (see
also Section 1.2.4). Even at its early stage, Fungal Genome Initiative
(FGI), a genome sequencing program by the National Human Genome
Research Institute, USA, proposed to sequence 15 fungi selected on the
basis of medical, scientific and commercial criteria, in 2002. FGI will ap-
96
ply deep-shotgun sequencing approaches (sequencing coverage > 10) in
order to finish all sequencing work quickly. If fully funded, it will produce
massive valuable information for elaborate comparative genomic analysis
across the fungal taxa.
The genome sequences have an immediate impact on conventional fun-
gal genetics by eliminating years of efforts previously associated with gene
discovery. Traditionally genetic and biochemical approach in gene discov-
ery suffered from many aspects of limitation in fungi, such as poor efficien-
cies of transfer, lack of stable extrachromosomal elements, poor growth in
the laboratory. With the genomic sequence in hand, one can bypass these
limitations by using genomics approaches, which permit rapid identifica-
tion of novel genes. Therefore, obtaining genome sequences from patho-
genic fungi is one of the most efficient steps in identification of potential
targets for therapeutic, intervention and vaccination.
Other genomic approaches
Current genomic approaches can be categorised into three groups: mutagenic-
based, nucleotide-based and protein-based [206]. The mutagenic-based
techniques include signature-tagged mutagenesis and construction of mu-
tant libraries, etc. Microarray analysis and serial analysis of gene expres-
sion (SAGE), for example, belong to the nucleotide based techniques.
Two-hybrid system, protein arrays and 2D-PAGE expression analysis
are examples of protein-based techniques.
4.3 Materials and Methods
4.3.1 Identification of melanin biosynthesis genes in P. marneffei
To identify melanin biosynthesis genes in P. marneffei genome, pro-
tein sequences of melanin biosynthesis genes of Aspergillus were down-
loaded from GenBank. The downloaded protein sequences were used
as queries to the P. marneffei genome. The comparison was conducted
97
using the NCBI TBLASTN program version 2.0 with the BLOSUM62
scoring matrix [6]. The E-value cutoff used to assign homologues was
1 × 10−20. The contigs in the P. marneffei genome that contained
homologues were extracted and annotated manually. Predicted pep-
tides were compared to the amino acid sequences of their correspond-
ing query proteins using NCBI BLAST2SEQ (http://www.ncbi.nlm.
nih.gov/blast/bl2seq/bl2.html). The statistics of the “expect value”
were calculated based on the size of NCBI non-redundant protein data-
base. Conserved domains/motifs were identified using InterPro release
5.1 [367].
4.3.2 Multiple alignments and phylogenetic analyses
Multiple alignments of amino acid sequences were performed using the
program ClustalX 1.81 [311]. Initial pairwise alignments were per-
formed using the Blosum62 protein weight matrix and adjustments to
the alignments were performed manually. Graphic presentation of the
alignments and consensus sequences were performed using the program
BOXSHADE 3.21 (http://www.ch.embnet.org/software/BOX form.html).
Regions of ambiguous alignment were removed by using the GeneDoc pro-
gram (http://www.psc.edu/biomed/genedoc). Phylogenetic trees were
inferred by the neighbour-joining method [273]. Bootstrap resampling
with 1000 pseudoreplicates was carried out to assess support for each
individual branch.
4.4 Results and Discussion
4.4.1 Melanin gene cluster present in P. marneffei
Secondary metabolism, the production of compounds not essential for
growth in culture, is thought to be integrally intertwined with develop-
ment in fungi. These events, usually induced by nutrient, biosynthesis
or addition of an inducer, and/or by a growth rate decrease, generate
98
signals which effect a cascade of regulatory events resulting in chemical
differentiation (secondary metabolism) and morphological differentiation
(morphogenesis). Microbial secondary metabolites have a major effect on
the health, nutrition and economics of our society. They include antibi-
otics, pigments, toxins, effectors of ecological competition and symbiosis,
pheromones, enzyme inhibitors, immunomodulating agents, receptor an-
tagonists and agonists, pesticides, antitumor agents and growth promot-
ers of animals and plants. Among them, fungal secondary metabolites
are of intense interest due to their pharmaceutical (antibiotics) and/or
toxic (mycotoxins) properties. Unlike primary metabolism, the pathways
of secondary metabolism are still not understood to a great degree and
thus provide opportunities for basic investigations of enzymology, con-
trol and differentiation. Recently tremendous progress has been made
in understanding the genes that are associated with production of var-
ious fungal secondary metabolites. For example, work with Aspergillus
species has revealed a link between asexual reproduction and the produc-
tion of toxic secondary metabolites. One of the most well studied fungal
secondary metabolic processes is the biosynthesis of melanin.
Based on the principle of similarity search, we took advantage of
the whole genome sequence to identify the presence of this important
genetic capacity in P. marneffei. Six known genes for DHN-melanin
biosynthesis in A. fumigatus are abr2, abr1, ayg1, arp2, arp1, and alb1
[318]. Functions or gene products of these genes are given in Table 4.2,
note that function of ayg1 is unknown. All these genes are available
from GenBank and gene order has been determined by a previous genetic
study [318] and further confirmed by the A. fumigatus genome project.
The gene order is: abr2 -abr1 -ayg1 -arp2 -arp1 -alb1 (Fig 4.2).
When the amino acid sequences of proteins encoded by these 6 genes
were used as queries to the P. marneffei genome, significant hits were
obtained for all 6 proteins. When the predicted peptides of the corre-
99
Table
4.2:P
utativegene
productsrelated
tom
elaninbiosynthesis
inP.m
arneffei.
Afprotein
(Acc.
No.)
FunctionP
mprotein
Length
(aa),A
f/Pm
E-value
Identity/
Pos-
itive(%
)O
verlaplength
(aa)
abr1(A
AF03353)
brown
1pm
-abr1664/555
0.060/77
528ayg1
(AA
F03354)
yellowish-green
1pm
-ayg1406/403
e-14057/71
403arp2
(AA
F03314)
1,3,6,8-tetra-hydroxynaphthalene
reductasepm
-arp2273/275
8e-9563/74
254
arp1(A
AC
49843)scytalone
dehydratasepm
-arp1168/208
2e-8177/91
160abr2
(AA
F03349)
brown
2pm
-abr2587/526
0.055/73
505alb1
(AA
C39471)
polyketidesynthase
pm-alb1
2146/15680.0
59/711639
100
sponding contigs were compared to the amino acid sequences of the corre-
sponding query proteins, the E-values of the 6 comparisons ranged from
5E-13 to 0 (Table 4.2), indicating high levels of similarity between the P.
marneffei protein and the A. fumigatus proteins. In A. fumigatus, abr1
encodes a multicopper oxidase and abr2 encodes laccase. We detected
weak sequence similarity (60% alignable overlap with 30% amino-acid
positive similarity) between the two genes at the amino-acid level. This
weak sequence similarity suggests two genes are paralogs of each other
which originated from gene duplication. In addition, we collected abr1 or
abr2 homologs from some other fungal species and did a multiple align-
ment of the gene family (Fig. 4.1). This gives information about how
the gene family diverges.
Figure 4.1: P. marneffei abr1 gene Cu-oxidase domain homologues.Alignment of partial amino acid sequences of Cu-oxidase domains of as-comycetes.
More importantly, the synthases of secondary metabolism are often
coded by clustered genes on chromosomal DNA. It has been suggested
that such an organisation of genes may allow coordinated regulation of
the pathway [337]. The 6 melanin biosynthesis are located in a gene clus-
ter in P. marneffei (Fig. 4.2). The gene order is largely conserved when
101
compared to that of A. fumigatus. In P. marneffei, abr1 -ayg1 -arp2 -arp1
locate in one contig, and abr2 and alb1 in other two contigs. Scaffolding
suggests that these 3 contigs belong to one single scaffold. Within this
scaffold, the 3 contigs are ordered one after another, i.e. uninterrupted
by other contigs. Therefore, gene order in P. marneffei can be inferred
as: abr1 -ayg1 -arp2 -arp1 -abr2 -alb1. Such a placement was supported by
5 and 6 pairs of forward-reverse paired reads respectively in the 2 gaps
of the 3 contigs, therefore, it is likely the location of 6 genes is correctly
ordered and the length of this gene cluster can be closely approximated.
As shown in Fig 4.2, the 6 genes span over 35 kb on the P. marneffei
genome, which is about as twice the length in A. fumigatus (19 kb). The
majority of this difference is due to a > 15 kp of gene-free region between
abr2 and alb1 (Fig 4.2). Comparing the gene order in the two fungi, the
only gene order change is abr2 jumping from the beginning of the cluster
(as in A. fumigatus) to after arp1 in P. marneffei. In addition, the di-
rection of alb1 is reversed. The tendency of genes for enzymes of certain
metabolic pathways to be clustered in filamentous fungi has been noted
previously [161]. Generally these gene clusters encode optional pathways
for nutrient utilisation (e.g., the optional carbon source, quinate) [107]
or for synthesis of secondary metabolites (e.g., the mycotoxin, sterigma-
tocystin) [28]. Unlike the clustering of genes as operons in prokaryotes,
clusters of similar genes in fungi are not cotranscribed, nor has any vital
regulatory function for clustering been established [161]. Thus the rea-
son for the existence of gene clusters in filamentous fungi has not been
resolved.
4.4.2 Disrupted aflatoxin biosynthesis gene cluster in P. marneffei
With the possible exception of the penicillin metabolic cluster, the most
thoroughly examined fungal secondary metabolite gene clusters are those
involved in mycotoxin biosynthesis, particularly the aflatoxin (AF) and
102
A. fumigatus abr2 abr1 ayg1 arp2 arp1 alb1
P. marneffei abr1 ayg1 arp2 arp1 abr2 alb1
5kb
Figure 4.2: Comparison between melanin gene cluster between P. marn-effei and A. fumigatus.
sterigmatocystin (ST) biosynthetic clusters found in several Aspergillus
species [28]. These clusters contain a total of 23 genes involved in afla-
toxin biosynthesis and other related functions (including 20 genes that
encode enzymes, two genes that encode regulatory proteins, and one gene
that encode an efflux transport protein) in Aspergillus species. No se-
quence information of cypA, norB, and ordB was available from Gen-
Bank at the time of analysis. The sequences of the remaining 20 genes,
including 17 genes that encode enzymes (hexA, hexB, pksA, nor-1, avnA,
adhA, norA, avfA, cypX, estA, vbs, ver1, moxY, verB, omtB, omtA, and
ordA) and the two regulatory (aflR and aflJ ) and one transport (aflT )
genes, were downloaded. When the amino acid sequences of these pro-
teins were used as queries to search against the P. marneffei genome,
significant hits (TBLSTN E-value cutoff 1.0e-10) were obtained for all 20
proteins. When the predicted peptides of the corresponding contigs were
compared to the amino acid sequences of the corresponding query pro-
teins, the BLASTP E-values of these comparisons ranged from 5.0e-13
to 0 (data not shown), indicating high levels of similarity between the P.
marneffei protein and the Aspergillus proteins. It is noticeable that the
putative gene products of omtA and ordA that are responsible for the
last step in conversion of ST to AF were found in P. marneffei to have
high similarity with their corresponding genes in A. parasiticus.
Despite putative homologues of the Aspergillus genes in the aflatoxin
biosynthesis pathway being present in the P. marneffei genome, these
103
genes do not form a cluster as they do in Aspergillus. This contradicts the
general trend that genes involved in fungal secondary metabolism usually
appear as a cluster, as in the A. flavus and A. parasiticus genomes.
Since almost all of these genes in the P. marneffei genome were not
in the same contig, it suggests that the homologs we identified might
be for production of other unknown secondary metabolites, instead of
aflatoxin. Or major movement of the genes in the aflatoxin biosynthesis
gene cluster has occurred in P. marneffei during evolution, which might
affect the ability and amount of aflatoxins.
4.4.3 Absence of penicillin biosynthesis genes in P. marneffei
Genomic sequence provides evidence for the presence of genetic compo-
nents, such as, melanin biosynthesis gene cluster. On the other hand, it
also provides evidence for the absence of some important genetic compo-
nent, which is also valuable. The beta-lactam antibiotic penicillin, one
of the most commonly used antibiotics for the therapy of infectious dis-
eases, is produced as an end product by some filamentous fungi, such as,
Penicillium chrysogenum. Penicillin biosynthesis is catalysed by three
enzymes which are encoded by the following three genes: acvA (pcbAB),
ipnA (pcbC ) and aatA (penDE ), which are organised in a gene cluster.
Although the production of secondary metabolites, such as penicillin,
is not essential for the direct survival of the producing organisms, sev-
eral studies indicated that penicillin biosynthesis genes are controlled by
a complex regulatory network, e.g., by the ambient pH, carbon source,
amino acids, nitrogen etc. Most notably, this gene cluster is present in
A. nidulans which is a penicillin producer.
In conclusion, the identification of the coding capacity for a set of
proteins that could be involved in melanin biosynthesis has been reported
here. The presence of these homologues suggests the potential ability for
the biosynthesis of melanin or melanin-like substances in P. marneffei.
104
Since melanin is a well-defined fungal virulence factor, it is reasonable to
infer that it is also a virulence factor in P. marneffei, albeit experimental
confirmation is required. In addition, despite putative homologues of the
Aspergillus genes in the aflatoxin biosynthesis pathway being present in
the P. marneffei genome, these genes do not form a cluster as they do in
Aspergillus. They might be involved in the production of other unknown
secondary metabolites.
105
Chapter 5
MATING ABILITIES IN PENICILLIUM MARNEFFEI
Penicillium marneffei was believed to be asexual, but the genome
sequence analysis suggests that the fungus maintains the genetic capa-
bility for sexual reproduction. If confirmed, this raises the potential for
developing powerful genetic tools for the organism, with far reaching im-
plications for its genetic study and disease control.
5.1 Introduction
The most unique feature of Penicillium marneffei is the temperature-
dependent dimorphic switch. At 25 P. marneffei exhibits true fila-
mentous growth, while at 37 it undergoes a dimorphic transition to
produce uninucleate yeast cells that divide by fission. The control of this
“dramatic” developmental process is of interest because it is required for
pathogenicity and may therefore provide a target for controlling infec-
tion. Fungal dimorphic growth and mating are regulated by common
signal transduction pathways, such as the mitogen-activated protein ki-
nase pathway and the nutrient sensing cAMP pathway. Studies of devel-
opment in many fungi have converged to define these conserved pathways,
which are organised in different ways to regulate filamentation, mating
and virulence, in different fungi as they adapt to unique environmental
challenges [192]. Given such a common regulatory mechanism, it is not so
surprising to find an association between the mating process and virulence
in some fungi. For example, a MATα strain of Cryptococcus neoformans
is 30-fold more prevalent in the environment and 40-fold more prevalent
in infections than a MATa strain [183, 193]. Candida albicans utilises a
106
number of the same genes for both mating and pathogenesis. The mating
pheromone of C. albicans elicits an over-expression of a set of virulence
genes in recipient cells [16]. Proteins encoded by these genes were previ-
ously shown to be required for virulence in a mouse model of disseminated
candidiasis. Therefore, it is of particular interest to understand the P.
marneffei mating system, which may be parallel to dimorphic develop-
ment and pathogenesis of this medically important fungus.
Traditionally, P. marneffei is considered as an asexual (anamorph)
ascomycete that lacks an apparent sexual (teleomorph) stage in its life
cycle and seems to reproduce only mitotically [44, 104]. Recent genetic
studies, however, suggest it may have an unidentified sexual cycle [20,19].
Two homologs of the Aspergillus nidulans steA and stuA genes, stlA and
stuA have been cloned from P. marneffei [20, 19]. Both steA and stuA
are involved in controlling mating in the sexual homothallic A. nidulans.
The stlA gene displays no role in vegetative growth, asexual develop-
ment, or dimorphic switching in P. marneffei and is able to complement
the sexual defect of an steA mutant of A. nidulans [19]. The P. marn-
effei stuA gene encodes a basic helix-loop-helix (bHLH) protein of the
APSES family and is supposed to regulate both dimorphic growth and
mating or asexual sporulation. Loss of stuA from P. marneffei resulted
in no obvious effect on dimorphic growth and P. marneffei stuA is able
to complement the conidation defect of an A. nidulans stuA mutant [20].
Moreover, the P. marneffei tupA gene, a homolog of rcoA, is able to com-
plement both the asexual and sexual development phenotypes of an A.
nidulans rcoA deletion mutant [315]. This indicates that the sexual func-
tion of tupA has been retained in P. marneffei. Although the presence of
these highly conserved P. marneffei homologs of these A. nidulans genes
indeed suggests the presence of an undiscovered mating systems in P.
marneffei, the mating process needs a comprehensive network of genes
to function coordinately. Therefore, the finding of a complete mating
107
gene repository in P. marneffei would be a stronger piece of evidence to
support the presence of a sexual stage for the fungus.
Now the genome sequence information has enabled us to conduct a
search for mating-related genes in the P. marneffei genome in order to
reveal the potential mating system in this important dimorphic fungal
pathogen. Similar studies have been carried out in C. albicans, which
was thought to be constitutively diploid and to reproduce only asexually
[138]. The complete genome predicted that a mating system existed in C.
albicans after the identification of numerous highly conserved homologs
of S. cerevisiae mating genes [190, 259, 272]. Eventually, it has been
demonstrated by two research groups that C. albicans can be induced to
mate under certain conditions [139,213].
The sexual cycle introduces valuable genetic tools for fungal study.
If a fungus has a sexual cycle, we can always screen for mutants from
recombination events during meosis and gamete formation, then zygote
formation. In the case of P. marneffei, the absence of a sexual stage
has handicapped biological studies with this fungus. Genome sequence
analysis reported in this chapter, however, provides encourageing infor-
mation: many homologs of sex cycle-related genes have been identified
in the P. marneffei genome, suggesting a potential matting ability of
this important pathogenic fungus, despite which the sexual state has not
been reported. Practically, this discovery might open the door to simple
and efficient procedures for obtaining sexual recombinants of P. marn-
effei that will be useful for genetic analyses of pathogenicity and other
traits.
5.2 Literature Review
Studies on mating type in fungi have been helpful for the understand-
ing of many eukaryotic regulation pathways, including cell cycle regu-
lation, cellular and nuclear identity, and signal transduction. Most as-
108
comycetes have only two different mating types, their MAT locus encodes
transcription factors that regulate mating-type–specific genes involved in
pheromone production, pheromone sensing, and signal transduction [94].
Some ascomycetes are asexual, while many others have adopted different
reproductive strategies: heterothallic, homothallic, and, less frequently,
pseudohomothallic (Table 5.1). For homothallic species, homokaryotic
haploid strains are self-fertile and complete the sexual cycle without seek-
ing a mate. This diversity is so extensive that even species within the
same genus, such as Neurospora, adopt either homothallic or heterothallic
modes. More strikingly, in a recent study, researchers discovered that the
heterothallic C. neoformans α cells can sexually reproduce via fruiting,
without fusing with a partner of the opposite mating type.
5.2.1 Mating in hemiascomycete yeasts
The mating-type locus has been well studied in ascomycete S. cerevisiae.
Two haploid cell types of S. cerevisiae are determined by their MAT loci,
denominated as α and a. A pheromone-mediated fusion process creates
a diploid cell (a/α), which then, under starvation conditions, can un-
dergoes meiosis with the formation of four haploid cells, two of which
are a, two are α. Each α and a mating-type locus contains two diver-
gently transcribed genes: a1, a2 and α1, α2, respectively. The a1 and
α2 proteins are transcriptional repressors (when both are present) and
both contain a homeodomain DNA-binding motif [284]. The α1 protein
has been shown to be a transcription activator [278] but its DNA-binding
domain (the α-box) has yet to be characterised in detail. The function of
a2 is unknown. The a1 and α1 proteins are encoded by totally dissimilar
sequences of 642 and 747 bp, respectively, while a2 and α2 sequences
have partial similarity [227, 299]. S. cerevisiae is basically heterothal-
lic, however, a homothallic breeding system can be achieved through a
mating-type switching, in which S. cerevisiae α haploid cell can switch
109
to the opposite mating type a, or vice verse [132]. This is caused by gene
conversion between the MAT locus and two MAT-like loci during cellular
division of haploid cells [120]. The molecular basis of the gene conver-
sion is the presence of two MAT-like cassettes, HMR and HML. Normally
they are transcriptionally repressed through silencing by the formation of
a specialised compacted chromatin structure. They are both surrounded
by “silencers,” short specific sequences that are binding sites for DNA-
binding proteins and are also involved in transcriptional activation and
DNA replication (for recent reviews, see [105, 117]). Moreover, haploid-
specific gene products, such as the HO endonuclease, are involved in
repression of meiosis and mating-type switching [120].
5.2.2 Mating in filamentous ascomycetes
These mating systems include many conserved components, such as gene
regulatory polypeptides and pheromone/receptor signal transduction cas-
cades, as well as conserved processes, like self-nonself recognition and
controlled nuclear migration. The mating systems in filamentous as-
comycetes share similar components and processes with those in yeasts
but they exhibit many unique properties. First, the sequence dissimi-
larity between two alternate mating-type alleles is more pronounced in
filamentous ascomycetes. Usually they consist of unrelated and unique
sequences. Second, the mating-type switching mechanism of filamentous
ascomycetes is unknown but different from that of yeast. Filamentous
ascomycetes exhibit great stability of the mating type, which might be
due to the lack of additional copies of mating-type sequences outside
the mating-type locus. The additional copies of the mating-type locus in
yeasts are usually silent copies facilitating mating type switching through
gene conversion.
Among filamentous ascomycetes, the structure of the components and
genetic arrangements of their mating type loci vary greatly. Neurospora
110
Table 5.1: Mating strategies adopted by ascomycetous fungi, the presenceof mating type gene and ability in switching between mating types.
Species Mating strategy Matingtypegene
Switching
S. cerevisiae Homothallic Y YC. glabrata Asexual? Y NAKluyveromyces lactis Heterothallic, some
homothallic strainsY Y
Kluyveromyceswaltii
Homothallic Y Y
Ashbya gossypii Asexual? Y YDebaryomyceshansenii
Homothallic Y Y
Yarrowia lipolytica Heterothallic Y YNeurospora crassa Homothallic Y NAPodospora anserina Pseudohomothallic Y NABipolaris sacchari Asexual Y NANeurospora interme-dia
Heterothallic Y NA
S. almonella Heterothallic Y YC. neoformans Heterothallic Y N
111
crassa and Podospora anserina are two representative ascomycetes from
which molecular analyses of mating systems have been well-characterised.
In N. crassa, mat a-1 and mat A-1 are the two genes responsible for a
and A mating specificity, respectively. Two additional genes mat A-
2 and mat A-3, with opposite orientations are present at the mat A-1
adjacent region. In P. anserine, FPR1 is the only gene present in the
mat+ idiomorph and sufficient to induce fertilisation, in contrast, FMR1
with two additional genes, SMR1 and SMR2, are required for the mat-
strain to develop perithecia to maturity.
Heterothallic species require a partner for mating, whereas homothal-
lic species are able to self-mate. The difference between heterothallic
species and homothallic species is not due to the presence or absence of
mating-type genes. Sequences similar to mating types have been identi-
fied and functionally characterised in all the species tested, whether they
are heterothallic or homothallic. Mating type genes are even present in
asexual species, for example, asexual Bipolaris sacchari has a homolog of
the MAT-2 gene of the related species C. heterostrophus. The process of
sexual development is identical in homothallic and heterothallic species.
Homothallic filamentous ascomycetes, even individual nuclei contain both
mating-type informations, could be functionally heterothallic through a
proposed a mechanism allowing alternate expression of either mating
type.
Mating may serve as a model for the study of developmental genetics
and could help in elucidating regulatory mechanisms of multicellularity
and sexual dimorphism. Mating systems are divergent in ascomycetes.
The presence of mating-type genes does not determine the mode of sexual
reproduction. Because the changes in modes of sexual reproduction are
frequent and disruption of sexual function is tolerated in ascomycetous
fungi, the presence or absence of particular genetic components involved
in the mating system is not necessarily a good indicator for which repro-
112
ductive modes a fungus adopted.
5.3 Materials and Methods
Protein sequences of fungal sex-related genes downloaded from GenBank
were used as queries to the P. marneffei genome sequences. The com-
parison was conducted using the NCBI TBLASTN program 2.0 with
the BLOSUM62 scoring matrix [6]. The E-value cutoff used to assign
homologues was 1.0e-20. The contigs of the P. marneffei genome that
contained homologues were extracted and annotated manually. Each an-
notated gene is given a locus number of the form Pm## sequentially
to identify a gene uniquely and positively. Each gene also has a ver-
sion attribute (so loci are in fact displayed as Pm##.version). Predicted
peptides were compared to the amino acid sequences of their correspond-
ing query proteins using NCBI BLAST2SEQ (http://www.ncbi.nlm.
nih.gov/blast/bl2seq/bl2.html). The statistics of the expect value
were calculated based on the size of NCBI non-redundant protein data-
base. Conserved domains/motifs were identified using InterPro release
5.1 [367]. Multiple alignments of amino acid sequences were performed
using the program ClustalX 1.81 [311]. Adjustments to the alignments
were performed manually. Graphic presentation of the alignments and
consensus sequences were performed using the program BOXSHADE 3.21
(http://www.ch.embnet.org/software/BOX form.html).
In addition to the degree of sequence similarity, several lines of supple-
mentary information were used to further support gene homology. These
include: (i) conserved positions of intron(s) between homologs, which
argues for a common ancestor of genes studied; (ii) phylogenetic trees
constructed from aligned genes, so that the most close homolog can be
identified when paralogous genes present; (iii) identified features charac-
teristic of the family that a gene belongs to.
Phylogenetic trees were inferred by the neighbour-joining method
113
AfMAT-2 (Af59.m09249)
mat a-1 (M54787)
A. fumigatus
A. nidulans
P. marneffei
N. crassa
HMG box
alpha box
AnMAT-2 (AF508279/AN4734.2*)
mat A-1 mat A-3 mat A-2
PmMAT-1 (Pm1.126)
AnMAT-1 (AY339600/AN2755.2)
S. cerevisiaeMATalpha2 MATalpha1 MATa1
S. pombemat1-P mat2-P mat3-M
15kb 11kb15kb 11kb
mat1-M mat2-P mat3-M
Chromosome 3 Chromosome 6
Figure 5.1: Comparison of the mating-type loci in P. marneffei and otherfungi. Boxes interrupted by gaps represent the coding sequences of thegenes and the introns, respectively. Arrows indicates the directions ofgenes. Dash lines indicate the genes linked together are present in thegenome of the same isolate. Symbols: dark-gray bar, conserved HMG-box domain; light-gray bar, conserved alpha-box motif.
[273]. Genetic distances between protein sequences was estimated using
WAG amino-acid substitution model [342] implemented in MBEToolbox
(Chapter 10).
5.4 Results and Discussion
The close relationship between Penicillium and Aspergillus genera has
been well established based on various sources of evidences. It is further
supported by our recent comparative study of the mitochondrial genome
of P. marneffei and those of other fungi (Chapter 3). It has prompted the
search for previously undiscovered characteristics in P. marneffei based
on our knowledge in the various Aspergillus species.
114
5.4.1 Homologs of known sexual genes
With respect to the potential mating system of P. marneffei, A. nidu-
lans is of particular interest as this model species has two distinctive
reproductive developmental processes: sexual and asexual development.
We used a set of empirically selected A. nidulans genes involved in sex-
ual development as queries to identify their homologs in P. marneffei.
These genes are veA, medA, tubB, phoA and nsdD. The veA gene was
first known to mediate the light response as early as 1965 [156]. It was
later found to be required for cleistothecium and ascospore formation as
well [159]. The veA1 mutant is unable to develop sexual structures and
asexual sporulation in the veA1 mutant is promoted and increased [164],
implying that veA gene plays a key role in activating sexual develop-
ment and/or inhibiting asexual development. A. nidulans medA (Gen-
bank Acc.: AAC31205) encodes a transcriptional regulator of sexual and
asexual reproduction. tubB, one of two genes encoding alpha-tubulin, is
involved in the processes of karyogamy and meiosis I [167, 168], but it
is not required for vegetative growth or asexual reproduction, nor is it
required for the initiation or early stages of sexual differentiation. The
gene nsdD encodes a GATA-type transcription factor that functions in
activating sexual development [124]. The gene phoA [33], like stuA [222],
is involved in the biosynthesis of tryptophan and has been identified as
being involved in sexual development [77,314,355].
As in A. nidulans veA, the predicted P. marneffei veA contains one
intron with conserved boundaries. The predicted P. marneffei MedA
(741 aa) shows 49% identity in amino acid to A. nidulans MedA (600 aa)
within an alignable region of 555 aa. The predicted P. marneffei tubB
and phoA are highly conserved, sharing 83 and 80% identical amino acid
residues with A. nidulans tubB and phoA, respectively. The predicted P.
marneffei NsdD consists of 385 amino acid residues and, like A. nidulans
NsdD, is rich in proline (13.8 and 11.3%) and serine (13.8 and 13.4%).
115
Both have the type IVb C-X2-C-X18-C-X2-C zinc finger DNA-binding
domains at their C-termini.
We also identified homologs of two inhibitors of sexual processes, lsdA
and rosA, in P. marneffei. The LsdA is expressed abundantly at the late
sexual developmental stage of A. nidulans. Disruption of lsdA causes the
preferential formation of sexual structures even under certain conditions,
such as a salt at high concentration, where sexual development in the wild
type is inhibited [191]. Hence, the lsdA gene inhibits sexual development
in the presence of sex-inhibiting environmental signals. Under low-carbon
conditions and in submersed culture, A. nidulans RosA is also a repressor
of sexual development initiation [331]. The predicted P. marneffei lsdA
encodes a 350 amino acid polypeptide, which when compared to the
356 amino-acid A. nidulans lsdA, shares 43% identical and 60% similar
amino-acid residues. The predicted P. marneffei RosA exhibits 57%
amino acid identity to A. nidulans RosA. The position of the larger intron
of P. marneffei rosA is same as that in orthologs of A. nidulans, Sordaria
brevicollis and N. crassa. At the N terminus of P. marneffei RosA,
the highly conserved Zn(II)2Cys6 motif and a putative bipartite nuclear
localisation signal and a predicted DNA-binding domain are predicted.
In summary, although studies of the molecular mechanism controlling
sexual development in filamentous fungi are very limited, several sexual
genes that have been identified, isolated and characterised from A. nidu-
lans enable us to find their homologs in P. marneffei. This finding is in
line with the other two genes mentioned above, stuA [222] and steA [19],
that have been experimentally characterised in both A. nidulans and P.
marneffei, revealing the functional exchangeability between correspond-
ing homologs. The presence of these faithful homologs suggests that
sexual development is potentially possible in P. marneffei. However, it
becomes not so conclusive when the following fact is taken into account –
many sexual genes may function not only in sexual development but also
116
Figure 5.2: Comparison of the alpha1 domian of MAT proteins of filamen-tous ascomycetes. The amino acid sequence alignments are as follows:putative P. marneffei, MAT-1 (Pm1.126); putative A. nidulans, MAT-1 (AN2755.2); N. crassa, mat A-1; Paecilomyces tenuipes MAT1-1-1;Gibberella fujikuroi, MAT-1-1; Alternaria alternate, MAT-1; Pyrenopez-iza brassicae, alpha-1 domain protein (CAA06844.1); Gibberella zeae,MAT1-1-1; Fusarium oxysporum, MAT-1; Cochliobolus ellisii, MAT-1;Podospora anserine, FMR1. The arrow indicates conserved position ofintrons.
in other processes, like secondary metabolism. Hence, homologous sexual
genes in P. marneffei might be responsible for other processes that are
not related to sexual development. Therefore we need further evidences
to draw a conclusion.
5.4.2 Mating type genes
Fungi are capable of sexual reproduction by using either heterothallic
(self-sterile) or homothallic (self-fertile) mating strategies. In most as-
comycetes, mating ability is controlled by a single mating type locus,
MAT, with two alternate forms (MAT-1 and MAT-2) called idiomorphs.
MAT-1 and/or MAT-2 mediate not only mating, but also several other
key processes, including secretion of and response to pheromones and
vegetative incompatibility. In heterothallic ascomycetes, these alternate
idiomorphs reside in different nuclei. In contrast, most homothallic as-
comycetes carry both MAT-1 and MAT-2 in a single nucleus, usually
closely linked.
A. nidulans is a homothallic ascomycete. A. nidulans MAT-2 (AnMAT -
117
Pm1.124
Pm1.128
Pm1.127
PmMAT-1
(Pm1.126)
Pm1.125
Pm1.129
AnMAT-1
(AN2755.2)
AN2756.2AN4732.2
AN4736.2
AN4735.2
AnMAT-2
(AN4734.2)
AN4733.2
AN4737.2
AN2753.2
AN2754.2AfMAT-2
(Af59.m09249)
Af59.m09500
Af59.m09247
Af59.m09248
Af59.m09250
Af59.m09246
Relationship:
is neighbor
is homolog
A. nidulans contig 47 A. fumigatus
P. marneffei
A. nidulans contig 27
DNA lyase
cytoskeleton assembly control protein
Figure 5.3: Gene organisation around the MAT locus of A. nidulans andthe putative MAT loci of P. marneffei and A. fumigatus. AnMAT -1 andAnMAT -2 are A. nidulans MAT-1 and MAT-2, locating on contig 47 and27 of A. nidulans unfinished genome, respectively.
2) have been previously characterised using ‘classic’ molecular biological
techniques [76], while A. nidulans MAT-1 (AnMAT -1, Genbank Acc.
BK001307) has been found by similarity searching [76]. In the MIT A.
nidulans genome database, two annotated genes AN2755.2 and AN4734.2
on different contigs are actually the AnMAT -1 and AnMAT -2 respec-
tively. Note that AN4734.2 is slightly different from AnMAT -2 (Genbank
Acc. AF508279), simply due to different isolates of A. nidulans. In con-
trast to A. nidulans, only MAT-2 has been identified by genome analyses
from A. fumigatus [253,326]. The AfMAT -2 encodes a regulatory protein
with a high mobility group (HMG) DNA-binding domain [320], which is
the characteristic feature of MAT-2 genes. No homologue of the MAT -
1 gene sequence in any of the tested fungi was found in the TIGR A.
fumigatus genomic database. This suggests A. fumigatus is perhaps a
heterothallic ascomycete, rather than a homothallic ascomycete (as all
homothallic euascomycetes so far analysed either contain only MAT-1 or
both an MAT-1 and MAT-2 [252]), and the genome sequence was from a
118
MAT-2 strain.
Using this pair of Aspergillus species that are closely related to P.
marneffei, the homothallic A. nidulans and the possibly heterothallic A.
fumigatus as models we undertook a series of MAT searches to determine
whether P. marneffei has a hypothetical MAT locus, and if so, whether P.
marneffei carries both MAT1-1 and MAT1-2 genes. Through BLAST
searches, we identified a putative mating-type (PmMAT ) locus in P.
marneffei, containing a conserved homolog of the A. nidulans MAT-1
(AnMAT -1), which is denoted as PmMAT -1 hereafter. The PmMAT -1
gene encodes a putative 348 amino acid polypeptide which shares 38%
similarity to AnMAT-1 (361 aa) in full length, and exhibits 59, 60, 61
and 60% similarity to the alpha-box domain of AnMAT-1, P. brassicae
MAT-1, G. fujikuroi MAT-1 and P. anserine MAT-1. More importantly,
the intron boundaries are conserved between the putative PmMAT -1 and
other fungal MAT -1 genes (Fig. 5.2).
Despite extensive genome sequence searches, we cannot identify a
MAT-2 like gene in P. marneffei. Having one mating-type gene is similar
to the situation in A. fumigatus, where, in contrast, MAT-1 cannot be
found. The other mating type gene, P. marneffei MAT -2 or A. fumiga-
tus MAT-1, might be present in other isolates, as observed in the asexual
Fusarium culmorum species [163]). Alternatively the other putative mat-
ing type gene could have become extinct, as observed in C. neoformans
populations and Ophiostoma novoulmi [356].
The former explanation seems more plausible after we identified pu-
tative mating-type loci in P. marneffei and A. fumigatus, which show
similarity to A. nidulans MAT-2 and MAT-1 regions, respectively. We
compared flanking genes of two mating-type loci to each other, as well as
to corresponding A. nidulans MAT-2 or MAT-1 regions (Fig. 5.3). Strik-
ing patterns were observed in the organisation of flanking genes where
several syntenies were identified. Comparing P. marneffei to A. fumi-
119
gatus, PmMAT-1 (Pm1.126) and AfMAT-2 (Af59.m09249) are oriented
differently, upstream of a hypothetical gene (Pm1.127 and Af59.m09250
respectively). The mating-type gene and its following gene occupy a
unique region of ∼5 kb in both P. marneffei and A. fumigatus. No sig-
nificant similarity at the amino-acid or nucleotide level can be detected
between the two regions. Three pairs of homologous genes flank the two
regions, the first pair encodes a homologues of S. cerevisiae SLA2-like
cytoskeleton assembly control protein, and the other two encode a pu-
tative DNA lyase and a proteins of the cytochrome c oxidase subunit
VIa family. It therefore seems likely that the non-homologous regions in
P. marneffei or A. fumigatus are the mating-locus of their idiomorphic
type. The mating-locus of the other idiomorphic type might be found
in another isolates. This suggests P. marneffei and A. fumigatus are
heterothallic fungi.
Taken together with N. crassa, we now have the schematic organ-
isation of mating-type loci from four filamentous fungi, whose genome
sequences are completed or almost completed (Fig. 5.1). To compare
them with those from yeasts, we note that the mating-type DNA regions
of filamentous fungi are generally larger than in S. cerevisiae [10] or in
S. pombe [162]. In fission yeast S. pombe, the mating-type region com-
prises three linked loci, mat1, mat2 and mat3, which occupy about 30
kb of DNA on chromosome II [14]. The mat1 locus determines the cell
type, depending on whether it has P (for plus) or M (for minus) infor-
mation. mat2-P and mat3-M loci are transcriptionally silent and act as
donors of information for switching mat1 DNA by the process of gene
conversion. There is no similar arrangement of such mating-type regions
in P. marneffei ; however, it is noteworthy that there are other genes,
such as Pm6.88 or AN1962.2, in P. marneffei or A. nidulans, having
similarity to the HMG mating-type genes. They are not ‘true’ MAT-
2 family mating-type genes because they do not contain the intron with
120
conserved positions and some other conserved motifs, which are only seen
in the MAT -2 gene. Also they are not located at the MAT locus, unlike
other filamentous fungi, such as N. crassa, which may have an additional
HMG gene at the MAT-1 idiomorph involved in fertility. These extra
HMG genes are not possible to be silent copies of MAT genes, as seen in
the yeasts. However, they may theoretically have some role in fertility
which will need experimental investigation [Dr Paul S. Dyer, personal
communication].
Finally, the detection of mating type genes, which play roles in sexual
signalling between compatible heterothallic isolates, yet are present in a
‘selfing’ fungus like A. nidulans, is noteworthy itself. As suggested by
Dyer [76], this observation can be interpreted by either the evolution of
heterothallic species towards homothallic form or vice versa. Taking our
observation from the P. marneffei genome into account, then we assume
the former interpretation is more plausible, i.e., homothallic A. nidulans
is originated from a heterothallic common ancestor of Penicillium and
Aspergillus.
5.4.3 Mating pheromone precursor genes
The nucleotide sequence and deduced amino acid sequence of the pheromone
precursor gene from several fungi have been used to search the P. marn-
effei genome. After intensive searches, however, no significant similarity
was found (BLAST E-value cutoff = 10). As mentioned in a previous
section (Section 1.4.2), syntenic comparisons suggest the loss of original
mating pheromone precursor loci may occur in P. marneffei. However,
we cannot exclude the possibility that P. marneffei mating pheromone
precursor genes are so highly specific that they are too divergent to be
detected by similarity searches.
121
C-TerminalCAAXModification
N-TerminalProcessing
Export
Ste6p
Ram1p
Ram2pFarnesylation
Pm6.49
Pm60.30
Ste24pAXX Proteolysis
Pm60.4
Pm96.20
Ste14p CarboxylmethylationPm92.26
Ste24p P1->P2 ProteolysisPm60.4
Axl1p
Ste23pP2->M Proteolysis
No match
Pm134.14
Export
Pm125.22
Rce1p
Figure 5.4: Predicted P. marneffei homologues of the genes involved inthe biogenesis of the a-factor pheromones in S. cerevisiae. The a-factorbiosynthetic intermediates and the components of the a-factor biogene-sis machinery are shown (see the text for more information). Several ofthe a-factor intermediates can be directly visualised by SDS-PAGE andare designated P0, P1, P2, and M [49]. The a-factor precursor containsan N-terminal extension, a mature portion, and a C-terminal CAAXmotif, as indicated at top. During a-factor biogenesis, the unmodifieda-factor precursor (P0) undergoes C-terminal modification (prenylation,proteolytic cleavage of AAX, and carboxylmethylation) to yield the fullyC-terminally modified species P1. Next, N-terminal proteolytic process-ing occurs in two distinct steps, the first (P1→P2) cleavage removingseven residues from the N-terminal extension to yield the P2 species, andthe second (P2→M) cleavage generating mature a-factor, which is ex-ported from the cell. The corresponding components predicted from P.marneffei have been given. Among them, AXL1 has not been identified.
122
Table
5.2:P
heromone-processing
enzymes
encodedby
theputative
P.m
arneffei
genes,as
shown
bya
BLA
STsearch
ofthe
P.
marneff
eigenom
e.
Scprotein
(aa)Function
Pm
protein(aa)
E-value,
identityand
similarity
inoverlap
Kex1p
(729)C
arboxypeptidaseα-factor
processingP
m76.8
(672)4e-057,
124/350(35%
),183/350
(52%)
Kex2p
(814)E
ndoproteaseα-factor
processingP
m6.3
(813)1e-154,
302/774(39%
),428/774
(55%)
Ste13p(931)
Dipeptidyl
aminopeptidase
α-factor
processingP
m10.77
(899)1e-128,
263/787(33%
),399/787
(50%)
Ram
2p(316)
CaaX
Farnesyltransferaseα
subunit;a-factor
modi-
ficationP
m60.30
(350)5e-051,
124/354(35%
),177/354
(50%)
Ram
1p(431)
CaaX
Farnesyltransferaseβ
subunit;a-factor
modi-
ficationP
m6.49
(635)6e-050,
114/329(34%
),157/329
(47%)
Rce1p
(315)C
aaXprotease
a-factorC
-terminal
processingP
m96.20
(333)3e-025,
79/263(30%
),132/263
(50%)
Ste14p(239)
Prenylcysteine
carboxylm
ethyltransferaseP
m92.26
(259)1e-034,
61/134(45%
),87/134
(64%)
Ste24p(453)
CaaX
prenylprotease
N-
andC
-terminal
a-factorprocessing
Pm
60.4(456)
1e-115,202/446
(45%),
274/446(61%
)
Ste23p(988)
Metalloprotease
involved,w
ithhom
ologA
xl1p,in
N-term
inalprocessing
ofpro-a-factor
tothe
mature
form
Pm
134.44(1012)
0.0,369/947
(38%),
562/947(59%
)
Ste6p(1290)
AT
P-dependent
multidrug
efflux
pump
ofa-factor
Pm
125.22(1262)
1e-127,335/1280
(26%),
580/1280(45%
)
123
5.4.4 Mating pheromone processing genes
The production of pheromones has provided important insights into pro-
protein processing in eukaryotic cells. The system has been well char-
acterised in S. cerevisiae (for review, see [62]). A budding yeast cell
produces either a-factor or α-factor corresponding to its mating type.
Either a- or α-factor is synthesised as precursor that undergoes multiple
maturation steps to generate its mature form. A number of S. cerevisiae
pheromone processing genes have been cloned and characterised [32]. We
used the protein sequences of all these genes in a BLAST search to iden-
tify pheromone-processing genes encoding putative homologous proteins
in P. marneffei. For all the query S. cerevisiae proteins, except Axl1p,
the corresponding P. marneffei homologs with high levels of amino-acid
similarity have been identified (Table 5.2). Hence, P. marneffei ap-
pears capable of synthesising/processing mating pheromones although
the pheromone precursor gene has not been identified by searching for
known pheromone precursor genes.
Genes involved in the processing of α-factor and a-factor are different.
In the case of α-factor, the maturation requires signal cleavage, glycosy-
lation and proteolytic processing by three peptidases encoded by KEX2,
KEX1 and STE13. The S. cerevisiae KEX2 gene encoding kexin belongs
to the prohormone convertase family, which has been identified in many
species. The S. cerevisiae Kex2p is membrane-bound and cleaves pep-
tide substrates at both Lys-Arg and Arg-Arg sites [26, 100]. A previous
study has shown that mutant Kex2p enzyme molecules lacking as many
as 200 C-terminal residues still retained protease activity. Although not
essential for enzymatic activity, C-terminal cytoplasmic tail contains a
localisation signal so that Kex2p is localised to a later compartment of
the Golgi complex. The predicted P. marneffei Kex2p shows high simi-
larity (55%) to S. cerevisiae Kex2p overall and similarity at C-terminal
residues is slightly lower, hence, the predicted P. marneffei Kex2p pos-
124
sibly bears protease activity but may be localised differently. The S.
cerevisiae KEX1 encoding carboxypeptidase cleaves the Lys-Arg residues
exposed at the C-terminus of α-factor precursor following digestion with
the kexin [60, 70, 188]. Like Kex2p, the C-terminal residues of S. cere-
visiae Kex1p are not highly conserved in P. marneffei, also suggesting
a difference in peptide localisation between species. P. marneffei is pre-
dicted to have a homolog of S. cerevisiae Ste13p, a type IV dipeptidyl
aminopeptidase that trims N-terminal x-Ala dipeptides of the α-factor
precursors [154].
a-factor undergoes three major maturation stages: C-terminal mod-
ification, N-terminal modification, and export [49], which involve genes
RAM2, RAM1, RCE1, STE14, STE24/AFC1, STE23, AXL1 and STE6
(Fig. 5.4). The S. cerevisiae RAM2 and RAM1 genes encode the α
and β subunits of farnesyltransferase (FTase), respectively [129]. FTase
catalyses the addition of 15-carbon (farnesyl) groups to a-factor des-
tined for cell membranes [260]. RAM2 and RAM1 are conserved genes
that have mammalian counterparts. RAM2 is essential to the viabil-
ity of C. albicans, while RAM1 is essential to C. neoformans, indicating
that protein prenylation is an indispensable cellular process in these op-
portunistic yeast pathogens. The predicted P. marneffei Ram1p shows
high levels of similarity to S. cerevisiae Ram1p (51 %) and to mam-
malian protein farnesyltransferase β subunits (e.g. 55 % similarity to
rat fntb). The predicted P. marneffei Ram2p shows 50 % similarity to
S. cerevisiae Ram2p, with both containing at least three PPTA (Pfam
acc. PF01239) domains at their N-termini. The S. cerevisiae RCE1
encodes an AAX prenyl protease [21]. The sequence of RCE1 contains
three potential transmembrane domains but there are no other defining
features and no significant similarity with other proteins, hence it may
belong to a novel superfamily [247]. The predicted P. marneffei Rce1p,
which is 50% similar, also contains multiple potential transmembrane
125
domains. More importantly, the three putative zinc-binding residues
(E156A, H184A, H248A) and Cys (C251) are all conserved. Mutating
each of these residues inactivates the protease [72]. The S. cerevisiae
STE14 encodes a carboxyl methyltransferase that methylates a-factor.
The predicted P. marneffei Ste14p, containing multiple predicted trans-
membrane spans, shares 64% similarity with S. cerevisiae Ste14p. The
S. cerevisiae Ste24p, a membrane-associated metalloprotease, is required
for the first step of N-terminal processing of a-factor [99]. The predicted
P. marneffei Ste24p shows 60% similarity to its counterpart. Like S.
cerevisiae Ste24p, P. marneffei Ste24p (at position 299 to 303) has a Zn-
dependent metalloprotease motif (HEXXH) [304]. It also matches the
larger consensus sequence characteristic of neutral Zn metalloproteases,
and contains multiple predicted transmembrane regions. Unlike S. cere-
visiae Ste24p, however, the C-terminal di-lysine motif, KKXX (K is Lys)
is replaced with KXXX in P. marneffei Ste24p. Our analysis reveals that
the predicted Ste24p homologs in A. fumigatus (AF58.m07859) and N.
crassa (NCU03637.2) also have the replacement of the di-lysine motif.
Since the di-lysine motif at the C-terminus of many proteins facilitates
their retrieval from the Golgi complex to the ER [310], it could sug-
gest that Ste24p in S. cerevisiae is localised to the ER, but this is not
the case in P. marneffei or the other two filamentous fungi. The S. cere-
visiae metalloprotease Ste23p, a member of the insulin-degrading enzyme
family, is involved in N-terminal processing of pro-a-factor to the mature
form. Axl1p is a paralog to Ste23p. In S. cerevisiae, Ste23p and Axl1p
proteins show 22% identity and 39% similarity throughout their entire
length and Ste23p performs a role at least partially redundant with that
of Axl1p in a-factor processing [1]. In P. marneffei, I identified a pu-
tative homolog of Ste23p but not Axl1p. P. marneffei Ste23p is highly
conserved, showing 59% similarity to S. cerevisiae Ste23p. We argue that
since STE23 genes are present in S. cerevisiae and P. marneffei while
126
AXL1 is present in S. cerevisiae only, it is possible that AXL1 was cre-
ated by duplication of the gene STE23 after the separation of the two
species. Moreover, S. cerevisiae STE23 and AXL1 may be an example of
duplicate genes that undergo subfunctionalisation, through which Axl1p
gains a new role in controlling the axial budding pattern of haploid cells
while retaining partial STE23 functions in processing a-factor. Finally,
unlike α-factor that is exported in MATα cells via the classical secretion
pathway, a-factor is pumped out of the cell by the MATa cell-specific
protein Set6p. The homolog of Set6p was identified in P. marneffei, with
multiple transmembrane domains and two ATP binding domains.
5.4.5 Mating pheromone receptor and other GPCRs
In S. cerevisiae, a or α-factor binds to cell-type-specific receptors encoded
by STE2 or STE3. STE2 is expressed in a cells and is recognised by α-
factor, and STE3 is expressed in α cells and recognised by a-factor. The
binding is essential for signalling mating process between haploid cells.
In A. nidulans, Han et al. [125] identified 9 genes, gprA∼I, belonging to
the GPRC family. Among them, gprA and gprB are putative orthologs to
STE2 and STE3. gprD is similar to the yeast glucose sensing Gpr1p [176]
and plays a key role in coordinating hyphal growth and sexual develop-
ment. Using these A. nidulans GPCRs as query genes, I identified 7 P.
marneffei GPCRs closely related to them. A phylogeny reconstructed
from a collection of fungal GPCRs gives an indication of several distinct
families. The seven P. marneffei distribute across all these sub-divisions.
They all contain multiple predicted transmembrane domains, which is one
of characteristic features of GPCRs. Han et al. [125] also claimed that 7
putative GPCRs have been found in A. fumigatus genome. It would be
interesting to re-analyse this gene family when gene sequences from all
these three genomes of closely related species become available.
Our results indicate that P. marneffei might have a recent evolu-
127
tionary history of sexual recombination and might have the potential for
sexual reproduction. The possible presence of a sexual cycle is highly
significant for the population biology and disease management of the
species.
128
Chapter 6
EXPLORING THE GENETIC COMPONENTS
ASSOCIATED WITH THE DIMORPHISM OF
PENICILLIUM MARNEFFEI
Penicillium marneffei accommodates both complex asexual develop-
ment and dimorphic switching programs, hence becomes a valuable sys-
tem for the study of morphogenesis and pathogenicity. The study of
the morphogenetic programs of P. marneffei has been recently greatly
facilitated by the development of molecular genetic techniques, but we
are only beginning to uncover some determinants which control these
events, and the comprehensive picture still remains blurred. This chap-
ter contributes to the thesis by offering a systemic exploration of genetic
components that may be responsible for the morphogenetic processes in
the genome of P. marneffei, mainly through sequence analysis in a con-
text of comparative genomics. This will provide insights into the biology
of P. marneffei and its pathogenic capacity.
6.1 Introduction
Dimorphism, the ability to switch between a cellular yeast form and a
filamentous form, is a common morphogenetic feature in many fungi, de-
spite their enormous diversity in size and shape. The change of growth
form is believed to be effected by an altered programme of gene expres-
sion, which is induced by a wide range of metabolic and environmental
factors. In Saccharomyces, it is starvation for nitrogen, in Candida, it is
serum (among other things); in Ustilago, it is a putative molecular signal
from the host plant; and in P. marneffei, it is apparently temperature.
129
Note that environment conditioned dimorphism is reversible.
The yeast-form is characterised by a round or ovoid unicellular or-
ganisms, dividing mitotically, either by budding or fission, to form two
independent daughters. Filamentous or mould forms are more com-
plex multicellular structures. The filaments are characterised by long,
thin, parallel-walled tubes, growing by apical extension, with occasional
branching at an angle from the original direction of growth. In contrast
to yeast, filamentous cells do not separate after nuclear division but,
rather, forming septations between cellular units that remain physically
associated to the mother cell.
There is a growing body of evidence suggesting that the morphogen-
esis is a crucial determinant of fungal pathogenicity in both plants and
animals. In Magnaporthe grisea, for example, MAPK and cAMP sig-
nalling promote the formation of a highly specialized infection structure,
appressorium, which is essential for invasion into the host [223]. Most di-
morphic fungal pathogens including P. marneffei, Blastomyces dermati-
tidis, Coccidioides immitis, Histoplasma capsulatum and Paracoccidioides
brasiliensis, typically enter the body as spores or, possibly, mycelial frag-
ments via the lungs and grow in yeast forms in the body. Pathogenic
Cryptococcus neoformans has been shown to form self-fertilising, diploid
strains that are thermally dimorphic [286]. Aspergillus fumigatus spores
establish invasive disease in lung tissue exclusively by hyphal develop-
ment.
Because of the prevalence of dimorphism among human pathogenic
fungi, it is of interest and importance to identify the molecules neces-
sary for the morphologic switch. However, the mechanism of thermal di-
morphism of P. marneffei remains unknown. Nevertheless, since fungal
dimorphism has been seen by many investigators as a useful model of dif-
ferentiation in eukaryotic systems, significant progress has been achieved
in the study of fungal morphogenesis in other fungi. The approach to
130
this chapter is a review of this progress (especially experimental devel-
opments) achieved in recent years in the fields of fungal genetics. These
developments have suggested models and hypothesis to understand the
regulation of the molecular mechanisms involved in fungal differentia-
tion. Comparative sequence analysis is adopted to explore the genetic
components that may be involved in the morphogenesis of P. marneffei.
Specifically, we would like to know whether P. marneffei possess spe-
cific (probably temperature-sensitive) cellular sensors to detect external
stimuli, or unique signalling transduction pathways that translate the
external stimuli into biochemical messages that alter genomic expression
levels, or an enhanced ability in structural reorganization resulting in the
morphological change.
It is noteworthy that the comparative genomics approach adopted
in this Chapter is impaired by the lack of genome sequence information
from true dimorphic fungus. Nevertheless, even the genome sequences of
Blastomyces dermatitidis, Coccidioides immitis, Histoplasma capsulatum
or Paracoccidioides brasiliensis had become available, the comparative
genomics approach might also be handicapped by the too far genetics
distance between P. marneffei and these divergent species. The follow-
ing analysis is therefore mainly limited by the comparison between P.
marneffei and Aspergillus species.
6.2 Materials and Methods
6.2.1 Sequence similarity
To identify homologous genes in the P. marneffei genome, protein se-
quences derived from target genes were used as queries to the P. marneffei
genome. Sequence similarity searches were performed using BLASTP or
PSI-BLAST against selected fungal genomes downloaded from GenBank.
The searches were also performed against an inhouse database composed
of whole-genome sequences of several fungal species from finished and
131
ongoing sequencing projects. The comparison was conducted using the
BLOSUM62 scoring matrix [6]. The E-value cutoff used to assign homo-
logues was 1e10-5, unless otherwise claimed. Conserved domains/motifs
were identified using InterPro release 5.1 [367].
6.2.2 Phylogenetic Analysis
Protein sequences were aligned using PROBCONS [71] and columns of
low conservation removed manually. Phylogenetic trees were inferred
by the neighbour-joining method [273]. The alignments were also used
to infer maximum-likelihood trees. The maximum-likelihood trees were
constructed using the PHYLIP package [86], applying the JTT substi-
tution model with a gamma distribution (alpha = 0.5) of rates over
four categories of variable sites. In general, the maximum-likelihood and
neighbour-joining trees were congruent.
6.3 Results and Discussion
It has long been assumed that morphogenesis and virulence are associated
in dimorphic fungi, as one morphotype exists in the environment or dur-
ing commensalism, and another within the host during invasive process.
For instance, P. marneffei lives outside the host as environmental sapro-
phytic moulds. Its primary infectious form may be conidia or mycelial
fragments aerosolised from disturbed soil or animal excreta. After enter-
ing the host via the respiratory route upon inhalation, the cells rapidly
convert to the yeast form. So do the other members of dimorphic fungi,
such as B. dermatitidis, C. immitis, H. capsulatum and P. brasiliensis.
From the perspective of the fungal cell, the phenomenon of dimorphic
switching can be divided into four interwoven events as follows [275]:
(i) perception of external stimuli by cellular sensors; (ii) transduction of
biochemical signal; (iii) alteration of the genomic expression, and (iv)
structural reorganization towards the morphological change.
132
6.3.1 Perception of external stimuli by cellular sensors
Table 6.1: GPCR family in P. marneffei and A. nidulans. orthologrelationship supported by synteny; when knocked out, no phenotypicchanges. Abbreviations: Pm - P. marneffei, An - A. nidulans, Sc - S.cerevisiae, Af - A. fumigatus, and Sp, S. pombe.
Family An gene Pm gene Sc/Af homolog Sp homolog1 gprA (AN2520.2) Pm198.6 Ste22 gprB (AN7743.2) Pm20.41 Ste3 Map3
3gprC (AN3765.2)gprD (AN3387.2) Pm14.37 Gpr1 Git3gprE (AN9199.2)
4gprF (AN5721.2) Pm105.27 AF54.m07020 Stm1gprG (An5720.2) Pm34.71
5gprH (AN8262.2) Pm58.4 AF53.m04209gprI (AN8348.2) Pm31.53
Limited information about cellular sensors that detect external stim-
uli (especially temperature) is available for ascomycetes. Among known
receptors, G protein-coupled receptors (GPCRs) are key components of
heterotrimeric G protein-mediated signalling pathways. The receptors
detect environmental signals and confer rapid cellular responses. The
GPCR family has been propagated in the genome of Aspergillus nidulans
as shown in the recent analyses of the Aspergillus nidulans genome: 9
genes (gprA∼gprI) predicted to encode seven transmembrane spanning
GPCRs have been identified [125]. Among them, gprD gene was found
to play a central role in coordinating hyphal growth and sexual devel-
opment. Deletion of gprD causes extremely restricted hyphal growth,
delayed conidial germination and uncontrolled activation of sexual devel-
opment resulting in a small colony covered by sexual fruiting bodies. We
identified 7 P. marneffei GPCRs closely related to A. nidulans GPCRs
(Table 6.1). The phylogenetic tree of fungal GPCR family genes (Fig.
6.1) helps the assignment of these putative P. marneffei GPCRs into
their corresponding sub-families.
133
An G
prF
5721
Af5
4.m
07020
Pm105.2
7
Pm34.71
An GprG 5720
Sp Stm1
Sc Ste2
Sp map3
An G
prBS
c Ste
3
Pm
20.4
1
An G
prH
8262
Pm
58.4
Af53.m
04209
Sp
Git3
Sc Gpr1
An GprC
3765
Pm14.37 An GprD 3387
An G
prE 9199
Dd crlA
Pm
31.5
3
Dd
cA
R1
AN
8348.2
An G
prA
2520
Sp m
am2
Pm198.6
2
Figure 6.1: Phylogenetic tree of fungal GPCR family genes. Classifica-tion of fungal GPCR families was carried out by analyses of P. marneffeiPm198.6, Pm20.41, Pm14.37, Pm105.27, Pm34.71, Pm58.4 and Pm31.53,A. nidulans GprA∼GprI, A. fumigatus Af54.m07020, Af53.m04209, Sac-charomyces cerevisiae Ste2p, Ste3p, Gpr1p, Schizosaccharomyces pombeMam2p, Map3p, Git3p, Stm1p, Dictyostelium discoideum cAR1p andcrlAp (GenBank Acc.: AAO62367) using PROBCONS [71]. Algorithmparameters: Gaps/Missing data - Pairwise Deletion; Distance method– Amino Gamma Model [Pairwise distances]; Tree making method -Neighbour-joining.
134
6.3.2 Transduction of biochemical signal
Studies combining the powerful genetic and genomics tools available in
fungi (mainly in Saccharomyces) have revealed three pathways that cou-
ple afferent signals to the dimorphic switch. Although many different
signals can induce filamentous development, the strategies for connect-
ing the external signal to the change in cell differentiation are broadly
conserved among the fungi. For example, studies show that distantly
related fungi – Saccharomyces, an ascomycete, and Cryptococcus, a ba-
sidiomycete, – use common STE12 family members to forms filamentous
structures in response to nitrogen starvation, sharing a high degree of
conservation in the regulatory pathways that control filamentous growth.
Studies on signalling filamentous growth in S. cerevisiae have revealed
that four genes of the MAPK pathway that signals the mating pheromone
response are also required for filamentous growth of diploid cells and the
invasive growth of haploid cells (Fig. 1.6). These four genes are STE20,
STE11 and STE7, which encode three protein kinases that act in se-
quence, and STE12, which acts as a transcription factor at the terminus
of both pathways. As shown in Fig. 1.6 all these four genes are marked
with asterisks, indicating that the S. cerevisiae genes’ ortholog in P.
marneffei has been identified (see also Table 6.2). The STE20 homolog
from P. marneffei, pakA (GenBank Acc. AY621630; Pm80.15), is known
to be essential during yeast but not hyphal growth (Boyce KJ et al., per-
sonal communications). The STE12 homolog, stlA, has been cloned [19].
The P. marneffei stlA gene together with the A. nidulans steA and C.
neoformans STE12alpha genes form a distinct subclass of STE12 ho-
mologs that have a C2H2 zinc-finger motif in addition to the homeobox
domain that defines STE12 genes. The stlA gene had no detectable func-
tion on vegetative growth, asexual development, or dimorphic switching
in P. marneffei. However stlA complements the sexual defect of an A.
nidulans steA mutant [19]. These data suggest that although members
135
Ras2p (Pm85.8) Gpa2P (Pm51.59)
ATP
PKA (r) PKA (r)
PKA (c)
Cyr1p (Pm7.24) Pde2p (Pm146.17)
Bcy1p (Pm33.83)
cAMP AMP
Tpk1p, 2p, 3p
(Pm18.86, Pm47.4, Pm19.3)
Figure 6.2: P. marneffei genes in cAMP pathway.
of the STE12 family of regulators are involved in both controlling mating
and yeast-hyphal transitions in a number of fungi, stlA in P. marneffei
may only play a role in controlling mating processes (see also chapter 5)
but not dimorphic switching. There may be as yet undetected compen-
satory genes or pathways responsible for dimorphic switching.
Another pathway controlling filamentation in Saccharomyces is cAMP
pathway (Fig. 6.2). Ras2p and Gpa2p are regulators of cAMP levels,
acting upstream of adenylate cyclase, Cyr1p, which in turn regulates for-
mation of cAMP. The processes inactivates the cAMP-dependent protein
kinase (protein kinase A, PKA), leading to enhanced filamentous growth
in Saccharomyces. Homologs of all genes related in this pathway have
been identified in P. marneffei (Fig. 6.2 and Table 6.2).
Another regulator implicated in Saccharomyces filamentation is Rim1p
zinc-finger transcription factor. It is activated by a proteolytic cleavage
dependent on several other RIM genes (RIM8, RIM9, RIM13). Rim1p’s
homolog in Aspergillus nidulans, PacC, is also regulated by such a prote-
olysis mechanism. Again homologs of all these RIM genes are identified
136
in P. marneffei, suggesting the existence of the regulatory pathway.
Because signal transduction pathways have been well elucidated in
Saccharomyces, the yeast has been used as a reference library for the
analysis of conserved signalling pathways. However, the most detailed
analyses in S. cerevisiae will be able only to provide stepping stones on
the way to the explaining of key morphological features in more com-
plex, multicellular filamentous fungi. These mould-specific features may
include polarized hyphal growth, septation, establishment of multinucle-
ate cellular compartments, cell type-specific gene expression, and sub-
cellular localization of proteins. Furthermore, protein networks of other
fungi may even differ in their regulation of similar morphological tasks.
Hence, further studies toward an understanding of these differences on
the molecular level will remain an important task in functional analy-
ses, particularly of organisms, like P. marneffei, whose genomes will be
completely sequenced in the near future.
6.3.3 Alteration of the genomic expression
Elevated temperature is apparent by the major environmental stimulus
to P. marneffei resulting in the fungus undergoing a mycelium-to-yeast
transformation. However, the influence of elevated temperature on the
overall gene expression of P. marneffei has not been studied. Neverthe-
less, since surviving at the elevated temperatures, i.e. thermotolerance,
is a trait critical to the ability of many fungal pathogens to thrive in host
infections, a number of studies have been conduced in other fungi. For
example, two genes have been implicated during growth at elevated tem-
peratures in C. neoformans. Gene RAS1 (encoding a small GTP-binding
protein) regulates filamentation, mating and growth at high tempera-
ture [5]. Gene CNA1 (encoding calcineurin) is required for C. neofor-
mans virulence and may define signal transduction elements required
for fungal pathogenesis [236]. Homologs of both genes can be identified
137
Table 6.2: Homologous genes related to signal transduction in filamentousgrowth.
Sc gene Pm gene Function/productMAPK pathwaySTE20(CST20)
Pm80.15 Signal transducing kinase of the PAK fam-ily, involved in pheromone response andpseudohyphal/invasive growth pathways
STE11 Pm129.8 MAP kinase kinase kinase in the filamen-tous growth pathway pathway
STE7(HST7)
Pm161.15 Serine/threonine/tyrosine protein kinaseof MAP kinase kinase family
STE12(CPH1)
Pm201.2(stlA)
Ortholog to AN2290.2 (SteA). Membersof the STE12 family of regulators are in-volved in controlling mating and yeast-hyphal transitions in a number of fungi
TEC1 Pm109.16(abaA)
Transcription factor participates in twodevelopmental programmes: conidiationand dimorphic growth
PSS1 Pm41.61 MAP kinase dedicated to filamentationpathway
FUS3 Pm8.42 MAP kinase dedicated to pheromone re-sponse pathway
cAMP pathwayPDE2 Pm146.17 cAMP phosphodiesterase, component of
the cAMP-dependent protein kinase sig-naling system
RAS2 Pm85.8 Regulator of cAMP levelsGPA2 Pm51.59 G protein alpha subunit homologueCYR1 Pm7.24 Adenylate cyclase, required for cAMP pro-
duction and cAMP-dependent protein ki-nase signalling
BCY1 Pm33.83 Regulatory subunit of the cyclic AMP-dependent protein kinase (PKA)
TPK1, 2, 3 Pm18.86,Pm47.4,Pm19.3
Subunit of cytoplasmic cAMP-dependentprotein kinase; promotes vegetativegrowth in response to nutrients; inhibitsfilamentous growth
to be continued...
138
RIM1 relatedRIM1 Pm20.42 Rim1p is homologous to the Aspergillus
nidulans transcription factor PacC, whichis also regulated by proteolysis
RIM8 Pm148.7 Protein of unknown function, involved inthe proteolytic activation of Rim101p inresponse to alkaline pH; has similarity toA. nidulans PalF
RIM9 Pm26.50 Involved in the proteolytic activation ofRim101p in response to alkaline pH; hassimilarity to A. nidulans PalI
RIM13 Pm146.2 Calpain-like protease involved in prote-olytic activation of Ri0m101p in responseto alkaline pH; has similarity to A. nidu-lans palB
within the P. marneffei genome. The P. marneffei homolog of C. neo-
formans RAS1, Pm85.8, is a known P. marneffei gene (rasA, GenBank
Acc. AY232652). It has been confirmed by experiment to act upstream
of CflA (Cdc42) to regulate germination of spores and polarized growth
of both hyphal and yeast cells, while also exhibiting CflA-independent
activities [23]. For CNA1, the putative homologue gene, Pm119.15, en-
codes a highly conserved (74% aa identity within alignable region of 485
aa) calcineurin peptide sequence (557 aa long).
In addition to these analyses on individual gene’s functions, Steen
et al. have initiated a genome-wide analysis of the response of C. neo-
formans to host temperature [296]. This analysis revealed differences
in the levels of responsiveness of serotype A and D strains to growth
at 25 versus 37 with changes in transcript levels for histone genes,
stress-related genes, and genes encoding translation components. Nunes
et al. [234] used a Paracoccidioides brasiliensis biochip to monitor gene
expression at several time points of the mycelium-to-yeast morpholog-
ical shift. Their results revealed a total of 2,583 genes that displayed
statistically significant modulation in at least one experimental time
point. Among the identified genes, some encoded enzymes involved in
139
amino acid catabolism, signal transduction, protein synthesis, cell wall
metabolism, genome structure, oxidative stress response, growth control,
and development. Particularly, the gene 4-HPPD encoding 4-hydroxyl-
phenyl pyruvate dioxygenase is highly overexpressed during mycelium-to-
yeast differentiation, and its function has been shown to be the inhibition
of growth and differentiation of the pathogenic yeast phase of the fun-
gus in vitro [234]. Two copies of 4-HPPD, Pm48.10 and Pm14.48, were
identified in the P. marneffei genome.
Neither C. neoformans nor P. brasiliensis are phylogenetically closely
related to P. marneffei. Comparison of patterns in gene expression with
the much more closely related Aspergillus species may be more meaning-
ful. Information about A. fumigatus gene expression in metabolic adap-
tation to higher temperatures became available recently [233]. Nierman
et. al., examined gene expression throughout a time course upon shift of
growth temperatures from 30 to 37 and 48 [233]. A total 1926 tem-
perature shift-responsive genes were identified. Comparative data also
indicate that high temperature responses in A. fumigatus differ from the
general stress response in yeast. We performed comparative analysis of
these genes against P. marneffei genome in order to identify their ho-
mologs. Among the 1,926 genes, 1,032 have homologs in P. marneffei,
i.e., a majority of A. fumigatus temperature shift-responsive genes are
present in P. marneffei. Here the set of homologs was defined by iden-
tifying unique pairwise reciprocal best hits, with at least 40% similarity
in protein sequence and less than 20% difference in length. This result
suggests that the genetic component of P. marneffei may not differ much
from those for general high temperature responses in A. fumigatus.
The experiments mentioned above identified the temperature shift-
responsive genes that may play a role in the structural or metabolic
changes that take place during morphogenesis or may be necessary for
colonisation and survival in the host. However, a direct interpretation
140
of the association between P. marneffei homologs of temperature shift-
responsive genes in other fungi may not be reliable. Moreover, very few
genetic determinants have been identified to be directly involved in either
phase transition and/or pathogenicity. Further studies of gene expression
in P. marneffei are necessary in order to solve these problems.
In addition to revealing the overall gene expression pattern, under-
standing the transcriptional mechanisms which control the dimorphic
program is also important. Some of transcription factors within known
pathways have been mentioned above. Here I mention more studies that
identified several other transcription factors which control conidiation
and dimorphic switching in P. marneffei. The P. marneffei abaA gene
(Pm109.16) encoding an ATTS/TEA DNA-binding domain transcrip-
tional regulator regulates cell cycle events and morphogenesis in both
filamentous and yeast growth [18]. The stuA gene (Pm107.14) encod-
ing a basic helix-loop-helix transcription factor may control processes
that require budding but not those that require fission as in dimorphic
growth in P. marneffei [20]. TATA-binding protein (TBP) is a general
transcription factor required for initiation of transcription in eukaryotes.
The TBP encoding gene, Tbp (Pm19.17), has been cloned and character-
ized in P. marneffei [254]. Tbp is essential for P. marneffei filamentous
growth, but plays a less significant role in growth and development dur-
ing the yeast phase. Furthermore, it has been shown that transcriptional
regulation in S. cerevisiae appears to be mechanistically bipolar, i.e.,
TATA box-containing genes are predominantly involved in responses to
stress, whereas TATA-less genes are mainly associated with constitutive
housekeeping functions [12]. Only 20% of yeast genes contain a TATA
box [12]. It therefore is interest to see if TATA-less promoters are also
present in P. marneffei, suggesting a need to balance inducible stress-
related responses with constitutive housekeeping functions or reflecting
the difference in the regulatory basis for growth and development of the
141
two morphological forms [254].
6.3.4 Structural reorganization towards the morphological change
It is reasonable to speculate that the mycelium-to-yeast transformation of
P. marneffei is an active process triggered by a shift in temperature. The
fungus undergoes a ‘drastic’ structural reorganisation associated with this
active process. We assume this process may be linked with a number of
phenotypic changes like those characteristic of apoptosis or programmed
cell death. Indeed, programmed cell death has been observed in both A.
fumigatus [225] and A. nidulans [313]. The metazoan upstream apop-
totic machinery is absent in fungi, whereas the downstream effectors and
regulators, both caspase-dependent and caspase-independent, seem to
present in A. fumigatus [225]. As in animal apoptotic cells, caspase activ-
ities are involved in fungal mycelium self-activated proteolysis. Searches
in P. marneffei genome revealed three genes (Pm105.4, Pm112.34 and
Pm205.1) encoding metacaspase proteins that could be responsible for the
caspase-like activities. Only two copies of these proteins were identified in
A. nidulans genome. The searches also found a single gene (Pm93.8) en-
coding a poly (ADP-ribose) polymerase (PARP) protein, a homologue of
the key participant of caspase-independent apoptosis in mammals. PARP
is one of the known target proteins inactivated by caspase degradation in
animal cells. PARP activity was demonstrated previously in A. nidulans
during sporulation-induced apoptosis. PARP is absent in S. cerevisiae
but present in Aspergillus. The presence of these proteins in P. marnef-
fei and Aspergillus is indicative of the PARP-dependent programmed cell
death pathway. In addition, homologs of mammalian apoptotic protein
AMID are found in P. marneffei and A. fumigatus, but not in unicellular
yeasts such as S. cerevisiae, further suggesting that mechanisms of cell
death appear to be more complex in filamentous fungi.
Analysis of the cell wall of P. marneffei is basic for understanding its
142
morphological transformation. In the mould form, the hyphal cell wall
is essential for P. marneffei to penetrate solid nutrient substrates. In
yeast form, a transformed cell wall is essential to resist host cell defence
reactions. The cell wall protects P. marneffei against the aggressive
human defence reactions, harbours most of the fungal antigens and it
represents a potential drug target. Therefore, comprehension of cell wall
biosynthesis pathways is important. We speculate that, like many other
filamentous fungi, the structural organization of the cell wall of P. marn-
effei is the polysaccharide constituents composed of alpha and beta(1,3)-
glucans, chitin, galactomannan, and beta(1,3),(1,4)-glucan. These struc-
tural genes and genes encoding a number of enzymes including synthases,
transglycosidases, and glycosyl hydrolases responsible for their biosynthe-
sis and remodelling were identified in the P. marneffei genome (provided
in PMGD website: www.pmarneffei.hku.hk). One of the known dif-
ferences between the yeast cell wall and the mycelium cell wall is that
β1,6-glucan and peptidomannan present in yeast cell walls are missing in
A. fumigatus [233]. The beta1,6-Glucan is a key component of the yeast
cell wall, interconnecting cell wall proteins, beta1,3-glucan, and chitin.
Yeast genes, KRE5, KRE6 and SKN1, are predicted to encode paralog
proteins that participate in assembly of the β1,6-glucan. Homologs of
these three genes, Pm76.37, Pm104.21 and Pm34.5 were identified in P.
marneffei genome, as well as in A. fumigatus genome. Seemingly, the
specificity of the cell wall biosynthetic gene inventory in the P. marneffei
genome determines the specificity of the polymer organization of the cell
wall. Yet we need further analysis for confirmation.
As a general feature of development in eukaryotes, only a small pro-
portion of the genome is associated with any particular morphogenetic
process. In yeast for example, only 21-75 of the estimated 6,000 genes
were assumed to be specific to meiosis and ascospore formation. This
is also the case in P. marneffei. Therefore, the study of morphogenesis
143
should be directed to an emphasis on morphogenetic gene regulation of
differential expression of activity, rather than on large scale replacement
of one set of gene products by another. We still lack gene expression
studies in P. marneffei to date. Nevertheless, the findings in this chap-
ter offer new interpretive clues to the mechanisms of fungal virulence
and dimorphism. First, the signalling systems that control dimorphism
may be conserved between P. marneffei and related fungi. That is to
say, many fungal species contain orthologous genes specifying the same
pathways. Presumably, only subtle quantitative differences in the inputs
and outputs of each pathway generate the different morphologies and
behaviours characteristics. Second, dimorphism in P. marneffei may be
controlled by multiple signalling pathways. As in Saccharomyces, at least
three parallel pathways control the switch to filamentous growth. How
the fungus integrates the information from different pathways to effect a
change in cell type is not known.
In summary, morphogenesis is an essential developmental event, pro-
moting host invasion and evasion by dimorphic fungi. Prevention of this
event may hold the key to control of infections by these fungi. Under-
standing the molecular mechanisms for the morphologic switch could lead
to new drug or vaccine targets that block the earliest events in coloniza-
tion or infection.
144
Chapter 7
INTRAGENIC TANDEM REPEATS IN PENICILLIUM
MARNEFFEI AND OTHER ASCOMYCETES
Tandemly repeated DNA sequences occur frequently in the genomes of
organisms. Although their function and origin are not truly understood,
these highly dynamic genomic components may provide the most insights
into how a pathogenic fungus adapts to the host immune system.
7.1 Introduction
A tandem repeat (TR) is defined to be two or more adjacent copies of
the same sequence of nucleotides and may result from tandem duplica-
tion event(s). Over time, individual copies within a TR may undergo
additional, uncoordinated mutations so that typically, only approximate
tandem copies are present. The number of adjacent copies in a TR can
be variable. Lengths of TR range from few tens of base pairs (micro- and
mini-satellites) to megabases (larger satellite repeats).
Genomes, particularly of eukaryotes, contain a large number of TR.
For example, 10% or more human genome is composed of TRs. Simple
sequence repeats are fairly abundant in plant genomes, occurring once
in every approximately 6 Kb [258]. TRs are of biological importance
for many reasons. First, they cause human diseases, including fragile-X
mental retardation, Huntington’s disease, myotonic dystrophy, etc [288],
which are the result of a dramatic expansion in the number of copies of
a trinucleotide pattern. Second, they play a variety of regulatory and
evolutionary roles. The repeats may interact with transcription factors
or alter the structure of the chromatin or act as protein binding sites [121,
145
208]. Third, they are important laboratory and analytic tools. They have
been applied in linkage analysis and DNA fingerprinting [78,340] since the
number of copies of a specific TR is often polymorphic in the population.
Last but not least, TRs play an apparent role in the development of
immune system cells in human. Du et al. [75] showed that breakpoints
of immunoglobulin switch recombination, which occur between pairs of
switch regions located upstream of the constant heavy chain genes, cluster
to a defined subregion in three TRs.
The most interesting feature of TRs is that their association with the
functional variability of a gene product. Most TRs are in intergenic re-
gions, but some are in coding sequences or pseudogenes. Verstrepen et
al. [328] showed that in the genome of Saccharomyces cerevisiae, most
genes containing intragenic TRs (IntraTRs) encode cell-wall proteins.
The presence of IntraTRs facilitates recombination in the gene or between
the gene and a pseudogene. The result of this increased frequency of re-
combination events is an expansion or contraction of the gene size. More
importantly, this size variation creates quantitative alterations in pheno-
types (e.g., adhesion, flocculation or biofilm formation). The variation of
the fungal cell surface allows fungal microbes to ‘disguise’ themselves in
order to evade the host immune system’s defences.
Inspired by the finding of Verstrepen et al. [328], the aim of this
chapter is to reveal the composition of IntraTRs from the genomes of
Penicillium marneffei, as well as other related species. Using computer
programs, we searched for both long and short repeated sequences within
protein-coding regions in P. marneffei and related Ascomycetes. Com-
parison of observed frequencies with expected values reveals that repeats
are enriched in the P. marneffei genome.
146
7.2 Materials and Methods
7.2.1 Identification of coding tandem repeats
The previously described methodology [328] was applied to find Intra-
TRs in P. marneffei genome and other fungal genomes, using the EM-
BOSS ETANDEM software [263] to screen the sequences. The ETAN-
DEM threshold score was set to 20. All known and predicted genes were
scanned for long (> 40 nucleotide (nt)) or short (3-39 nt) repeats. Here
a sequence was considered to be an intragenic repeat if it meet two con-
ditions: (i) repeat conservation was at least 85%; and (ii) the number of
repeats was at least 20 for trinucleotide repeats, 16 for repeats between
4 and 10 nt, 10 for repeats between 11 and 39 nt and 3 for repeats of at
least 40 nt.
7.2.2 Sequence analysis
Position-specific iterated BLAST (PSI-BLAST) [6] was used to search
publicly available microbial genome sequences, GenBank, or EMBL. Gen-
Bank and EMBL were accessed through the National Center for Biotech-
nology Information http://www.ncbi.nlm.nih.gov/ and the Oxford Uni-
versity Bioinformatics Centre, respectively. Protein domain determina-
tions were addressed through the NCBI Conserved Domain Search. The
MBEToolbox package (Chapter 10) was used for nucleotide and amino
acid sequence analysis and alignments.
7.3 Results and Discussion
One of the ultimate goals of sequence analysis is to accurately iden-
tify candidate virulence genes that confer pathogenicity to P. marneffei.
General comparative analyses, such as ortholog prediction and species-
specific gene detection, are valuable, but not very specific. That is to say,
these methods give too many candidate genes. To narrow these candidate
147
Table 7.1: P. marneffei genes containing intragenic tandem repeats. Col-umn “size” is the length of repeat unit, “count” is the occurrence of re-peat unit. Total length of repeat units is therefore equals: size × count.Sequence identity (%) of repeat unit is greater than 80%. Consensus se-quences of repeat unit for each gene are available in PMGD. * indicatesthe gene contains more than one type of repeat. Genes are ordered bythe size of repeat unit. The last 12 genes contain short repeats, the restcontain long repeats.
Pm gene Size Count Putative FunctionPm6.47 228 3 Polyubiquitin, similar to S. cerevisiae
UBI4 (YLL039C)Pm27.95 171 5 Unknown functionPm78.37* 165 3 Unknown functionPm54.4 147 3 Streptococcal protective antigen
(Q8NZA4)Pm71.41 144 5 Unknown functionPm133.2 141 3 Unknown functionPm1.199 126 12 Homologous to AN7363.2, AN3547.2 and
AN8457.2Pm12.139 126 9 Putative ATP/GTP binding proteinPm14.111 126 4 O-acetylhomoserine (Thiol)-lyase
(CYSD EMENI)Pm30.75 126 7 Beta transducin-like protein HET-E2C*4
(Q8X1P4)Pm35.44 126 11 Beta transducin-like protein HET-E2C
(Q8X1P5)Pm94.31 126 8 Putative ATP/GTP binding protein
(Q6TMU6)Pm210.2 126 9 Beta transducin-like protein HET-D2Y
(Q8X1P2)Pm39.56 120 3 Unknown functionPm183.10 117 3 Casein kinase I homolog hhp1
(HHP1 SCHPO)Pm54.56* 108 6 Pedal peptide precursor protein (O01387)Pm12.114 102 3 Unknown functionPm161.1 102 3 Phosphorylase (Q8TK58)Pm77.10 99 5 KIAA1223 protein (Q8TB46)Pm226.4* 99 7 Ankyrin 2 (Q9NCP8)Pm209.2 96 3 Beta transducin-like protein HET-E4S
(Q8X1P6)to be continued...
148
Pm44.53 81 3 Related to transport protein USO1(Q873K7)
Pm163.5 78 9 Erythrocyte binding protein 3 [Plasmod-ium falciparum] (Q7K5Q6)
Pm42.29 72 3 Phenol 2-monooxygenase (Q8X0B1)Pm117.16* 72 5 Unknown functionPm31.1 66 5 Unknown functionPm34.34 66 5 Chitinase (Q873Y0)Pm54.65 66 4 Extensin class I (cell wall hydroxyproline-
rich glycoprotein) [Plasmodium falci-parum] (Q09082)
Pm78.42 66 3 Chitinase 4 (Q7ZA41)Pm118.4 63 5 Unknown functionPm40.30 60 3 PAAA motif protein, similar to microfila-
ment and actin filament cross-linker pro-tein [Pan troglodytes]
Pm64.14 60 8 Zonadhesin – [Mouse]; PT repeat pro-tein family (EAL93999) [Aspergillus fumi-gatus]
Pm95.32 60 3 Related to mannosyltransferase ALG2(Q8X0H8)
Pm194.2 60 3 Retrovirus-related Pol polyprotein fromtransposon TNT 1-94 (POLX TOBAC)
Pm41.72 54 5 Unknown functionPm166.6 54 3 Unknown functionPm48.11 48 5 Similar to S. cerevisiae YJR054W
(Q6CXI0)Pm78.3 48 4 Telomere-linked helicase 1 (Q8J216)Pm173.14 48 4 Telomere-linked helicase 1 (Q8J216)Pm194.1 48 4 Telomere-associated recQ-like helicase
(O13400)Pm194.5 48 5 Polymerase (Q9C435)Pm224.1 48 3 Telomere-linked helicase 1 (Q8J216)Pm224.2 48 5 Telomere-linked helicase 1 (Q8J216)Pm230.1 48 5 Telomere-linked helicase 1 (Q8J216)Pm234.1 48 5 Telomere-linked helicase 1 (Q8J216)Pm236.2 48 4 DWIQ motif containing hypothetical pro-
tein (NP 702011) PF14 0123 [Plasmodiumfalciparum]
Pm236.3 48 7 Q8J216 Telomere-linked helicase 1to be continued...
149
Pm247.2 48 5 Q8J216 Telomere-linked helicase 1Pm108.33 45 4 Unknown functionPm8.109 42 3 ATPase, AAA familyPm40.29 42 3 Unknown functionPm40.31 42 4 H7H motif in multiple proteins of Plas-
modiumPm52.29 42 3 Mitochondrial chaperone BCS1
(BCS1 XENLA)Pm210.1 42 4 Unknown functionPm173.16 24 10 Unknown functionPm36.21 12 11 Unknown functionPm1.35 6 25 Transcription initiation factor TFIID sub-
unit 12 (TAF12 YEAST)Pm1.28 3 24 Unknown functionPm3.168 3 28 Q7Z884 Putative cell wall protein FLO11pPm5.75 3 25 Dynamin binding protein, TUBA; DN-
MBP MOUSE (Q6TXD4)Pm14.75 3 29 Unknown functionPm22.8 3 22 Unknown functionPm67.24 3 22 Related to heat shock transcription factore
HSF21 (Q9P554)Pm76.36 3 21 Unknown functionPm85.21 3 30 Unknown functionPm138.7 3 24 Oxygenase-like protein (Q93M01)
genes down to a manageable amount, genes that contain IntraTRs were
carefully investigated. This is because IntraTRs have been suggested to
generate functional variability in S. cerevisiae, and variation in IntraTR
number provides the functional diversity of cell surface antigens that, in
fungi and other pathogens, allows rapid adaptation to the environment
and elusion of the host immune system [328]. In S. cerevisiae, there are
a total of 44 such genes with known functions that have been identified.
These genes show unexpected functional similarities: 62% with conserved
long repeats encode cell-wall proteins [328].
A total 66 P. marneffei genes that contain IntraTR(s) were identi-
fied (Table 7.1). Nearly one third of these genes are of unknown func-
tion, i.e., neither putative homologs have been detected by the extensive
150
PSI-BLAST search against GenPept databases, nor putative conserved
domains have been detected. These genes may be P. marneffei -specific.
The remaining two thirds of them, whose putative homologs can be found,
are genes with assigned functions. Nine of these genes, namely, Pm78.3,
Pm173.14, Pm224.1, Pm224.2, Pm230.1, Pm234.1, Pm236.3, Pm247.2,
and Pm194.1, are homologs of the Magnaporthe grisea telomere-linked
helicase 1 (TLH1) gene. Genetic mapping showed that most members
of the TLH gene family are tightly linked to the telomeres and located
within 10 kb from the telomeric repeat. Similar helicase gene families
are also present in the chromosome ends of Saccharomyces cerevisiae
and Ustilago maydis, which suggests the initial association of helicase
genes with fungal telomeres might date back to the very early stages of
the fungal evolution [103]. Four genes, Pm210.2, Pm30.75, Pm35.44, and
Pm209.2 are homologs of beta transducin-like protein genes, most closely
similar to Podospora anserina het-d2y, het-e2c, het-e2c*4 and het-e4s, re-
spectively. These genes are involved in vegetative incompatibility, which
prevents a viable heterokaryotic cell from being formed by the fusion of
filaments from two different wild-type strains. In P. anserina, such in-
compatibility is always the consequence of at least one genetic difference
in het genes, specifically het-e and het-d. These loci control heterokaryon
viability through genetic interactions with alleles of the unlinked het-c lo-
cus [82]. The other interesting homologs include streptococcal protective
antigen, chitinase, extensin, zonadhesin, and erythrocyte binding protein,
etc (Table 7.1).
For further experimental studies, such as, DNA typing, only those
that are most likely to be responsible for P. marneffei ’s pathogenic adap-
tation should be selected. The selective process involves a multi-step fil-
tering. The underlying rationale is that a candidate virulence gene has to
be (1) P. marneffei -specific (without orthologs or orthologs containing no
similar IntraTR), and (2) functionally known to be related to intracellular
151
adaptation or otherwise completely functionally unknown. Moreover, in
order to conduct a PCR-based IntraTR length polymorphism study, the
constraint of the length of target DNA in PCR reactions has to be taken
into account. After the multi-step filtering and investigating the lengths
of IntraTR and introns of these genes, two genes, Pm40.30 (745 bp) and
Pm40.31 (733 bp), were selected for further polymorphism study. The
lengths of IntraTRs plus introns of the two genes are 234 and 277 bp re-
spectively. What makes these two genes special are their BLAST analysis
results. Pm40.30’s top hit of PSI-BLAST against NCBI NRProt database
is a hypothetical Chimpanzee protein containing multiple PAAA motifs.
While Pm40.31’s top hit is a hypothetical histidine-rich motif containing
protein from Plasmodium falciparum. Although the function of this hy-
pothetical gene encoding this protein is unknown, it is still noteworthy
that another histidine-rich protein PfHRP2, encoded by P. falciparum
gene HRP-2, is indeed responsible for intracellular adaptation of this
parasite [11]. PfHRP2 binds heme, playing a role in hemoglobin prote-
olysis, which is the primary nutrient source of the erythrocytic growth
stage of P. falciparum [52].
The relative abundances of IntraTR within different fungi are com-
pared. Table 7.2 shows the genome size, G, bases in repeat regions, B,
and number of genes containing repeats, n, from several fungi. When
take all diploid and haploid species are taken together, the two diploid
fungi, S. cerevisiae and C. albicans show higher B/G ratio. It appears
that genomes of diploid species may accommodate more bases located in
IntraTR regions, as much as 3 times higher. Among haploid fungi, P.
marneffei shows the highest B/G ratio, i.e. its fraction of bases belong to
repeat regions is higher than any other haploid fungi. We argue that the
relatively more abundant IntraTRs in P. marneffei might be responsible
for its immuno-escaping mechanism, which enables the fungal pathogen
to survive within its host. Finally, note that B/N ratios remain largely
152
constant across different species, i.e., the average number of bases within
each gene is similar.
Table 7.2: Comparison of genome size and base in repeats. Abbrevi-ations: Pm, P. marneffei ; Af, Aspergillus fumigatus; An, Aspergillusnidulans; Sc, Saccharomyces cerevisiae; Ca, Candida albicans; Mg, Mag-naporthe grisea; Nc, Neurospora crassa.
Pm Af An Sc Ca Mg NcDiploid No No No Yes Yes No NoGenome size (Mb), G 30 28 30 12 16 39 40Bases in repeat re-gions (bp), B
23,814 12,687 16,820 29,664 34,662 16,933 22,101
No. of genes contain-ing repeats, N
66 33 31 69 82 62 121
B/G ratio 794 453 561 2,472 2,166 434 553B/N ratio 361 384 543 430 423 273 183
The amino acid composition of a protein is the mole percent of the
different amino acids its sequence. It is usually conserved among the
same proteins of different organism species. Here we performed a cross-
species comparsion of IntraTRs’ amino acid composition (Fig. 7.1). The
two yeasts show a different visual pattern compared to these of moulds.
S. cerevisiae and C. albicans use much more threonine and/or serine
residues than any other amino acid; while in moulds the patterns are
more contrast. Serine is used most in P. marneffei and A. fumigatus;
alanine in A. nidulans, glycine in N. crassa and isoleucine in M. grisae.
Phenylalanine, valine and tryptophan are ubiquitously less used in all
species. The overall patterns of P. marneffei, A. nidulans and A. fumi-
gatus are similar to each other. The result shows that the differences
among amino acid composition are associated with the phylogenetic dis-
tances among species. This suggests that the amino acid composition of
IntraTR is not subject to neutral mutation but under the constraint of
a certain level of selection.
The cell surfaces of microorganisms show distinctive properties which
153
0 500 1000 1500
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
0 500 1000 1500 2000 2500
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
0 100 200 300 400 500 600 700 800
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
0 500 1000 1500 2000 2500
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
0 100 200 300 400 500 600 700
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
0 100 200 300 400 500 600 700
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
0 100 200 300 400 500 600
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
YAf An Nc
Mg
Pm
Ca Sc
Figure 7.1: Amino acid composition in intragenic tandem repeats. Fungalspecies are: Af, A. fumigatus; Pm, P. marneffei ; An, A. nidulans; Sc,S. cerevisiae; Ca, C. albicans; Nc, N. crassa; Mg, M. grisae. For eachsubplot, x axis is occurrence/frequency of amino acid, y axis is aminoacid in the order of downwards: A - Alanine, C - Cysteine, D - AsparticAcid, E - Glutamic Acid, F - Phenylalanine, G - Glycine, H - Histidine,I - Isoleucine, K - Lysine, L - Leucine, M - Methionine, N - Asparagine,P - Proline, Q - Glutamine, R - Arginine, S - Serine, T - Threonine, V -Valine, W - Tryptophan, and Y - Tyrosine (Tyr).
154
can be recognised by the host immune system. Many microorganisms
have the ability to switch their cell-surface molecules, a tactic that per-
mits them to elude the immune system and adhere to diverse materials
and cells (for review, see [329]). The human immune system poses chal-
lenges to P. marneffei, which might have characteristic cell-surface mole-
cules that are recognized by dedicated phagocytic cells. Recent studies
linked the the diversity of cell surface molecules to the variation in In-
traTR number. The persistence of a large amount of IntraTRs in the
P. marneffei genome suggests that there is a compensating benefit. We
therefore propose that variation in IntraTR number provides the func-
tional diversity of cell surface antigens in P. marneffei, allowing rapid
adaptation to the environment and evasion of the host immune system.
155
Chapter 8
EXTENT AND EVOLUTIONARY PATTERN OF
DUPLICATE GENES IN PENICILLIUM MARNEFFEI
AND OTHER ASCOMYCETES
Gene duplication and subsequent divergence have long been believed
to be of importance for the functional novelty and complexity of organ-
isms. The extent and evolutionary patterns of duplicate genes (paralogs)
have long been studied in higher eukaryotes, but not in lower eukary-
otes such as fungi. In this chapter, gene-coding sequences in genomes
from Penicillium marneffei, together with those from other ascomycetes,
Saccharomyces cerevisiae, Schizosaccharomyces pombe, Candida albicans,
Aspergillus nidulans and Neurospora crassa, are used to identify multi-
gene families. The number of synonymous substitutions per synonymous
site, Ks, and the number of nonsynonymous substitutions per nonsyn-
onymous site, Ka, are calculated to measure the time (or relative fre-
quency) of duplication as well as the selective constraint on gene pairs.
The evolutionary rates of duplicate gene pairs are measured by applying
the codon substitution model, which is more sensitive than traditional
models [111]. A large variation in the extent of gene duplication in these
species was found (percentage of genes in multigene families ranged from
23.6% in S. cerevisiae to 8.0% in N. crassa). The age distribution of
the gene duplications tentatively suggests that the P. marneffei genome
may have experienced two rounds of large-scale duplication. It is also
detected that paralogs in filamentous ascomycetes (but not paralogs in
yeast ascomycetes) are under weaker functional constraint than those
of orthologs. Analysis of the divergence of evolutionary rates in S. cere-
156
visiae and C. albicans revealed that 17.8% of gene pairs show asymmetric
divergence pattern in amino-acid substitutions. However, there is no evi-
dence to show that this asymmetry is associated with positive selection. I
speculate that the different extent and evolutionary pattern of duplicate
genes in these ascomycetes might be associated with their genotypical
and phenotypical differences.
8.1 Introduction
In early 1970s Ohno proposed in his book that gene duplication is a ma-
jor evolutionary source of gene innovation [237]. By this he meant that:
the creation of a paralog of a gene through duplication (by many possible
means) results in one of the duplicates being functional redundant. This
redundant copy may mutate more freely without affecting the overall fit-
ness of the organism, and thus is more likely to become a gene with a
novel function. Now generally, biologists accept the vision that, by cre-
ating sets of gene paralogs, gene duplication plays an important role in
the adaptation of organisms to their environment and in the origin of yhe
phenotypic diversity of organismal evolution [210]. Nowadays, with the
completion of several eukaryotic genome projects, it is well known that
one of the characteristics of eukaryotic genomes is the presence of dupli-
cate genes, forming numerous gene families [287]. More than a third of a
typical eukaryotic genome consists of gene families [115,287,345]. Whole
genome duplication(s) during the earlier evolution of the vertebrate lin-
eage have been proposed to account for the presence of extensive gene
duplications in most of the vertebrate genomes [209,221,287].
The extent of gene families in one organism is firstly determined by
the frequency and magnitude of gene duplication events, and secondly de-
termined by the subsequent evolutionary fates of gene pairs following the
duplication events. This may be better understood through comparative
studies of sequence divergence in duplicate genes in different genomes.
157
However, until recently few studies have been conducted in the limited
number of representative organisms available [68,174,210], because such
kinds of inter-genomic comparisons rely on the availability of complete
genome sequence from multiple organisms.
In this study, I compare the extent and evolutionary pattern of dupli-
cate genes in the phylum ascomycota, using the complete sets of protein-
coding genes in the fungi, Saccharomyces cerevisiae [110], Schizosaccha-
romyces pombe [354], Candida albicans, Aspergillus nidulans, Penicillium
marneffei and Neurospora crassa [101]. These fungal species display dif-
ferent life styles and phenotypic characteristics. The brewer’s yeast, S.
cerevisiae, and fission yeast, S. pombe, have a life cycle characterized
by a unicellular thallus that reproduces by budding and fission respec-
tively. Filamentous ascomycetes, N. crassa and A. nidulans, grow hyphae
apically and branch laterally. P. marneffei shows dimorphic switching
between mould and yeast forms of growth under different temperatures.
It is of interest to know how gene duplication shaped their gene reposi-
tories leading to novel genes conferring novel adaptive functions in these
fungi.
In practice I used nucleotide alignments of duplicate genes to calculate
two key parameters of molecular evolution: the number of synonymous
(silent) substitutions per synonymous site, Ks, and the number of non-
synonymous (amino-acid replacement) substitutions per nonsynonymous
site, Ka. Ks provides a crude measure of the time since duplication for
each gene pair, if assume Ks increases approximately linearly with time.
The ratio Ka/Ks provides a measure of the selection pressure to which a
gene pair is being subjected. Generally speaking, if Ka/Ks ratio = 1, it
means that the duplicate genes are under few or no selective constraints
(i.e., amino acid replacement substitutions occur at the same rate as syn-
onymous substitutions). A Ka/Ks ratio > 1, which is a strong evidence
for positive selection, indicates that replacement substitutions occur at
158
a rate higher than that expected by chance, so advantageous mutations
have occurred during sequence divergence. In contrast, a Ka/Ks < 1
is consistent with ‘purifying selection’. That is to say, some amino-acid
replacement substitutions have been purged by natural selection because
of their deleterious effects [48]. Another evolutionary pattern that has at-
tracted great interest is the asymmetry of evolutionary rates between the
two copies of a duplicated gene pair, i.e., one copy evolves faster than the
other one. Intensive studies on this pattern in different organisms have
shown a wide range in estimation of the portion of duplicate gene pairs
show asymmetric evolution [59,68,137,174,265,321,370,371].
Since the completion of whole genome sequence of S. cerevisiae [110],
a number of studies have involved the identification of multigene families
in this model eukaryotic genome. The resulting numbers of multigene
families in S. cerevisiae reported by Rubin et al. [270] are higher than
those reported by Friedman and Hughes [95] (1858 compared to 1440).
This is because the former study used the simple criterion, BLAST E-
value of 10−6, while the latter used the much stricter search with E =
10−50. However, using a single statistical score (such as the E value given
by a BLAST search or a related score) without specifying the proportion
of alignable regions may put two non-homologous proteins into the same
family due to domain sharing [118]. Hence in this study, in order to
obtain a reasonable estimate, I adopted a relatively stringent definition
in which the lengths of gene-encoding proteins are taken into account,
instead of relying on E-values only.
8.2 Literature Review
The ability to adapt to changing environments and to exploit new niches
has a great influence on the success of an organism [210]. This ability is
associated with new genes or genes with new functions [219]. Gene dupli-
cations are traditionally considered to be a major evolutionary source of
159
new protein functions. After duplication, the fate of the resulting copy of
a gene is of great interest. At least three hypotheses have been proposed,
as follows:
Nonfunctionalisation The classical view pioneered by Susumu Ohno
[237] holds that a duplicate gene produces two functionally redundant,
paralogous genes and thereby frees one of them from selective constraints.
The duplicate gene may be degraded to a pseudogene by mutational
inactivation and finally could be removed from the genome by deletion
[237,238]. This is the most likely outcome of duplicate genes [237,68].
Neofunctionalisation The duplicate gene may avoid redundancy by
assuming a novel function, i.e., the redundant copy may be modified and
in time assume a new role [237, 166, 336, 334, 298, 212]. Since this un-
constrained paralog is free to accumulate neutral mutations, there is the
possibility of fixation of mutations that may lead to a new function. This
prediction was supported by studies on isozyme spectra of polyploidy in a
number of organisms (reviewed in [196]). Of course, mutational time is a
deciding factor, since copies need sufficient modifications to assume roles
different from their parents, assuming that they are initially of neutral
fitness. Thus, the deletion rate is of great importance to gene innovation
by being sufficiently slow to give copies time to diverge.
Both the hypotheses above assume one copy of a duplicate gene pair
is free to evolve, while the other remains under selective pressure. This
has been challenged in work by Kondrashov et al. [174] and Lynch and
Conery [212], who show that paralogs do not seem to have experienced
any extensive period of neutral evolution. Kondrashov et al. [174] pro-
posed that paralogs avoid neutrality through gene amplification, followed
by a period of either relaxed or positive selection. They also observed
that paralogs evolve faster than their corresponding orthologs. Again,
this could be due to relaxed or positive selection. Furthermore, a study
160
of 17 pairs of duplicate genes in the tetraploid frog Xenopus laevis has
shown that both copies were subject to purifying selection, contrary to
the notion of neutrality of one of the copies [137]. The failure of em-
pirical research to support Ohno’s model has led to the proposal of an
alternative hypothesis – subfunctionalisation.
Subfunctionalisation The third hypothesis, ‘subfunctionalisation’ or
the duplication-degeneration-complementation (DDC) model [90], pro-
poses that duplicate genes come under selective pressure and are re-
tained by losing separate subfunctions from a multifunctional ancestal
gene. Redundant material is discarded through degradation [90]. It also
states that duplicate genes are initially redundant in function and, ac-
cordingly, a duplication event is selectively neutral. But it differs from
the hypothesis that successfully retained subdomains can be reused for
subset of orignial functions or even other new or related purposes [90].
As a result, the two genes can be said to belong to a family, being related
by sequence similarity, if not by function. Naturally, this relationship
will decrease with time until no discernable similarity can be observed in
regions of low conservation. A large number of observations support this
model, although mostly in diploid or polyploid eukaryotes.
8.3 Materials and Methods
8.3.1 Sequences and gene families
For each organism, other than P. marneffei, the complete sets of available
putative amino-acid sequences and coding DNA sequences were down-
loaded from genomic databases as follows: for S. cerevisiae, http://
genome-www.stanford.edu/Saccharomyces; for S. pombe, http://www.
genedb.org/genedb/pombe (Schizosaccharomyces pombe GeneDB); for
C. albicans, http://genolist.pasteur.fr/CandidaDB/ (CandidaDB Data
Release R1 Dec 17, 2001), this genome database was created by the EU-
161
funded consortium Galar Fungail by performing independent annotation
of assembly 19 sequence data obtained from the Stanford Genome Tech-
nology Centre (http://www-sequence.stanford.edu/group/candida);
for A. nidulans, http://www.broad.mit.edu/annotation/fungi/aspergillus/
(Aspergillus nidulans Database), and for N. crassa, http://www-genome.
wi.mit.edu/annotation/fungi/neurospora (Neurospora crassa Data-
base release 3: 02.12.2002). All protein sequences that were annotated
as known or suspected pseudogenes and those proteins encoded by mi-
tochondrial genomes were removed. Gene families in each genome were
identified by using BLASTCLUST (30% of identical residues and aligned
over at least 80% of their lengths). BLASTCLUST applies the single-
linkage algorithm. For documentation on its use, see ftp://ftp.ncbi.
nlm.nih.gov/blast/documents/README.bcl. The clusters were used to
identify and count duplication events (although not all pairs of genes in
the cluster are homologous to each other). Throughout the analysis, the
same criteria were applied in searching for orthologs of genes from all
other species, that is to say, orthologs were predicted by BLASTP search
for interspecies genes with > 30% identical residues and alignable region
over at least 80%.
8.3.2 Estimation of substitution rate
Gene families with sequences similar to known transposable elements
were removed at this point and excluded from the rest of analysis. Paralo-
gous protein sequences were aligned using ClustalW version 1.82 with the
default parameters (PAM matrix; gap opening penalty = 10.0; gap exten-
sion penalty = 0.2). The corresponding nucleotide-sequence alignments
were derived by substituting the respective coding sequences from the
protein sequences by using MBEToolbox (Chapter 10 ). Ks and Ka were
calculated by the method of maximum-likelihood, which is implemented
in the CODEML program of the PAML package version 3.13d [359].
162
Following the procedure described in Zhang et al. [371], pairs of dupli-
cate genes with smallest value of Ks were picked within each family. This
process was repeated for the remaining genes within the family until there
was no gene pairs that could be picked. The process was implemented
by ad hoc scripts in Perl.
To plot Ka versus Ks, pairs with Ks > 5.0 or Ka > 5.0 were elim-
inated because such high sequence divergence is often associated with
problems like difficulty in alignment, different codon usage biases or
nucleotide compositions in the different sequences. Ks is known to be
strongly distorted by codon usage bias [283]. The codon adaptation index
(CAI) [282] was used as a measure of codon bias. I therefore calculated
average values of CAI for all gene pairs and excluded those with average
CAI > 0.5 from the analysis.
8.3.3 Relative rate test
The relative evolutionary rate test aims to compare the substitution rates
of two sequences or two groups of sequences. Here it was applied to
compare the evolutionary rate of two copies of a duplicate gene pair.
In the test I only used recently duplicated (i.e., duplicate genes with
Ks < 0.5). These ‘young’ duplicates have fewer multiple substitutions
and therefore can be estimated more accurately than those of older ones.
In addition, very young duplicates (Ks < 0.05) were excluded because
they have too few substitutions to make statistical test significance [199].
In order to apply the relative rate test, I obtained outgroup sequences
for these young gene pairs. Each relative rate test was based on one gene
pair and its outgroup, forming triplets. Selection of outgroup were done
by using the method described in Conant and Wagner [59]. When more
than one outgroup sequence was available, either from the same genome
or from other genomes, triplets of genes closest to each other in syn-
onymous divergence rate, Ks, were chosen. I used two likelihood ratio
163
Table 8.1: Distribution of multigene families in fungi. Abbreviations: SC- S. cerevisiae; SP - S. pombe; CA - C. albicans; AN - A. nidulans; PM- P. marneffei ; NC - N. crassa.
Family size SC SP CA AN PM NC1 4500 4104 5276 7887 8725 92742 390 229 188 320 291 1983 54 34 41 84 64 434 23 18 29 38 26 225 11 4 8 17 10 56-10 17 18 24 29 29 1511-20 7 4 2 9 3 3>20 2 0 2 5 5 1Number of multigene families(size >=2)
504 307 294 502 428 287
Total genes used in the analysis 5889 4939 6165 9541 10060 10082Number of genes in families 1389 835 1189 1654 1335 808Number of young duplicategene pairs (Ks < 0.5)
165 51 50 43 52 10
(LR) tests to test for asymmetric divergence in both amino-acid and
codon. Codon substitution rate was estimated using the codon substitu-
tion model described by Goldman and Yang [111]. To do the LR test,
two models were applied to the data: model 0 constrains the amino-acid
or codon substitution rates to be equal in the two sequences; and model
1 assumes the rates are free parameters (hence they could be unequal to
each other in two sequences). Maximum likelihood values ML1 and ML2
from the two models were collected and the likelihood ratios were calcu-
lated as LR = 2(ln(ML1) − ln(ML2)). LR was then compared against
the χ2 distribution with one degree of freedom, as detailed by Yang [358].
8.4 Results
8.4.1 Extent of gene duplication in ascomycetes
As shown in Table 8.1, 1,389 (23.6%) of 5,889 genes in S. cerevisiae belong
to multigene families (including at least two genes), 16.9% in S. pombe,
164
19.3% in C. albicans, 17.3% in A. nidulans, 13.3% in P. marneffei, and
only 8.0% in N. crassa.
When comparing number of young duplicates, I found 23.8% of gene
families are young (Ks < 0.5) in S. cerevisiae, 12.2% in S. pombe, 8.4%
in C. albicans, 5.2% in A. nidulans, 7.8% in P. marneffei, and only 2.5%
in N. crassa (Table 8.1).
Apparently S. cerevisiae contains more multigene families and more
recently duplicated genes than any other fungus in this analysis. This
is in concordance with an earlier study [345]. Whole-genome duplica-
tion approximately 108 years ago was proposed as an explanation for the
presence of many duplicate genes [279]. S. pombe, C. albicans, A. nidu-
lans and P. marneffei contain moderate numbers of duplicated genes to
roughly the same extent as each other. Very few duplicated genes are
present in the N. crassa genome. This low number of duplicate genes is
consistent with results reported previously [101,231].
Table 8.2 lists top multigene families that contain the most homol-
ogous genes in number. S. cerevisiae contains large amount of trans-
posable elements which play an important role in creating duplication
in yeast genome [366]. Top multigene families of S. cerevisiae include
a group of proteins, seripauperins, whose function(s) remain poorly un-
derstood [332]. Comparable number of predicted sugar transporters is
found in N. crassa and S. cerevisiae. Transporter and reductase gene
families are expanded in filamentous fungi. Interestingly, P. marneffei
has large gene family of 24 putative pepsin-like proteases, which is not
so substantial in other fungi studied here.
8.4.2 Age distribution of duplicate genes
In general, we assume Ks increases approximately linearly with time
because synonymous substitutions do not alter the amino-acid sequence
and therefore there will be lower constraint due to natural selection [212].
165
Table 8.2: Large multigene families in fungi.
Fungi Size of family Function/ProductS. cerevisiae
20 Hexose transporter20 Seripauperins17 Amino acid permease15 GTP-binding protein13 Helicase
S. pombe20 Multidrug resistance protein17 GTP-binding protein12 Amino acid permease11 Retrotransposable element10 Protein kinase
C. albicans23 Unknown proteins21 Amino acid permease13 GTP-binding protein11 Ferric reductase transmembrane component9 Unknown proteins
A. nidulans61 Hexose transporter42 Putative transporter36 Oxidoreductase28 Multidrug resistance protein21 Aldehyde dehydrogenase
P. marneffei34 MFS multidrug transporter31 Short chain dehydrogenase/reductase family27 Hexose transporter protein24 Pepsin-type protease23 Major facilitator superfamily
N. crassa21 Oxidoreductase17 Phosphoethanolamine N-methyltransferase16 Hexose transporter11 Aldehyde dehydrogenase10 Endoglucanase
166
C. a
lbic
an
s
30
20
100
Std
. Dev =
1.6
3
Mean =
2.0
9
N =
198.0
0
P. m
arn
effe
i
30
20
100
Std
. Dev =
1.7
2
Mean =
2.1
6
N =
174.0
0
A. n
idu
lan
s
40
30
20
100
Std
. Dev =
1.8
2
Mean =
2.4
7
N =
142.0
0
S. p
om
be
40
30
20
100
Std
. Dev =
1.3
0
Mean =
1.1
6
N =
123.0
0
S. c
ere
vis
iae
5.0
4.0
3.0
2.0
1.0
100
80
60
40
200
Std
. Dev =
1.6
5
Mean =
1.3
9
N =
313.0
0
N. c
rassa
1086420
Std
. Dev =
1.4
9
Mean =
2.4
7
N =
48.0
0
5.0
4.0
3.0
2.0
1.0
5.0
4.0
3.0
2.0
1.0
5.0
4.0
3.0
2.0
5.0
4.0
3.0
2.0
1.0
5.0
4.0
3.0
2.0
1.0
Figure
8.1:Frequency
distributionof
Ks .
Frequencydistribution
ofduplicategene
pairsas
afunction
ofthenum
berofsynonym
oussubstitution
persynonym
oussite
(Ks ).
Arrow
indicatesthe
secondpeak
inP.m
arneffei
167
S. cerevisiae
0.01 0.1 1
0.01
0.1
1
S. pombe
0.01 0.1 1
0.01
0.1
1
C. albicans
0.01 0.1 1
0.01
0.1
1
N. crassa
0.01 0.1 1
0.01
0.1
1
A. nidulans
0.01 0.1 1
0.01
0.1
1
P.. marneffei
0.01 0.1 1
.01
0.1
1
0
Ks
Ka
Ka
Figure 8.2: Log-log plots of Ka vs. Ks for duplicate gene pairs. Log-logplots of the number of nonsynonymous substitution per nonsynonymoussite (Ka) vs. the number of synonymous substitution per synonymoussite (Ks) for duplicate gene pairs. Each point represents a single pair ofgene duplications. Points below the diagonal (Ka < Ks) imply the geneshave been subjected to purifying selection against amino acid changes.Open points denote orthologous gene pairs.
168
If this assumption largely holds, the distribution of Ks can be used as an
indicator for the distribution of duplication events along a time scale. I
plotted the frequency distribution of pairs of duplicate genes as a function
of the number of Ks in Fig. 8.1. An obvious pattern found in all species
is that most of gene duplicates are young and the density of duplicates
drops off with increasing Ks. The distribution of C. albicans shows a flat
pattern, in which the gene pairs are evenly distributed over Ks, with a
peak around Ks = 0.2. This may indicate small-scale gene duplications
happened persistently during the course of evolution.
For P. marneffei, there are two peaks in the plot: the first one is a
high peak in the age distribution centered around Ks = 0.1, indicating
there are a large number of gene pairs of a similar recent age, the second
peak coresponds to a low region from Ks = 2.0 to 4.5. I speculate the
second peak is a trace of ancient gene duplication events on a relatively
large-scale. This proposed ancient duplication would have created many
duplicate gene pairs. After such a long evolutionary time, most of these
gene pairs would be expected to have mutated and become divergent.
Only some pairs retain some degree of similarity, which gives rise to
the second peak. This dual-peak pattern is not readily observed in other
fungal species, except for N. crassa with a second-peak which might result
from gene duplication prior to the development of the repeat-induced
point mutation (see below).
8.4.3 Selective constraint between paralogs
As metioned in the Introduction, Ka/Ks is used as a measure of selective
constraint between two copies of duplicate genes. The larger the Ka/Ks
value, the stronger the selective constraint between the two copies. Table
8.3 gives the estimated Ka/Ks values in different fungi.
Comparison of Ka/Ks values for different fungi revealed that the
strength of selection is generally similar among yeasts (i.e., S. cerevisiae,
169
S. pombe and C. albicans, and among moulds (i.e., A. nidulans, P. marn-
effei and N. crassa). There is substantial difference in Ka/Ks between
yeasts and moulds. The strongest purifying selection is among the S.
cerevisiae paralogs and the weakest purifying selection in A. nidulans.
Mould paralogs show significantly stronger functional constraints, indi-
cated by larger values of Ka/Ks, than those in yeasts (Student’s t-tests
for pairwise comparisons).
Table 8.3: Ratio of nonsynonymous to synonymous substitution rates(Ka/Ks) for recently diverged paralogs (0.05 < Ks < 0.5).
Fungi No. of gene pairs Ka/Ks (mean ± SD)S. cerevisiae 89 0.134 ± 0.166S. pombe 22 0.148 ± 0.234C. albicans 34 0.245 ± 0.224A. nidulans 12 0.491 ± 0.214P. marneffei 29 0.456 ± 0.231N. crassa 9 0.359 ± 0.276
8.4.4 Ka/Ks between paralogs and orthologs
Ka/Ks is also used to estimate the selective constraints acting on or-
thologs. I therefore also characterised rates of synonymous and nonsyn-
onymous substitution of orthologs for each genome. By plotting Ka as
a function of Ks and superimposing data from paralogs onto those from
orthologs, we can get an overall view of how natural selection acts on two
groups of comparisons (Fig. 8.2).
In all species, overall Ka values are much smaller than Ks values,
which implies that vast majority of duplicate gene are subject to purifying
selection. In C. albicans, A. nidulans and P. marneffei, gene pairs with
smaller Ks tend to gather round the diagonal line (Ka/Ks = 1) and gene
pairs with larger Ks tend to get away from the line. It seems that, in
C. albicans, A. nidulans and P. marneffei, recent duplicates appear to
170
tolerate more amino-acid replacement substitution than older duplicates.
In mould species, the strength of purifying selection acting on paralogs
is smaller than that acting on orthologs with the same level of sequence
divergence. As shown in Fig. 8.2, at the same level of Ks, most of
the open points are below clusters of closed points, that is to say, Ka
in paralogs is generally larger than that of orthologs in A. nidulans, P.
marneffei and N. crassa. On the other hand, there is no difference in
overall Ka/Ks between paralogs and orthologs in yeasts, S. cerevisiae, S.
pombe and C. albicans.
8.4.5 Relative evolutionary rate between paralogs
The two copies of a paralog pair may evolve at the different rate. If most
paralog pairs evolve in such an asymmetric way, it may indicate that
Ohno’s neofunctionalisation theory is plausible. Therefore, as mentioned,
many studies on the relative evolutionary rates between paralogs have
been conducted. However, these studies have led to different conclusions.
Two critical aspects responsible for the success of such analyses are the
sensitivity of methods and the appropriateness of the outgroup used.
Here I used a method that incorporates a codon-based model. Gen-
erally speaking, methods relying on codon-based models (for example,
[111, 226]) are more sensitive than nucleotide-based tests and amino-
acid based tests, because, in the latter two, one cannot distinguish be-
tween silent substitutions and amino-acid replacement substitutions [59].
Codon-based model however takes into account the ratio between the rate
of nonsynonymous and synonymous substitutions which gives a more di-
rect measure of the strength of selection or functional constraints on the
gene.
The major issue is choosing an outgroup is that the potential outgroup
cannot be too distant evolutionarily from the paralogs being studied, oth-
erwise, saturation in synonymous sites for many genes will interfere with
171
the power of the statistical test. To avoid this influence, Kondrashov et
al. [174] used a within-genome approach, since their study included four
highly diverged eukaryotic organisms, S. cerevisiae, A. thaliana, C. ele-
gans and D. melanogaster. By using the within-genome approach, they
identified outgroups of S. cerevisiae paralogs within the S. cerevisiae
genome itself. In addition, they required that the two paralogs be closer
in amino-acid sequence to each other than to the outgroup. This extra
condition, which probably has led to underestimate asymmetric diver-
gence, was criticised by Conant and Wagner [59], who adopted a similar
within-genome approach in multiple eukaryotes.
In the selection of gene duplicates and their outgroups, I adopted a
method similar to that of Conant and Wagner [59]. The only modification
made was the search of all fugal genomes for outgroups, instead of using
the within-genome approach.
I identified a total 163 triplets (composed of two paralogs and one
corresponding outgroup) which included 101 triplets based on paralogs
from S. cerevisiae, 6 from S. pombe, 50 from C. albicans, 2 from A.
nidulans, 3 from P. marneffei, and 1 from N. crassa.
Because the majority of triplets are from S. cerevisiae and C. albi-
cans, the following analysis has no power to distinguish differences among
species. Instead it can only be considered as a comprehensive analysis
dealing with the subject of ascomycetes as a whole.
I adopted the model of Goldman and Yang [111] (see Methods) in
the comparison of the relative rates in amino-acid substitution between
each of the paralogs. The result shows that, of a total of 163 analysed
gene pairs from the ascomycetes, 29 (17.8%) evolve at a significantly
(p < 0.05) different rate (Table 8.4). This figure includes 12 (11.9%) of
101 triplets in S. cerevisiae and 17 (32.7%) of 52 in C. albicans. In the
majority of cases, both paralogs evolved at approximately the same rate,
under a similar level of purifying selection.
172
In order to examine whether Ka/Ks ratio is the factor causing asym-
metry in evolutionary rates between paralogs, I estimated the asymmetry
of Ka/Ks ratios between two paralogs. A 2 × 2χ2 test failed to reject
the null hypothesis that the number of pairs with different Ka/Ks ratio
is independent of the number of pairs with different amino-acid substi-
tution rates (Table 8.4). That is to say, there is no correlation between
different Ka/Ks ratios and different amino-acid substitution rates.
Table 8.4: Amino-acid substitution rates versus Ka/Ks ratios in twocopies of duplicate genes. Columns show gene pairs with different orequal amino-acid substitution rates between two paralogs; rows showgene pairs with different or equal Ka/Ks ratios between two paralogs.
Different Ka Equal Ka TotalDifferent Ka/Ks ratio 3 10 13Equal Ka/Ks ratio 26 124 150Total 29 134 163
8.5 Discussion
This study took advantage of the avaiability of genome sequences of P.
marneffei and other 5 ascomycetes, S. cerevisiae, S. pombe, C. albicans,
A. nidulans and N. crassa. It also relied on the recent development
of methods to analyse selective constrains on duplicate genes in each
genome. Given the considerable phenotypic variation between the two
groups of distinct ascomycetes, yeasts and moulds, I speculated that gene
duplication may play an evolutionary role at different levels and selection
patterns of duplicate genes may be different. To my knowledge, no similar
analysis has been conducted in fungi, despite several genome-level studies
on gene duplications using S. cerevisiae as one of their model eukaryotic
organisms [95].
173
8.5.1 Gene duplication in ascomycetes is highly diverse
Most genomes show a certain degree of redundancy caused by single-
gene duplication, chromosomal segment duplication or complete genome
duplication (through polyploidisation). So do the ascomycetes I studied.
S. cerevisiae S. cerevisiae has the largest amount of gene redundancy
among all ascomycetes I analysed. Previously studies have revealed that
its genome contains approximately 55 large duplicated chromosomal re-
gions [345]. It has been widely accepted that the duplicated regions
found in the modern Saccharomyces species are probably the result of
a whole-genome duplication (tetraploidisation) approximately 108 years
ago [95, 250, 279, 280, 345]. This proposed genome duplication might co-
incide with the origin of the ability to grow under anaerobic conditions,
one of most striking physiological differences between S. cerevisiae and
other yeasts.
S. pombe S. pombe and S. cerevisiae have been separated for as long
as 420 million years [289]. Comparing the two yeasts, S. pombe has
fewer gene duplications than S. cerevisiae, which may account in part
for the smaller genome size. Transposable elements exist in the S. pombe
genome. However, their proportion is low compared to S. cerevisiae.
Using phylogenetic analysis, Hughes and Friedman [136] suggested that
parallel gene duplication appears to have played a role in the independent
origin of similar adaptations in the two unicellular fungi, S. pombe and S.
cerevisiae [136]. That is to say, gene duplications have occurred indepen-
dently in the same gene families in S. pombe and S. cerevisiae; S. pombe
has adapted to a similar unicellular lifestyle without polyploidisation.
C. albicans The age distribution of relative by young duplicate genes
(Ks < 5) in C. albicans (Fig. 8.1) suggests that duplication events are
likely to occur continuously during the course of evolution in this yeast.
174
In either S. pombe or C. albicans, no evidence suggesting polyploidisa-
tion, such as, duplicated genomic blocks, has so far been found. Hence,
genome duplication, as happened in S. cerevisiae, which may represent
an extreme adaptive strategy in providing genetic raw material for func-
tional divergence of novel genes, has not occurred in C. albicans.
A. nidulans A. nidulans contains a relatively large number of recently
duplicated gene pairs; totally 43 with Ks < 0.5. The age distribution of
duplicate genes (Ks < 5) in A. nidulans displays a high peak at Ks = 0.1
to 0.2 and shows a similar pattern with that in S. cerevisiae (Fig. 8.2).
However, S. cerevisiae has undergone genome duplication and there are
extensive duplicated blocks in its genome as the traces of the proposed
ancient tetraploidy that remain detectable after widespread deletion of
superfluous duplicate genes and sequence divergence. Most of gene pairs
in these duplicated regions are believed to have been produced simultane-
ously or within a narrow time frame [95]. Based on the similar patterns
of age distribution of gene pairs between A. nidulans and S. cerevisiae,
I might propose that duplicate genes in A. nidulans probably originated
through one or more episodic, large-scale gene duplications in a relatively
short period of time. What is uncertain is whether such a peak of gene
duplication over the course of evolution implies a polyploidisation event
in A. nidulans. As noted by Friedman and Hughes [95], a peak of gene
duplication need not imply polyploidisation event. Therefore, it would
be interesting to know how many duplicated blocks are present within
and between A. nidulans chromosomes when the genome sequencing of
A. nidulans is completely finished.
P. marneffei Slightly fewer genes in P. marneffei belong to multiple
gene families than A. nidulans. However, 52 pairs are young duplicate
genes compare to 43 in A. nidulans. There is no difference in the overall
extent of duplicate genes between these two close species. The pattern
175
of the Ks histogram is broadly similar to those of A. nidulans and S.
cerevisiea. A difference is the dual-peak pattern, seemingly implying
that besides the modern duplications, there was an ancient large-scale
duplication. The modern peak is at the similar location, Ks = 0.1, as
that of A. nidulans and other fungi, but on a smaller scale (less than 25%
genes belong to this peak) compared to that of A. nidulans. In contrast
the second peak at Ks = 2.0 to 4.0 is more apparent than in other fungi
except N. crassa. More evidence is needed before any solid conclusion
can be reached though.
N. crassa N. crassa exhibits much greater morphological and devel-
opmental complexity. Its genome is approximately three times the size
of the S. cerevisiae genome, and accordingly has a protein count much
larger than those in yeasts . However the paucity of duplicate genes in
N. crassa is obvious: (1) the number of multigene families in N. crassa is
much smaller than that in yeast, and (2) the number of gene pairs with
a small Ks (0.05 < Ks < 0.5) in N. crassa is much smaller that those
in unicellular yeasts (Table 8.1). An extraordinary feature of N. crassa,
repeat-induced point (RIP) mutation [219], has been suggested to play a
major role in preventing gene innovation through gene duplication and
response for this paucity. The RIP, acting as a defense against mobile
DNA [219], can detect and mutate both copies of a sequence duplica-
tion. In fact, the RIP is so efficient that all gene duplications remaining
in N. crassa genomes have been proposed to be raised and fixed before
the emergence of the RIP mechanism. Examples of the remaining multi-
gene families may have ‘survived’ RIP include hexose transporters and
cellulases (Table 8.2). N. crassa may have other mechanisms of gene
innovation, since gene duplication has rarely occurred in its genome.
Ascomycetes display a wide variation in the number of gene duplica-
tion events. This may have provided the foundation for specialisation of
a number of genes and their corresponding proteins, and formed the basis
176
for diversification. Amplification of their genetic material might increase
their fitness of adaptation to the environment. Examples include genes
for the yeast hexose transporters increasing fitness in low-glucose; genes
for N. crassa cellulases to allow growth on decaying plant material; genes
for cytochrome P450 and efflux systems involving in detoxification.
8.5.2 Different selective constraints in yeasts and filamentous ascomycetes
There are differnt models, such as, the classical model and duplication-
degeneration-complementation (DDC) model, to explain the creation of
novel genes by gene duplication. The classical model emphasises that
one copy is neutral and free to evolve while the other remains under
selective pressure. The DDC model [90] explains sub-functional diver-
gence when a gene has been duplicated. According to the DDC model,
the two gene copies then acquire complementary loss of function muta-
tions in independent sub-functions. Thus both genes required to produce
the full complement of functions of the single ancestral gene. Both the
classic model and DDC model predict a period immediately following
duplication when the genome should be able to tolerate a high degree of
nonsynonymous substitutions in one member of a duplicate pair because
the other member is still functioning at full strength.
Comparing Ka with Ks in each genome, I found a common pattern
in all fungi which is in partial agreement with these theoretical expec-
tations. First, in either filamentous fungi or yeasts, purifying selection
was dominant against amino acid changes in paralogous genes. This
confirms the earlier observation that paralogs evolve under purifying se-
lection [211], which challenges the classical model but supports the DDC
model. Second, recent duplicates with smaller Ks appear to tolerate
more replacement amino-acid substitutions than older duplicates, which
is compatible with both models.
I also found two exclusive patterns in filamentous fungi. The first
177
finding is that there are significantly (p < 0.01) higher values of the
Ka/Ks ratio in paralogs in moulds than those in yeasts with a similar
level of divergence (Table 8.3). Filamentous fungi show greater morpho-
logical and developmental complexity than do yeasts, and their genomes
are normally larger. As gene duplication is a source of novel protein
functions, the bigger genome size may partially result from frequently
occurring gene duplications provided a basis for divergence and resulting
in the increase of novel genes caused by the neofunctionlisation, or the
increase of gene number caused by the subfunctionalisation. Therefore,
the higher value of Ka/Ks ratio in paralogs in moulds may imply that, at
the similar stage after duplication, gene pairs in filamentous fungi have
faster evolutionary rates than those in yeasts. Either positive selection
or relaxed functional constraint can cause the higher value of the Ka/Ks
ratio. Few gene pairs in moulds are actually found under positive se-
lection, when use Ka/Ks > 1 as indicator of positive selection. Thus,
the slightly elevated Ka relative to Ks, accounts for the larger value of
Ka/Ks given by gene pairs in moulds.
Another interesting finding is that paralogs in A. nidulans, P. marn-
effei and N. crassa appear to be under weaker functional constraint than
orthologs at the same age. In other words, orthologs in moulds expe-
rience stronger functional constraints than paralogs. Natural selection
seems to allow paralogs in these three filamentous fungi to mutate with
less constraint, which may lead to more advantageous mutations. This
phenomenon was first observed in eukaryotes [174] but it has not been
reported in fungi. Note that this trend is not observed in the unicellular
yeasts, S. cerevisiae, S. pombe, and C. albicans. Therefore, it is suggested
that elevated functional constraint in orthologs or weaker functional con-
straint in paralogs is a more common feature in the evolutionary pattern
of multicellular eukaryotes.
178
8.5.3 Majority of paralogous genes evolve symmetrically
Estimation of asymmetric evolution rates were conducted mainly on par-
alogs from S. cerevisiae and C. albicans, so the result should not be
applied to other species. 29 (17.8%) of a total of 163 analysed gene pairs,
evolve at significantly (p < 0.05) different rates (Table 8.4). Therefore,
in the majority of cases at least in S. cerevisiae and C. albicans, both
paralogs evolved at approximately the same rate, under similar levels of
purifying selection.
Several similar studies have been done in S. cerevisiae and in several
other eukaryotes. Some concluded that both copies of duplicate gene typ-
ically evolved at the same rates [137,174,265], whereas others suggested
asymmetric divergence between two paralogs is not uncommon. Because
different organisms were used in those studies and different methods with
varying sensitivities were applied, it is hard to compare data in this study
with others directly. For instance, Kondrashov et al. [174] selected 15 S.
cerevisiae triplet genes and, by using a distance based method they found
no paralogs showing different rates. In another study, Conant and Wag-
ner [59] identified six of 22 (27%) gene triplets in S. cerevisiae, and three
(21%) of 14 in S. pombe, that showed asymmetry in Ka by using codon
based model following Muse and Gaut [226].
An asymmetric evolutionary rate is not always associated with an
asymmetric evolutionary constraint, as indicated by Ka/Ks. Moreover,
no simple dependence between evolutionary rate and gene function is
observed (data not shown). This finding is inconsistent with Zhang’s
finding in young paralogs of human genes [371], that genes with different
Ka/Ks ratios tend to evolve at different rates, suggesting that different
functional constraints might be largely responsible for the unequal evo-
lutionary rates. The incongruence may be again due to the difference in
species used in the studies.
In conclusion, this chapter reports the variation in the extent of gene
179
duplications in ascomycetes. The age distribution of gene duplications
tentatively suggests that the P. marneffei genome has experienced a
recent as well as an ancient large-scale duplication. Analysis of the di-
vergence of evolutionary rates in S. cerevisiae and C. albicans revealed
that less than 20% of gene pairs in these two yeasts show asymmetric
divergence patterns in amino-acid substitutions. I speculate that the dif-
ferent extent and evolutionary pattern of duplicate genes in ascomycetes
might be associated with their genotypical and phenotypical differences.
180
Chapter 9
ACCELERATED EVOLUTIONARY RATE MAY BE
RESPONSIBLE FOR THE EMERGENCE OF
LINEAGE-SPECIFIC GENES
Once the genome of Penicillium marneffei become available, genes
can be predicted and annotated. Hundreds of these predicted genes lack
homology to any known gene. They are species-specific genes or called
“orphan” genes. Where do these genes come from? This is still a mys-
tery. One suggestion has been that most orphan genes evolve rapidly
so that similarity to other genes cannot be traced after a certain evolu-
tionary distance. This can be tested by examining the divergence rates
of genes with different degrees of lineage specificity. Here the lineage
specificity (LS) of a gene describes the phylogenetic distribution of that
gene’s orthologs in related species. Highly lineage-specific genes will be
distributed in fewer species in a phylogeny.
In this chapter, I used the complete genomes of seven ascomycetes
and two animals to define several levels of LS, such as, Eukaryotes-core,
Ascomycota-core, Euascomycetes-specific, Hemiascomycetes-specific, As-
pergillus-specific and Saccharomyces-specific. The rates of gene evolution
in groups of higher LS to those in groups with lower LS are compared.
Molecular evolutionary analyses indicate a significant increase in nonsyn-
onymous nucleotide substitution rates in genes with higher LS. Multiple
regression analyses suggest that LS is significantly correlated with the
evolutionary rate of the gene. This correlation is stronger than those of a
number of other factors that have been proposed as predictors of a gene’s
evolutionary rate, including the expression level of genes, gene essential-
181
ity or dispensability and the number of protein-protein interactions. The
significantly accelerated evolutionary rates of genes with higher LS may
reflect the influence of selection and adaptive divergence during the emer-
gence of orphan genes. These analyses suggest that accelerated rates of
gene evolution may be responsible for the origin of apparently orphan
genes.
This chapter is very closely based on a paper I have published with
colleagues [in press]. The original draft of the manuscript has been re-
vised by Dr. David K. Smith, in Department of Biochemistry, HKU.
The preliminary version of this work has been presented at the SMBE
conference on 17th June 2004.
9.1 Introduction
During annotation of genome sequences a substantial fraction of the puta-
tive genes are found to lack sequence similarity to any of the genes in pub-
lic databases. These genes or protein-coding regions have been referred
to as “orphan” genes. Some may have crucial organism-specific func-
tions, however, the origin and evolution of orphan genes remain poorly
understood. A proposed explanation of this problem has been that some
genes evolve so rapidly that their homologs cannot be discovered over
larger evolutionary distances. Although this has been supported by re-
cent findings in Drosophila that orphan genes evolve, on average, more
than three times faster than non-orphan genes [73], the influence of other
factors on the evolutionary rate of genes should be taken into account.
These factors include the expression level of genes [127,241], a gene’s
dispensability (the organism’s fitness after deletion of the gene) [178],
gene essentiality [343], gene duplication [150, 357], and the number of
protein-protein interactions involving the gene’s product [93, 335]. Due
to the inherently stochastic property of evolutionary rates, the influence
of many of these factors has proved difficult to confirm and their relative
182
importance also needs further elaboration.
In order to systematically examine the relationship between a gene’s
evolutionary rate and the origin of orphan genes, as well as to assess the
influence of other factors, we have devised a study based on the following
rationale. Orthologs of a gene usually have a particular phyletic distri-
bution in several related species, thus giving each gene a certain lineage
specificity (LS). Orphan genes represent the extreme of LS because they
are only present in one node of a phylogeny. In contrast, highly con-
served genes have a low degree of LS and are widely distributed, while a
range of different degrees of LS can be defined for other gene groups. If
an elevated evolutionary rate is the major cause of the origin of orphan
genes, one should find a correlation between evolutionary rate and LS.
Slower evolving genes should tend to be less lineage specific.
Studying the relationship between the evolutionary rate of genes and
LS may reveal the dynamic processes that lead to the origin of species
specific, or orphan, genes. It can also be tested whether the evolutionary
rate leading to the emergence of orphan genes is relatively constant or
highly variable. If genes become lineage-specific gradually, one might
expect a simple relationship (e.g., a linear relationship, perhaps after data
transformation) between divergence time and genetic distance, otherwise,
a more complex relationship would be expected.
To investigate these matters, the complete sets of predicted protein-
coding genes from Aspergillus fumigatus (http://www.sanger.ac.uk/
Projects/A fumigatus/) and Saccharomyces cerevisiae [110] were ex-
tracted. Orthologs of these genes from five other ascomycotan fungi,
Aspergillus nidulans (http://www.broad.mit.edu/annotation/fungi/
aspergillus/), Schizosaccharomyces pombe [354], Candida albicans [65],
Neurospora crassa [101], and Saccharomyces mikatae , and two meta-
zoans Caenorhabditis elegans [79] and Drosophila melanogaster [2] were
also obtained.
183
The fungi studied here represent three major Ascomycetes classes,
Euascomycetes, Hemiascomycetes and Archaeascomycetes. The Euas-
comycetes, which contain well over 90% of Ascomycota, comprises As-
pergillus and Neurospora. The Hemiascomycetes comprises the Saccha-
romyces yeasts and Candida. The fission yeast, S. pombe belongs to the
class Archaeascomycetes which are distantly related to each other, pos-
sibly remnants of an early radiation of Ascomycota [289]. These fungi
also represent two major fungal morphological subdivisions, yeasts and
moulds. Yeasts, like S. cerevisiae, S. mikatae, C. albicans, as well as
S. pombe, have life cycles characterised by unicellular (occasionally di-
morphic) growth. In contrast, the filamentous ascomycota, A. nidulans,
A. fumigatus and N. crassa, predominantly grow as hyphal filaments.
Despite having such a morphological divergence, all of them share a rela-
tively recent common ancestor with respect to the rest of the eukaryotes.
The phylogeny of these ascomycota is clear and generally accepted, ex-
cept for the ancient Schizosaccharomyces, S. pombe [289].
Genes from S. cerevisiae and A. fumigatus were classified, according
to their phylogenetic profiles, into several LS groups as follows: Eukaryote-
core, Ascomycota-core, Euascomycetes-specific, Hemiascomycetes-specific,
Aspergillus-specific and Saccharomyces-specific. Average nonsynonymous
substitution rates, Ka, of genes among LS groups were compared and
correlations between LS and several other factors, for example, gene ex-
pression level, gene dispensability and gene redundancy, were explored.
The relative importance of LS and other factors, in terms of the pre-
diction of a protein’s evolutionary rate, were evaluated and whether the
divergence rate is relatively constant over genes with similar degrees of
LS was tested.
184
9.2 Literature Review
Holding the gene-centric rationale, our understanding of evolutionary
novelties is limited in the consequence of creation new gene. Recent at-
tention has been put to this phenomenon in genomes, yet the mechanism
remains mystery. Some insights have been obtained especially by study-
ing newly created genes (i.e., young genes) [210, 257, 204]. A number of
mechanisms that may be responsible for new gene origination have been
proposed. These include gene duplication, exon shuffling, retroposition,
lateral gene transfer, and transposable element assimilation (for review,
see [204]). Topic regarding to the gene duplication has been reviewed in
Chapter 8.
Here I only focus on the origination of exon – the basic units of gene.
Once exons exist, exon-shuffling, recombination or exclusion of exons, is
widely recognised as important in the generation of new genes [109,244,
155]. The creation of new exons has been proposed through three possi-
ble processes: (1) exaptation of transposable elements [27, 215, 230,293],
(2) exon duplication [172,194], and (3) exonisation of intronic sequences
[173].
Exaptation of transposable elements is a process in which a retroele-
ment has taken on new functions for a genome. It was firstly exampled by
the integration of an Alu element into the coding portion of the human
decay-accelerating factor (DAF) gene [215], and an L1 retrotransposon el-
ement insertion provides a premature stop codon and the polyadenylation
sites is responsible for the generation of the secreted form of the human
transmembrane protein attractin [305]. Recently as much as about 4% of
human genes were found containing transposable elements in their cod-
ing regions [230]. Exon duplication has been reported as about 10% of
all genes contain tandemly duplicated exons when searching the genomes
of human, fly and worm. They are likely to be involved in mutually
exclusive alternative splicing events, which might confer further evolu-
185
tionary potential [194]. Exonisation of intronic sequences is the most
easily conceived mechanism but few examples of such a process have
been reported [173]. Wang et al. [339] identified newly evolved exons by
EST comparison against outgroup to learn the ways new exons originate
and evolve, and how often new exons appear. They claim that the new
exon origination rate is about 2.71−3 per gene per million years and a
much higher proportion of new exons have Ka/Ks ratios > 1 than do the
old exons.
It is noteworthy that gene origination processes mentioned above does
not necessarily create new genes with novel functions, instead yield new
variants of genes [369]. Moreover, newly evolved genes often come up
with elevated evolutionary rate driven by positive selection [205,235,147,
338,369].
9.3 Materials and Methods
9.3.1 Sequences and data sets
Table 9.1: Genomic sequence sources.
Species Web Source for the sequence data.A. nidulans www-genome.wi.mit.edu/annotation/fungi/aspergillus/A. fumigatus www.sanger.ac.uk/Projects/A fumigatusN. crassa www-genome.wi.mit.edu/annotation/fungi/neurospora/S. cerevisiae genome-www.stanford.edu/SaccharomycesS. mikatae ftp://genome-ftp.stanford.edu/pub/yeast/data
download/sequence/fungal genomes/S mikataeC. albicans genolist.pasteur.fr/CandidaDBS. pombe www.genedb.org/genedb/pombe/index.jspC. elegans www.sanger.ac.uk/Projects/C elegans/wormpepD. melanogaster www.fruitfly.org
For each Ascomycotan, the complete set of available amino acid se-
quences and coding DNA sequences was downloaded from the repositories
186
A. nidulans
N. crassa
C. albicans
S. pombe
ANIMALS
A. fumigatus
S. mikatae
S. cerevisiae
1,458 1,085 841 ~106701,144
Ascomycota-core
Aspergillus-specific
Eukaryotes-coreEuascom
ycetes-specificHem
iascomycetes-specific
Saccharomyces-specific
Figure 9.1: LS classification based on phylogenetic profiles of genes. Di-vergence times were adopted from Hedges and Kumar [131]. The diver-gence times between S. cerevisiae and S. mikatae and between A. fumi-gatus and A. nidulans are based on the estimates by Cliften et al. [56]and [87], respectively. A solid square (¥) means the gene is present incorresponding species; an open square point (¤) means it is absent.
187
given in Table 9.1. All known or suspected pseudogenes and genes in mi-
tochondrial genomes were removed. The S. mikatae dataset is derived
from the ORF predictions of Cliften et al. [56].
Yeast gene expression data came from Cho et al. [51] who charac-
terised all mRNA transcript levels during the cell cycle of S. cerevisiae.
mRNA levels were measured at 17 time points at 10 min intervals, cover-
ing nearly two full cell cycles. The mean of these 17 numbers was taken
to produce one general time-averaged expression level for each protein.
Protein dispensability was assessed by the fitness effect of a single-
gene deletion, as measured by the average growth rate of the knockout
strain in several types of media. The results of assays of a nearly complete
set of single gene deletions in S. cerevisiae [297] were obtained, and the
data were manipulated following the method by Gu et al. [119]. Briefly,
the fitness value fi is defined as ri/rpool, where ri is the growth rate of
the strain with gene i deleted and rpool is the pooled average growth rate
of different strains.
Essential genes were from the dataset of the Saccharomyces Genome
Deletion Project which contains 1,106 essential genes (http://www-sequence.
stanford.edu/group/yeast deletion project/). Although gene dis-
pensability and gene essentiality are highly associated, they were treated
as two separate variables in order to compare the results of each variable
to previous studies.
A list of protein-protein interactions among S. cerevisiae proteins
was obtained from two integrated interaction databases, YEAST GRID
[25] and the yeast subset of DIP [274], and a number of major high-
throughput studies published to date [106]. The final non-redundant set
contains 252,011 interactions involving 5,698 proteins.
188
9.3.2 Identification of orthologs
Orthologs of the genes from S. cerevisiae and A. fumigatus in each other
and in other genomes studied here were identified by the automatic clus-
tering method INPARANOID [261]. Orthologs between the genomes
of two species are derived in this method from mutual best pairwise
BLASTP hits. A further reciprocal test was applied by requiring the
longest region of local sequence similarity between putative orthologs to
cover ≥ 80% of each sequence and to have ≥ 30% sequence identity in
this region. 113 pairs that did not pass this test were excluded. A gene
was considered as being absent from another genome if no sequence sim-
ilarity could be detected between the gene and the genes in that genome.
To define the level at which sequence similarity was not detectable, a
TBLASTN expectation (E) value 1×10−2 with respect to a fixed effec-
tive search space (set to the size of the N. crassa genome) was used as a
cut-off.
Orthologs of fast-evolving genes may not be detected in their more dis-
tantly related genomes by the TBLASTN search used above. To address
this, ancestral sequence(s) were constructed (Collins et al. [58], based
on the detected orthologs, using the maximum likelihood method imple-
mented in the PAML phylogenetic analysis package version 3.13d [359].
Ancestral sequences are expected to be less divergent from their pos-
sible orthologs in the more distant genomes and their reconstructions
were used to search, as above, for orthologs in the more distantly related
genomes. If potential orthologs were identified, the gene was excluded
from further analysis to avoid ambiguity in the assignment of genes to
LS groups.
9.3.3 Classification of genes into LS groups
Phylogenetic profiles, a gene table giving 1 (or 0) if a gene is present in (or
absent from) a genome, for the genes from S. cerevisiae and A. fumigatus,
189
were constructed based on the detected orthologs in the genomes studied.
The genes were then classified into the different LS groups, Eukaryotes-
core (present in all genomes studied), Ascomycota-core (present in all fun-
gal genomes), Hemiascomycetes-specific, Euascomycetes-specific, Saccharomyces-
specific and Aspergillus-specific (Fig. 9.1). The phylogenetic tree relating
the species was derived from [131].
9.3.4 Divergence Times
Lineage divergence times are somewhat controversial [285]. In this work
divergence times were taken from [130] and [131]. These give the following
divergence times (Fig. 9.1): Animals vs Fungi, 1576 Mya; Fungi vs As-
comycetes, 1144 Mya; Saccharomyces and Candida vs Aspergillus, 1085
Mya; Candida vs Saccharomyces, 841 Mya; Neurospora vs Aspergillus,
670 Mya. Divergence times for S. cerevisiae vs S. mikatae and A. fumi-
gatus vs A. nidulans were taken as ∼10 Mya.
To convert LS into numeric form to calculate correlations with other
properties, the ratio of the time of the animal-fungi divergence to that
of the divergence of a lineage from its last common ancestor was used.
For example, the Eukaryotes-core value is 1 (1458/1458) while that of
Ascomycota-core is 1.27 (1458/1144). The final results were not sensitive
to changes in the divergence time estimates used for this category to
numeric conversion.
9.3.5 Estimation of substitution rates and statistical analyses
The number of synonymous substitutions per synonymous site, Ks, and
the number of nonsynonymous substitutions per nonsynonymous site,
Ka, were estimated between A. fumigatus-A. nidulans ortholog pairs and
S. cerevisiae-S. mikatae ortholog pairs in the Euascomycetes and Hemi-
ascomycetes lineages respectively. For each ortholog pair, the ortholo-
gous protein sequences were aligned using ClustalW version 1.82 with the
190
default parameters. The corresponding nucleotide-sequence alignments
were derived by substituting the respective coding sequences from the
protein sequences by using MBEToolbox (Chapter 10 [35] ). Ks and Ka
were then estimated by the maximum-likelihood method implemented in
the CODEML program of PAML [359].
High apparent sequence divergence, as shown by high Ks or Ka values,
is often associated with problems such as difficulty in alignment, or dif-
ferences in codon usage bias or nucleotide composition in the sequences.
Ortholog pairs with Ks < 0.05 may include too few substitutions to
provide a statistically significant measure of change [371]. To accurately
measure the intensity of selective forces acting on a protein, only ortholog
pairs with Ka ≤ 2 and 0.05 ≤ Ks ≤ 2 were used. Similar results were
obtained when more relaxed cutoffs for Ka and Ks (≤ 5) were used (data
not shown). All known ribosomal protein genes were excluded from the
data set as their high level of conservation gives them substantially lower
average values of Ka, Ks and Ka/Ks than those for the rest of the genes.
Statistical regression analyses were performed by referring to the pro-
cedure described by Rocha and Danchin. Since the linear regression
model works better with normal variables , the scatter plots of Ka by
other variables were examined to determine whether linear models are
reasonable for these variables. It was necessary to transform the values
of Ka, expression level and fitness of gene deletion into their logarithmic
forms to give a distribution closer to a normal distribution. For the same
reason, log(Ka) values were used in the correlation and partial correlation
analyses.
9.3.6 Detection of rate variability across species - Relative Divergence
Score (RDS)
To measure the degree of divergence of genes in a species away from or-
thologs in other species TBLASTN comparisons for all proteins in the A.
191
fumigatus or S. cerevisiae genomes were run against all DNA sequences
in the 9 genomes studied here. The relative divergence score (RDS) was
defined as: DA,B = −ln(SA,B/SA,A), where SA,Bis the TBLASTN bit
score for the query protein from genome A and subject genome B. Such
scores range from 0 (identical proteins found in the subject genome) to
infinity (no significant hit found). For genes belonging to each LS group,
and to the relevant species at each divergence time point, 10,000 boot-
strapped medians of random samples were taken from the RDS values
of the genes. The mean of the bootstrapped medians was used as the
estimated RDS of the LS group.
9.4 Results
9.4.1 Evolutionary rate differences among LS groups
The Ascomycotan fungi used in this study represent two distinct fun-
gal groups: Euascomycetes (A. nidulans, A. fumigatus and N. crassa)
and Hemiascomycetes (S. cerevisiae, S. mikatae and C. albicans) and
the more distantly related fission yeast, S. pombe. Data from the two
groups, Euascomycetes and Hemiascomycetes, were processed separately.
For the Euascomycetes sequences, we predicted 6,432 A. fumigatus-A.
nidulans orthologs and calculated the nonsynonymous substitution rate,
Ka, and the synonymous substitutions rate, Ks, for each gene pair. We
then classified the predicted orthologs into the following groups: (1)
Eukaryotes-core, (2) Ascomycota-core, (3) Euascomycetes-specific and
(4) Aspergillus-specific, according to the phylogenetic profiles of A. fu-
migatus genes. The Hemiascomycetes sequences gave 3,707 pairs of
S. cerevisiae-S. mikatae orthologs which were processed similarly and
classified into four groups: (1) Eukaryotes-core, (2) Ascomycota-core,
(3) Hemiascomycetes-specific and (4) Saccharomyces-specific. Thus, LS
groups from (1) to (4) represent increasingly more recent times of origin.
Filtering steps of (1) removing ortholog pairs with Ks,Ka > 2 or
192
212227113N =
Aspergillus-spec
Euascomycetes-spec
Ascomycota-core
Eukaryotes-core
Ka
.7
.6
.5
.4
.3
.2
.1
0.0
-.1
297222317N =
Saccharomyces-spec
Hemiascomycetes-spec
Ascomycetes-core
Eukaryotes-core
Ka
.5
.4
.3
.2
.1
0.0
-.1
(A)
(B)
Figure 9.2: Divergence of nonsynonymous substitution rate in LS groups.The edges of the boxes indicate the upper and lower quartiles. The line atthe centre of the box indicates the median and the edges of the whiskersrepresent the limits of 1.5 times the upper or lower inter-quartile ranges.The circle (©) indicates cases with values between 1.5 and 3 box lengthsfrom the upper or lower edge of the box. The number of the gene pairs(N) is given. (A) A. fumigatus-A. nidulans orthologs. (B) S. cerevisiae-S.mikatae orthologs.
193
Ks < 0.05, (2) excluding ribosomal proteins, and (3) eliminating genes
where possible similarity to a reconstructed ancestral sequence was found,
were applied to the data set. Step 3 removed only 3 gene pairs, 2 in the
Hemiascomycetes lineage and 1 in the Euascomycetes lineage, which may
be due to either the limits of the ancestral reconstruction method or the
relatively conservative criteria adopted in defining orthologs. Final sets
of 183 A. fumigatus-A. nidulans ortholog pairs and of 359 S. cerevisiae-
S. mikatae ortholog pairs were obtained. The mean Ka, Ks and Ka/Ks
of the ortholog pairs in each LS group are given in Table 9.2.
Genes that are distributed in the more specific lineages tend to have
higher Ka values than more widely distributed genes. Box plots of the
distribution of the Ka values for the Aspergillus and Saccharomyces genes
are shown in Fig. 9.2 (A and B, respectively). In both the Aspergillus
and Saccharomyces gene sets, average Ka increases with the degree of LS
with significant among-group variation as measured by a Kruskal-Wallis
test (Aspergillus, P < 0.001; Saccharomyces, P < 0.001). Moreover, as
expected, Ka is consistently smaller than Ks within all LS groups, which
suggests the operation of purifying (negative) selection or functional con-
straints.
The ratio Ka/Ks (i.e., the rate of nonsynonymous substitutions cor-
rected for neutral rates) showed a trend similar to Ka, namely, the values
of Ka/Ks for genes of high LS (e.g., Aspergillus-specific or Euascomycetes-
specific genes) are significantly higher than those for genes of low LS (e.g.,
Eukaryotes-core or Ascomycota-core genes). The differences among the
rates of sequence divergence for different LS groups are more pronounced
for Ka than for Ks, which suggests that the acceleration of a gene’s di-
vergence rate may be mainly caused by more relaxed purifying selection
against amino acid replacement. Functions of representative genes in dif-
ferent LS groups were also examined. Largely, the functions of highly
lineage-specific genes are poorly characterised or simply unknown.
194
Log(EXP)
43210-1
Lo
g(K
a)
0.0
-.5
-1.0
-1.5
-2.0
-2.5
-3.0
-3.5
Saccharomyces-
specif ic
Hemiascomycetes-
specif ic
Ascomycota-core
Eukaryotes-core
All genes
Log(EXP)
43210-1
Lo
g(K
s)
.4
.2
0.0
-.2
-.4
-.6
-.8
-1.0
-1.2
Saccharomyces-
specif ic
Hemiascomycetes-
specif ic
Ascomycota-core
Eukaryotes-core
All genes
(A)
(B)
Figure 9.3: Dependence of log gene expression level, Log(EXP), andsubstitution rate. (A) log non-synonymous substitution rate, log(Ka).(B) log synonymous substitution rate, log(Ks).
195
(A)
R2 = 0.9518
R2 = 0.9429
0.0
0.5
1.0
1.5
2.0
2.5
0 500 1000 1500 2000
Divergence time (Myr)
-ln
(D),
D=
rela
tive d
issim
ilari
ty s
co
re
Euascomycetes-specif ic
Ascomycota-core
Eukaryotes-core
(B)
R2 = 0.9544
R2 = 0.939
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 500 1000 1500 2000
Divergence time (Myr)
-ln
(D),
D=
rela
tive d
issim
ilari
ty s
co
re
Hemiascomycetes-specif ic
Ascomycota-core
Eukaryotes-core
Figure 9.4: Linear regression analysis of divergence time and RDS. (A)LS of A. fumigatus-A. nidulans genes. (B) LS of S. cerevisiae-S. mikataegenes.
196
9.4.2 Evolutionary rate-related factors of genes belonging to different
LS groups
The correlation between Ka and LS may be confounded by other factors.
For S. cerevisiae-S. mikatae orthologs, bivariate correlations were used
to compute the pairwise associations between Ka and LS and potentially
confounding factors. These factors include the expression level of genes,
the dispensability or essentiality of a gene, gene duplication and the num-
ber of protein-protein interactions of the gene product. The results are
summarised in the upper diagonal of Table 9.3. The coefficient for cor-
relation between log(Ka) and LS is 0.584 (Pearson’s R, P < 0.01, Table
9.4), which is higher than that between log(Ka) and any other factor or
that between any two other factors.
Log gene expression level correlates negatively with log Ka (R = -
0.382, P < 0.01, Table 9.3, Fig. 9.3). This is consistent with previ-
ous studies which showed a correlation between Ka and gene expression
level [127,241]. A correlation between Ka and gene essentiality has long
been proposed [343] but remains controversial [141,149]. The correlation
between log(Ka) and gene essentiality was found to be weak, albeit sig-
nificant (R = -0.163, P < 0.01), and essential genes have a lower mean
Ka (0.081, median 0.081) compared to that for non-essential genes (mean
0.136; median 0.110) (Mann-Whitney U test, P = 0.004).
Our data show a weak correlation between log(Ka) and gene dispens-
ability (R = 0.186, P < 0.001, Table 9.3), which is at a similar magnitude
to that of gene essentiality. This result is consistent with that recently
reported by Hirsh and Fraser. This correlation remains significant af-
ter controlling for gene expression levels (partial R = 0.240, P < 0.01),
suggesting the independent nature of gene dispensability as a factor.
Gene duplication has been shown to play a role in influencing gene
divergence rates [119,150,357]. Genes were classified as either singletons
or duplicate genes if they belonged to any multigene family. The mean
197
Table
9.2:A
verageK
a ,K
sand
Ka /K
sam
ongLS
classes.∗
AK
ruskal-Wallis
testreveals
significantrate
heterogeneityofaverage
Ka
oraverage
Ka /K
sofgenes
indifferent
LS
groupsin
bothE
uascomycetes
branchand
Hem
iascomycetes
branch,P
<0.001.
§A
Kruskal-W
allistest
revealsno
significantrate
heterogeneityofaverage
Ks
ofgenesin
differentLSG
groupsin
bothE
uascomycetes
branchand
Hem
iascomycetes
branch,P
>0.01.
LS
Class
Num
berof
genespairs
K∗a
mean
(SD)
K§s
mean
(SD)
Ka /K
∗sm
ean(SD
)A
.fum
igatus–
A.nidulans
(Euascom
ycetesbranch)
Eukaryotes-core
1130.051
(0.032)1.431
(0.441)0.039
(0.027)A
scomycota-core
270.126
(0.069)1.577
(0.329)0.080
(0.042)E
uascomycetes-specific
220.198
(0.118)1.436
(0.490)0.155
(0.091)A
spergillus-specific21
0.293(0.136)
1.263(0.567)
0.261(0.127)
S.cerevisiae
–S.
mikatae
(Hem
iascomycetes
branch)E
ukaryotes-core17
0.018(0.021)
0.586(0.213)
0.029(0.026)
Ascom
ycota-core23
0.031(0.030)
0.639(0.172)
0.047(0.040)
Hem
iascomycetes-specific
220.072
(0.037)0.839
(0.284)0.091
(0.045)Saccharom
yces-specific297
0.131(0.100)
0.830(0.329)
0.165(0.130)
198
Table 9.3: Correlation (Pearson’s R) (upper triangle) and partial corre-lation after controlling for log(Ks) (lower triangle). Abbreviations: Ka:nonsynonymous substitution rate; LS: lineage specificity; EXP: expres-sion level; FIT: fitness effect (gene dispensability); ESS: gene essentiality;DUP; duplicated (or not) gene; (INT) number of interactions. Amongthem, Ka, Ks, EXP and FIT are in their log forms.
Ka LS EXP FIT ESS DUP INT Ks
Ka – 0.584 -0.382 0.186 -0.163 0.257 -0.308 0.429LS 0.582 – -0.271 0.195 -0.263 0.324 -0.428 0.185EXP -0.294 -0.161 – -0.037 0.076 -0.113 0.197 -0.165FIT 0.240 0.192 -0.049 – 0.032 -0.116 -0.159 -0.048ESS -0.018 -0.146 -0.091 0.033 – 0.020 0.243 -0.087DUP 0.215 0.312 -0.065 -0.106 0.028 – -0.163 0.160INT -0.253 -0.379 0.123 -0.175 -0.007 -0.111 – -0.128
Ka of 0.097 (median 0.049) for duplicate genes was significantly smaller
than the mean of 0.138 (median = 0.114) for singleton genes (Mann-
Whitney U test, P < 0.001). The same pattern was observed between
different LS groups with the exception of the Ascomycota-core group.
Ka has been shown to be positively correlated with Ks in several
species [116, 214, 239, 344]. Such a correlation, which may confound cor-
relations between log(Ka) and LS or with other factors, was observed here
for log(Ka) and log(Ks) (R = 0.429, p < 0.01, Table 9.4). To examine
the influence of the correlation ofKa with Ks on other factors, partial cor-
relation coefficients between log(Ka) and other variables were calculated
while holding the value of log(Ks) constant. The results are given in the
lower diagonal portion of Table 9.4 and indicate that, after controlling
for log(Ks), log(Ka) remains significantly correlated with LS. There is
little change in the value of the coefficients with or without controlling for
log(Ks) (partial Rlog(Ka)−LS|log(Ks)=0.582 to Rlog(Ka)−LS=0.584). Thus,
Ka is correlated with LS independently of Ks.
A decrease in the absolute value of the correlation coefficient was ob-
served between log(Ka) and expression level when controlling for log(Ks)
199
Table
9.4:R
esultsofthe
regressionanalyses
on359
predictedS.cerevisiae-S.m
ikataeorthologs.
¶R2
isthe
proportionofvariation
inthe
dependentvariable
explainedby
theregression
model
constructedfrom
theindividual
variable.T
hevalues
indicatethe
independentcontribution
ofeach
variableto
explainthe
globalvariance
oflog(K
a ).∗
Order
ofvariables
enteredinto
model
ateach
step.∗∗
tstatistics
canindicate
therelative
importance
ofeach
variablein
them
odel.
Indep.contribution
(R2) ¶
Entry
order ∗U
nstd.coeffi
(B)±
1SEStd.
coeffi(β
)t ∗∗
P
Inclu
ded
Variab
les(C
onstant)–
–-1.149±
0.113–
-10.148<
0.0001LS
0.3411
0.048±0.004
0.56211.676
<0.0001
log(EX
P)
0.1642
-0.197±0.038
-0.247-5.124
<0.0001
Exclu
ded
Variab
leslog(F
IT)
0.0353
0.0871.836
>0.1
DU
P0.066
40.070
1.399>
0.1E
SS0.027
50.038
0.787>
0.1IN
T0.095
6-0.028
-0.546>
0.1
200
(|Rlog(Ka)−log(EXP )|Log(Ks)| = 0.294 and |Rlog(Ka)−log(EXP )| = 0.382).
This suggests Ks might be a confounding factor for gene expression level
in determining Ka. Figure 9.3 plots the relationship of log expression
level with log(Ka) (Fig. 9.3A) and with log(Ks) (Fig. 9.3B) showing the
values for the Saccharomyces gene lineage groups. The more consistent
relationship of log expression value with log(Ks) among the genes can be
seen.
Linear multiple regression was used to further examine the effect of
multiple factors on log(Ka). Gene essentiality and gene redundancy were
recoded to be quantitative variables by using two sets of binary variables
(essential = 1 and non-essential = 0; duplicated gene = 1 and singleton
gene = 0). A forward stepwise regression model was used to examine
the contribution of each independent variable to the regression. The
regression model defines log(Ka) as a function of LS (XLS), log expression
level (log(Xexp)), log fitness effect of gene deletion (log(Xfit)), essentiality
(Xess), gene duplication (Xdup), and the number of protein interactions
(Xint):
log(Ka) = β0+βlsgXlsg+βexplog(Xexp)+βfitlog(Xfit)+βessXess+βdupXdup+βintXint
Table 9.4 gives the results of the modelling procedure. The final model
gives a global R2 of 0.436 (P < 0.001). That is, nearly one half of the
variation in log(Ka) is explained by this model. During the construction
of the final model, the predictors most highly correlated with log(Ka),
LS and the expression level, were kept. The remaining variables, which
have minor roles in overall regression with log(Ka), were excluded from
the final model (Table 9.4). The standardised coefficients were examined
to determine the relative importance of the significant predictors. LS
contributes more to the model than does the expression level, as shown
by its larger absolute standardised coefficient of 0.562 and t statistic of
201
11.676, when compared with values of 0.247 and 5.124, respectively, for
expression level. This analysis suggests that LS is the most relevant
predictor of the rate of protein divergence.
9.4.3 Linear regression of divergence time and relative divergence score
(RDS)
To relate the group divergence times and RDS a linear regression for
each LS group was performed (Fig. 9.4). An increasing linear trend of
RDS with divergence time was observed in each LS group, suggesting
that genes diverge from other species at an approximately constant rate.
Groups with higher LS have greater slopes than those with lower LS, in-
dicating that genes with higher LS evolve faster than those with lower LS.
This trend would still be apparent if different divergence time estimates
were used.
9.5 Discussion
The phylogenetic distribution of a gene has been suggested to be of bi-
ological importance. For example, genes with the same phylogenetic
distribution may have linked functions [8, 218]. Lineage specificity (LS)
is a form of phylogenetic distribution whereby genes are found only in
a group of species that diverge from a certain point in a species tree.
Orphan genes, those identified from only one species, are the extremes of
lineage specificity. How these orphan and lineage specific genes arose is
still an open question.
Three possibilities are generally proposed [73]. One is that genes in a
lineage originate from a lineage ancestral gene formed by the recombina-
tion of exons from other genes or from random ORFs. These genes might
show similarity to the original exons and so not necessarily be considered
orphans or lineage specific. In the case of formation from random ORFs
it is unlikely that such a protein would be functional. A second option is
202
gene loss [8, 178]. However it is relatively unlikely that a gene would be
lost in all but one lineage [73] and this may not explain most orphan or
lineage specific genes. The third option, which is examined here, is that
some genes evolve at a rapid rate and so can no longer be recognised as
orthologs of the genes they diverged from after a certain time span.
If accelerated rates of evolution lead to the creation of orphan or
lineage specific genes, then it follows that genes with a high degree of LS
should show higher rates of evolution than genes with lower degrees of LS.
This hypothesis has been tested here with respect to the Ascomycotan
fungi. If LS arose through widespread gene loss or from creation of new
genes from recombination of exons or ORFs there is no reason to expect
accelerated evolutionary rates or a trend in evolutionary rate with respect
to the degree of LS.
The evolutionary rate of genes in Ascomycotan fungi that have dif-
ferent degrees of LS were compared and revealed a significant, strong
correlation between LS and the evolutionary rates of the genes. A trend
that genes with narrow phylogenetic distributions (high LS) tend to have
elevated evolutionary rates when compared with more ubiquitous genes
(low LS) was observed. This is consistent with the hypothesis that accel-
eration of the evolutionary rate is largely responsible for the formation
of lineage specific genes.
However, the rate of gene evolution is one of the most important pa-
rameters in molecular evolution. Correlations between the rate of gene
evolution and many properties of genes, including their phylogenetic dis-
tribution have been explored by several studies. As noted in the In-
troduction, the evolutionary rate has been associated with expression
level [127,241], gene dispensability [178], essentiality [343] or morbidity ,
gene duplication, gene loss [178] and protein-protein interactions [93,335].
Not all these studies have been in agreement e.g., [93,151]. These factors
may influence the apparent correlation of LS with evolutionary rate.
203
All pair-wise correlations of these factors with LS, Ka and Ks were
examined to investigate the influence of these factors on the relationship
between LS and Ka. The strongest correlation observed was that of LS
with log(Ka), however log(Ka) also correlated highly with log(Ks). Cor-
relations of log(Ka) with LS and the other factors were then calculated
after controlling for log(Ks). Again the correlation of LS with log(Ka)
was the strongest and similar to that without controlling for log(Ks).
With one exception, both LS and log(Ka) showed significant but low
correlations to all other factors. As log(Ka) showed the strongest corre-
lation with LS in both cases it seems clear that the evolutionary rate has
a considerable, though not unique, influence on the origin of LS.
Further examination of this was undertaken with a stepwise regres-
sion analysis of the factors likely to influenceKa. In the final regression
model, which explained close to half the variation in log(Ka), only the
parameters LS and log expression level were kept, with LS making the
larger individual contribution. The other parameters investigated did
not make significant contributions to the regression model. This again
indicated the role of evolutionary rate on LS.
Another approach used the relative divergence score (RDS) which
measures the divergence of a gene from its orthologs in other genomes
as a ratio of the TBLASTN score with its orthologs to the maximal (or
self-self) score. This provides another view of the degree of divergence
within a lineage and, when matched to divergence times, allows an ex-
amination of the evolutionary rate as the degree of LS increases. Within
each LS group a reasonably constant rate of evolution was seen since the
appearance of the LS group. Groups with low LS show lower RDS values
and evolutionary rates than groups with higher LS, consistent with the
evolutionary rate being a major determinant of LS. Allowing for errors
in the determination of divergence times this trend will still hold.
Genes with a certain degree of LS may have arisen from duplication
204
followed by acquisition of a lineage specific function [73] or simply have
diverged from a common ancestor to the extent that they cannot be
recognised as orthologs across lineages. Our findings support the idea
that genes destined to have high levels of LS will have higher evolution-
ary rates. It should be noted that Ka is a measurement of the average
nonsynonymous substitution rate along the whole length of a gene. Al-
though highly lineage-specific genes had higher average Ka, the extent
to which region- specific or site-specific contributions to Ka affect this
was not examined. Further research could be directed to evaluate such
region- or site-specific effects on the rate of protein divergence, especially,
for instance, for genes that have high LS but low evolutionary rates or
vice versa.
For ascomycotan fungi, our findings show that the degree of LS cor-
relates with the evolutionary rate and indicate that an elevated evolu-
tionary rate may be a major cause of the development of lineage specific
genes.
205
Chapter 10
MBETOOLBOX: A MATLAB TOOLBOX FOR
SEQUENCE DATA ANALYSIS IN MOLECULAR
BIOLOGY AND EVOLUTION
This chapter is very closely based on a paper I have published [35].
The original draft of the manuscript has been revised by Dr. David K.
Smith, in Department of Biochemistry, HKU.
10.1 Introduction
Matlab is a high-performance language for technical computing, integrat-
ing computation, visualization, and programming in an easy-to-use envi-
ronment. It has been widely used in many areas, such as, mathematics
and computation, algorithm development, data acquisition, modelling,
simulation, and scientific and engineering graphics. However, few func-
tions are freely available in Matlab to perform the sequence data analysis
for molecular biology and evolution specifically. I have developed a Mat-
lab toolbox, called MBEToolbox, aiming at filling this gap by offering
efficient implementations of the most needed functions in molecular bi-
ology and evolution. It can be used to manipulate aligned sequences,
calculate evolutionary distances, estimate synonymous and nonsynony-
mous substitution rates, and infer phylogenetic trees. Moreover, it pro-
vides an extensible functional framework for more specialized needs in
exploring and analysing aligned nucleotide or protein sequences from the
evolutionary perspective. The full functions in the toolbox are accessible
through command-line for those seasoned Matlab users, yet, it does pro-
vide a graphical user interface may be especially useful for non-specialist
206
end users. Through applicaiton of this software during the Penicillium
marneffei genome project, MBEToolbox is proved to be a useful tool
that can aid in the exploration, interpretation and visualization of data
in molecular biology and evolution. The software are publicly available
at http://web.hku.hk/∼jamescai/mbetoolbox/.
10.2 Literature Review
10.2.1 Probabilistic DNA substitution models
In this section I will discuss probability models, more specifically, Markov
models. (Of course, there also exist other types of models, e.g., determin-
istic models). Morkov models can be discrete or continuous in regard to
time. The discrete time models are called Markov chains, whereas con-
tinuous time models are usually called Markov processes. Mathematical
notations used in this section are given as: R - intrinsic rate matrix; Q
- (instantaneous) transition rate matrix; P - transition probability ma-
trix; X - divergence matrix; Π - matrix base frequencies; and t - time or
evolutionary distance.
Molecular evolution of sequences generally is constructed under a hy-
pothesis of phylogeny, i.e., modelling sequence evolution along a branch
of phylogenetic tree. This is using a continuous time Markov process,
more specifically finite, aperiodic, irreducible such processes (here refer
to these simply as Markov process). A Markov process has a defined
state space, e.g., A, C, G, T, and the (instantaneous) transition rate
between states is given by any n × n transition rate matrix, Q, where
Qij > 0 for all i 6= j and Qii = −∑i 6=jQij . Amino acid models have
207
n = 20, while nucleotide models have n = 4, e.g.:
Q =
−1.218 0.504 0.336 0.378
0.126 −0.882 0.252 0.504
0.168 0.504 −1.050 0.378
0.126 0.672 0.252 −1.050
Qij indicates the rate for going from state i to state j. Since the total
instantaneous rate is zero each row should sum to zero. For a specified
time interval, t, we can calculate the transition probability matrix from
P(t) = eQt, e.g.:
P(t) =
0.6883 0.1308 0.0828 0.0981
0.0327 0.7783 0.0654 0.1236
0.0414 0.1308 0.7297 0.0981
0.0327 0.1647 0.0654 0.7372
Here t = 0.33, the exponential operation is matrix exponential. In
Matlab, this is computed using a scaling and squaring algorithm with a
Pade approximation. In P, the rows sum to one, since the total prob-
ability under the time interval is one. If the Markov process are run
sufficiently long time, the probabilities, P(t) will converge on a station-
ary distribution such that for all pairs (i, k) of states, Pi,j(t) = Pi,k(t).
That is the probability of the end state is independent of the starting
state. Here we will limit our discussion to cases where the overall rate
of changing from state i to state j is the same as the rate from i to j,
a constraint to models that are said to be time-reversible. The models
used in phylogenetic inference to date are almost exclusively subsets of
this class.
The transition rate matrix, Q, can be decomposed into an intrinsic
rate matrix, R, and Π, such that:
208
Q = RΠ
If R is symmetric, and Q is constructed as indicated above, and Π
is the equilibrium frequency vector. The rates at which each state is
replaced with each alternative state in R and methods for calculating or
estimating Π are set differently in different situation. Hence, different
DNA substitution model are existing. I will start to introduce the most
general models of nucleotide substitution is the general time reversible
model (REV), also called General Time Reversible model (GTR). The
instantaneous rate matrix for the REV model is:
R<REV> =
− µa µb µc
µa − µd µe
µb µd − µf
µc µe µf −
In this matrix, the rows (and columns) correspond to the bases A, C,
G, and T respectively. The factor µ represents the mean instantaneous
rate. This rate is modified with the relative rate parameters a, b, c, · · · , l,which correspond to each possible transformation between two bases. To
construct Q<REV>, all we need to do is: RΠ, where Π, (πA, πC , πG, πT ),
is frequency parameters that correspond to the frequencies of the four
bases. The diagonal elements of Q are always chosen so that the row
sums are zero (i.e., stationarity).
Many other models (still belong to GTR class) have been designated.
They are usually designated by the initial letters of the authors last names
and the year of the publication. Their relationship can be illustrated as
in Fig 10.1. The κ parameter represents the ratio of the instantaneous
rate of transition-type substitutions to transversion-type substitutions.
It assumes the value 1.0 for models in which all substitutions are taken
to occur at the same rate (i.e., the JC and F81 models). In the K2P and
209
JC
πA=πC=πG=πT
α=β
JC
πA=πC=πG=πT
α=β
HKY85
πA≠πC≠πG≠πT
α≠β
HKY85
πA≠πC≠πG≠πT
α≠β
GTR/REV
πA≠πC≠πG≠πT
a,b,c,d,e,f
GTR/REV
πA≠πC≠πG≠πT
a,b,c,d,e,f
K2P
πA=πC=πG=πT
α≠β
K2P
πA=πC=πG=πT
α≠β
Allow transition/Allow transition/
transversion biastransversion bias
Allow transition/Allow transition/
transversion biastransversion bias
F81
πA≠πC≠πG≠πT
α=β
F81
πA≠πC≠πG≠πT
α=β
Allow baseAllow base
frequencies to varyfrequencies to vary
Allow baseAllow base
frequencies to varyfrequencies to vary
Figure 10.1: Relationship of GTR class DNA substitution models
HKY models, the rate of transversion is β, with the rate of transitions
being determined as α = κβ.
JC model The JC model was described by Jukes & Cantor in 1969
[153] and is the most restrictive model. It assumes that the base fre-
quencies are all equal and the instantaneous rate of substitution is the
same for all possible changes. When this model is selected, the base fre-
quencies (πA, πC , πG, πT ) are all set to 0.25 and a, b, c, · · · , l is set to 1.0.
The only free parameter that can be adjusted under this model is the µt
parameter.
F81 model The F81 model was described by Felsenstein (1981) [85].
It is like the JC model in assuming that all possible changes occur at
the same rate, but allows the base frequencies to be unequal. If the base
frequencies are all set to 0.25, this model is equivalent to the JC model.
When this model is selected, you will be free to vary the base frequency
parameters, but the κ parameter will not be changed as it is set to 1.0
under this model.
K2P model The K2P model was described by Kimura in 1980 [165].
It is like the JC model in assuming equal base frequencies, but allows the
210
rate of transition-type substitutions to differ from the rate of transversion-
type substitutions. As you know, the ratio of these two instantaneous
rates is κ. Two parameters, both κ and µt, will be free to vary when
using this model. In case of setting κ = 1.0, K2P model is identical with
the JC model. The base frequency parameters are forced to be equal.
HKY model The Hasegawa, Kishino and Yano (HKY) model [126]
allows for a different rate of transitions and transversions as well as un-
equal frequencies of base frequencies. The parameters requires by this
model are transition to transversion ratio κ and the base frequencies. If
base frequencies are uniform, the HKY model reduces to the K2P model.
10.2.2 Maximum likelihood estimation
Maximum likelihood estimation (MLE) is a popular statistical method
used to make inferences about parameters of the underlying probability
distribution of a given data set. Given a set of observations, the method
of maximum likelihood finds the parameters of a model that are most
consistent with these observations.
Here I use a simple and general example to explain the philosophy of
MLE. Example n data, X1, X2, . . . , Xn, are drawn from a given discrete
probability distribution D with known probability mass function fD and
distributional parameter θ. The probability associated with our observed
data may be computed:
P (x) = fD(x|θ)
where x ∈ x1, x2, . . . , xn. At this moment, although we know that
our data comes from the distribution D, we may don’t know the value of
the parameter θ. Such a situation is usually the case when we do exper-
iment to sample data points so that we can estimate some parameters,
such as, θ of a distribution. The question is how should we estimate θ?
211
MLE provides a general technique for seeking an estimate of the value
of θ from the sample. We maximise the likelihood of the observed data
set over all possible values of θ, i.e., seeking the most likely value of the
parameter θ.
We define likelihood mathematically:
lik(θ) =n∏
i=1
fD(x|θ)
MLE seeks the value θ which maximises this likelihood function over all
possible θ. MLE methods are versatile and apply to most models and to
different types of data.
The general principle of MLE has found its way of applying in many
aspects of phylogenetics, such as, phylogenetic parameter estimation, and
optimal tree searching [41, 85]. Generally, the likelihood of observing a
given set of data is maximised for each topology, and the topology that
gives the highest maximum likelihood is chosen as the final tree. In this
case, however, the parameters to be considered are not the topologies but
the branch lengths for each topology, and the likelihood is maximised to
estimate branch lengths rather than the topology. The problem with
phylogenetic inference based on the optimisation principle is that it is
very time-consuming, because the number of possible topologies is very
large for a sizable number of nucleotide sequences (> 15) and an enor-
mous amount of computational time is required to find the optimal tree.
Calculating MLE’s in phylogeny often requires specialised software for
solving complex non-linear equations. Numerical optimisation is often
required to solve these non-linear problems.
10.2.3 Elements of phylogenetic theory
The purpose of the reminder section is to explain how phylogenetic trees
may be constructed from analysis of nucleotide and protein sequences.
212
Such analyses enable the evolutionary relationships among species or
genes to be deduced. I will review basic concepts of phylogenetic the-
ory, such as, phylogenetic tree and likelihood calculation of a phylogeny,
given a substitution model. Then I will introduce some most commonly
used software packages in phylogenetic analyses, their advantages and
shortcomings.
Phylogenetic trees
We usually describe evolution, of either genes or species, by using a sketch
of a tree-like structure, which represents the hierarchical relationships
among species/genes arising through evolution. Such a tree-like struc-
ture is phylogenetic tree. In the case of rooted trees the root is the
common ancestor of all the nodes. In a evolutionary tree of species,
ancestors’ species are located at the root of the tree and contemporary
species are the leaves. In this sense, the tree is rooted. The topology of
the tree, branching pattern, defines the phylogenetic relationships among
the nodes. When the data for the ancestors are missing, the phylogenetic
trees produced are unrooted, which are only schematic trees comprising
a set of nodes linked together by branches. The location of the com-
mon ancestor of all the species/genes under study cannot be identified in
unrooted tree.
The string representation of a tree, following the newick standard,
is usually used. It uses the recursive definition of a tree to represent
phylogenies in a computer readable form with nested parentheses. For
example, a tree can be written:
(outgroup, neurospora, (penicillium, aspergillus));
However one must be aware that this representation is not unique,
the following one works as well:
(penicillium,(outgroup,neurospora),aspergillus));
213
Sometimes, when an outgroup was provided, the rooted representa-
tion is:
(outgroup,(neurospora,(penicillium,aspergillus)));
In addition to the branch topology, the branch lengths in phylogeny
are also important to specify a particular tree. The lengths of branches
represent the evolutionary distances between two consecutive nodes.
Phylogeny reconstruction
Data required for phylogeny reconstruction is not limited in nucleotide
and amino acid sequences; in fact, protein structures or exon-intron struc-
tures can also be used for this purpose. But I will limit the following dis-
cussion on nucleotide and amino acid sequences merely. It is important to
note that most phylogeny-building methods require multiple alignment of
sequences. Sequence alignment is one of the most important problems in
bioinformatics. Many efforts have been put in improvement of efficiency
and accuracy. The area is still actively developing.
Once obtaining the multiple alignments, we can usually use 3 different
methods to construct phylogeny: the distance matrix method, maximum
parsimony method and maximum likelihood method. A good review for
all these methods can be found in [199].
Maximum parsimony infers a phylogenetic tree by minimising the
total number of evolutionary steps required to explain a given set of data,
or in other words by minimising the total tree length. It is a character-
based method, the input data used is in the form of “characters” for a
range of taxa. Besides protein or nucleotide residue, a character could
be a binary value for the presence or absence of a feature (such as the
presence of a tail). Maximum parsimony is a very simple approach, and
is popular for this reason. However, it is not always very accurate.
Maximum likelihood evaluates a hypothesis about evolutionary his-
tory in terms of the probability that the proposed model and the hypoth-
214
esised history would give rise to the observed data set.
The central of likelihood based method is the likelihood function (for
general description, see Section 10.2.2).
Likelihood = f(Data|T, l, θ)
where T is topology, l is branch lengths of the given tree.
The topology with the highest maximum probability (likelihood) is
chosen. Advantages of maximum likelihood methods over other meth-
ods are: may have lower variance than other methods (least affected by
sampling error), tend to be robust to violations of the assumptions in
the evolutionary model, are statistically well founded, can statistically
evaluate different tree topologies and use all of the sequence information.
There are also some disadvantages: very computationally intensive (slow)
and the result depends on the model of evolution.
Computation of likelihood of phylogeny
Substitution models are a description of the way sequences evolve in
time by nucleotide replacements. Most commonly used Markov models
of DNA subsititution has been reviewed in Section 10.2.1.
10.2.4 Programs used for phylogenetic analyses
A few selective programs are introduced below, they are representatives
of the most commonly used ones in phylogenetic analyses.
PAUP* - http://paup.csit.fsu.edu/ is an integrated and user-
friendly package. Many distinct models of nucleotide substitution are
available (all possible submodels of the GTR + Γ + inv sites model). It
does not allow analyses of protein sequences using parametric approaches.
Tree-Puzzle - http://www.tree-puzzle.de/ reconstructs phyloge-
netic trees from molecular sequence data by maximum likelihood. It
implements a fast tree search algorithm, quartet puzzling, that allows
215
analysis of large data sets and automatically assigns estimations of sup-
port to each internal branch. It also computes pairwise maximum likeli-
hood distances as well as branch lengths for user specified trees.
Mesquite - http://mesquiteproject.org/mesquite/mesquite.html
is an extensible and modular program for a variety of evolutionary analy-
ses. It is written in Java, therefore, is plantform-independent. At this
point Mesquite is of limited usefulness because it is a modular set of
programs to which specific applications must be added. But it does im-
plement one- and two-parameter models of evolution for ancestral state
reconstruction.
MrBayes - http://morphbank.ebc.uu.se/mrbayes/ is a program for
Markov chain Monte Carlo analysis of phylogeny. Implements a limited
set of submodels of the GTR + Γ + inv sites model. The current version
allows the use of mixed models (e.g., distinct GTR + Γ + inv sites sub-
models for 1st, 2nd, and 3rd codon positions or for different genes). A
number of protein models, using parameters estimated from large-scale
analyses of protein databases, are also available. It is only known package
implementing the covarion model.
PAML - http://abacus.gene.ucl.ac.uk/software/paml.html, is
a package of programs for phylogenetic analyses of DNA or protein se-
quences using maximum likelihood. It contains a modular set of programs
for various likelihood analyses flexibly (submodels of the GTR + Γ + inv
sites model, amino acid models, codon-based models). It is not designed
for tree-searches. But it is ideal for analyses of the evolutionary process,
estimation of evolutionary parameters, because of its flexibility. PAML
has a simulator module called “evolver” that is also quite flexible.
PHYLIP - http://evolution.genetics.washington.edu/phylip.
html, is a modular set of programs for various types of phylogenetic analy-
ses (including likelihood analyses of DNA and proteins). It implements
a heuristic tree space search algorithm, which is faster than PAML, but
216
does not search as rapidly or as extensively as PAUP*.
10.3 Implementation
MBEToolbox is written in the Matlab language and has been tested on
the Windows platform with Matlab version 6.1.0. The main functions
implemented are: sequence manipulation, computation of evolutionary
distances derived from nucleotide-, amino acid- or codon-based substi-
tution models, phylogenetic tree construction, sequence statistics and
graphics functions to visualize the results of analyses. Although it imple-
ments only a small fraction of the multiplicity of existing methods used
in molecular evolutionary analyses, interested users can easily extend the
toolbox.
10.3.1 Input data and formats
MBEToolbox requires a single ASCII file containing the nucleotide or
amino acid sequence alignment in either Phylip [86], ClustalW [312]
or Fasta format. The toolbox does provide a built-in Clustalw [312]
interface if an unaligned sequence file is provided. Protein-coding DNA
sequences can be automatically aligned based on the corresponding pro-
tein alignment with the command alignseqfile.
After input, in common with the MathWorks bioinformatics tool-
box, MBEToolbox represents the alignment as a numeric matrix with
every element standing for a nucleic or amino acid character. Nucleotides
A, C, G and T are converted to integers 1 to 4, and the 20 amino acids are
converted to integers 1 to 20. A header, containing information about the
names and type of the sequences as well as the relevant genetic code for
protein-coding nucleotides, is attached to the alignment matrix to form a
Matlab structure. An example alignment structure, aln, in Matlab code
follows:
aln =
217
seqtype: 2
geneticcode: 1
seqnames: 1xn cell
seq: [nxm double]
where n is the number of sequences and m is the length of the aligned
sequences. The type of sequence is denoted by 1, 2 or 3 for sequences
of non-coding nucleotides, protein coding nucleotides and amino acids,
respectively.
10.3.2 Sequence Manipulation and Statistics
The alignment structure, aln, can be manipulated using the Matlab lan-
guage. For example, aln.seq(x,:) will extract the xth sequence from
the alignment, while aln.seq(:,[i:j]) will extract columns i to j from
the alignment. Users may easily extract more specific positions by us-
ing functions developed in the toolbox, such as extractpos(aln,3) or
extractdegeneratesites to obtain the third codon positions or fourfold
degenerate sites, respectively. For each sequence, some basic statistics
such as the nucleotide composition (ntcomposition) and GC content,
can be reported. Other functions include the calculation of the relative
synonymous codon usage (RSCU) and the codon adaptation index (CAI),
counts of segregating sites, taking the reverse complement or translating
a sequence, and determining the sequence complexity.
10.3.3 Evolutionary Distances
The evolutionary distance is one of the important measures in molecu-
lar evolutionary studies. It is required to measure the diversity among
sequences and to infer distance-based phylogenies. MBEToolbox con-
tains a number of functions to calculate evolutionary distances based
on the observed number of differences. The formulae used in these
functions are analytical solutions of a variety of Markov substitution
218
models, such as JC69 [153], K2P [165], F84 [86], HKY [126] (see [229]
for detail). Given the stationarity condition, the most general form of
Markov substitution models is the General Time Reversible (GTR or
REV) model [185, 309, 266, 358]. There is no analytical formula to cal-
culate the GTR distance directly. A general method, described by Ro-
driguez et al. [266], has been implemented here. In this method a matrix
F, where Fij denotes the proportion of sites for which sequence 1 (s1) has
an i and sequence 2 (s2) has a j, is formed. The GTR distance between
s1 and s2 is then given by
d = −tr(Π log(Π−1F))
where Π denotes the diagonal matrix with values of nucleotide equilib-
rium frequencies on the diagonal, and tr(A) denotes the trace of matrix
A. The above formula can be expressed in Matlab syntax directly as:
>> d=-trace(PI*logm(inv(PI)*F))
MBEToolbox also calculates the gamma distribution distance and the
LogDet distance [295] (i.e., Lake’s paralinear distance [184]).
For alignments of codons, the toolbox provides calculation or esti-
mation of the synonymous (Ks) and non-synonymous (Ka) substitution
rates by the counting method of Nei and Gojobori [228], the degenerate
methods of Li, Wu and Luo [198] and the method of Li or Pamilo and
Bianchi [197, 242], as well as the maximum likelihood method through
PAML [360]. All these methods for calculating Ks and Ka require that
the input sequences are aligned in the appropriate reading frame, which
can be performed by the function alignseqfile. Unresolved codon sites
will be removed automatically. In addition, several quantities, includ-
ing the number of substitutions per site at only synonymous sites, at
only non-synonymous sites, at only four-fold-degenerate sites, or at only
219
zero-fold-degenerate sites can be calculated. The output from these cal-
culations are distance matrices which can be exported into text or Excel
files, or used directly in further operations.
10.3.4 Phylogeny Inference
Two distance-based tree creation algorithms, Unweighted Pair Group
Method with Arithmetic mean (UPGMA) and neighbour-joining (NJ)
[273] are provided and trees from these methods can be displayed or ex-
ported. Maximum parsimony and maximum likelihood algorithms can
be applied to nucleotide or amino acid alignments through an interface
to the Phylip package [86]. As properly implemented maximum likeli-
hood methods are the best vehicles for statistical inference of evolution-
ary relationships among species from sequence data, several maximum
likelihood functions have been explicitly implemented in MBEToolbox.
These functions allow users to incorporate various evolutionary models,
estimate parameters and compare different evolutionary trees.
The simplest case of estimation of the evolutionary distance between
two sequences, s1 and s2, can be considered as the estimation of the
branch length (the number of substitutions along a branch) separating
ancestor and descendent nodes. Branch lengths, relative to a calibrated
molecular clock, can reveal the time interval for this separation. A con-
tinuous time Markov process is generally used to model evolution along
the branch from s1 to s2. A transition rate matrix, Q, is used to indicate
the rate of changing from one state to another. For a specified time in-
terval or distance, t, the transition probability matrix is calculated from
P(t) = eQt. If there are N sites, the full likelihood is
L =N∏
i=1
πs1iP (s1
i → s2i , t)
In this equation, s1i and s2
i are the ith bases of sequences 1 and 2 respec-
220
tively; πs1i
is the expected frequency of base s1i .
In MBEToolbox, to calculate the likelihood, L, at a given time interval
(or distance) t, we have to specify a substitution model by using an appro-
priate model defining function, such as modeljc, modelk2p or modelgtr
for non-coding nucleotides, modeljtt or modeldayhoff for amino acids,
or modelgy94 for codons. These functions return a model structure com-
posed of an instantaneous rate matrix, R, and an equilibrium frequency
vector, pi which give Q, (Q=R*diag(pi)). Once the model is specified,
the function likelidist(t,model,s1,s2) can calculate the log likeli-
hood of the alignment of the two sequences, s1 and s2, with respect to
the time or distance, t, under the substitution model, model.
In most cases we wish to estimate t instead of calculating L as a func-
tion of t, so the function optimlikelidist(model,s1,s2) will search for
the t that maximises the likelihood by using the Nelder-Mead simplex (di-
rect search) method, while holding the other parameters in the model at
fixed values. This constraint can be relaxed by allowing every parameter
in the model to be estimated by functions, such as optimlikelidistk2p,
that can estimate both t and the model’s parameters. Figure 10.2(a and
b) illustrates the estimation of the evolutionary distance between two
ribonuclease genes through the fixed- and free-parameter K2P models,
respectively. When the K2P model’s parameter, kappa, is fixed, the re-
sult and trace of the optimisation process is illustrated by the graph of
L and t (Fig. 10.2a). When kappa is a free parameter, a surface shows
the result and trace of the optimisation process (Fig. 10.2b).
When calculating the likelihood of a phylogenetic tree, where s1 and
s2 are two (descendant) nodes in a tree joined to an internal (ancestor)
node, sa, we must sum over all possible assignments of nucleotides to sa
to get the likelihood of the distance between s1 and s2. Consequently,
the number of possible combinations of nucleotides becomes too large to
be enumerated for even moderately sized trees. The pruning algorithm
221
(a)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
1240
1220
1200
1180
1160
1140
1120
1100
1080
1060
1040
Distance (substitutions/site)
ln(L
ike
liho
od
)
(b)(
00.1
0.20.3
0.40.5
01
2
34
5 1350
1300
1250
1200
1150
1100
1050
1000
950
Distance (substitutions/site)kappa
ln(L
ikelih
ood)
Figure 10.2: Log-likelihood of evolutionary distance. (a) Likelihood asfunction of K2P distance. Distance is estimated by maximising likelihoodof the alignment when the bias of transition and transversion, kappa, isfixed. (b) Likelihood as function of distance and kappa. Both distanceand kappa are numerically optimised simultaneously to give maximumlikelihood. The maximum likelihood peaks are marked with *. The twosequences used are coding regions of two mammalian ribonuclease genes,enc, of 474 bp.
222
introduced by Felsenstein [85] takes advantage of the tree topology to
evaluate the summation in a computationally efficient (but mathemati-
cally equivalent) manner. This and a simple and elegant mapping from a
‘parentheses’ encoding of a tree to the matrix equation for calculating the
likelihood of a tree, developed in the Matlab software, PhylLab [271],
have been adopted in likelitree.
10.3.5 Combination of functions
Basic operations can be combined to give more complicated functions.
A simple combination of the function to extract the fourfold degenerate
sites with the function to calculate GC content produces a new function
(countgc4) that determines the GC content at 4-fold degenerate sites
(GC4). A subfunction for calculating synonymous and nonsynonymous
differences between two codons, getsynnonsyndiff, can be converted
into a program for calculating codon volatility [251] with trivial effort.
Similarly, karlinsig which returns Karlin’s genomic signature (the din-
ucleotide relative abundance or bias) for a given sequence can be easily
re-formulated to estimate relative di-codon frequencies, which may be a
new index of biological signals in a coding sequence. In addition, the
menu-driven user interface, MBEGUI, is also a good example illustrating
the power of combination of basic MBEToolbox functions.
10.3.6 Graphics and GUI
Good visualisation is essential for successful numerical model building.
Leveraging the rich graphics functionality of Matlab, MBEToolbox pro-
vides a number of functions that can be used to create graphic output,
such as scatterplots of Ks vs Ka, plots of the number of transitions and
transversions against genetic distance, sliding window analyses on a nu-
cleotide sequence and the Z-curve (a 3-dimensional curve representation
of a DNA sequence [372]). A simple menu-driven graphical user inter-
223
face (GUI) has been developed by using GUIDE (Graphical User Inter-
face Development Environment) in Matlab. The top menu includes File,
Sequences, Distances, Phylogeny, Graph, Polymorphism and Help sub-
menus (Fig. 10.3). It aids the usage of the most frequently required
functions so that users do not have to run any scripts or functions from
the Matlab command line in most cases.
10.4 Results and Discussion
Only few Matlab toolboxes or functions are freely available for data analy-
sis, exploration, and visualisation of nucleotide and protein sequences.
The toolbox, MBEToolbox, presented here to fulfil most obvious needs in
sequence manipulation, genetic distance estimation and phylogeny infer-
ence under Matlab environment. Moreover, it is an extensible functional
framework to formulate and solve problems in evolutionary data analysis;
it facilitates the rapid construction of both general applications as well
as special-purpose tools for computational biologists in a fraction of the
time it would take to write a program in a scalar noninteractive language
such as C or FORTRAN.
10.4.1 Vectorisation simplifies programming
Matlab is a matrix language, which means it is designed for vector and
matrix operations. Programming can be simplified and made more effi-
cient by using algorithms that take advantage of vectorisation (converting
for and while loops to the equivalent vector or matrix operations). The
Matlab compiler in version 7.0 will automatically recognise and vectorise
loops without recursion. An example of vectorisation is the calculation
of Z-scores [246] for Smith-Waterman alignments [291] to give a mea-
sure of the significance of an alignment score against a background of
scores from randomly generated sequences with the same composition
and length. Hence, Z-scores are designed to overcome the bias due to the
224
Figure 10.3: MBEToolbox GUI. (a) Distances submenu; (b) Phylogenysubmenu; and (c) Graph submenu.
225
composition of the alignment and are usually calculated by comparing
an actual alignment score with the scores obtained on a set of random
sequences generated by a Monte-Carlo process. The Z-score is defined
as:
Z(A, B) = (S(A,B)−mean)/standard deviation
where S(A,B) is the Smith-Waterman (S-W) score between two se-
quences A and B. The mean and standard deviation are taken from
realignments of the permuted sequences. The algorithm is implemented
as follows in Matlab with as few as 15 lines of code:
function [z,z_raw]=zscores(s1,s2,nboot)
m1=length(s1);
m2=length(s2);
% Initialise two vectors holding Z-score of
% s1_rep and s2_rep, \textiti.e., replicate samples
% of sequences s1 and s2.
v_z1=zeros(1,nboot);
v_z2=zeros(1,nboot);
z_raw=smithwaterman(s1,s2);
for (k=1:nboot),
s1_rep=s1(:,randperm(m1));
v_z1(1,k)=smithwaterman(s1_rep, s2);
s2_rep=s2(:,randperm(m2));
v_z2(1,k)=smithwaterman(s1, s2_rep);
end
z1=(z_raw-mean(v_z1))./std(v_z1);
226
z2=(z_raw-mean(v_z2))./std(v_z2);
z=min(z1,z2);
where randperm(n) is a vector function returning a random permutation
of the integers from 1 to n and smithwaterman performs local alignment
by the standard dynamic programming technique.
10.4.2 Extensibility
An important distinction between compiled languages with subroutine
libraries and interactive environments like Matlab is the ease with which
problems can be specified and solved in the latter. Moreover, Matlab
toolboxes are traditionally organised in a less object-oriented mode and,
consequently, functions are more independent of each other and easier to
combine and extend. Several examples were given in the Implementation
section.
10.4.3 Comparison with other toolboxes
Some other toolboxes have been developed in Matlab for bioinformatics
related analyses. These include PhylLab [271] and MatArray [327]
as well as the bioinformatics toolbox developed by MathWorks. Other
examples can be found at the link and file exchange maintained at Mat-
lab Central [42]. PhylLab is a molecular phylogeny toolbox which
also provides some functions for sequence and tree input and manipula-
tion. Its main focus is on creating a maximum likelihood tree based on
Bayesian principles using a Markov chain Monte Carlo method to com-
pute posterior parameter distributions. MatArray is focussed on the
analysis of gene expression data from microarrays and provides normali-
sation and clustering functions but does not address molecular evolution.
The bioinformatics toolbox from MathWorks provides a range of bioin-
formatics functions, including some related to molecular evolution.
227
MBEToolbox provides a much broader range of molecular evolution
related functions and phylogenetic methods than either the more spe-
cialised Phyllab project or the more general bioinformatics toolbox from
MathWorks. These extra functions include IO in Phylip format, sta-
tistical and sequence manipulation functions relevant to molecular evo-
lution (e.g. count segregating sites), evolutionary distance calculation
for nucleic and amino acid sequences, phylogeny inference functions and
graphic plots relevant to molecular evolution (e.g. Ka vs Ks). As such
it makes an important contribution to the bioinformatics analyses that
can be performed in the Matlab environment.
10.4.4 A novel enhanced window analysis
To test for the selective pressures in the different lineages of a phyloge-
netic tree, the nonsynonymous to synonymous rate ratio (Ka/Ks) is nor-
mally estimated [281, 4, 61]. Values of Ka/Ks = 1, > 1, or < 1 indicate
neutrality, positive selection, or purifying selection, respectively. How-
ever, Ks and Ka are measurements of average synonymous and nonsyn-
onymous substitutions per site along the whole length of the sequences.
Average Ks and Ka values give neither the pattern of intragenic fluc-
tuation of selective constraints, nor region- or site-specific information.
A sliding window method is usually adopted to examine the intragenic
pattern of the substitution rates and to test for the occurrence of signifi-
cant clusters of variant regions [55, 145, 80, 53]. Significant heterogeneity
in Ks would indicate that the neutral substitution rate varies across the
gene, whereas heterogeneity in Ka may indicate that selective constraints
vary along the gene. The results and accuracy of sliding window meth-
ods, either overlapping or non-overlapping, depend on both the size of
the window and the moving distance adopted. Large window lengths
may obliterate the details of patterns in Ks or Ka, whereas small win-
dow lengths usually result in larger statistical fluctuations. Hence, the
228
500 1000 1500 2000 2500 30000
0.5
1
1.5
2
2.5
Substitu
tion n
um
ber
per
site
synnonsyn
500 1000 1500 2000 2500 3000-120
-100
-80
-60
-40
-20
0
20
40
Codon site
synnonsyn
a
b
c
d
f
e
C E1 E2 NS2 NS3 NS4 NS5A NS5B
Tra
nsfo
rmed s
ubstitu
tion n
um
ber
per
site
(a)
(b)
Figure 10.4: Comparison between sliding window and enhanced slidingwindow methods. Sliding window analysis of Ks and Ka for the con-catenated coding regions of two hepatitis C virus strains, HCV-JS andHCV-JT. The number of codons for the C, E1, E2, NS2, NS3, NS4,NS5A, and NS5B genes are 191, 192, 426, 217, 631, 315, 447, and 591,respectively. The different coding regions are separated by vertical lines.(a) illustrates the result of a normal sliding window analysis; (b) illus-trates the result of the enhanced sliding window analysis. Beginningsand ends of regions poor in synonymous substitutions (slope < 0) areindicated by the arrows a and b (genes C and E1) and e and f (geneNS5B). A region rich in synonymous substitutions (slope > 0) in geneNS3 is indicated by arrows c and d.
229
resolution of a sliding window is usually limited.
A mathematical formalism, similar to the Z’-curve [368], is introduced
here to solve this problem. Consider a subsequence based analysis of Ks
or Ka. In the n-th step, count the cumulative numbers of Ks or Ka
occurring from the first to the n-th nucleotide position in the gene se-
quences being inspected. Let K denote either Ks or Ka and K(n) denote
the cumulative K at the n-th sequence position. K(n) is usually an ap-
proximately mono-increasing linear function of n. The points (K(n), n),
n = 1, 2, · · · , N are fit by a least square method to a linear function,
f(K(n)) = βn, to give a straight line with β being its slope. We define
K′(n) = K(n) − βn
The two-dimensional curve of (K′(n) ∼ n) gives an alternative represen-
tation of the normal sliding window curve.
To compare these two curve representations, the example dataset of
Suzuki and Gojobori [303], which contains the coding regions of two
hepatitis C virus strains (HCV-JS - Genbank Acc.: D85516 and HCV-
JT - Genbank Acc.: D11168), was used. The entire coding sequence is
divided into eight regions (C, E1, E2, NS2, NS3, NS4, NS5A, NS5B).
Some of the coding regions have been combined as these short ORFs are
unlikely to yield meaningful Ks and Ka values. The reduction of Ks
in the C, E1 and NS5B regions, as well as its elevation in NS3, which
have been shown in previous studies [303], are not clear in a standard
sliding window representation (Fig. 10.4a). In contrast a sharp increase
in the (K′(n) ∼ n) curve (Fig. 10.4b), indicates an increase in K, while
a drop in the curve indicates a decrease in K. This new method has
been implemented in the function plotSlidingKaKs. Since it is derived
from the sliding window method, it is called the enhanced sliding window
method.
230
10.4.5 Limitations
The current version of this toolbox lacks novel algorithms yet it imple-
ments a variety of existing algorithms. There are some limitations in
the practical use of MBEToolbox. First, though the toolbox provides
many methods to infer and handle sequence and evolutionary analyses,
the full range of these features can only be accessed through the Matlab
command line interface, as in the majority of Matlab packages. Second,
some of the functions cannot handle ambiguous nucleotide or amino acid
codes in the sequences. The future development of MBEToolbox will
overcome these present limitations.
In summary, the MBEToolbox project is an ongoing effort in providing an
easy-to-use and yet powerful analysis environment for molecular biology
and evolution. Currently, it offers a solid set of frequently used functions
to manipulate sequences, calculate genetic distances, infer phylogenetic
trees and for related analyzes. MBEToolbox is a useful tool and inspires
evolutionary biologists to take advantage of Matlab. Moreover, it has
been widely applied in data analysis in the Penicillium marneffei genome
project as mentioned in pages 73, 113, 146, 161 and 190.
231
Chapter 11
CONCLUDING REMARKS
In this last chapter I provide a summary of the conclusions and rec-
ommendations for future research to the preceding chapters presented.
Chapter 1 has presented the draft genome of the important thermally
dimorphic fungus Penicillium marneffei. A number of features of the
pathogenic fungus have been uncovered.
Given the similarity of mitochondrial genome of P. marneffei and
other nonpathogenic Aspergillus (Chapter 3), it suggests that P. marnef-
fei is more close to mould than yeast, which is consistent with established
classification. No direct association between mitochondrion-encoding ge-
netic components and pathogenicity can be observed. Moreover, in silico
evidences for the capability of melanin biosynthesis P. marneffei (Chap-
ter 4) will inspire further research towards the experimental elucidation
of melanin’s role in fungal virulence. Based on the computational finding,
gene knockout and in vivo animal survival analysis are being undertaken
in our department. The possible presence of sexual cycle in P. marneffei
reported in Chapter 5 is highly significant as it affects genetic study of
the fungus, since the sexual cycle could be a useful genetic tool allowing
us to study the way in which the fungus causes disease. On the other
hand, if the fungus does reproduce sexually as part of its life cycle, it
might evolve more rapidly to become resistant to anti-fungal drugs be-
cause sex might create new strains with increased ability to cause disease
and infect humans. Chapter 6 explored our current knowledges about
the genetic components related to the fungal morphogenesis, trying to
emphasise molecular mechanism for dimorphic switching. Yet more re-
searches are required in the following directions, including (i) perception
232
of external stimuli by cellular sensors; (ii) transduction of biochemical
signal; (iii) alteration of the genomic expression, and (iv) structural re-
organization towards the morphological change, in order to solve this
far less archived task. The presence of over-abundant intragenic tan-
dem repeats (IntraTRs) in P. marneffei genome is a striking finding
(Chapter 7). The IntraTRs may create quantitative alterations in phe-
notypes (e.g., adhesion, flocculation or biofilm formation). The variation
resulted from the quantitative alterations of the fungal cell surface may
have allowed the fungus ‘disguise’ itself in order to slip past the host
immune system’s vigilant defences. Many P. marneffei proteins contain-
ing tandemly repeated domain/motif, with some degree of homology to
Plasmodium erythrocyte-binding protein domain.
The area of gene and genome duplication and its evolutionary sig-
nificance has attracted significant attention from researchers in recent
years. Chapter 8 represents a novel contribution to the field by present-
ing a description of gene duplication in five ascomycetes. We have cal-
culated the rates of synonymous and non-synonymous substitution using
the codon substitution model and reported large variation in the propor-
tion of genes in multigene families across these fungi. We also suggest
that paralogs of filamentous fungi are under less selective constraint than
orthologs (but that this does not hold for yeasts), also there is a lack
of evidence for an association between asymmetry in rates of evolution
and positive selection, and finally that different extents and consequences
of gene duplication may explain some of the phenotypic variation of the
ascomycetes. One of new conclusion, that P. marneffei may have under-
gone a whole-genome duplication, is not solidly supported by the evidence
presented so far; analysis of gene order information will be necessary to
support the claim, when the P. marneffei genome sequencing approaches
complete. Moreover, at the time when the analysis was performed, As-
pergillus genomes remain unpublished, the underlying data may change,
233
and results from a pre-mature analysis may be hard to reproduce or be-
come obsolete. Therefore, no Aspergillus genomes was included into the
comparison; further analysis of this sort should overcome this limitation.
In addition, in Chapter 9 we conducted the analysis on genes with
various degree of conservation among species as measured by lineage-
specificity of genes (LS). We examined the correlations between evolu-
tionary rate and LS, as well as several other related factors, such as
expression, essentiality, and protein-protein interactions. We found that
in seven ascomycets genomes, the more lineage specific a gene, the higher
its evolutionary rate. This is taken as evidence for the hypothesis that
orphan genes arise as a result of higher rate of evolution. The general
rule applies to the explaining of the origin of P. marneffei -specific genes.
Finally, the software products, P. marneffei genome database and
MBEToolbox for sequence data analysis, have been developed (Chapters
2 and 10). Two of them literally covers two major aspects of bioin-
formatics, i.e., biological database management system and algorithm
development. They have been successfully applied throughout the whole
genome project, and proved to be efficient and sufficient.
In conclusion, the boom in fungal genome sequence data over the past
few years came with high expectations for new insights into fungal bi-
ology, and pathogen control strategies. In the case of P. marneffei, it
became evident that computational approaches can be used in the deci-
phering of the genome so as to derive biological meaning or evolutionary
processes. This work paves the way for a systemic experimental study of
the pathogenic fungus.
234
BIBLIOGRAPHY
[1] N. Adames, K. Blundell, M. N. Ashby, and C. Boone. Role of yeast insulin-degrading enzyme homologs in propheromone processing and bud site selection.Science, 270(5235):464–7, 1995.
[2] M. D. Adams, S. E. Celniker, R. A. Holt, C. A. Evans, J. D. Gocayne, P. G. Ama-natides, S. E. Scherer, P. W. Li, R. A. Hoskins, R. F. Galle, R. A. George, S. E.Lewis, S. Richards, M. Ashburner, S. N. Henderson, G. G. Sutton, J. R. Wort-man, M. D. Yandell, Q. Zhang, L. X. Chen, R. C. Brandon, Y. H. Rogers, R. G.Blazej, M. Champe, B. D. Pfeiffer, K. H. Wan, C. Doyle, E. G. Baxter, G. Helt,C. R. Nelson, G. L. Gabor, J. F. Abril, A. Agbayani, H. J. An, C. Andrews-Pfannkoch, D. Baldwin, R. M. Ballew, A. Basu, J. Baxendale, L. Bayraktaroglu,E. M. Beasley, K. Y. Beeson, P. V. Benos, B. P. Berman, D. Bhandari, S. Bol-shakov, D. Borkova, M. R. Botchan, J. Bouck, P. Brokstein, P. Brottier, K. C.Burtis, D. A. Busam, H. Butler, E. Cadieu, A. Center, I. Chandra, J. M. Cherry,S. Cawley, C. Dahlke, L. B. Davenport, P. Davies, B. de Pablos, A. Delcher,Z. Deng, A. D. Mays, I. Dew, S. M. Dietz, K. Dodson, L. E. Doup, M. Downes,S. Dugan-Rocha, B. C. Dunkov, P. Dunn, K. J. Durbin, C. C. Evangelista,C. Ferraz, S. Ferriera, W. Fleischmann, C. Fosler, A. E. Gabrielian, N. S. Garg,W. M. Gelbart, K. Glasser, A. Glodek, F. Gong, J. H. Gorrell, Z. Gu, P. Guan,M. Harris, N. L. Harris, D. Harvey, T. J. Heiman, J. R. Hernandez, J. Houck,D. Hostin, K. A. Houston, T. J. Howland, M. H. Wei, C. Ibegwam, et al. Thegenome sequence of drosophila melanogaster. Science, 287(5461):2185–95, 2000.
[3] L. Ajello, A. A. Padhye, S. Sukroongreung, C. H. Nilakul, and S. Tantimavanic.Occurrence of penicillium marneffei infections among wild bamboo rats in thai-land. Mycopathologia, 131(1):1–8, 1995.
[4] H. Akashi. Within- and between-species dna sequence variation and the ‘foot-print’ of natural selection. Gene, 238:39–51, 1999.
[5] J. A. Alspaugh, L. M. Cavallo, J. R. Perfect, and J. Heitman. Ras1 regulates fila-mentation, mating and growth at high temperature of cryptococcus neoformans.Mol Microbiol, 36(2):352–65, 2000.
[6] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, andD. J. Lipman. Gapped blast and psi-blast: a new generation of protein databasesearch programs. Nucleic Acids Res, 25(17):3389–402, 1997.
[7] M. A. Andrade, N. P. Brown, C. Leroy, S. Hoersch, A. de Daruvar, C. Reich,A. Franchini, J. Tamames, A. Valencia, C. Ouzounis, and C. Sander. Automatedgenome sequence analysis and annotation. Bioinformatics, 15(5):391–412, 1999.
[8] L. Aravind, H. Watanabe, D. J. Lipman, and E. V. Koonin. Lineage-specificloss and divergence of functionally linked genes in eukaryotes. Proc Natl AcadSci U S A, 97(21):11319–24, 2000.
[9] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald,G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology.the gene ontology consortium. Nat Genet, 25(1):25–9, 2000.
[10] C. R. Astell, L. Ahlstrom-Jonasson, M. Smith, K. Tatchell, K. A. Nasmyth,and B. D. Hall. The sequence of the dnas coding for the mating-type loci ofsaccharomyces cerevisiae. Cell, 27(1 Pt 2):15–23, 1981.
235
[11] J. Baker, J. McCarthy, M. Gatton, D. E. Kyle, V. Belizario, J. Luchavez, D. Bell,and Q. Cheng. Genetic diversity of plasmodium falciparum histidine-rich protein2 (pfhrp2) and its effect on the performance of pfhrp2-based rapid diagnostictests. J Infect Dis, 192(5):870–7, 2005.
[12] A. D. Basehoar, S. J. Zanton, and B. F. Pugh. Identification and distinct regu-lation of yeast tata box-containing genes. Cell, 116(5):699–709, 2004.
[13] A. Bateman, L. Coin, R. Durbin, R. D. Finn, V. Hollich, S. Griffiths-Jones,A. Khanna, M. Marshall, S. Moxon, E. L. Sonnhammer, D. J. Studholme,C. Yeats, and S. R. Eddy. The pfam protein families database. Nucleic AcidsRes, 32(Database issue):D138–41, 2004.
[14] D. H. Beach and A. J. Klar. Rearrangements of the transposable mating-typecassettes of fission yeast. Embo J, 3(3):603–10, 1984.
[15] G. Bejerano and G. Yona. Variations on probabilistic suffix trees: statisticalmodeling and prediction of protein families. Bioinformatics, 17(1):23–43, 2001.
[16] R. J. Bennett and S. C. West. Ruvc protein resolves holliday junctions viacleavage of the continuous (noncrossover) strands. Proc Natl Acad Sci U S A,92(12):5635–9, 1995.
[17] P. Bork, T. Dandekar, Y. Diaz-Lazcoz, F. Eisenhaber, M. Huynen, and Y. Yuan.Predicting function: from genes to genomes and back. J Mol Biol, 283(4):707–25,1998.
[18] A. R. Borneman, M. J. Hynes, and A. Andrianopoulos. The abaa homologueof penicillium marneffei participates in two developmental programmes: conidi-ation and dimorphic growth. Mol Microbiol, 38(5):1034–47, 2000.
[19] A. R. Borneman, M. J. Hynes, and A. Andrianopoulos. An ste12 homologfrom the asexual, dimorphic fungus penicillium marneffei complements the de-fect in sexual development of an aspergillus nidulans stea mutant. Genetics,157(3):1003–14, 2001.
[20] A. R. Borneman, M. J. Hynes, and A. Andrianopoulos. A basic helix-loop-helixprotein with similarity to the fungal morphological regulators, phd1p, efg1p andstua, controls conidiation but not dimorphic growth in penicillium marneffei.Mol Microbiol, 44(3):621–31, 2002.
[21] V. L. Boyartchuk, M. N. Ashby, and J. Rine. Modulation of ras and a-factorfunction by carboxyl-terminal proteolysis. Science, 275(5307):1796–800, 1997.
[22] K. J. Boyce, M. J. Hynes, and A. Andrianopoulos. The cdc42 homolog of thedimorphic fungus penicillium marneffei is required for correct cell polarizationduring growth but not development. J Bacteriol, 183(11):3447–57, 2001.
[23] K. J. Boyce, M. J. Hynes, and A. Andrianopoulos. The ras and rho gtpasesgenetically interact to co-ordinately regulate cell polarity during development inpenicillium marneffei. Mol Microbiol, 55(5):1487–501, 2005.
[24] A. A. Brakhage, K. Langfelder, G. Wanner, A. Schmidt, and B. Jahn. Pigmentbiosynthesis and virulence. Contrib Microbiol, 2:205–15, 1999.
[25] B. J. Breitkreutz, C. Stark, and M. Tyers. The grid: the general repository forinteraction datasets. Genome Biol, 4(3):R23, 2003.
[26] C. Brenner and R. S. Fuller. Structural and enzymatic characterization of apurified prohormone-processing enzyme: secreted, soluble kex2 protease. ProcNatl Acad Sci U S A, 89(3):922–6, 1992.
[27] J. Brosius and S. J. Gould. On ”genomenclature”: a comprehensive (and re-spectful) taxonomy for pseudogenes and other ”junk dna”. Proc Natl Acad SciU S A, 89(22):10706–10, 1992.
236
[28] D. W. Brown, J. H. Yu, H. S. Kelkar, M. Fernandes, T. C. Nesbitt, N. P. Keller,T. H. Adams, and T. J. Leonard. Twenty-five coregulated transcripts define asterigmatocystin gene cluster in aspergillus nidulans. Proc Natl Acad Sci U SA, 93(4):1418–22, 1996.
[29] T. A. Brown, R. B. Waring, C. Scazzocchio, and R. W. Davies. The aspergillusnidulans mitochondrial genome. Curr Genet, 9(2):113–7, 1985.
[30] C. Burge and S. Karlin. Prediction of complete gene structures in human genomicdna. J Mol Biol, 268(1):78–94, 1997.
[31] M. Burset and R. Guigo. Evaluation of gene structure prediction programs.Genomics, 34(3):353–67, 1996.
[32] H. Bussey. Proteases and the processing of precursors to secreted proteins inyeast. Yeast, 4(1):17–26, 1988.
[33] H. J. Bussink and S. A. Osmani. A cyclin-dependent kinase family member(phoa) is required to link developmental fate to environmental conditions inaspergillus nidulans. Embo J, 17(14):3990–4003, 1998.
[34] E. T. Buurman, C. Westwater, B. Hube, A. J. Brown, F. C. Odds, and N. A.Gow. Molecular analysis of camnt1p, a mannosyl transferase important for adhe-sion and virulence of candida albicans. Proc Natl Acad Sci U S A, 95(13):7670–5,1998.
[35] J. J. Cai, D. K. Smith, X. Xia, and K. Y. Yuen. Mbetoolbox: a matlab toolboxfor sequence data analysis in molecular biology and evolution. BMC Bioinfor-matics, 6(1):64, 2005.
[36] R. Calderone. Molecular pathogenesis of fungal infections. Trends Microbiol,2(12):461–3, 1994.
[37] L. Cao, C. M. Chan, C. Lee, S. S. Wong, and K. Y. Yuen. Mp1 encodes anabundant and highly antigenic cell wall mannoprotein in the pathogenic funguspenicillium marneffei. Infect Immun, 66(3):966–73, 1998.
[38] L. Cao, K. M. Chan, D. Chen, N. Vanittanakom, C. Lee, C. M. Chan, T. Sirisan-thana, D. N. Tsang, and K. Y. Yuen. Detection of cell wall mannoprotein mp1pin culture supernatants of penicillium marneffei and in sera of penicilliosis pa-tients. J Clin Microbiol, 37(4):981–6, 1999.
[39] L. Cao, D. L. Chen, C. Lee, C. M. Chan, K. M. Chan, N. Vanittanakom, D. N.Tsang, and K. Y. Yuen. Detection of specific antibodies to an antigenic manno-protein for diagnosis of penicillium marneffei penicilliosis. J Clin Microbiol,36(10):3028–31, 1998.
[40] T. J. Carver, K. M. Rutherford, M. Berriman, M. A. Rajandream, B. G. Barrell,and J. Parkhill. Act: the artemis comparison tool. Bioinformatics, 21(16):3422–3, 2005.
[41] L. L. Cavalli-Sforza and A. W. Edwards. Phylogenetic analysis. models andestimation procedures. Am J Hum Genet, 19(3):Suppl 19:233+, 1967.
[42] MATLAB Central. Matlab central, 2005.
[43] C. M. Chan, P. C. Woo, A. S. Leung, S. K. Lau, X. Y. Che, L. Cao, and K. Y.Yuen. Detection of antibodies specific to an antigenic cell wall galactomanno-protein for serodiagnosis of aspergillus fumigatus aspergillosis. J Clin Microbiol,40(6):2041–5, 2002.
[44] Y. F. Chan and T. C. Chow. Ultrastructural observations on penicillium marn-effei in natural human infection. Ultrastruct Pathol, 14(5):439–52, 1990.
[45] S. Chariyalertsak, T. Sirisanthana, K. Supparatpinyo, and K. E. Nelson. Sea-sonal variation of disseminated penicillium marneffei infections in northern thai-land: a clue to the reservoir? J Infect Dis, 173(6):1490–3, 1996.
237
[46] S. Chariyalertsak, T. Sirisanthana, K. Supparatpinyo, J. Praparattanapan, andK. E. Nelson. Case-control study of risk factors for penicillium marneffei infectionin human immunodeficiency virus-infected patients in northern thailand. ClinInfect Dis, 24(6):1080–6, 1997.
[47] S. Chariyalertsak, P. Vanittanakom, K. E. Nelson, T. Sirisanthana, and N. Vanit-tanakom. Rhizomys sumatrensis and cannomys badius, new natural animal hostsof penicillium marneffei. J Med Vet Mycol, 34(2):105–10, 1996.
[48] D. Charlesworth, B. Charlesworth, and G. A. McVean. Genome sequences andevolutionary biology, a two-way interaction. Trends Ecol Evol, 16(5):235–242,2001.
[49] P. Chen, S. K. Sapperstein, J. D. Choi, and S. Michaelis. Biogenesis of thesaccharomyces cerevisiae mating pheromone a-factor. J Cell Biol, 136(2):251–69, 1997.
[50] C. S. Chim, C. Y. Fong, S. K. Ma, S. S. Wong, and K. Y. Yuen. Reactivehemophagocytic syndrome associated with penicillium marneffei infection. AmJ Med, 104(2):196–7, 1998.
[51] R. J. Cho, M. J. Campbell, E. A. Winzeler, L. Steinmetz, A. Conway, L. Wod-icka, T. G. Wolfsberg, A. E. Gabrielian, D. Landsman, D. J. Lockhart, andR. W. Davis. A genome-wide transcriptional analysis of the mitotic cell cycle.Mol Cell, 2(1):65–73, 1998.
[52] C. Y. Choi, E. L. Schneider, J. M. Kim, I. Y. Gluzman, D. E. Goldberg, J. A.Ellman, and M. A. Marletta. Interference with heme binding to histidine-richprotein-2 as an antimalarial strategy. Chem Biol, 9(8):881–9, 2002.
[53] S. S. Choi and B. T. Lahn. Adaptive evolution of mrg, a neuron-specific genefamily implicated in nociception. Genome Res, 13:2252–2259, 2003.
[54] P. Chongtrakool, S. C. Chaiyaroj, V. Vithayasai, S. Trawatcharegon, R. Tean-paisan, S. Kalnawakul, and S. Sirisinha. Immunoreactivity of a 38-kilodaltonpenicillium marneffei antigen with human immunodeficiency virus-positive sera.J Clin Microbiol, 35(9):2220–3, 1997.
[55] A. G. Clark and T. Kao. Excess nonsynonymous substitution at shared poly-morphic sites among self-incompatibility alleles of solanaceae. Proc Natl AcadSci USA, 88:9823–9827, 1991.
[56] P. Cliften, P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton, J. Majors, R. Wa-terston, B. A. Cohen, and M. Johnston. Finding functional features in saccha-romyces genomes by phylogenetic footprinting. Science, 301(5629):71–6, 2003.
[57] L. Coin, A. Bateman, and R. Durbin. Enhanced protein domain discovery byusing language modeling techniques from speech recognition. Proc Natl AcadSci U S A, 100(8):4516–20, 2003.
[58] L. J. Collins, A. M. Poole, and D. Penny. Using ancestral sequences to uncoverpotential gene homologues. Appl Bioinformatics, 2(3 Suppl):S85–95, 2003.
[59] G. C. Conant and A. Wagner. Asymmetric sequence divergence of duplicategenes. Genome Res, 13(9):2052–8, 2003.
[60] A. Cooper and H. Bussey. Characterization of the yeast kex1 gene product: acarboxypeptidase involved in processing secreted precursor proteins. Mol CellBiol, 9(6):2706–14, 1989.
[61] KA Crandall, CR Kelsey, H Imamichi, HC Lane, and NP Salzman. Parallelevolution of drug resistance in hiv: failure of nonsynonymous/synonymous sub-stitution rate ratio to detect selection. Mol Biol Evol, 16:372–382, 1999.
[62] J. Davey, K. Davis, M. Hughes, G. Ladds, and D. Powner. The processing ofyeast pheromones. Semin Cell Dev Biol, 9(1):19–30, 1998.
238
[63] F. De Bernardis, S. Arancia, L. Morelli, B. Hube, D. Sanglard, W. Schafer, andA. Cassone. Evidence that members of the secretory aspartyl proteinase genefamily, in particular sap2, are virulence factors for candida vaginitis. J InfectDis, 179(1):201–8, 1999.
[64] R. A. Dean, N. J. Talbot, D. J. Ebbole, M. L. Farman, T. K. Mitchell, M. J.Orbach, M. Thon, R. Kulkarni, J. R. Xu, H. Pan, N. D. Read, Y. H. Lee, I. Car-bone, D. Brown, Y. Y. Oh, N. Donofrio, J. S. Jeong, D. M. Soanes, S. Djonovic,E. Kolomiets, C. Rehmeyer, W. Li, M. Harding, S. Kim, M. H. Lebrun, H. Bohn-ert, S. Coughlan, J. Butler, S. Calvo, L. J. Ma, R. Nicol, S. Purcell, C. Nusbaum,J. E. Galagan, and B. W. Birren. The genome sequence of the rice blast fungusmagnaporthe grisea. Nature, 434(7036):980–6, 2005.
[65] C. d’Enfert, S. Goyard, S. Rodriguez-Arnaveilhe, L. Frangeul, L. Jones,F. Tekaia, O. Bader, A. Albrecht, L. Castillo, A. Dominguez, J. F. Ernst,C. Fradin, C. Gaillardin, S. Garcia-Sanchez, P. de Groot, B. Hube, F. M. Klis,S. Krishnamurthy, D. Kunze, M. C. Lopez, A. Mavor, N. Martin, I. Moszer,D. Onesime, J. Perez Martin, R. Sentandreu, E. Valentin, and A. J. Brown.Candidadb: a genome database for candida albicans pathogenomics. NucleicAcids Res, 33(Database issue):D353–7, 2005.
[66] Z. L. Deng and D. H. Connor. Progressive disseminated penicilliosis caused bypenicillium marneffei. report of eight cases and differentiation of the causativeorganism from histoplasma capsulatum. Am J Clin Pathol, 84(3):323–7, 1985.
[67] Z. L. Deng, M. Yun, and L. Ajello. Human penicilliosis marneffei and its relationto the bamboo rat (rhizomys pruinosus). J Med Vet Mycol, 24(5):383–9, 1986.
[68] E. T. Dermitzakis and A. G. Clark. Differential selection after duplication inmammalian developmental genes. Mol Biol Evol, 18(4):557–62, 2001.
[69] V. Desakorn, M. D. Smith, A. L. Walsh, A. J. Simpson, D. Sahassananda, A. Ra-januwong, V. Wuthiekanun, P. Howe, B. J. Angus, P. Suntharasamai, and N. J.White. Diagnosis of penicillium marneffei infection by quantitation of urinaryantigen by using an enzyme immunoassay. J Clin Microbiol, 37(1):117–21, 1999.
[70] A. Dmochowska, D. Dignard, D. Henning, D. Y. Thomas, and H. Bussey. Yeastkex1 gene encodes a putative protease with a carboxypeptidase b-like functioninvolved in killer toxin and alpha-factor precursor processing. Cell, 50(4):573–84,1987.
[71] C. B. Do, M. S. Mahabhashyam, M. Brudno, and S. Batzoglou. Probcons: Prob-abilistic consistency-based multiple sequence alignment. Genome Res, 15(2):330–40, 2005.
[72] J. M. Dolence, L. E. Steward, E. K. Dolence, D. H. Wong, and C. D. Poulter.Studies with recombinant saccharomyces cerevisiae caax prenyl protease rce1p.Biochemistry, 39(14):4096–104, 2000.
[73] T. Domazet-Loso and D. Tautz. An evolutionary analysis of orphan genes indrosophila. Genome Res, 13(10):2213–9, 2003.
[74] R. F. Doolittle. The multiplicity of domains in proteins. Annu Rev Biochem,64:287–314, 1995.
[75] J. Du, Y. Zhu, A. Shanmugam, and A. L. Kenter. Analysis of immunoglobulinsgamma3 recombination breakpoints by pcr: implications for the mechanism ofisotype switching. Nucleic Acids Res, 25(15):3066–73, 1997.
[76] P. S. Dyer, M. Paoletti, and D. B. Archer. Genomics reveals sexual secrets ofaspergillus. Microbiology, 149(Pt 9):2301–3, 2003.
[77] S. E. Eckert, B. Hoffmann, C. Wanke, and G. H. Braus. Sexual develop-ment of aspergillus nidulans in tryptophan auxotrophic strains. Arch Microbiol,172(3):157–66, 1999.
239
[78] A. Edwards, H. A. Hammond, L. Jin, C. T. Caskey, and R. Chakraborty. Ge-netic variation at five trimeric and tetrameric tandem repeat loci in four humanpopulation groups. Genomics, 12(2):241–53, 1992.
[79] C. elegan Sequencing Consortium. Genome sequence of the nematode c. elegans:a platform for investigating biology. Science, 282(5396):2012–8, 1998.
[80] T Endo, K Ikeo, and T Gojobori. Large-scale search for genes on which positiveselection may operate. Mol Biol Evol, 13:685–690, 1996.
[81] E. Eskin, W. N. Grundy, and Y. Singer. Protein family classification using sparsemarkov transducers. Proc Int Conf Intell Syst Mol Biol, 8:134–45, 2000.
[82] E. Espagne, P. Balhadere, M. L. Penin, C. Barreau, and B. Turcq. Het-e andhet-d belong to a new subfamily of wd40 proteins involved in vegetative incom-patibility specificity in the fungus podospora anserina. Genetics, 161(1):71–81,2002.
[83] B. Ewing and P. Green. Base-calling of automated sequencer traces using phred.ii. error probabilities. Genome Res, 8(3):186–94, 1998.
[84] B. Ewing, L. Hillier, M. C. Wendl, and P. Green. Base-calling of automatedsequencer traces using phred. i. accuracy assessment. Genome Res, 8(3):175–85,1998.
[85] J. Felsenstein. Evolutionary trees from dna sequences: a maximum likelihoodapproach. J Mol Evol, 17:368–376, 1981.
[86] J. Felsenstein. Phylip – phylogeny inference package (version 3.2). Cladistics,5:164–166, 1989.
[87] Fungal Research Community FGI. Fungal genome initiative(http://www.broad.mit.edu/annotation/fungi/fgi/), 2002.
[88] M. C. Fisher, D. Aanensen, S. de Hoog, and N. Vanittanakom. Multilocusmicrosatellite typing system for penicillium marneffei reveals spatially structuredpopulations. J Clin Microbiol, 42(11):5065–9, 2004.
[89] M. C. Fisher, W. P. Hanage, S. de Hoog, E. Johnson, M. D. Smith, N. J.White, and N. Vanittanakom. Low effective dispersal of asexual genotypes inheterogeneous landscapes by the endemic pathogen penicillium marneffei. PLoSPathog, 1(2):e20, 2005.
[90] A. Force, M. Lynch, F. B. Pickett, A. Amores, Y. L. Yan, and J. Postlethwait.Preservation of duplicate genes by complementary, degenerative mutations. Ge-netics, 151(4):1531–45, 1999.
[91] F. Foury, T. Roganti, N. Lecrenier, and B. Purnelle. The complete sequence ofthe mitochondrial genome of saccharomyces cerevisiae. FEBS Lett, 440(3):325–31, 1998.
[92] C. M. Fraser and R. D. Fleischmann. Strategies for whole microbial genomesequencing and analysis. Electrophoresis, 18(8):1207–16, 1997.
[93] H. B. Fraser, D. P. Wall, and A. E. Hirsh. A simple dependence between proteinevolution rate and the number of protein-protein interactions. BMC Evol Biol,3(1):11, 2003.
[94] J. A. Fraser and J. Heitman. Evolution of fungal sex chromosomes. Mol Micro-biol, 51(2):299–306, 2004.
[95] R. Friedman and A. L. Hughes. Gene duplication and the structure of eukaryoticgenomes. Genome Res, 11(3):373–81, 2001.
[96] D. Frishman, M. Mokrejs, D. Kosykh, G. Kastenmuller, G. Kolesov, I. Zubrzycki,C. Gruber, B. Geier, A. Kaps, K. Albermann, A. Volz, C. Wagner, M. Fellenberg,K. Heumann, and H. W. Mewes. The pedant genome database. Nucleic AcidsRes, 31(1):207–11, 2003.
240
[97] M. C. Frith, J. L. Spouge, U. Hansen, and Z. Weng. Statistical significance ofclusters of motifs represented by position specific scoring matrices in nucleotidesequences. Nucleic Acids Res, 30(14):3214–24, 2002.
[98] Y. Fu, G. Rieg, W. A. Fonzi, P. H. Belanger, Jr. Edwards, J. E., and S. G. Filler.Expression of the candida albicans gene als1 in saccharomyces cerevisiae inducesadherence to endothelial and epithelial cells. Infect Immun, 66(4):1783–6, 1998.
[99] K. Fujimura-Kamada, F. J. Nouvet, and S. Michaelis. A novel membrane-associated metalloprotease, ste24p, is required for the first step of nh2-terminalprocessing of the yeast a-factor precursor. J Cell Biol, 136(2):271–85, 1997.
[100] R. S. Fuller, A. Brake, and J. Thorner. Yeast prohormone processing enzyme(kex2 gene product) is a ca2+-dependent serine protease. Proc Natl Acad Sci US A, 86(5):1434–8, 1989.
[101] J. E. Galagan, S. E. Calvo, K. A. Borkovich, E. U. Selker, N. D. Read, D. Jaffe,W. FitzHugh, L. J. Ma, S. Smirnov, S. Purcell, B. Rehman, T. Elkins, R. Engels,S. Wang, C. B. Nielsen, J. Butler, M. Endrizzi, D. Qui, P. Ianakiev, D. Bell-Pedersen, M. A. Nelson, M. Werner-Washburne, C. P. Selitrennikoff, J. A. Kin-sey, E. L. Braun, A. Zelter, U. Schulte, G. O. Kothe, G. Jedd, W. Mewes,C. Staben, E. Marcotte, D. Greenberg, A. Roy, K. Foley, J. Naylor, N. Stange-Thomann, R. Barrett, S. Gnerre, M. Kamal, M. Kamvysselis, E. Mauceli,C. Bielke, S. Rudd, D. Frishman, S. Krystofova, C. Rasmussen, R. L. Met-zenberg, D. D. Perkins, S. Kroken, C. Cogoni, G. Macino, D. Catcheside, W. Li,R. J. Pratt, S. A. Osmani, C. P. DeSouza, L. Glass, M. J. Orbach, J. A. Berglund,R. Voelker, O. Yarden, M. Plamann, S. Seiler, J. Dunlap, A. Radford, R. Ara-mayo, D. O. Natvig, L. A. Alex, G. Mannhaupt, D. J. Ebbole, M. Freitag,I. Paulsen, M. S. Sachs, E. S. Lander, C. Nusbaum, and B. Birren. The genomesequence of the filamentous fungus neurospora crassa. Nature, 422(6934):859–68,2003.
[102] C. A. Gale, C. M. Bendel, M. McClellan, M. Hauser, J. M. Becker, J. Berman,and M. K. Hostetter. Linkage of adhesion, filamentous growth, and virulence incandida albicans to a single gene, int1. Science, 279(5355):1355–8, 1998.
[103] W. Gao, C. H. Khang, S. Y. Park, Y. H. Lee, and S. Kang. Evolution andorganization of a highly dynamic, subtelomeric helicase gene family in the riceblast fungus magnaporthe grisea. Genetics, 162(1):103–12, 2002.
[104] R. G. Garrison and K. S. Boyd. Dimorphism of penicillium marneffei as observedby electron microscopy. Can J Microbiol, 19(10):1305–9, 1973.
[105] S. M. Gasser and M. M. Cockell. The molecular biology of the sir proteins.Gene, 279(1):1–16, 2001.
[106] A. C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer,J. Schultz, J. M. Rick, A. M. Michon, C. M. Cruciat, M. Remor, C. Hofert,M. Schelder, M. Brajenovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dick-son, T. Rudi, V. Gnau, A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M. A.Heurtier, R. R. Copley, A. Edelmann, E. Querfurth, V. Rybin, G. Drewes,M. Raida, T. Bouwmeester, P. Bork, B. Seraphin, B. Kuster, G. Neubauer, andG. Superti-Furga. Functional organization of the yeast proteome by systematicanalysis of protein complexes. Nature, 415(6868):141–7, 2002.
[107] R. F. Geever, L. Huiet, J. A. Baum, B. M. Tyler, V. B. Patel, B. J. Rutledge,M. E. Case, and N. H. Giles. Dna sequence, organization and regulation of theqa gene cluster of neurospora crassa. J Mol Biol, 207(1):15–34, 1989.
[108] M. S. Gelfand. Prediction of function in dna sequence analysis. J Comput Biol,2(1):87–115, 1995.
[109] W. Gilbert, S. J. de Souza, and M. Long. Origin of genes. Proc Natl Acad SciU S A, 94(15):7698–703, 1997.
241
[110] A. Goffeau, B. G. Barrell, H. Bussey, R. W. Davis, B. Dujon, H. Feldmann,F. Galibert, J. D. Hoheisel, C. Jacq, M. Johnston, E. J. Louis, H. W. Mewes,Y. Murakami, P. Philippsen, H. Tettelin, and S. G. Oliver. Life with 6000 genes.Science, 274(5287):546, 563–7, 1996.
[111] N. Goldman and Z. Yang. A codon-based model of nucleotide substitution forprotein-coding dna sequences. Mol Biol Evol, 11(5):725–36, 1994.
[112] D. Gordon, C. Abajian, and P. Green. Consed: a graphical tool for sequencefinishing. Genome Res, 8(3):195–202, 1998.
[113] N. A. Gow. Candida albicans switches mates. Mol Cell, 10(2):217–8, 2002.
[114] N. A. Gow, A. J. Brown, and F. C. Odds. Fungal morphogenesis and hostinvasion. Curr Opin Microbiol, 5(4):366–71, 2002.
[115] D. Grant, P. Cregan, and R. C. Shoemaker. Genome organization in dicots:genome duplication in arabidopsis and synteny between soybean and arabidopsis.Proc Natl Acad Sci U S A, 97(8):4168–73, 2000.
[116] D. Graur. Amino acid composition and the evolutionary rates of protein-codinggenes. J Mol Evol, 22(1):53–62, 1985.
[117] S. I. Grewal and D. Moazed. Heterochromatin and epigenetic control of geneexpression. Science, 301(5634):798–802, 2003.
[118] Z. Gu, A. Cavalcanti, F. C. Chen, P. Bouman, and W. H. Li. Extent of geneduplication in the genomes of drosophila, nematode, and yeast. Mol Biol Evol,19(3):256–62, 2002.
[119] Z. Gu, L. M. Steinmetz, X. Gu, C. Scharfe, R. W. Davis, and W. H. Li.Role of duplicate genes in genetic robustness against null mutations. Nature,421(6918):63–6, 2003.
[120] J. E. Haber. Mating-type gene switching in saccharomyces cerevisiae. Annu RevGenet, 32:561–99, 1998.
[121] H. Hamada, M. Seidman, B. H. Howard, and C. M. Gorman. Enhanced geneexpression by the poly(dt-dg).poly(dc-da) sequence. Mol Cell Biol, 4(12):2622–30, 1984.
[122] A. J. Hamilton, L. Jeavons, S. Youngchim, and N. Vanittanakom. Recognition offibronectin by penicillium marneffei conidia via a sialic acid-dependent processand its relationship to the interaction between conidia and laminin. Infect Im-mun, 67(10):5200–5, 1999.
[123] A. J. Hamilton, L. Jeavons, S. Youngchim, N. Vanittanakom, and R. J. Hay.Sialic acid-dependent recognition of laminin by penicillium marneffei conidia.Infect Immun, 66(12):6024–6, 1998.
[124] K. H. Han, K. Y. Han, J. H. Yu, K. S. Chae, K. Y. Jahng, and D. M. Han. Thensdd gene encodes a putative gata-type transcription factor necessary for sexualdevelopment of aspergillus nidulans. Mol Microbiol, 41(2):299–309, 2001.
[125] K. H. Han, J. A. Seo, and J. H. Yu. A putative g protein-coupled receptornegatively controls sexual development in aspergillus nidulans. Mol Microbiol,51(5):1333–45, 2004.
[126] M Hasegawa, H Kishino, and T Yano. Dating of the human-ape splitting by amolecular clock of mitochondrial dna. J Mol Evol, 22:160–174, 1985.
[127] K. E. Hastings. Strong evolutionary conservation of broadly expressed proteinisoforms in the troponin i gene family and other vertebrate gene families. J MolEvol, 42(6):631–40, 1996.
[128] K. Haynes. Virulence in candida species. Trends Microbiol, 9(12):591–6, 2001.
242
[129] B. He, P. Chen, S. Y. Chen, K. L. Vancura, S. Michaelis, and S. Powers. Ram2,an essential gene of yeast, and ram1 encode the two polypeptide components ofthe farnesyltransferase that prenylates a-factor and ras proteins. Proc Natl AcadSci U S A, 88(24):11373–7, 1991.
[130] D. S. Heckman, D. M. Geiser, B. R. Eidell, R. L. Stauffer, N. L. Kardos, andS. B. Hedges. Molecular evidence for the early colonization of land by fungi andplants. Science, 293(5532):1129–33, 2001.
[131] S. B. Hedges and S. Kumar. Genomic clocks and evolutionary timescales. TrendsGenet, 19(4):200–6, 2003.
[132] I. Herskowitz. Fungal physiology. yeast branches out. Nature, 357(6375):190–1,1992.
[133] L. H. Hogan, S. Josvai, and B. S. Klein. Genomic cloning, characterization, andfunctional analysis of the major surface adhesin wi-1 on blastomyces dermatitidisyeasts. J Biol Chem, 270(51):30725–32, 1995.
[134] P. R. Hsueh, L. J. Teng, C. C. Hung, J. H. Hsu, P. C. Yang, S. W. Ho, andK. T. Luh. Molecular evidence for strain dissemination of penicillium marneffei:an emerging pathogen in taiwan. J Infect Dis, 181(5):1706–12, 2000.
[135] H. Huang, W. C. Barker, Y. Chen, and C. H. Wu. iproclass: an integrateddatabase of protein family, function and structure information. Nucleic AcidsRes, 31(1):390–2, 2003.
[136] A. L. Hughes and R. Friedman. Parallel evolution by gene duplication in thegenomes of two unicellular fungi. Genome Res, 13(6A):1259–64, 2003.
[137] M. K. Hughes and A. L. Hughes. Evolution of duplicate genes in a tetraploidanimal, xenopus laevis. Mol Biol Evol, 10(6):1360–9, 1993.
[138] C. M. Hull and A. D. Johnson. Identification of a mating type-like locus in theasexual pathogenic yeast candida albicans. Science, 285(5431):1271–5, 1999.
[139] C. M. Hull, R. M. Raisner, and A. D. Johnson. Evidence for mating of the”asexual” yeast candida albicans in a mammalian host. Science, 289(5477):307–10, 2000.
[140] C. C. Hung, M. Y. Chen, S. M. Hsieh, W. H. Sheng, C. F. Hsiao, and S. C.Chang. Discontinuation of secondary prophylaxis for penicilliosis marneffei inaids patients responding to highly active antiretroviral therapy. Aids, 16(4):672–3, 2002.
[141] L. D. Hurst and N. G. Smith. Do essential genes evolve slowly? Curr Biol,9(14):747–50, 1999.
[142] M. Huynen, B. Snel, 3rd Lathe, W., and P. Bork. Predicting protein functionby genomic context: quantitative evaluation and qualitative inferences. GenomeRes, 10(8):1204–10, 2000.
[143] I. Iliopoulos, S. Tsoka, M. A. Andrade, A. J. Enright, M. Carroll, P. Poul-let, V. Promponas, T. Liakopoulos, G. Palaios, C. Pasquier, S. Hamodrakas,J. Tamames, A. T. Yagnik, A. Tramontano, D. Devos, C. Blaschke, A. Valencia,D. Brett, D. Martin, C. Leroy, I. Rigoutsos, C. Sander, and C. A. Ouzounis.Evaluation of annotation strategies using an entire genome sequence. Bioinfor-matics, 19(6):717–26, 2003.
[144] P. Imwidthaya, A. S. Sekhon, T. D. Mastro, A. K. Garg, and E. Ambrosie. Use-fulness of a microimmunodiffusion test for the detection of penicillium marneffeiantigenemia, antibodies, and exoantigens. Mycopathologia, 138(2):51–5, 1997.
[145] Y. Ina. Oden: a program package for molecular evolutionary analysis and data-base search of dna and amino acid sequences. Comput Appl Biosci, 10:11–12,1994.
243
[146] L. Jeavons, A. J. Hamilton, N. Vanittanakom, R. Ungpakorn, E. G. Evans,T. Sirisanthana, and R. J. Hay. Identification and purification of specific peni-cillium marneffei antigens and their recognition by human immune sera. J ClinMicrobiol, 36(4):949–54, 1998.
[147] M. E. Johnson, L. Viggiano, J. A. Bailey, M. Abdul-Rauf, G. Goodwin, M. Roc-chi, and E. E. Eichler. Positive selection of a gene family during the emergenceof humans and african apes. Nature, 413(6855):514–9, 2001.
[148] T. Jones, N. A. Federspiel, H. Chibana, J. Dungan, S. Kalman, B. B. Magee,G. Newport, Y. R. Thorstenson, N. Agabian, P. T. Magee, R. W. Davis, andS. Scherer. The diploid genome sequence of candida albicans. Proc Natl AcadSci U S A, 101(19):7329–34, 2004.
[149] I. K. Jordan, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. Essential genes aremore evolutionarily conserved than are nonessential genes in bacteria. GenomeRes, 12(6):962–8, 2002.
[150] I. K. Jordan, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. Microevolutionarygenomics of bacteria. Theor Popul Biol, 61(4):435–47, 2002.
[151] I. K. Jordan, Y. I. Wolf, and E. V. Koonin. No simple dependence betweenprotein evolution rate and the number of protein-protein interactions: only themost prolific interactors tend to evolve slowly. BMC Evol Biol, 3(1):1, 2003.
[152] T. Joseph-Horne, D. W. Hollomon, and P. M. Wood. Fungal respiration: afusion of standard and alternative components. Biochim Biophys Acta, 1504(2-3):179–95, 2001.
[153] T. H. Jukes and C.R. Cantor. Evolution of protein molecules. In H. N. Munro,editor, Mammalian Protein Metabolism, pages 21–132. Academic Press, NewYork, 1969.
[154] D. Julius, L. Blair, A. Brake, G. Sprague, and J. Thorner. Yeast alpha factor isprocessed from a larger precursor polypeptide: the essential role of a membrane-bound dipeptidyl aminopeptidase. Cell, 32(3):839–52, 1983.
[155] H. Kaessmann, S. Zollner, A. Nekrutenko, and W. H. Li. Signatures of domainshuffling in the human genome. Genome Res, 12(11):1642–50, 2002.
[156] E. Kafer. Origins of translocations in aspergillus nidulans. Genetics, 52(1):217–32, 1965.
[157] T. Kanbe and J. E. Cutler. Minimum chemical requirements for adhesin activ-ity of the acid-stable part of candida albicans cell wall phosphomannoproteincomplex. Infect Immun, 66(12):5812–8, 1998.
[158] R. Kappe, C. Fauser, C. N. Okeke, and M. Maiwald. Universal fungus-specificprimer systems and group-specific hybridization oligonucleotides for 18s rdna.Mycoses, 39(1-2):25–30, 1996.
[159] N. Kato, W. Brooks, and A. M. Calvo. The expression of sterigmatocystin andpenicillin genes in aspergillus nidulans is controlled by vea, a gene required forsexual development. Eukaryot Cell, 2(6):1178–86, 2003.
[160] L. Kaufman, P. G. Standard, M. Jalbert, P. Kantipong, K. Limpakarnjanarat,and T. D. Mastro. Diagnostic antigenemia tests for penicilliosis marneffei. JClin Microbiol, 34(10):2503–5, 1996.
[161] N. P. Keller and T. M. Hohn. Metabolic pathway gene clusters in filamentousfungi. Fungal Genet Biol, 21(1):17–29, 1997.
[162] M. Kelly, J. Burke, M. Smith, A. Klar, and D. Beach. Four mating-type genescontrol sexual differentiation in the fission yeast. Embo J, 7(5):1537–47, 1988.
[163] Z. Kerenyi and L. Hornok. Structure and function of mating-type genes infusarium species. Acta Microbiol Immunol Hung, 49(2-3):313–4, 2002.
244
[164] H. Kim, K. Han, K. Kim, D. Han, K. Jahng, and K. Chae. The vea gene activatessexual development in aspergillus nidulans. Fungal Genet Biol, 37(1):72–80,2002.
[165] M. Kimura. A simple method for estimating evolutionary rates of base sub-stitutions through comparative studies of nucleotide sequences. J Mol Evol,16:111–120, 1980.
[166] M. Kimura and J. L. King. Fixation of a deleterious allele at one of two ”dupli-cate” loci by mutation pressure and random drift. Proc Natl Acad Sci U S A,76(6):2858–61, 1979.
[167] K. E. Kirk and N. R. Morris. The tubb alpha-tubulin gene is essential for sexualdevelopment in aspergillus nidulans. Genes Dev, 5(11):2014–23, 1991.
[168] K. E. Kirk and N. R. Morris. Either alpha-tubulin isogene product is sufficient formicrotubule function during all stages of growth and differentiation in aspergillusnidulans. Mol Cell Biol, 13(8):4465–76, 1993.
[169] B. S. Klein, L. H. Hogan, and J. M. Jones. Immunologic recognition of a 25-aminoacid repeat arrayed in tandem on a major antigen of blastomyces dermatitidis.J Clin Invest, 92(1):330–7, 1993.
[170] M. A. Klich, E. J. Mullaney, C. B. Daly, and J. W. Cary. Molecular and phys-iological aspects of aflatoxin and sterigmatocystin biosynthesis by aspergillustamarii and a. ochraceoroseus. Appl Microbiol Biotechnol, 53(5):605–9, 2000.
[171] Y. Koguchi, K. Kawakami, S. Kon, T. Segawa, M. Maeda, T. Uede, and A. Saito.Penicillium marneffei causes osteopontin-mediated production of interleukin-12by peripheral blood mononuclear cells. Infect Immun, 70(3):1042–8, 2002.
[172] F. A. Kondrashov and E. V. Koonin. Origin of alternative splicing by tandemexon duplication. Hum Mol Genet, 10(23):2661–9, 2001.
[173] F. A. Kondrashov and E. V. Koonin. Evolution of alternative splicing: deletions,insertions and origin of functional parts of proteins from intron sequences. TrendsGenet, 19(3):115–9, 2003.
[174] F. A. Kondrashov, I. B. Rogozin, Y. I. Wolf, and E. V. Koonin. Selection in theevolution of gene duplications. Genome Biol, 3(2):RESEARCH0008, 2002.
[175] R. Koszul, A. Malpertuy, L. Frangeul, C. Bouchier, P. Wincker, A. Thierry,S. Duthoy, S. Ferris, C. Hennequin, and B. Dujon. The complete mitochondrialgenome sequence of the pathogenic yeast candida (torulopsis) glabrata. FEBSLett, 534(1-3):39–48, 2003.
[176] L. Kraakman, K. Lemaire, P. Ma, A. W. Teunissen, M. C. Donaton, P. Van Dijck,J. Winderickx, J. H. de Winde, and J. M. Thevelein. A saccharomyces cerevisiaeg-protein coupled receptor, gpr1, is specifically required for glucose activation ofthe camp pathway during the transition to growth on glucose. Mol Microbiol,32(5):1002–12, 1999.
[177] A. Krause, J. Stoye, and M. Vingron. The systers protein sequence cluster set.Nucleic Acids Res, 28(1):270–2, 2000.
[178] D. M. Krylov, Y. I. Wolf, I. B. Rogozin, and E. V. Koonin. Gene loss, proteinsequence divergence, gene dispensability, expression level, and interactivity arecorrelated in eukaryotic evolution. Genome Res, 13(10):2229–35, 2003.
[179] N. Kudeken, K. Kawakami, and A. Saito. Cytokine-induced fungicidal activityof human polymorphonuclear leukocytes against penicillium marneffei. FEMSImmunol Med Microbiol, 26(2):115–24, 1999.
[180] N. Kudeken, K. Kawakami, and A. Saito. Role of superoxide anion in the fungici-dal activity of murine peritoneal exudate macrophages against penicillium marn-effei. Microbiol Immunol, 43(4):323–30, 1999.
245
[181] N. Kudeken, K. Kawakami, and A. Saito. Mechanisms of the in vitro fungi-cidal effects of human neutrophils against penicillium marneffei induced bygranulocyte-macrophage colony-stimulating factor (gm-csf). Clin Exp Immunol,119(3):472–8, 2000.
[182] E. Y. Kwan, Y. L. Lau, K. Y. Yuen, B. M. Jones, and L. C. Low. Penicil-lium marneffei infection in a non-hiv infected child. J Paediatr Child Health,33(3):267–71, 1997.
[183] K. J. Kwon-Chung and J. E. Bennett. Distribution of alpha and alpha matingtypes of cryptococcus neoformans among natural and clinical isolates. Am JEpidemiol, 108(4):337–40, 1978.
[184] J. A. Lake. Reconstructing evolutionary trees from dna and protein sequences:paralinear distances. Proc Natl Acad Sci USA, 91:1455–1459, 1994.
[185] C. Lanave, G. Preparata, C. Saccone, and G. Serio. A new method for calculatingevolutionary substitution rates. J Mol Evol, 20:86–93, 1984.
[186] E. S. Lander and M. S. Waterman. Genomic mapping by fingerprinting randomclones: a mathematical analysis. Genomics, 2(3):231–9, 1988.
[187] K. Langfelder, B. Jahn, H. Gehringer, A. Schmidt, G. Wanner, and A. A.Brakhage. Identification of a polyketide synthase gene (pksp) of aspergillus fu-migatus involved in conidial pigment biosynthesis and virulence. Med MicrobiolImmunol (Berl), 187(2):79–89, 1998.
[188] L. Latchinian-Sadek and D. Y. Thomas. Expression, purification, and charac-terization of the yeast kex1 gene product, a polypeptide precursor processingcarboxypeptidase. J Biol Chem, 268(1):534–40, 1993.
[189] J. P. Latge and R. Calderone. Host-microbe interactions: fungi invasive humanfungal opportunistic infections. Curr Opin Microbiol, 5(4):355–8, 2002.
[190] E. Leberer, D. Harcus, I. D. Broadbent, K. L. Clark, D. Dignard, K. Ziegelbauer,A. Schmidt, N. A. Gow, A. J. Brown, and D. Y. Thomas. Signal transductionthrough homologs of the ste20p and ste7p protein kinases can trigger hyphalformation in the pathogenic fungus candida albicans. Proc Natl Acad Sci U SA, 93(23):13217–22, 1996.
[191] D. W. Lee, S. Kim, S. J. Kim, D. M. Han, K. Y. Jahng, and K. S. Chae. Theisda gene is necessary for sexual development inhibition by a salt in aspergillusnidulans. Curr Genet, 39(4):237–43, 2001.
[192] K. B. Lengeler, R. C. Davidson, C. D’Souza, T. Harashima, W. C. Shen,P. Wang, X. Pan, M. Waugh, and J. Heitman. Signal transduction cascades reg-ulating fungal development and virulence. Microbiol Mol Biol Rev, 64(4):746–85,2000.
[193] K. B. Lengeler, P. Wang, G. M. Cox, J. R. Perfect, and J. Heitman. Iden-tification of the mata mating-type locus of cryptococcus neoformans reveals aserotype a mata strain thought to have been extinct. Proc Natl Acad Sci U SA, 97(26):14455–60, 2000.
[194] I. Letunic, R. R. Copley, and P. Bork. Common exon duplication in animals andits role in alternative splicing. Hum Mol Genet, 11(13):1561–7, 2002.
[195] J. C. Li, L. Q. Pan, and S. X. Wu. Mycologic investigation on rhizomys pruinoussenex in guangxi as natural carrier with penicillium marneffei. Chin Med J(Engl), 102(6):477–85, 1989.
[196] W. H. Li. Rate of gene silencing at duplicate loci: a theoretical study andinterpretation of data from tetraploid fishes. Genetics, 95(1):237–58, 1980.
[197] W. H. Li. Unbiased estimation of the rates of synonymous and nonsynonymoussubstitution. J Mol Evol, 36:96–99, 1993.
246
[198] W. H. Li, C. I. Wu, and C. C. Luo. A new method for estimating synonymousand nonsynonymous rates of nucleotide substitution considering the relative like-lihood of nucleotide and codon changes. Mol Biol Evol, 2:150–174, 1985.
[199] Wen-Hsiung Li. Molecular evolution. Sinauer Associates, Sunderland, Mass.,1997.
[200] F. Lisacek, Y. Diaz, and F. Michel. Automatic identification of group i introncores in genomic dna sequences. J Mol Biol, 235(4):1206–17, 1994.
[201] C. Y. Lo, D. T. Chan, K. Y. Yuen, F. K. Li, and K. P. Cheng. Penicilliummarneffei infection in a patient with sle. Lupus, 4(3):229–31, 1995.
[202] K. F. LoBuglio and J. W. Taylor. Phylogeny and pcr identification of the humanpathogenic fungus penicillium marneffei. J Clin Microbiol, 33(1):85–9, 1995.
[203] B. J. Loftus, E. Fung, P. Roncaglia, D. Rowley, P. Amedeo, D. Bruno, J. Va-mathevan, M. Miranda, I. J. Anderson, J. A. Fraser, J. E. Allen, I. E. Bosdet,M. R. Brent, R. Chiu, T. L. Doering, M. J. Donlin, C. A. D’Souza, D. S. Fox,V. Grinberg, J. Fu, M. Fukushima, B. J. Haas, J. C. Huang, G. Janbon, S. J.Jones, H. L. Koo, M. I. Krzywinski, J. K. Kwon-Chung, K. B. Lengeler, R. Maiti,M. A. Marra, R. E. Marra, C. A. Mathewson, T. G. Mitchell, M. Pertea, F. R.Riggs, S. L. Salzberg, J. E. Schein, A. Shvartsbeyn, H. Shin, M. Shumway, C. A.Specht, B. B. Suh, A. Tenney, T. R. Utterback, B. L. Wickes, J. R. Wort-man, N. H. Wye, J. W. Kronstad, J. K. Lodge, J. Heitman, R. W. Davis, C. M.Fraser, and R. W. Hyman. The genome of the basidiomycetous yeast and humanpathogen cryptococcus neoformans. Science, 307(5713):1321–4, 2005.
[204] M. Long, E. Betran, K. Thornton, and W. Wang. The origin of new genes:glimpses from the young and old. Nat Rev Genet, 4(11):865–75, 2003.
[205] M. Long and C. H. Langley. Natural selection and the origin of jingwei, achimeric processed functional gene in drosophila. Science, 260(5104):91–5, 1993.
[206] M. C. Lorenz. Genomic approaches to fungal pathogenicity. Curr Opin Micro-biol, 5(4):372–8, 2002.
[207] T. M. Lowe and S. R. Eddy. trnascan-se: a program for improved detection oftransfer rna genes in genomic sequence. Nucleic Acids Res, 25(5):955–64, 1997.
[208] Q. Lu, L. L. Wallrath, H. Granok, and S. C. Elgin. (ct)n (ga)n repeats and heatshock elements have distinct roles in chromatin structure and transcriptionalactivation of the drosophila hsp26 gene. Mol Cell Biol, 13(5):2802–14, 1993.
[209] L. G. Lundin. Evolution of the vertebrate genome as reflected in paralogouschromosomal regions in man and the house mouse. Genomics, 16(1):1–19, 1993.
[210] M. Lynch and J. S. Conery. The evolutionary fate and consequences of duplicategenes. Science, 290(5494):1151–5, 2000.
[211] M. Lynch and J. S. Conery. The evolutionary demography of duplicate genes. JStruct Funct Genomics, 3(1-4):35–44, 2003.
[212] M. Lynch and A. Force. The probability of duplicate gene preservation bysubfunctionalization. Genetics, 154(1):459–73, 2000.
[213] B. B. Magee and P. T. Magee. Induction of mating in candida albicans byconstruction of mtla and mtlalpha strains. Science, 289(5477):310–3, 2000.
[214] W. Makalowski and M. S. Boguski. Synonymous and nonsynonymous substitu-tion distances are correlated in mouse and rat genes. J Mol Evol, 47(2):119–21,1998.
[215] W. Makalowski, G. A. Mitchell, and D. Labuda. Alu sequences in the codingregions of mrna: a source of protein variability. Trends Genet, 10(6):188–93,1994.
247
[216] G. Mannhaupt, C. Montrone, D. Haase, H. W. Mewes, V. Aign, J. D. Hoheisel,B. Fartmann, G. Nyakatura, F. Kempken, J. Maier, and U. Schulte. What’sin the genome of a filamentous fungus? analysis of the neurospora genomesequence. Nucleic Acids Res, 31(7):1944–54, 2003.
[217] E. M. Marcotte, M. Pellegrini, H. L. Ng, D. W. Rice, T. O. Yeates, and D. Eisen-berg. Detecting protein function and protein-protein interactions from genomesequences. Science, 285(5428):751–3, 1999.
[218] E. M. Marcotte, M. Pellegrini, M. J. Thompson, T. O. Yeates, and D. Eisenberg.A combined algorithm for genome-wide prediction of protein function. Nature,402(6757):83–6, 1999.
[219] A. McLysaght, K. Hokamp, and K. H. Wolfe. Extensive genomic duplicationduring early chordate evolution. Nat Genet, 31(2):200–4, 2002.
[220] H. W. Mewes, K. Albermann, M. Bahr, D. Frishman, A. Gleissner, J. Hani,K. Heumann, K. Kleine, A. Maierl, S. G. Oliver, F. Pfeiffer, and A. Zollner.Overview of the yeast genome. Nature, 387(6632 Suppl):7–65, 1997.
[221] A. Meyer and M. Schartl. Gene and genome duplications in vertebrates: theone-to-four (-to-eight in fish) rule and the evolution of novel gene functions.Curr Opin Cell Biol, 11(6):699–704, 1999.
[222] K. Y. Miller, T. M. Toennis, T. H. Adams, and B. L. Miller. Isolation and tran-scriptional characterization of a morphological modifier: the aspergillus nidulansstunted (stua) gene. Mol Gen Genet, 227(2):285–92, 1991.
[223] T. K. Mitchell and R. A. Dean. The camp-dependent protein kinase catalyticsubunit is required for appressorium formation and pathogenesis by the rice blastpathogen magnaporthe grisea. Plant Cell, 7(11):1869–78, 1995.
[224] N. P. Money. Plant pathology. reverend berkeley’s devil. Nature, 411(6838):644,2001.
[225] S. A. Mousavi and G. D. Robson. Oxidative and amphotericin b-mediated celldeath in the opportunistic pathogen aspergillus fumigatus is associated with anapoptotic-like phenotype. Microbiology, 150(Pt 6):1937–45, 2004.
[226] S. V. Muse and B. S. Gaut. A likelihood approach for comparing synonymous andnonsynonymous nucleotide substitution rates, with application to the chloroplastgenome. Mol Biol Evol, 11(5):715–24, 1994.
[227] K. A. Nasmyth and K. Tatchell. The structure of transposable yeast matingtype loci. Cell, 19(3):753–64, 1980.
[228] M. Nei and T. Gojobori. Simple methods for estimating the numbers of synony-mous and nonsynonymous nucleotide substitutions. Mol Biol Evol, 3:418–426,1986.
[229] Masatoshi Nei and S. Kumar. Molecular evolution and phylogenetics. OxfordUniversity Press, Oxford, UK, 2000.
[230] A. Nekrutenko and W. H. Li. Transposable elements are found in a large numberof human protein-coding genes. Trends Genet, 17(11):619–21, 2001.
[231] M. A. Nelson, S. Kang, E. L. Braun, M. E. Crawford, P. L. Dolan, P. M.Leonard, J. Mitchell, A. M. Armijo, L. Bean, E. Blueyes, T. Cushing, A. Er-rett, M. Fleharty, M. Gorman, K. Judson, R. Miller, J. Ortega, I. Pavlova,J. Perea, S. Todisco, R. Trujillo, J. Valentine, A. Wells, M. Werner-Washburne,D. O. Natvig, and et al. Expressed sequences from conidial, mycelial, and sexualstages of neurospora crassa. Fungal Genet Biol, 21(3):348–63, 1997.
[232] S. L. Newman, S. Chaturvedi, and B. S. Klein. The wi-1 antigen of blastomycesdermatitidis yeasts mediates binding to human macrophage cd11b/cd18 (cr3)and cd14. J Immunol, 154(2):753–61, 1995.
248
[233] W. C. Nierman, A. Pain, M. J. Anderson, J. R. Wortman, H. S. Kim, J. Ar-royo, M. Berriman, K. Abe, D. B. Archer, C. Bermejo, J. Bennett, P. Bowyer,D. Chen, M. Collins, R. Coulsen, R. Davies, P. S. Dyer, M. Farman, N. Fedorova,T. V. Feldblyum, R. Fischer, N. Fosker, A. Fraser, J. L. Garcia, M. J. Garcia,A. Goble, G. H. Goldman, K. Gomi, S. Griffith-Jones, R. Gwilliam, B. Haas,H. Haas, D. Harris, H. Horiuchi, J. Huang, S. Humphray, J. Jimenez, N. Keller,H. Khouri, K. Kitamoto, T. Kobayashi, S. Konzack, R. Kulkarni, T. Kuma-gai, A. Lafton, J. P. Latge, W. Li, A. Lord, C. Lu, W. H. Majoros, G. S.May, B. L. Miller, Y. Mohamoud, M. Molina, M. Monod, I. Mouyna, S. Mul-ligan, L. Murphy, S. O’Neil, I. Paulsen, M. A. Penalva, M. Pertea, C. Price,B. L. Pritchard, M. A. Quail, E. Rabbinowitsch, N. Rawlins, M. A. Rajan-dream, U. Reichard, H. Renauld, G. D. Robson, S. Rodriguez de Cordoba, J. M.Rodriguez-Pena, C. M. Ronning, S. Rutter, S. L. Salzberg, M. Sanchez, J. C.Sanchez-Ferrero, D. Saunders, K. Seeger, R. Squares, S. Squares, M. Takeuchi,F. Tekaia, G. Turner, C. R. Vazquez de Aldana, J. Weidman, O. White, J. Wood-ward, J. H. Yu, C. Fraser, J. E. Galagan, K. Asai, M. Machida, N. Hall, B. Bar-rell, and D. W. Denning. Genomic sequence of the pathogenic and allergenicfilamentous fungus aspergillus fumigatus. Nature, 438(7071):1151–6, 2005.
[234] L. R. Nunes, R. Costa de Oliveira, D. B. Leite, V. S. da Silva, E. dos Reis Mar-ques, M. E. da Silva Ferreira, D. C. Ribeiro, L. A. de Souza Bernardes, M. H.Goldman, R. Puccia, L. R. Travassos, W. L. Batista, M. P. Nobrega, F. G. No-brega, D. Y. Yang, C. A. de Braganca Pereira, and G. H. Goldman. Transcrip-tome analysis of paracoccidioides brasiliensis cells undergoing mycelium-to-yeasttransition. Eukaryot Cell, 4(12):2115–28, 2005.
[235] D. I. Nurminsky, M. V. Nurminskaya, D. De Aguiar, and D. L. Hartl. Se-lective sweep of a newly evolved sperm-specific gene in drosophila. Nature,396(6711):572–5, 1998.
[236] A. Odom, S. Muir, E. Lim, D. L. Toffaletti, J. Perfect, and J. Heitman.Calcineurin is required for virulence of cryptococcus neoformans. Embo J,16(10):2576–89, 1997.
[237] S Ohno. Evolution by Gene Duplication. Springer-Verlag Inc., New York, 1970.
[238] T. Ohta. How gene families evolve. Theor Popul Biol, 37(1):213–9, 1990.
[239] T. Ohta. Synonymous and nonsynonymous substitutions in mammalian genesand the nearly neutral theory. J Mol Evol, 40(1):56–63, 1995.
[240] H. D. Osiewacz and E. Kimpel. Mitochondrial-nuclear interactions and lifespancontrol in fungi. Exp Gerontol, 34(8):901–9, 1999.
[241] C. Pal, B. Papp, and L. D. Hurst. Highly expressed genes in yeast evolve slowly.Genetics, 158(2):927–31, 2001.
[242] P. Pamilo and N. O. Bianchi. Evolution of the zfx and zfy genes: rates andinterdependence between the genes. Mol Biol Evol, 10:271–281, 1993.
[243] B. Paquin and B. F. Lang. The mitochondrial dna of allomyces macrogynus: thecomplete genomic sequence from an ancestral fungus. J Mol Biol, 255(5):688–701, 1996.
[244] L. Patthy. Genome evolution and the evolution of exon-shuffling–a review. Gene,238(1):103–14, 1999.
[245] W. R. Pearson. Rapid and sensitive sequence comparison with fastp and fasta.Methods Enzymol, 183:63–98, 1990.
[246] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence com-parison. Proc Natl Acad Sci USA, 85:2444–2448, 1988.
[247] J. Pei and N. V. Grishin. Type ii caax prenyl endopeptidases belong to a novelsuperfamily of putative membrane-bound metalloproteases. Trends Biochem Sci,26(5):275–7, 2001.
249
[248] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates.Assigning protein functions by comparative genome analysis: protein phyloge-netic profiles. Proc Natl Acad Sci U S A, 96(8):4285–8, 1999.
[249] G. E. Pierard, J. Arrese Estrada, C. Pierard-Franchimont, A. Thiry, and D. Sty-nen. Immunohistochemical expression of galactomannan in the cytoplasm ofphagocytic cells during invasive aspergillosis. Am J Clin Pathol, 96(3):373–6,1991.
[250] J. Piskur. Origin of the duplicated regions in the yeast genomes. Trends Genet,17(6):302–3, 2001.
[251] J. B. Plotkin, J. Dushoff, and H. B. Fraser. Detecting selection using a singlegenome sequence of m. tuberculosis and p. falciparum. Nature, 428:942–945,2004.
[252] S. Poggeler. Mating-type genes for classical strain improvements of ascomycetes.Appl Microbiol Biotechnol, 56(5-6):589–601, 2001.
[253] S. Poggeler. Genomic evidence for mating abilities in the asexual pathogenaspergillus fumigatus. Curr Genet, 42(3):153–60, 2002.
[254] S. Pongsunk, A. Andrianopoulos, and S. C. Chaiyaroj. Conditional lethal dis-ruption of tata-binding protein gene in penicillium marneffei. Fungal Genet Biol,42(11):893–903, 2005.
[255] M. Pop, D. S. Kosack, and S. L. Salzberg. Hierarchical scaffolding with bambus.Genome Res, 14(1):149–59, 2004.
[256] R. O. Poyton and J. E. McEwen. Crosstalk between nuclear and mitochondrialgenomes. Annu Rev Biochem, 65:563–607, 1996.
[257] V. E. Prince and F. B. Pickett. Splitting pairs: the diverging fates of duplicatedgenes. Nat Rev Genet, 3(11):827–37, 2002.
[258] L. Ramsay, M. Macaulay, S. degli Ivanissevich, K. MacLean, L. Cardle, J. Fuller,K. J. Edwards, S. Tuvesson, M. Morgante, A. Massari, E. Maestri, N. Marmiroli,T. Sjakste, M. Ganal, W. Powell, and R. Waugh. A simple sequence repeat-basedlinkage map of barley. Genetics, 156(4):1997–2005, 2000.
[259] M. Raymond, D. Dignard, A. M. Alarco, N. Mainville, B. B. Magee, and D. Y.Thomas. A ste6p/p-glycoprotein homologue from the asexual yeast candidaalbicans transports the a-factor mating pheromone in saccharomyces cerevisiae.Mol Microbiol, 27(3):587–98, 1998.
[260] Y. Reiss, J. L. Goldstein, M. C. Seabra, P. J. Casey, and M. S. Brown. Inhibitionof purified p21ras farnesyl:protein transferase by cys-aax tetrapeptides. Cell,62(1):81–8, 1990.
[261] M. Remm, C. E. Storm, and E. L. Sonnhammer. Automatic clustering oforthologs and in-paralogs from pairwise species comparisons. J Mol Biol,314(5):1041–52, 2001.
[262] M. Ricchetti, C. Fairhead, and B. Dujon. Mitochondrial dna repairs double-strand breaks in yeast chromosomes. Nature, 402(6757):96–100, 1999.
[263] P. Rice, I. Longden, and A. Bleasby. Emboss: the european molecular biologyopen software suite. Trends Genet, 16(6):276–7, 2000.
[264] I. Rigoutsos, T. Huynh, A. Floratos, L. Parida, and D. Platt. Dictionary-drivenprotein annotation. Nucleic Acids Res, 30(17):3901–16, 2002.
[265] M. Robinson-Rechavi and V. Laudet. Evolutionary rates of duplicate genes infish and mammals. Mol Biol Evol, 18(4):681–3, 2001.
[266] F. Rodriguez, J. L. Oliver, A. Marin, and J. R. Medina. The general stochasticmodel of nucleotide substitution. J Theor Biol, 142:485–501, 1990.
250
[267] S. Rogic, A. K. Mackworth, and F. B. Ouellette. Evaluation of gene-findingprograms on mammalian sequences. Genome Res, 11(5):817–32, 2001.
[268] S. Rogic, B. F. Ouellette, and A. K. Mackworth. Improving gene recognitionaccuracy by combining predictions from two gene-finding programs. Bioinfor-matics, 18(8):1034–45, 2002.
[269] Y. Rongrungruang and S. M. Levitz. Interactions of penicillium marneffei withhuman leukocytes in vitro. Infect Immun, 67(9):4732–6, 1999.
[270] G. M. Rubin, M. D. Yandell, J. R. Wortman, G. L. Gabor Miklos, C. R. Nelson,I. K. Hariharan, M. E. Fortini, P. W. Li, R. Apweiler, W. Fleischmann, J. M.Cherry, S. Henikoff, M. P. Skupski, S. Misra, M. Ashburner, E. Birney, M. S.Boguski, T. Brody, P. Brokstein, S. E. Celniker, S. A. Chervitz, D. Coates,A. Cravchik, A. Gabrielian, R. F. Galle, W. M. Gelbart, R. A. George, L. S.Goldstein, F. Gong, P. Guan, N. L. Harris, B. A. Hay, R. A. Hoskins, J. Li,Z. Li, R. O. Hynes, S. J. Jones, P. M. Kuehl, B. Lemaitre, J. T. Littleton, D. K.Morrison, C. Mungall, P. H. O’Farrell, O. K. Pickeral, C. Shue, L. B. Vosshall,J. Zhang, Q. Zhao, X. H. Zheng, and S. Lewis. Comparative genomics of theeukaryotes. Science, 287(5461):2204–15, 2000.
[271] A. Rzhetsky and P. Morozov. Markov chain monte carlo computation of confi-dence intervals for substitution-rate variation in proteins. Pac Symp Biocomput,6:203–214, 2001.
[272] C. Sadhu, D. Hoekstra, M. J. McEachern, S. I. Reed, and J. B. Hicks. A g-protein alpha subunit from asexual candida albicans functions in the matingsignal transduction pathway of saccharomyces cerevisiae and is regulated by thea1-alpha 2 repressor. Mol Cell Biol, 12(5):1977–85, 1992.
[273] N. Saitou and M. Nei. The neighbor-joining method: a new method for recon-structing phylogenetic trees. Mol Biol Evol, 4(4):406–25, 1987.
[274] L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie, and D. Eisen-berg. The database of interacting proteins: 2004 update. Nucleic Acids Res,32(Database issue):D449–51, 2004.
[275] G. San-Blas. [dimorphic fungi: biochemical approach to their dimorphism]. ActaCient Venez, 46(4):221–4, 1995.
[276] G. A. Sarosi and D. S. Serstock. Isolation of blastomyces dermatitidis frompigeon manure. Am Rev Respir Dis, 114(6):1179–83, 1976.
[277] A. S. Sekhon, J. S. Li, and A. K. Garg. Penicillosis marneffei: serological andexoantigen studies. Mycopathologia, 77(1):51–7, 1982.
[278] P. Sengupta and B. H. Cochran. Mat alpha 1 can mediate gene activation bya-mating factor. Genes Dev, 5(10):1924–34, 1991.
[279] C. Seoighe and K. H. Wolfe. Extent of genomic rearrangement after genomeduplication in yeast. Proc Natl Acad Sci U S A, 95(8):4447–52, 1998.
[280] C. Seoighe and K. H. Wolfe. Updated map of duplicated regions in the yeastgenome. Gene, 238(1):253–61, 1999.
[281] P. M. Sharp. In search of molecular darwinism. Nature, 385:111–112., 1997.
[282] P. M. Sharp and W. H. Li. The codon adaptation index–a measure of directionalsynonymous codon usage bias, and its potential applications. Nucleic Acids Res,15(3):1281–95, 1987.
[283] P. M. Sharp and W. H. Li. The rate of synonymous substitution in enterobac-terial genes is inversely related to codon usage bias. Mol Biol Evol, 4(3):222–30,1987.
[284] J. C. Shepherd, W. McGinnis, A. E. Carrasco, E. M. De Robertis, and W. J.Gehring. Fly and frog homoeo domains show homologies with yeast mating typeregulatory proteins. Nature, 310(5972):70–1, 1984.
251
[285] R. Shields. Pushing the envelope on molecular dating. Trends Genet, 20(5):221–2, 2004.
[286] R. A. Sia, K. B. Lengeler, and J. Heitman. Diploid strains of the pathogenicbasidiomycete cryptococcus neoformans are thermally dimorphic. Fungal GenetBiol, 29(3):153–63, 2000.
[287] A. Sidow. Gen(om)e duplications in the evolution of early vertebrates. CurrOpin Genet Dev, 6(6):715–22, 1996.
[288] R. R. Sinden. Biological implications of the dna structures associated withdisease-causing triplet repeats. Am J Hum Genet, 64(2):346–53, 1999.
[289] M. Sipiczki. Where does fission yeast sit on the tree of life? Genome Biol,1(2):REVIEWS1011, 2000.
[290] T. Sirisanthana, K. Supparatpinyo, J. Perriens, and K. E. Nelson. Amphotericinb and itraconazole for treatment of disseminated penicillium marneffei infectionin human immunodeficiency virus-infected patients. Clin Infect Dis, 26(5):1107–10, 1998.
[291] T. F. Smith and M. S. Waterman. Identification of common molecular subse-quences. J Mol Biol, 147:195–197, 1981.
[292] T. F. Smith, M. S. Waterman, and C. Burks. The statistical distribution ofnucleic acid similarities. Nucleic Acids Res, 13(2):645–56, 1985.
[293] R. Sorek, G. Ast, and D. Graur. Alu-containing exons are alternatively spliced.Genome Res, 12(7):1060–7, 2002.
[294] P. Staib, M. Kretschmar, T. Nichterlein, H. Hof, and J. Morschhauser. Differen-tial activation of a candida albicans virulence gene family during infection. ProcNatl Acad Sci U S A, 97(11):6102–7, 2000.
[295] M. A. Steel. Recovering a tree from the leaf colourations it generates under amarkov model. Appl Math Lett, 7:19–32, 1994.
[296] B. R. Steen, T. Lian, S. Zuyderduyn, W. K. MacDonald, M. Marra, S. J. Jones,and J. W. Kronstad. Temperature-regulated transcription in the pathogenicfungus cryptococcus neoformans. Genome Res, 12(9):1386–400, 2002.
[297] L. M. Steinmetz, C. Scharfe, A. M. Deutschbauer, D. Mokranjac, Z. S. Herman,T. Jones, A. M. Chu, G. Giaever, H. Prokisch, P. J. Oefner, and R. W. Davis.Systematic screen for human disease genes in yeast. Nat Genet, 31(4):400–4,2002.
[298] A. Stoltzfus. On the possibility of constructive neutral evolution. J Mol Evol,49(2):169–81, 1999.
[299] J. N. Strathern, E. Spatola, C. McGill, and J. B. Hicks. Structure and organi-zation of transposable of transposable mating type cassettes in saccharomycesyeasts. Proc Natl Acad Sci U S A, 77(5):2839–43, 1980.
[300] K. Supparatpinyo, C. Khamwan, V. Baosoung, K. E. Nelson, and T. Sirisan-thana. Disseminated penicillium marneffei infection in southeast asia. Lancet,344(8915):110–3, 1994.
[301] K. Supparatpinyo, K. E. Nelson, W. G. Merz, B. J. Breslin, Jr. Cooper, C. R.,C. Kamwan, and T. Sirisanthana. Response to antifungal therapy by humanimmunodeficiency virus-infected patients with disseminated penicillium marn-effei infections and in vitro susceptibilities of isolates from clinical specimens.Antimicrob Agents Chemother, 37(11):2407–11, 1993.
[302] K. Supparatpinyo, J. Perriens, K. E. Nelson, and T. Sirisanthana. A con-trolled trial of itraconazole to prevent relapse of penicillium marneffei infectionin patients infected with the human immunodeficiency virus. N Engl J Med,339(24):1739–43, 1998.
252
[303] Y. Suzuki and T Gojobori. Analysis of coding sequences. In M. Salemi and A.M.Vandamme, editors, The phylogenetic handbook: a practical approach to DNAand protein phylogeny, pages 283–311. Cambridge University Press, Cambridge,UK, 2003.
[304] A. Tam, W. K. Schmidt, and S. Michaelis. The multispanning membrane proteinste24p catalyzes caax proteolysis and nh2-terminal processing of the yeast a-factor precursor. J Biol Chem, 276(50):46798–806, 2001.
[305] W. Tang, T. M. Gunn, D. F. McLaughlin, G. S. Barsh, S. F. Schlossman, andJ. S. Duke-Cohan. Secreted and membrane attractin result from alternativesplicing of the human atrn gene. Proc Natl Acad Sci U S A, 97(11):6025–30,2000.
[306] D. Taramelli, S. Brambilla, G. Sala, A. Bruccoleri, C. Tognazioli, L. Riviera-Uzielli, and J. R. Boelaert. Effects of iron on extracellular and intracellulargrowth of penicillium marneffei. Infect Immun, 68(3):1724–6, 2000.
[307] D. Taramelli, C. Tognazioli, F. Ravagnani, O. Leopardi, G. Giannulis, and J. R.Boelaert. Inhibition of intramacrophage growth of penicillium marneffei by 4-aminoquinolines. Antimicrob Agents Chemother, 45(5):1450–5, 2001.
[308] R. L. Tatusov, D. A. Natale, I. V. Garkavtsev, T. A. Tatusova, U. T.Shankavaram, B. S. Rao, B. Kiryutin, M. Y. Galperin, N. D. Fedorova, andE. V. Koonin. The cog database: new developments in phylogenetic classifica-tion of proteins from complete genomes. Nucleic Acids Res, 29(1):22–8, 2001.
[309] S. Tavare. Some probabilistic and statistical problems in the analysis of dnasequences. Lectures on Mathematics in the Life Sciences, 17:57–86, 1986.
[310] R. D. Teasdale and M. R. Jackson. Signal-mediated sorting of membrane proteinsbetween the endoplasmic reticulum and the golgi apparatus. Annu Rev Cell DevBiol, 12:27–54, 1996.
[311] J. D. Thompson, D. G. Higgins, and T. J. Gibson. Clustal w: improving thesensitivity of progressive multiple sequence alignment through sequence weight-ing, position-specific gap penalties and weight matrix choice. Nucleic Acids Res,22(22):4673–80, 1994.
[312] JD Thompson, DG Higgins, and TJ Gibson. Clustal w: improving the sensitivityof progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl Acids Res, 22:4673–4680,1994.
[313] C. Thrane, U. Kaufmann, B. M. Stummann, and S. Olsson. Activation ofcaspase-like activity and poly (adp-ribose) polymerase degradation during sporu-lation in aspergillus nidulans. Fungal Genet Biol, 41(3):361–8, 2004.
[314] W. E. Timberlake. Molecular genetics of aspergillus development. Annu RevGenet, 24:5–36, 1990.
[315] R. B. Todd, J. R. Greenhalgh, M. J. Hynes, and A. Andrianopoulos. Tupa, thepenicillium marneffei tup1p homologue, represses both yeast and spore develop-ment. Mol Microbiol, 48(1):85–94, 2003.
[316] S. Trewatcharegon, S. Sirisinha, A. Romsai, B. Eampokalap, R. Teanpaisan, andS. C. Chaiyaroj. Molecular typing of penicillium marneffei isolates from thailandby noti macrorestriction and pulsed-field gel electrophoresis. J Clin Microbiol,39(12):4544–8, 2001.
[317] H. F. Tsai, Y. C. Chang, R. G. Washburn, M. H. Wheeler, and K. J. Kwon-Chung. The developmentally regulated alb1 gene of aspergillus fumigatus:its role in modulation of conidial morphology and virulence. J Bacteriol,180(12):3031–8, 1998.
253
[318] H. F. Tsai, M. H. Wheeler, Y. C. Chang, and K. J. Kwon-Chung. A devel-opmentally regulated gene cluster involved in conidial pigment biosynthesis inaspergillus fumigatus. J Bacteriol, 181(20):6469–77, 1999.
[319] N. Tsuchimori, L. L. Sharkey, W. A. Fonzi, S. W. French, Jr. Edwards, J. E., andS. G. Filler. Reduced virulence of hwp1-deficient mutants of candida albicansand their interactions with host cells. Infect Immun, 68(4):1997–2002, 2000.
[320] B. G. Turgeon and O. C. Yoder. Proposed nomenclature for mating type genesof filamentous ascomycetes. Fungal Genet Biol, 31(1):1–5, 2000.
[321] Y. Van de Peer, J. S. Taylor, I. Braasch, and A. Meyer. The ghost of selectionpast: rates of evolution and functional divergence of anciently duplicated genes.J Mol Evol, 53(4-5):436–46, 2001.
[322] K. Vandepoele, Y. Saeys, C. Simillion, J. Raes, and Y. Van De Peer. The auto-matic detection of homologous regions (adhore) and its application to microco-linearity between arabidopsis and rice. Genome Res, 12(11):1792–801, 2002.
[323] N. Vanittanakom, Jr. Cooper, C. R., S. Chariyalertsak, S. Youngchim, K. E.Nelson, and T. Sirisanthana. Restriction endonuclease analysis of penicilliummarneffei. J Clin Microbiol, 34(7):1834–6, 1996.
[324] N. Vanittanakom, W. G. Merz, N. Sittisombut, C. Khamwan, K. E. Nelson, andT. Sirisanthana. Specific identification of penicillium marneffei by a polymerasechain reaction/hybridization technique. Med Mycol, 36(3):169–75, 1998.
[325] N. Vanittanakom, P. Vanittanakom, and R. J. Hay. Rapid identification ofpenicillium marneffei by pcr-based detection of specific sequences on the rrnagene. J Clin Microbiol, 40(5):1739–42, 2002.
[326] J. Varga and B. Toth. Genetic variability and reproductive mode of aspergillusfumigatus. Infect Genet Evol, 3(1):3–17, 2003.
[327] D. Venet. Matarray: a matlab toolbox for microarray data. Bioinformatics,19:659–660, 2003.
[328] K. J. Verstrepen, A. Jansen, F. Lewitter, and G. R. Fink. Intragenic tandemrepeats generate functional variability. Nat Genet, 37(9):986–90, 2005.
[329] K. J. Verstrepen, T. B. Reynolds, and G. R. Fink. Origins of variation in thefungal cell surface. Nat Rev Microbiol, 2(7):533–40, 2004.
[330] P. E. Verweij, J. F. Meis, P. van den Hurk, J. Zoll, R. A. Samson, and W. J.Melchers. Phylogenetic relationships of five species of aspergillus and relatedtaxa as deduced by comparison of sequences of small subunit ribosomal rna. JMed Vet Mycol, 33(3):185–90, 1995.
[331] K. Vienken, M. Scherer, and R. Fischer. The zn(ii)2cys6 putative aspergillusnidulans transcription factor repressor of sexual development inhibits sexual de-velopment under low-carbon conditions and in submersed culture. Genetics,169(2):619–30, 2005.
[332] M. Viswanathan, G. Muthukumar, Y. S. Cong, and J. Lenard. Seripauperins ofsaccharomyces cerevisiae: a new multigene family encoding serine-poor relativesof serine-rich proteins. Gene, 148(1):149–53, 1994.
[333] M. A. Viviani, A. M. Tortorano, G. Rizzardini, T. Quirino, L. Kaufman, A. A.Padhye, and L. Ajello. Treatment and serological studies of an italian case ofpenicilliosis marneffei contracted in thailand by a drug addict infected with thehuman immunodeficiency virus. Eur J Epidemiol, 9(1):79–85, 1993.
[334] A. Wagner. The fate of duplicated genes: loss or new function? Bioessays,20(10):785–8, 1998.
[335] A. Wagner. The yeast protein interaction network evolves rapidly and containsfew redundant duplicate genes. Mol Biol Evol, 18(7):1283–92, 2001.
254
[336] J. B. Walsh. How often do duplicated genes evolve new functions? Genetics,139(1):421–8, 1995.
[337] J. D. Walton. Horizontal gene transfer and the evolution of secondary metabolitegene clusters in fungi: an hypothesis. Fungal Genet Biol, 30(3):167–71, 2000.
[338] W. Wang, F. G. Brunet, E. Nevo, and M. Long. Origin of sphinx, a youngchimeric rna gene in drosophila melanogaster. Proc Natl Acad Sci U S A,99(7):4448–53, 2002.
[339] W. Wang, H. Zheng, S. Yang, H. Yu, J. Li, H. Jiang, J. Su, L. Yang, J. Zhang,J. McDermott, R. Samudrala, J. Wang, H. Yang, J. Yu, K. Kristiansen, andG. K. Wong. Origin and evolution of new exons in rodents. Genome Res,15(9):1258–64, 2005.
[340] J. L. Weber and P. E. May. Abundant class of human dna polymorphismswhich can be typed using the polymerase chain reaction. Am J Hum Genet,44(3):388–96, 1989.
[341] M. H. Wheeler and A. A. Bell. Melanins and their importance in pathogenicfungi. Curr Top Med Mycol, 2:338–87, 1988.
[342] S. Whelan and N. Goldman. A general empirical model of protein evolutionderived from multiple protein families using a maximum-likelihood approach.Mol Biol Evol, 18(5):691–9, 2001.
[343] A. C. Wilson, S. S. Carlson, and T. J. White. Biochemical evolution. Annu RevBiochem, 46:573–639, 1977.
[344] K. H. Wolfe and P. M. Sharp. Mammalian gene evolution: nucleotide sequencedivergence between mouse and rat. J Mol Evol, 37(4):441–56, 1993.
[345] K. H. Wolfe and D. C. Shields. Molecular evidence for an ancient duplication ofthe entire yeast genome. Nature, 387(6634):708–13, 1997.
[346] K. H. Wong and S. S. Lee. Comparing the first and second hundred aids casesin hong kong. Singapore Med J, 39(6):236–40, 1998.
[347] L. P. Wong, P. C. Woo, A. Y. Wu, and K. Y. Yuen. Dna immunization usinga secreted cell wall antigen mp1p is protective against penicillium marneffeiinfection. Vaccine, 20(23-24):2878–86, 2002.
[348] S. S. Wong, H. Siau, and K. Y. Yuen. Penicilliosis marneffei–west meets east. JMed Microbiol, 48(11):973–5, 1999.
[349] S. S. Wong, K. H. Wong, W. T. Hui, S. S. Lee, J. Y. Lo, L. Cao, and K. Y. Yuen.Differences in clinical and laboratory diagnostic characteristics of penicilliosismarneffei in human immunodeficiency virus (hiv)- and non-hiv-infected patients.J Clin Microbiol, 39(12):4535–40, 2001.
[350] S. S. Wong, P. C. Woo, and K. Y. Yuen. Candida tropicalis and penicilliummarneffei mixed fungaemia in a patient with waldenstrom’s macroglobulinaemia.Eur J Clin Microbiol Infect Dis, 20(2):132–5, 2001.
[351] P. C. Woo, C. M. Chan, A. S. Leung, S. K. Lau, X. Y. Che, S. S. Wong, L. Cao,and K. Y. Yuen. Detection of cell wall galactomannoprotein afmp1p in culturesupernatants of aspergillus fumigatus and in sera of aspergillosis patients. J ClinMicrobiol, 40(11):4382–7, 2002.
[352] P. C. Woo, K. T. Chong, A. S. Leung, S. S. Wong, S. K. Lau, and K. Y.Yuen. Aflmp1 encodes an antigenic cel wall protein in aspergillus flavus. J ClinMicrobiol, 41(2):845–50, 2003.
[353] P. C. Woo, H. Zhen, J. J. Cai, J. Yu, S. K. Lau, J. Wang, J. L. Teng, S. S. Wong,R. H. Tse, R. Chen, H. Yang, B. Liu, and K. Y. Yuen. The mitochondrial genomeof the thermal dimorphic fungus penicillium marneffei is more closely related tothose of molds than yeasts. FEBS Lett, 555(3):469–77, 2003.
255
[354] V. Wood, R. Gwilliam, M. A. Rajandream, M. Lyne, R. Lyne, A. Stewart,J. Sgouros, N. Peat, J. Hayles, S. Baker, D. Basham, S. Bowman, K. Brooks,D. Brown, S. Brown, T. Chillingworth, C. Churcher, M. Collins, R. Connor,A. Cronin, P. Davis, T. Feltwell, A. Fraser, S. Gentles, A. Goble, N. Hamlin,D. Harris, J. Hidalgo, G. Hodgson, S. Holroyd, T. Hornsby, S. Howarth, E. J.Huckle, S. Hunt, K. Jagels, K. James, L. Jones, M. Jones, S. Leather, S. Mc-Donald, J. McLean, P. Mooney, S. Moule, K. Mungall, L. Murphy, D. Niblett,C. Odell, K. Oliver, S. O’Neil, D. Pearson, M. A. Quail, E. Rabbinowitsch,K. Rutherford, S. Rutter, D. Saunders, K. Seeger, S. Sharp, J. Skelton, M. Sim-monds, R. Squares, S. Squares, K. Stevens, K. Taylor, R. G. Taylor, A. Tivey,S. Walsh, T. Warren, S. Whitehead, J. Woodward, G. Volckaert, R. Aert,J. Robben, B. Grymonprez, I. Weltjens, E. Vanstreels, M. Rieger, M. Schafer,S. Muller-Auer, C. Gabel, M. Fuchs, A. Dusterhoft, C. Fritzc, E. Holzer,D. Moestl, H. Hilbert, K. Borzym, I. Langer, A. Beck, H. Lehrach, R. Reinhardt,T. M. Pohl, P. Eger, W. Zimmermann, H. Wedler, R. Wambutt, B. Purnelle,A. Goffeau, E. Cadieu, S. Dreano, S. Gloux, et al. The genome sequence ofschizosaccharomyces pombe. Nature, 415(6874):871–80, 2002.
[355] J. Wu and B. L. Miller. Aspergillus asexual reproduction and sexual reproduc-tion are differentially affected by transcriptional and translational mechanismsregulating stunted gene expression. Mol Cell Biol, 17(10):6191–201, 1997.
[356] Z. Yan, X. Li, and J. Xu. Geographic distribution of mating type alleles ofcryptococcus neoformans in four areas of the united states. J Clin Microbiol,40(3):965–72, 2002.
[357] J. Yang, Z. Gu, and W. H. Li. Rate of protein evolution versus fitness effect ofgene deletion. Mol Biol Evol, 20(5):772–4, 2003.
[358] Z. Yang. Estimating the pattern of nucleotide substitution. J Mol Evol, 39:105–111, 1994.
[359] Z. Yang. Paml: a program package for phylogenetic analysis by maximum like-lihood. Comput Appl Biosci, 13(5):555–6, 1997.
[360] Z Yang. Phylogenetic Analysis by Maximum Likelihood (PAML). Version 3.0.London: University College, 2000.
[361] R. F. Yeh, L. P. Lim, and C. B. Burge. Computational inference of homologousgene structures in the human genome. Genome Res, 11(5):803–16, 2001.
[362] G. Yona, N. Linial, and M. Linial. Protomap: automatic classification of proteinsequences and hierarchy of protein families. Nucleic Acids Res, 28(1):49–55,2000.
[363] K. Y. Yuen, C. M. Chan, K. M. Chan, P. C. Woo, X. Y. Che, A. S. Leung,and L. Cao. Characterization of afmp1: a novel target for serodiagnosis ofaspergillosis. J Clin Microbiol, 39(11):3830–7, 2001.
[364] K. Y. Yuen, G. Pascal, S. S. Wong, P. Glaser, P. C. Woo, F. Kunst, J. J. Cai,E. Y. Cheung, C. Medigue, and A. Danchin. Exploring the penicillium marneffeigenome. Arch Microbiol, 179(5):339–53, 2003.
[365] K. Y. Yuen, S. S. Wong, D. N. Tsang, and P. Y. Chau. Serodiagnosis of peni-cillium marneffei infection. Lancet, 344(8920):444–5, 1994.
[366] M. Zagulski, B. Babinska, R. Gromadka, A. Migdalski, J. Rytka, J. Sulicka,and C. J. Herbert. The sequence of 24.3 kb from chromosome x reveals fivecomplete open reading frames, all of which correspond to new genes, and atandem insertion of a ty1 transposon. Yeast, 11(12):1179–86, 1995.
[367] E. M. Zdobnov and R. Apweiler. Interproscan–an integration platform for thesignature-recognition methods in interpro. Bioinformatics, 17(9):847–8, 2001.
[368] C. T. Zhang, J. Wang, and R. Zhang. A novel method to calculate the g+ccontent of genomic dna sequences. J Biomol Struct Dyn, 19:333–341, 2001.
256
[369] J. Zhang, Y. P. Zhang, and H. F. Rosenberg. Adaptive evolution of a duplicatedpancreatic ribonuclease gene in a leaf-eating monkey. Nat Genet, 30(4):411–5,2002.
[370] L. Zhang, T. J. Vision, and B. S. Gaut. Patterns of nucleotide substitutionamong simultaneously duplicated gene pairs in arabidopsis thaliana. Mol BiolEvol, 19(9):1464–73, 2002.
[371] P. Zhang, Z. Gu, and W. H. Li. Different evolutionary patterns between youngduplicate genes in the human genome. Genome Biol, 4(9):R56, 2003.
[372] R. Zhang and C. T. Zhang. Z curves, an intutive tool for visualizing and ana-lyzing the dna sequences. J Biomol Struct Dyn, 11:767–782, 1994.