online supplementary material · web viewthese databases were clustered at the 50% identity level...
TRANSCRIPT
Online Supplementary Material
On the origin of MADS-domain transcription factors
Lydia Gramzow, Markus S. Ritz and Günter Theißen*
Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743 Jena,
Germany
*Corresponding author: Theißen, G. ([email protected])
1
1
2
3
4
5
6
7
8
9
12
Methods
Datasets
A list of sampled eukaryotic species, together with information about classification, numbers of
retrieved MADS domains, type and source of data is provided in Table S1. For remote homology
detection, the non-redundant databases for microbial and plants available at National Centre for
Biotechnology Information (NCBI) [1] were downloaded. These databases were clustered at the
50% identity level using cd-hit [2] for usability with HHsearch. Default values were used for all
parameters, except that word size was reduced to three. For all clusters, alignments were created
using Clustal W [3], and Hidden Markov Models (HMMs) were constructed using the HMMer
package [4].
To study the distribution of the MADS domain in eukaryotes, queries on the entrez protein database
of NCBI [1] and the corresponding annotation databases were carried out for 40 whole genomes
and five EST data sets. All sequences were also translated in the six possible reading frames.
Representative SERUM RESPONSE FACTOR (SRF) - like and MYOCYTE ENHANCER
FACTOR 2 (MEF2) - like MADS domains from plants, animals and fungi were chosen such that
both types of MADS domains and sequences from all major group of eukaryotes for which MADS
domains have been found are included. These sequences were then aligned manually (Figure 1b).
The alignment was used to create an HMM with the HMMer package [4] which was used as a
search pattern. For sequences yielding an HMMer E-value lower than 1 the occurrence of a MADS
domain was confirmed by scanning against the NCBI conserved domains database [1]. For
sequences that were not present in NCBI or the corresponding genome databases, GlimmerHMM
[5] was used to predict genes in the regions where the MADS domain was identified with all
available training sets.
Remote homology detection
The HMMer package and HHSearch (version 1.5.1) [6] were used to find putatively homologous 2
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
3434
sequences to the MADS domain in the non-redundant microbial database. For both methods we
used default parameters except that the E-value was increased to 80 in case of HMMer search and
that we wanted to identify global matches with HHSearch. The results were counterchecked by
reverse searches of the plant non-redundant database with an HMM for the six topoisomerases A,
subunit A (TOPOIIA-A) fragments identified using HMMer and the identified TOPOIIA-A cluster
(HHSearch) as query.
Character state evolution
The type of MADS domain (SRF-like or MEF2-like) was determined for all MADS domains
identified by scanning against the NCBI conserved domains database [1]. The SRF-like and the
MEF2-like MADS domain were then examined separately. Each of them was scored absent only if
none of the searches of a complete eukaryotic genome gave positive results. Character
transformations were reconstructed via likelihood ancestral states and an asymmetric 2-parameter
Markov model with a forward rate of 0.1, a backward rate of 0.9 and equal root state frequencies on
trees corresponding to both rooting hypotheses in Baldauf [7], by using Mesquite, version 2.5 [8].
Alignments and phylogenetic analysis
A dataset of 75 MADS-domain sequences, including the 57 sequences identified before and a
representative sequence of each of the major clades of MADS-domain proteins in Arabidopsis
thaliana, was aligned using Muscle, version 3.6 [9] with default settings (Figure S1). Phylogenetic
analyses were carried out by the maximum likelihood method using the RaxML program [10], with
the WAG [11] model of amino acid substitutions and 1000 bootstrap replicates. The best-fitting
model was determined using ProtTest, version 1.4 [12].
TOPOIIA-A sequences from the non-redundant microbial database were aligned using Clustal W
[3] with default settings. The position of partial sequences which were identified to be putatively
homologous to the MADS domain was assigned and pairwise Clustal W scores of this partial
alignment were determined.
3
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
56
Structural Analysis
The Phyre protein structure prediction server [13] was used to model the structure of DNA
topoisomerase IV, subunit A of A. variabilis. The server identified the solved structure of gyrase A,
C-terminal domain from B. burgdorferi (PDB identifier: 1SUU) to be a good template to model the
structure (E-value 1.7e-33). The modeled structure was compared to the solved structure of SRF
core from H. sapiens (PDB identifier: 1SRS).
Acquisitions and losses of SRF-like and MEF2-like MADS domains
We used a 2-parametric Markov model with a forward rate for a gain of the MADS domain of 0.1
and a backward rate for a loss of 0.9 (meaning that it is 9 times more likely to lose a MADS domain
than to gain one). Under this model, the likelihood that the SRF-like and the MEF2-like MADS
domain, respectively, were present in the most recent common ancestor (MRCA) of extant
eukaryotes, is 0.60 and 0.70 for rooting hypothesis I, and 0.84 and 0.92 for rooting hypothesis II
(Figure 2a). The MADS domain is assumed to be of monophyletic origin [14], and the convergent
evolution of a defined DNA-binding domain two or more times independently appears extremely
unlikely also to us. A forward rate of 0.1 was chosen to account for possible events of horizontal
gene transfer (HGT) and is still comparably high. Nevertheless, the probabilities that SRF-like and
MEF2-like MADS domains were present in the MRCA of extant eukaryotes are well above 50%
and they only decrease below 50% in rooting hypothesis I when the forward rate is set higher than
0.18 in the case of SRF-like MADS domains and 0.24 for MEF2-like MADS domains. Note that
two independent HGT events would be required to explain the origin of the MADS domain after the
diversification of discicritates, namely the HGT of the SRF-like and the HGT of the MEF2-like
MADS domain to the lineage that led to N. gruberi. Assuming that at least one of the two trees used
here is largely correct, one can conclude that the MADS domain originated early during eukaryote
evolution, either already in the lineage that led to the MRCA of extant eukaryotes, or after excavates
4
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
78
had branched-off, at the latest.
Phylognetic tree reconstruction corroborates the presence of two types
of MADS domains in the MRCA of extant eukaryotes
To critically test our conclusions concerning the early duplication of MADS-box genes during the
evolution of eukaryotes, we reconstructed a maximum likelihood tree with representative MADS
domain sequences of A. thaliana and the identified MADS domain sequences in the analyzed
genomes (Figures S1 and S2). The overall resolution of the tree is low, but it shows two branches
containing the vast majority of SRF-like and MEF2-like MADS domains, respectively. Some
sequences annotated as SRF-like or MEF2-like MADS domains appear on the other branch, but the
classification of those is usually not well supported by E-values (Tables S2, sequences in bold), so
that their classification is questionable. All in all, our tree supports the idea that both types of
MADS domains have been present in the MRCA of extant eukaryotes and thus must have been
generated by a gene duplication that occurred in the lineage that led to the MRCA of extant
eukaryotes.
Details on the similarity between TOPOIIA-A and MADS-domain
sequences
At positions two and four of the alignment of TOPOIIA-A and MADS-domain sequences (Figure
1b and c), positively charged residues are found that could be important for contact with the
negatively charged backbone of the DNA. The sequence of three positively charged amino acids in
a row at positions 23 to 25 is conserved in all MADS-domain proteins but is interrupted by a
hydrophobic residue in TOPOIIA-A fragments. These residues have frequently been identified to be
part of a nuclear localization signal in the MADS domain [15-17], which would have no function in
prokaryotes. Hydrophobic residues are found in all sequences at positions 11, 21, 35, 46 and 48 of
5
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
910
the alignment. Hydrophobic residues are generally important for the structural stability of proteins
[18].
Higher order structures of TOPOIIA-A and MADS-domain sequences
Generally, a similar tertiary protein structure is seen as an argument for homology even if proteins
have a low level of sequence identity [19, 20]. As there are no solved structures available for any of
the TOPOIIA-A proteins identified as being homologous to the MADS domain in our study, we
modeled the structure of the identified TOPOIIA-A protein with the lowest E-value, DNA
topoisomerase IV, subunit A of Anabaena variabilis. The solved structure of gyrase A, C-terminal
domain of Borrelia burgdorferi was used as a template (as suggested by the Phyre protein structure
prediction server [13]). The modeled structure folds into a β-pinwheel, similar to the structure
formed by the B. burgdorferi protein. The region that is putatively homologous to the MADS
domain includes five β-strands and one α-helix (Figure S3a). The MADS domain of SRF adopts a
structure with an N-terminal extension, a long α-helix and two β-strands (Figure S3b) [21]. At first
glance, these two structures appear quite dissimilar. However, the C-terminal two β-strands in the
region of topoisomerase IV that is putatively homologous to the MADS domain overlap with the
two β-strands in the MADS domain (Figure S4). If the predicted structure of DNA topoisomerase
IV is correct, there has been an elongation of the α-helix in the part putatively homologous to the
MADS domain in the evolution of this predicted structure from the template structure of B.
burgdorferi. Thus, a change of structure towards a long α-helix during the evolution of the MADS
domain from an ancestral TOPOIIA-A protein seems feasible. On that account also note that
changes in protein secondary structure have been shown to be induced with few or even without any
changes in amino acid sequence as in the case of so-called chameleon sequences [22], prions or the
Arc repressor [23-26].
The three residues identical in the 15 TOPOIIA-A fragments identified by homology searches and
the seed alignment are located in loop regions on the surface of DNA topoisomerase IV and in a
6
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
1112
loop region and in the α-helix of the MADS domain. These residues contact DNA in the solved
structure of SRF [21]. The fact that these residues are located in loop regions in the predicted
structure of topoisomerase IV indicates that there are few structural constraints, to maintain α-
helical or β-strand properties, on these residues. We hence assume that there have been functional
constraints during evolution of these residues possibly due to DNA binding.
7
136
137
138
139
140
1314
Supplementary figures
Figure S1 – Alignment for the phylogenetic tree shown in Figure S2. Species abbreviations are
as in Figure 1 and Mb – Monosiga brevicollis, Ng – Naegleria gruberi, Um – Ustilago maydis, Nc –
Neurospora crassa, Ps – Phytophthora sojae, Ptri – Phaeodactylum tricornutum. In XP001461257
of Paramecium tetraurelia the sequence NVNLLFQLLILLFLEPLYNLNYYLILC was omitted
between alignment positions 16 and 17 to simplify the presentation. The alignment is colored
according to the Clustal X color scheme. Sequences are named as in Figure 1, the accession number
is given or the common name is used.
8
141
142
143
144
145
146
147
148
149
1516
Figure S2 – Phylogenetic tree of 75 MADS domains constructed using the Maximum Likelihood
method as implemented in the program RAxML [10]. The two clusters of MEF2-like and SRF-like
MADS domains are indicated. Sequence names in red indicate domains that were classified as a
different type than the majority of the domains in the branch they belong to. For readability, not for
all domains that were annotated sequence names are shown, whereas all domains of questionable
classification are specified (also see Table S2). The branches are color coded such that the
respective sequences are from the following groups of organisms: green – plants, red –
ophisthokonts, black – alveolates, blue – amoebozoans, yellow – discicristates, cyan – chromistans.
Species abbreviations and sequence names are the same as in Figure 1 and Figure S1.
9
150
151
152
153
154
155
156
157
158
159
160
1718
Figure S3 – Presence or absence of SRF-like and MEF2-like MADS domains in the evolution of
eukaryotes under two alternative rooting hypotheses and reconstructed using the parsimony
principle. Black donates presence of the MADS domain while white indicates absence of the
MADS domain.
10
161
162
163
164
165
1920
Figure S4 – Structure comparison. Comparison of the predicted structure of DNA topoisomerase
IV, subunit A of Anabaena variabilis, a cyanobacterium (a), and the partial structure of DNA-bound
SERUM RESPONSE FACTOR of Homo sapiens (b; SRF, PDB: 1SRS). For visibility reasons, only
the sequence stretch representing amino acids 225 to 282 of the structure of DNA topoisomerase IV,
subunit A are shown. The MADS domain of SRF and the region putatively homologous to the
MADS domain in TOPOIIA-A are colored in dark blue. Residues identical between 15 TOPOIIA-A
fragments identified by homology searches and the MADS domain are shown in spacefill
representation and colored green.
11
166
167
168
169
170
171
172
173
174
175
176
177
178
179
1802122
Figure S5 – Secondary structure alignment of a part of SERUM RESPONSE FACTOR of Homo
sapiens (SRF, PDB: 1SRS) and the predicted structure of the corresponding part of DNA
topoisomerase IV, subunit A of Anabaena variabilis, a cyanobacterium. α-helices are shown as red
boxes and β-strands are shown as green boxes with an arrowhead. An amino acid alignment of the
corresponding amino acid sequences is indicated.
12
181
182
183
184
185
186
2324
Supplementary tables
Table S1 – MADS domains in six domains of life. The number of MADS domains found in the
corresponding genome is listed in the column “#MADS”. A star in the column “#MADS” indicates
that some of the recovered MADS domains are not annotated in the corresponding databases. The
number in brackets states how many MADS domains are not annotated. Alternating shading was
used to facilitate distinction between major groups of eukaryotes.
Taxonomy Species #MADS Data set Data source
Ophisthokonta
Fungi
Ascomycota Neurospora crassa 2 Complete genome Broad Institute
Yarrowia lipolytica 2 Complete genome Center for Bioinformatics,
Bordeaux
Saccharomyces
cerevisiae
4 Complete genome Saccharomyces Genome
DB
Schizosaccharomyces
pombe
4*(1) Complete genome Sanger
Basidiomycota Ustilago maydis 2 Complete genome Broad Institute
Cryptococcus
neoformans
3*(1) Complete genome Broad Institute
Microsporidia Encephalitozoon
cuniculi
1 Complete genome IMG
Antonospora locustae 2* Complete genome Antonospora GDB
Metazoa Drosophila
melanogaster
2 Complete genome Berkeley DGP
Choanoflagellata Monosiga brevicollis 3 Complete genome JGI
13
187
188
189
190
191
192
2526
Taxonomy Species #MADS Data set Data source
Proterospongia 0 ESTs TbestDB
Amoebozoa
Myxogastrida Physarum
polycephalum
2* Complete genome Washington University
GSC
Dictyostelida Dictyostelium
discoideum
4 Complete genome IMG
Acanthamoebidae Acanthamoeba
castellanii
3* Complete genome BCM
Hartmannellidae Hartmannella
vermiformis
2 ESTs TbestDB
Pelobionta Entamoeba histolytica 3*(1) Complete genome IMG
Plantae
Chlorophyta
Embryophyta Arabidopsis thaliana 122 Annotated Proteins TAIR
Chlorophyta Chlamydomonas
reinhardtii
2 Complete genome JGI
Rhodophyta
Cyanidiales Cyanidioschyzon
merolae
1* Complete genome C.m. Genome Project
Glaucocystophyta Glaucocystis
nostochinearum
0 ESTs TbestDB
Rhizaria
Cercozoa Bigelowiella natans 0 Nucleomorph NCBI
Alveolata
142728
Taxonomy Species #MADS Data set Data source
Apicomplexa Toxoplasma gondii 1* Complete genome ToxoDB
Theileria annulata 0 Complete genome Sanger
Theileria parva 0 Complete genome IMG
Plasmodium
falciparum
0 Complete genome IMG
Plasmodium yoelii
yoelii
0 Complete genome IMG
Cryptosporidium
parvum
0 Complete genome IMG
Cryptosporidium
hominis
0 Complete genome IMG
Dinoflagellata
Dinophycea Heterocapsa triquetra 0 ESTs TbestDB
Ciliophora Tetrahymena
thermophila
1 Complete genome TIGR
Paramecium
tetraurelia
8 Complete genome Paramecium Genome
Browser
Chromista
Heterokonta
Oomycota Phytophthora sojae 1 Complete genome JGI
Phytophthora ramorum 1 Complete genome JGI
Bacillariophyta Thalassiosira
pseudonana
0 Complete genome JGI
Phaeodactylum
tricornutum
1* Complete genome JGI
152930
Taxonomy Species #MADS Data set Data source
Cryptophyta Guillardia theta 0 Nucleomorph IMG
Hemiselmis andersenii 0 Nucleomorph NCBI
Discicristates
Euglenozoa
Kinetoplastida Leishmania major 0 Complete genome IMG
Leishmania infantum 0 Complete genome Sanger
Trypanosoma cruzi 0 Complete genome TIGR
Trypanosoma brucei 0 Complete genome IMG
Heterolobosae
Schizopyrenida Naegleria gruberi 2 Complete genome JGI
Excavata
Diplomonadina Giardia lamblia 0 Complete genome IMG
Parabasalia Trichomonas vaginalis 0 Complete genome TIGR
Oxymonadina Streblomastix strix 0 ESTs TbestDB
Table S2 – E-values for classification of MADS domains as SRF-like or MEF2-like according
to the NCBI conserved domains database. The lower one of the two E-values, used for
classification in Figures S1 and S2, is shaded. Species abbreviations and sequence names are the
same as in Figure 1 and Figure S1. Bold writing indicates questionable classifications. n.a., not
available.
Sequence SRF-like MEF2-like
JGI_EGW1.7.155.1_Ng 3.00E-14 1.00E-18
JGI_EEGNGPG.C520029_Ng 3.00E-19 3.00E-16
JGI_EEG1PG.C1440013_Ps 5.00E-16 3.00E-24
16
193
194
195
196
197
198
3132
SCAFFOLD37000080_Pr 4.00E-16 1.00E-24
chr11_37909-37967_Ptri n.a. 5.3E-02
XP001013498_Tt 2.00E-12 1.00E-09
scaffold129_50735-50800_Pt 2.00E-04 6.00E-05
scaffold2_81330-81385_Pt 7.00E-04 6.00E-04
XP001438374_Pt 5.00E-10 9.00E-09
XP001429517_Pt 2.00E-09 1.00E-08
XP001434362_Pt 7.00E-06 2.00E-07
scaffold157_37788-37869_Pt 1.00E-09 2.00E-09
XP001461257_Pt 1.00E-06 1.00E-07
scaffold91_69496-69580_Pt 5.00E-11 3.00E-11
TGG995082_Tg 8.00E-17 1.00E-26
AP006483_98690-98746_Cm 9.00E-12 8.00E-16
SCAFFOLD_3000406_Cr 1.00E-03 6.00E-04
SCAFFOLD66000005_Cr 2.00E-06 1.00E-07
IMG640321549_Eh 4.00E-17 1.00E-19
IMG640313519_Eh 2.00E-12 2.00E-14
NW665827_18595-18632_Eh 1.00E-07 3.00E-07
contig9912_5278-5374_Ac 6.00E-18 2.00E-16
contig4434_456-542_Ac 2.00E-15 3.00E-14
contig14208_1287-1381_Ac 3.00E-14 3.00E-18
IMG639614547_Dd 1.00E-17 3.00E-15
IMG639615340_Dd 1.00E-18 6.00E-16
IMG639614226_Dd 1.00E-12 3.00E-15
IMG639621525_Dd 1.00E-15 4.00E-23
HVL00004593_Hv 4.00E-08 6.00E-10
173334
HVL00000978_Hv 6.00E-17 2.00E-13
Contig6957_233-283_Pp 2.00E-18 1.00E-16
Contig5814_226-273_Pp 9.00E-16 2.00E-24
IMG638215947_Ec 8.00E-15 8.00E-13
contig39_1287-1340_Al 4.00E-03 n.a.
contig175_1617-1674_Al 2.00E-14 9.00E-14
XP772022_Cn 6.00E-18 2.00E-14
XP777518_Cn 3.00E-14 3.00E-21
AACO02000072_75987_76036_Cn 8.00E-08 2.00E-13
XP757371_Um 3.00E-10 8.00E-10
XP761470_Um 9.00E-16 3.00E-20
XP501533_Yl 3.00E-10 4.00E-10
XP505594_Yl 1.00E-16 6.00E-24
fgenesh1pg.C150048_Mb 2.00E-13 1.00E-15
gw1.4.553.1_Mb 1.00E-05 7.00E-04
scaffold_15000047_Mb 9.00E-12 2.00E-12
XP964617_Nc 1.00E-09 6.00E-10
XP965689_Nc 1.00E-15 4.00E-21
NP013756_Sc 3.00E-17 5.00E-16
NP013757_Sc 7.00E-08 5.00E-11
NP015236_Sc 3.00E-18 2.00E-24
NP009741_Sc 6.00E-17 3.00E-21
NP596507_Sp 2.00E-16 2.00E-15
chr2_786512-786567_Sp 1.00E-11 2.00E-11
NP594931_Sp 8.00E-13 3.00E-11
NP595972_Sp 6.00E-11 6.00E-11
183536
NP726438_Dm 2.00E-18 8.00E-14
NP995789_Dm 1.00E-16 1.00E-25
Table S3 – Results of searching the MADS domain in the non-redundant microbial database
using HMMer. Raw “Score” and empirical “E-value” as calculated by HMMer are shown.
Sequence ID Description Score E-value
gi|75909066| DNA gyrase, subunit A -1.4 6.2
gi|17227937| DNA topoisomerase chain A -2.8 8.7
gi|113953878| DNA gyrase subunit A -6.8 23
gi|33862278| DNA gyrase/topoisomerase IV, subunit A -9.2 41
gi|33239457| Type IIA topoisomerase, A subunit, ParC -11.6 73
gi|124021719| DNA gyrase/topoisomerase IV, subunit A -11.8 78
Table S4 – Microbial clusters identified with HHsearch using the MADS-domain HMM as
query. Clusters are numbered according to the cd-hit clustering procedure. Columns “Query” and
“Template” indicate which amino acid positions of the MADS domain and the identified cluster,
respectively, show sequence similarity. Shading indicates clusters containing TOPOIIA-A
sequences.
Hit Description E-value Query Template
cluster21422 DNA gyrase/topoisomerase, subunit A 0.035 1-58 630-685
cluster276140 Proteasome subunit alpha 1 38-58 1-21
19
199
200
201
202
203
204
205
206
207
208
3738
cluster466954 Predicted nucleic acid-binding protein 1 33-58 1-26
cluster72291 Hypothetical proteins 2.5 28-58 1-30
cluster224995 Proteins of unknown function DUF147 2.8 21-58 1-37
cluster170941 PSP1 proteins 4.2 39-58 1-20
cluster7568 Endo-beta-N-acetylglucosaminidase 4.6 22-58 1-35
cluster218999 3-oxoacyl synthases III 7.3 1-58 35-89
cluster493787 Transcriptional regulators 13 22-58 1-37
cluster7640 Chromosome segregation proteins SMC 14 1-58 856-911
cluster372191 Hypothetical proteins 15 1-25 175-199
cluster270568 Membrane proteins/Hypothetical proteins 19 28-58 1-31
cluster175139 Hypothetical proteins 24 28-58 1-34
cluster593245 Hypothetical proteins 33 1-19 37-55
cluster122934 Hypothetical proteins 33 1-20 408-427
cluster155000 Fibronectin-attachment family proteins 33 14-58 1-44
cluster191774 Fe3+ ABC transporters 38 28-58 1-36
cluster517725 Hypothetical proteins 41 28-58 1-31
cluster133636 Hydroxymethylglutaryl-coenzyme A synthases 43 1-21 393-412
cluster455816 Hypothetical proteins 44 35-58 1-23
cluster160652 Glycosyl transferases 45 31-58 1-27
cluster19843 DNA gyrase/topoisomerase, subunit A 49 1-58 663-721
20
209
3940
Table S5 - Results of HMMer searches of the non-redundant plant database using a HMM of
the six previously identified TOPOIIA-A sequences. Red shading indicates clusters containing
TOPOIIA-A sequences while blue shading indicates clusters containing MADS-domain
sequences. Raw “Score” and empirical “E-value” as calculated by HMMer are shown.
Sequence Description Score E-value
gi|145355547| predicted protein [Ostreococcus lucimarinus] 4.9 0.015
gi|115441497| Os01g0886200 [Oryza sativa (japonica cultivar-group)] -4.6 0.22
gi|30694601| AGL16 (AGAMOUS-LIKE 16); transcription factor -7.0 0.43
gi|145332879| AGL16 (AGAMOUS-LIKE 16); transcription factor -7.0 0.43
gi|15238067| FLC; transcription factor [Arabidopsis thaliana] -7.6 0.52
gi|145334363| FLC (FLOWERING LOCUS C) [Arabidopsis thaliana] -7.6 0.52
gi|42568779| MAF4 (MADS AFFECTING FLOWERING 4) -8.8 0.73
gi|115487796| Os12g0207000 [Oryza sativa (japonica cultivar-group)] -9.7 0.93
gi|15230284| AGL18 (AGAMOUS-LIKE 18); transcription factor -10.2 1.1
gi|115483150| Os10g0536100 [Oryza sativa (japonica cultivar-group)] -10.5 1.1
gi|42566942| AG (AGAMOUS); transcription factor -10.9 1.3
gi|115467168| Os06g0223300 [Oryza sativa (japonica cultivar-group)] -11.4 1.5
gi|115456153| Os03g0812000 [Oryza sativa (japonica cultivar-group)] -12.3 1.9
gi|115446901| Os02g0579600 [Oryza sativa (japonica cultivar-group)] -13.2 2.5
gi|115466584| Os06g0162800 [Oryza sativa (japonica cultivar-group)] -13.5 2.7
gi|115448477| Os02g0731200 [Oryza sativa (japonica cultivar-group)] -14.0 3.1
gi|115451551| Os03g0215400 [Oryza sativa (japonica cultivar-group)] -14.1 3.2
21
210
211
212
213
4142
gi|30681440| DNA gyrase subunit A family protein [Arabidopsis thaliana] -14.3 3.3
gi|115457632| Os04g0304400 [Oryza sativa (japonica cultivar-group)] -14.5 3.6
gi|115458790| Os04g0461300 [Oryza sativa (japonica cultivar-group)] -14.7 3.8
gi|15220084| MADS-box protein (AGL100) [Arabidopsis thaliana] -14.7 3.8
gi|115439679| Os01g0726400 [Oryza sativa (japonica cultivar-group)] -15-2 4-3
gi|115451205| Os03g0186600 [Oryza sativa (japonica cultivar-group)] -15.5 4.8
gi|15218456| MADS-box protein (AGL60) [Arabidopsis thaliana] -15.8 5.2
gi|42562154| AGL65; DNA binding / transcription factor [A. thaliana] -16.7 6.7
gi|115468584| Os06g0565900 [Oryza sativa (japonica cultivar-group)] -17.0 7.2
gi|115455401| Os03g0753100 [Oryza sativa (japonica cultivar-group)] -17.3 7.8
gi|15233857| AGL24 (AGAMOUS-LIKE 24); transcription factor -17.4 8.5
gi|30698092| AGL31; transcription factor [Arabidopsis thaliana] -17.5 8.3
gi|145334905| AGL31 [Arabidopsis thaliana] -17.5 8.3
gi|115469428| Os06g0667200 [Oryza sativa (japonica cultivar-group)] -17.5 8.4
gi|42568781| AGL68/MAF5 (MADS AFFECTING FLOWERING 5) -17.6 8.5
gi|145334907| AGL68/MAF5 (MADS AFFECTING FLOWERING 5) -17.6 8.5
gi|145350260| predicted protein [Ostreococcus lucimarinus CCE9901] -17.9 9.1
gi|15234874| STK (SEEDSTICK); transcription factor [A. thaliana] -17.9 9.3
gi|145332997| STK (SEEDSTICK) [Arabidopsis thaliana] -17.9 9.3
gi|30681253| STK (SEEDSTICK); transcription factor [A. thaliana] -17.9 9.3
gi|115476540| Os08g0431900 [Oryza sativa (japonica cultivar-group)] -18.0 9.6
224344
gi|79376490| AGL94; DNA binding / transcription factor [A. thaliana] .18.1 9.7
Table S6 – Results of reverse HHsearch using an HMM of the previously identified TOPOIIA-
A cluster as a query to search the non-redundant plant database. Clusters are numbered
according to the cd-hit clustering procedure. Blue shading indicates clusters containing
MADS-domain sequences while red shading indicates clusters containing TOPOIIA-A
sequences. Columns “Query” and “Template” indicate which amino acid positions of the
MADS domain and the identified cluster show sequence similarity.
Hit Description E-value Query Template
cluster28707 MADS AFFECTING FLOWERING proteins 0.29 1-56 2-59
cluster25045 AGAMOUS-LIKE 18 proteins 1.9 1-56 2-59
cluster16375 AGAMOUS-LIKE 65 proteins 2.9 1-56 2-59
cluster2431 DNA gyrase subunit A family proteins 3.8 1-56 715-770
cluster25305 AG/SHP/STK proteins 5.7 1-56 2-59
cluster14854 zinc finger family proteins 6.7 1-19 410-429
cluster22457 2-dehydro-3-deoxyphosphooctonate aldolases 7 1-56 87-140
cluster16544 AGAMOUS-LIKE 30 proteins 7.8 1-56 2-59
cluster20126 unknown proteins 9.9 1-27 299-326
cluster25478 SVP proteins 10 1-56 20-78
23
214
215
216
217
218
219
220
221
4546
Table S7 – Clustal W similarity scores between identified and non-identified sequences of TOPOIIA, subunit A. Summarized are Clustal W
scores of the partial TOPOIIA, subunit A identified by HMM- and/or HHSearch-queries to other TOPOIIA, subunit A sequences in the non-
redundant database. Maximum scores are shown except for the identified cyanobacterial sequences where minimum scores are shown. The
column titled “max” shows the overall maximum score and the column titled “avg” shows average scores of sequences not identified by
HMM- and/or HHSearch-queries to the identified sequences.
Acidobacteria Actinobacteria Aquificae Bacteroidetes Chlamydiae Chlorobi Deinococcus Euryarchaeota Firmicutes
gi|17227937| 32 37 30 35 32 33 33 32 33
gi|75909066| 32 37 32 33 28 32 32 32 33
gi|86604741| 30 33 21 35 33 39 39 35 35
gi|86607511| 32 33 34 37 33 41 41 30 35
gi|37521619| 33 33 20 32 30 30 37 33 42
gi|78211563| 28 28 23 21 23 19 23 26 30
gi|33864544| 26 40 27 25 25 21 23 25 28
gi|148238341| 26 29 18 21 25 23 23 19 25
gi|113953878| 23 29 25 23 25 19 26 25 30
gi|124021719| 28 32 30 25 21 28 28 28 28
gi|33862278| 26 30 29 23 19 25 26 28 28
gi|148241104| 17 29 14 23 14 17 26 23 26
gi|72383169| 17 28 10 21 16 14 16 25 28
24
222
223
224
225
226
4748
gi|124024717| 17 28 10 21 16 14 16 25 28
gi|33239457| 19 33 18 21 19 19 23 26 28
Fusobacteria Planctomycetes Proteobacteria Spirochaetes Tenericutes Thermotogae Cyano. (min) max avg (nonid.)
gi|17227937| 30 32 33 28 32 26 32 53 23.28846154
gi|75909066| 28 30 33 28 30 26 33 51 23.453125
gi|86604741| 20 33 37 35 31 24 23 55 19.12980769
gi|86607511| 20 33 39 32 33 34 26 57 20.44591346
gi|37521619| 39 28 37 32 37 28 21 44 23.58173077
gi|78211563| 10 23 32 28 26 19 35 67 18.15264423
gi|33864544| 10 25 33 25 28 25 37 75 19.32572115
gi|148238341| 10 19 30 25 23 30 26 64 16.75
gi|113953878| 12 23 35 28 23 17 33 62 17.89783654
gi|124021719| 12 28 37 25 33 16 30 64 19.10096154
gi|33862278| 14 25 35 25 30 16 26 62 18.23317308
gi|148241104| 10 21 32 23 25 21 33 60 18.26201923
gi|72383169| 14 16 25 19 30 25 21 44 15.04927885
gi|124024717| 14 16 25 19 30 25 21 42 14.86298077
gi|33239457| 12 23 33 23 26 26 23 51 17.02163462
254950
References
1. Sayers, E.W., et al. (2009) Database resources of the National Center for Biotechnology
Information. Nucleic Acids Res 37, D5-15
2. Li, W.Z., and Godzik, A. (2006) Cd-hit: a fast program for clustering and comparing large
sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659
3. Larkin, M.A., et al. (2007) Clustal W and clustal X version 2.0. Bioinformatics 23, 2947-
2948
4. Eddy, S.R. (1996) Hidden Markov models. Curr Opin Struct Biol 6, 361-365
5. Majoros, W.H., et al. (2004) TigrScan and GlimmerHMM: two open source ab initio
eukaryotic gene-finders. Bioinformatics 20, 2878-2879
6. Soding, J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics
21, 951-960
7. Baldauf, S.L. (2003) The deep roots of eukaryotes. Science 300, 1703-1706
8. Maddison, W.P., and Maddison, D.R. (2009) Mesquite: a modular system for evolutionary
analysis. Version 2.5 http://mesquiteproject.org.
9. Edgar, R.C. (2004) MUSCLE: a multiple sequence alignment method with reduced time and
space complexity. BMC Bioinformatics 5, 1-19
10. Stamatakis, A. (2006) RAxML-VI-HPC: Maximum likelihood-based phylogenetic analyses
with thousands of taxa and mixed models. Bioinformatics 22, 2688-2690
11. Whelan, S., and Goldman, N. (2001) A general empirical model of protein evolution derived
from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 18,
691-699
12. Abascal, F., et al. (2005) ProtTest: selection of best-fit models of protein evolution.
Bioinformatics 21, 2104-2105
13. Kelley, L.A., and Sternberg, M.J.E. (2009) Protein structure prediction on the Web: a case
study using the Phyre server. Nat Protoc 4, 363-37126
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
2525152
14. Alvarez-Buylla, E.R., et al. (2000) An ancestral MADS-box gene duplication occurred
before the divergence of plants and animals. Proc Natl Acad Sci USA 97, 5328-5333
15. Gauthierrouviere, C., et al. (1995) The Serum Response Factor Nuclear-Localization Signal
- General Implications for Cyclic-Amp-Dependent Protein-Kinase Activity in Control of
Nuclear Translocation. Mol Cell Biol 15, 433-444
16. McGonigle, B., et al. (1996) Nuclear localization of the Arabidopsis APETALA3 and
PISTILLATA homeotic gene products depends on their simultaneous expression (vol 10, pg
1812, 1996). Genes Dev 10, 2235-2235
17. Immink, R.G.H., et al. (2002) Analysis of MADS box protein-protein interactions in living
plant cells. Proc Natl Acad Sci USA 99, 2416-2421
18. Kellis, J.T., et al. (1988) Contribution of Hydrophobic Interactions to Protein Stability.
Nature 333, 784-786
19. Chi, S.W., et al. (1999) Solution structure of a conserved C-terminal domain of p73 with
structural homology to the SAM domain. EMBO J 18, 4438-4445
20. Hellman, M., et al. (2004) Solution structure of coactosin reveals structural homology to
ADF/cofilin family proteins. FEBS Lett 576, 91-96
21. Pellegrini, L., et al. (1995) Structure of Serum Response Factor Core Bound to DNA. Nature
376, 490-498
22. Tan, S., and Richmond, T.J. (1998) Crystal structure of the yeast MAT alpha 2/MCM1/DNA
ternary complex. Nature 391, 660-666
23. Guo, M.X., et al. (2008) PrPC interacts with tetraspanin-7 through bovine PrP154-182
containing alpha-helix 1. Biochem Biophys Res Commun 365, 154-157
24. Harrison, P.M., et al. (1997) The prion folding problem. Curr Opin Struct Biol 7, 53-59
25. Cordes, M.H.J., et al. (1999) Evolution of a protein fold in vitro. Science 284, 325-327
26. Anderson, T.A., et al. (2005) Sequence determinants of a conformational switch in a protein
structure. Proc Natl Acad Sci USA 102, 18344-18349
27
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
5354