1 running head: affytrees corresponding author: georg weiller

1

Running head:

AffyTrees

Corresponding author:

Georg Weiller

Research School of Biological Sciences

Australian National University

2602 Canberra Australia

Email:[email protected]

Tel: +61 2 6125 5916

Research area: Bioinformatics

Plant Physiology Preview. Published on December 7, 2007, as DOI:10.1104/pp.107.109603

Copyright 2007 by the American Society of Plant Biologists

www.plantphysiol.orgon February 17, 2018 - Published by Downloaded from Copyright © 2007 American Society of Plant Biologists. All rights reserved.

http://www.plantphysiol.org

2

AffyTrees: facilitating comparative analysis of Affymetrix plant microarray chips.

Tancred Frickey 1, Vagner Augusto Benedito 2, Michael Udvardi 2 and Georg Weiller 1* 1 ARC Centre of Excellence for Integrative Legume Research and Bioinformatics Laboratory,

Genomic Interactions Group, Research School of Biological Sciences, Australian National

University, GPO Box 475, Canberra, ACT 2601 Australia 2 The Samuel Roberts Noble Foundation, Ardmore, Oklahoma 73401

*to whom correspondence should be addressed

Email:

Tancred Frickey: [email protected]

Vagner Benedito: [email protected]

Michael Udvardi: [email protected]

Georg Weiller: [email protected]



3

Financial source:

This research was funded by an Australian Research Council Centre of Excellence grant. Funding to

pay for the publication charges was provided by the same grant.

Corresponding author:

Georg Weiller

Research School of Biological Sciences

Australian National University

2602 Canberra Australia

Email:[email protected]

Tel: +61 2 6125 5916



4

Species: Arabidopsis thaliana and Medicago truncatula

Abstract:

Microarrays measure the expression of large numbers of genes simultaneously and can be used to

delve into interaction networks involving many genes at a time. However, it is often difficult to

decide to what extent knowledge about the expression of genes gleaned in one model organism can

be transferred to other species. This can be examined either by measuring the expression of genes of

interest under comparable experimental conditions in other species, or by gathering the necessary

data from comparable microarray experiments. However, it is essential to know which genes to

compare between the organisms. To facilitate comparison of expression data across different

species, we have implemented a web-based software tool that provides information about sequence

orthologs across a range of Affymetrix microarray chips.

Affytrees provides a quick and easy way of assigning which probe sets on different Affymetrix

chips measure the expression of orthologous genes. Even in cases where gene or genome

duplications have complicated the assignment, groups of comparable probe sets can be identified.

The phylogenetic trees provide a resource that can be used to improve sequence annotation and

detect biases in the sequence complement of Affymetrix chips. Being able to identify sequence

orthologs and recognize biases in the sequence complement of chips is necessary for reliable cross-

species microarray comparison. As the amount of work required to generate a single phylogeny in a

non-automated manner is considerable, AffyTrees can greatly reduce the workload for scientists

interested in large-scale cross-species comparisons.



5

Introduction

Microarray experiments have made it possible to rapidly quantify the expression of large numbers

of genes for a given experimental condition. The rapidity and ease of use of this technology has

enabled research into complex aspects of growth and development involving multiple genes at a

time. However, it remains difficult to extend findings from one organism to another, as it is often

not known which of the spots on different microarray chips measure the expression of comparable

(i.e. orthologous) genes.

The basic idea of using “model organisms” is that the knowledge gained from studying such an

organism will, to a large extent, be transferable to other species. Taking the regulatory feedback

loop controlling branching in Arabidopsis thaliana as an example, validating analyses needed to be

performed in a range of other species to determine to what extent this mechanism was conserved

and how far the knowledge gained in Arabidopsis thaliana could be applied to other plants

(Johnson 2006).

Approaches to validate such regulatory networks range from crudely determining whether the

necessary genes might be present in another genome and then assuming the complete network of

gene interaction to be conserved, to quantifying the expression of the corresponding genes under

comparable experimental conditions and verifying that the genes actually do behave in a similar

manner. The former is a crude but quick, cheap and easy approach, while the latter is more refined,

but work intensive, expensive and complicated. Data-mining available microarray data may provide

an intermediate solution to the problem. Microarray data repositories such as the Gene Expression

Omnibus (GEO) (Edgar 2002) provide a wealth of information about how an organism responds to

a wide variety of experimental conditions and may provide information about the expression of a

gene of interest in a species of interest under an experimental condition of interest.

Regardless of the approach used, it is necessary to know which genes can be compared between

organisms. In many cases, available gene annotation or best BLAST (Altschul 1997) hits are used.

However, gene annotation is not always correct or up to date and best BLAST hits do not always

correspond to the closest phylogenetic relative (Koski 2001). The orthology of genes, i.e. gene

copies that arose due to a speciation event, is the quintessential feature to look for when attempting

to compare genes or gene-products. The underlying assumption is that a gene in an emergent

species will continue to perform the same function it had in the ancestral species. Genes that arose

via duplication (i.e. paralogous genes) are a different matter, as two copies of the gene are present in

the genome of the organism, making it less likely that changes in one of the duplicates will lead to a

noticeable reduction in fitness, making it more likely that such changes will be passed on to the next

generation. The paralogous genes we observe today were therefore less restrained in their ability to



6

change, be lost, be inactivated or evolve towards a new function. Alternatively, both of the

duplicates may have changed only slightly, each continuing to perform a subset of the original

gene's tasks or both may have remained fully functional, accumulating only minor changes in the

regulation of their expression to counteract potential dosage effects. This freedom of paralogs to

change is the main reason why comparison of paralogous genes is unlikely to be beneficial or

intended and cross-species comparisons should be confined to orthologous or co-orthologous genes.

A number of tools and databases exist that attempt to determine which genes are orthologous and

therefore comparable across organisms (for example COG (Tatusov 1997), Orthomcl (Li 2003),

KOG (Tatsutov 2003), Genome Clusters Database (Horan 2005), Inparanoid (O'Brien 2005),

Multiparanoid (Alexeyenko 2006) and Orthologid (Chiu 2006)). Unfortunately, some of these

provide orthology assignments only for a very restricted set of species while others require

completed genomes to base their predictions on. Both these points make these databases next to

useless for researchers wanting to compare sequences from organisms for which completed

genomes are not yet available and that were not part of the select set of species that were included

in the databases. For such organisms, researchers generally have to rely on sequence similarity

searches to determine potential sequence orthologs in better described species. In addition, the

majority of the methods do not base their orthology predictions on phylognetic trees but on other

clustering methods and only use phylogenies to visualize the results. Finally, none of the methods

provide an easy lookup of which affymetrix sequences are comparable across chips, making an

additional mapping of affymetrix exemplar sequences to predicted sequence orthologs necessary.

Our web-based software tool provides a quick and easy way of assessing the orthology of protein-

coding genes for a variety of plant microarray chips, irrespective of whether the genome of the

organism is completed or not. We focused on Affymetrix chips as the overwhelming majority of

microarray data present in public repositories is based on these (GEO (Edgar 2002)). These chips

generally provide a reasonable coverage of the transcriptome of an organism and the corresponding

sequence data is readily available. As many chips are designed and sold before the corresponding

organism is completely sequenced, there may be cases where sequences spotted on a chip are

thought no longer to be present in the genome or some genes in the genome may be represented

multiple times or missing on the chip. In contrast to other methods, we do not use ORF's predicted

from genomic data, but the sequences from which the probe sets for a given chip were derived,

hereafter refered to as either exemplar or consensus sequences. We thereby avoid problems arising

from inaccurate ORF prediction, genome sequences being revised and changed, as well as errors in

assigning the various probe sets to predicted genomic ORFs. For each of the consensus sequences,

we provide the results of sequence similarity searches against a number of sequence databases, a



7

Profile-Hidden-Markov-Model (HMM) representative of the sequence family, as well as a multiple

sequence alignment and phylogenetic tree for that family. An additional utility permits determining

sequence orthologs in a species of choice to the sequences present on an Affymetrix chip. A web-

interface is provided to PHAT, part of the PhyloGenie package (Frickey 2004), that allows the

repository of phylogenetic trees to be mined for trees corresponding to specific topological or

species constraints.

Construction and content:

The NCBI non-redundant protein database “nr” and 6-frame translations of the plant microarray

chip consensus sequences provided by Affymetrix provide the set of sequences we base our

predictions on. The 6-frame translations of the consensus sequences provide information as to what

proteins are represented on the various microarray chips. The “nr” database contains a wide variety

of species suitable as outgroups for the phylogenies as well as providing sequences that may have

failed to be included on the microarray chips of the various organisms. The latter are of special

importance as they provide critical data when attempting to assess whether two sequences are

orthologous or paralogous (Fig. 1).

PhyloGenie is used to automatically search for sequence homologs and infer phylogenetic trees for

all consensus sequences on a chip. This tool was originally developed to generate and analyze

phylomes in regards to gene duplications and lateral gene transfers and can be briefly described as

follows: A) Each microarray consensus sequence is compared against the above mentioned

databases using BLAST. The result of these sequence similarity searches are used to identify

potential sequence homologs. B) BLAST High-Scoring-Segment-Pairs (HSPs) with greater than

70% coverage of the query and E-values better than 1e-5 are extracted and aligned to one another.

These parameters were chosen lax enough to detect non-trivial sequence similarities yet stringent

enough to exclude high-scoring local similarities that would, by themselves, not warrant the

assignment of two sequences as being orthologous. The resulting alignment contains the sequence

regions we regard as homologous to the query. C) Hmmer (http://hmmer.janelia.org/) is used to

derive a HMM from this alignment and search the full-length sequences of all BLAST-HSPs with

E-values better than 1. Deriving a HMM from the above alignment gives a better representation of

the sequence family. Using this HMM to search against full-length sequences of even marginal

BLAST hits allows detection of more of the distant sequence homologs and better defines the start

and end of homologous sequence regions than a single BLAST search could. D) Sequence regions

matching the full-length HMM with E-values better than 1e-5 are combined to a multiple sequence

alignment. E) A phylogenetic tree with 100 bootstrap replicates is infered from this alignment. Due



8

to limited computational resources, we use Neighbor-Joining (Saitou 1987) to infer phylogenies. All

intermediary files are made available so that the process can be followed from beginning to end and

alternative approaches, for example a different method of tree inference, could be used. The trees

are rooted at the phylogenetic node closest to the “Last Universal Common Ancestor”, as described

in the PhyloGenie manuscript (Frickey 2004).

The set of trees generated by PhyloGenie provides the basis of our prediction of sequence

orthologs. The actual prediction requires a number of user-specified parameters and is performed

on-the-fly, allowing for a high degree of flexibility. Detection of sequence orthologs is based on the

number of nodes separating the query sequence, i.e. the sequence for which a tree was derived, from

sequences of any given species in the tree. In the following examples we assume that the user

selected the Arabidopsis thaliana ATH1-121501 chip and was attempting to find sequence

orthologs in Medicago truncatula.

Determining sequence orthologs is done in the following manner (Fig. 2). The number of nodes

separating each Medicago truncatula sequence (yellow) from the query (purple) is determined

(minimum number:4, standard deviation 2.87). An additional scaling factor (default:0.5) allows the

user to specify the range in which he is willing to accept Medicago truncatula sequences as

potential sequence orthologs. Increasing this value causes the program to take into account more

distant sequence relatives as potential orthologs while decreasing this value causes the program to

focus on the most closely related sequences only. In the presented analysis, we used a value of 0.5

as this allowed us to determine orthologs for most of the chip sequences while not causing too many

of the query sequences to be assigned multiple orthologs in the other species. The distance within

which sequences are accepted as potential sequence orthologs is refered to as the “permissive

range” in this manuscript. The permissive range is calculated as the minimal number of nodes

separating the query sequence from a Medicago truncatula homolog in the tree plus the standard

deviation multiplied by the scaling factor. The standard deviation reflects the dispersal pattern of

Medicago truncatula sequences throughout the tree. The more clades in a tree contain Medicago

truncatula sequences, the greater the uncertainty about which of these clades contain sequences

orthologous to the query. We therefore use the standard deviation of the number of nodes separating

Medicago truncatula sequences from the query as a measure for how uncertain we are that the

sequences closest to each other, in number of nodes, really are the sequence orthologs. For the tree

shown in Figure 2, the permissive range is highlighted in green and encompasses all sequences less

than 6 nodes removed from the query. Affymetrix Arabidopsis thaliana ATH1-121501 sequences

less than 6 nodes removed from the query are regarded as sequence paralogs to the query

(260439_at). Medicago truncatula sequences within the permissive range are regarded as potential



9

sequence orthologs (Mtr.28509.1.S1_at, Mtr.17370.1.S1_at and Mtr.21922.1.S1_at).

For each of the potential orthologs we subsequently perform a reverse lookup. We calculate the

minimum and standard deviation of the number of nodes separating each potential ortholog from

the Affymetrix Arabidopsis thaliana ATH1-121501 sequences present in the tree. As the minimum

and standard deviation are greatly influenced by the position in the tree of the sequence for which

the values are being calculated, the permissive ranges of the potential orthologs may be quite

different from one another. A red and blue line show the permissive ranges for two of our three

potential orthologs. The query sequence does not lie within the permissive range of

Mtr.21922.1.S1_at (blue line). This sequence is therefore removed from the set of potential

orthologs as it appears much more closely related to the Affymetrix Arabidopsis thaliana sequence

“257728_at” than to the query. Mtr.28509.1.S1_at (red line) and Mtr.17370.1.S1_at (not shown)

recover the query sequence in their permissive ranges and both are retained as sequence orthologs

to the query. Analysis of this tree therefore tells us that our query sequence “245641_at” has a

sequence paralog (260439_at) on the Affymetrix Arabidopsis thaliana ATH1-121501 chip and two

sequence orthologs (or co-orthologs) on the Affymetrix Medicago truncatula chip.

The aim of this tool is twofold: it offers a fully automated way of retreiving sequence orthologs for

microarray consensus sequences from a wide variety of species and provides the results of a

BLAST search, multiple sequence alignment and phylogenetic inference for every consensus

sequence on a chip. This allows manual validation of any dubious orthology predictions by

comparing the various intermediate results leading to the phylogeny against the corresponding

phylogenetic trees and alignments. In addition, the large number of alignments generated in the

process of constructing the phylogenies are a useful resource on which to base further analyses, as

they provide sets of aligned sequence homologs for every consensus sequence on a chip.

Utility:

User interface: The user interface has five webpages. The home page allows querying of individual

genes and links to the remaining pages, some help and supplemental data. The other four pages of

the interface deal with batch requests, analysis of chip phylomes, generating phylogenies for

sequences provided by the user and predicting sequence orthologs between the consensus sequences

represented on a chip and other species.

The results of an individual query are shown in Figure 3. Tabs at the top of the page allow

navigation between the results of a BLAST search (BLAST), alignment of high-scoring HSPs

(CLN), the derived HMM (HMM), results of the HMM-search (HMS), alignment of high-scoring

HMM-hits (HLN) and either a textual or applet-based representation of a Neighbor-Joining tree



10

(TRE). The tabs allow the user to retrace every step leading from query sequence to phylogeny and

are very useful to gain a better understanding of why two genes were regarded as homologous,

included in the same tree, or predicted to be sequence orthologs. To facilitate interpretation of batch

requests and complete phylome analyses, intermediate pages can be generated that gather the

results, order them and link to the results pages of the various genes. Prediction of sequence

orthologs between microarray chip consensus sequences and a species of choice generates a tab-

delimited list containing information about which sequences on the chip could be assigned sequence

orthologs in another species, which sequences should be regarded as co-orthologous or paralogous,

and which other homologous sequences were present in the phylogenies but could not be assigned a

more precise relationship.

Supplemental data, providing further information about the programs used, the individual steps

performed to generate the data as well as the parameters the user can tweak, is available at

http://bioinfoserver.rsbs.anu.edu.au/utils/affytrees/help.php. Results of phylome analyses, custom

phylogenetic trees and orthology predictions are stored for a week and can be accessed by referring

to the job identifier provided in the results.

This tool differs from other databases and programs in a number of ways. It provides the data on

which tree inference and orthology prediction is based and thereby allows the user to re-trace each

step of the decision process. Our trees include sequences from the “nr” database which greatly

facilitates correct rooting and interpretation. In addition, this allows us to potentially detect

sequence orthologs for any species represented in “nr” instead of being limited to those species for

which complete genomes or proteomes are available. The use of a user-defined “scaling factor”

avoids problems co-orthologous genes cause for approaches relying solely on reciprocal best hits

between genomes. If, for example, a species has a gene of interest, gene A, that was duplicated in

another species, giving rise to genes B and B', reciprocal best hit approaches may identify genes A

and B or A and B' as reciprocal best hits and assign them as sequence orthologs. However, if A

appears most similar to B but B' appears most similar to A, a possible scenario if non-symmetric

scoring schemes are used, such as employed by BLAST, then no reciprocal best hits can be

determined and no sequence orthologs are assigned. All of the above cases produce an incorrect

assignment of gene orthology, as B and B' are co-orthologous to A (i.e. duplicates derived from a

gene that was orthologous to A) and should be treated as such.

Another part of this tool allows the user to search through the trees of a given species or chip for

those corresponding to specific topological selection criteria. For example, to find all trees in which

a clade contains at least one Medicago truncatula and Arabidopsis thaliana sequence, but no

sequences from the Arabidopsis thaliana ATH1-121501 chip, the selection string “((Medicago



11

truncatula & Arabidopsis thaliana) & !Arabidopsis ATH1-121501)” could be used. Trees containing

such clades could identify sequences present in Medicago truncatula, the orthologs of which can

not be measured using the Affymetrix Arabidopsis thaliana ATH1-121501 chip, as no sequence

orthologs are present on that chip. As an example of such a case (Figure 4), we show a tree derived

for a hypothetical protein from Medicago truncatula, the ortholog of which was not included on the

ATH1-121501 chip even though orthologous sequences are present in the Arabidopsis thaliana

genome as well as throughout the plant, fungal and animal kingdoms.

Future developments include, as a first step, extending this tool beyond the currently available 7

chips to include all publicly available Affymetrix plant microarray chips. Since this system is not

limited as to what species can be analyzed, provided some sequence information for the species is

available, it is conceivable that the system may be extended to cover all available Affymetrix

microarray chips. Beyond that, the aim will be to develop and implement methods that further

facilitate comparative analysis of microarray expression data across species.

Results and discussion:

To determine whether the AffyTrees orthology predictions were comparable to, less or more

accurate than reciprocal best BLAST hits, the most widely used method to identify sequence

orthologs, we compared the orthology predictions generated by both methods. Phylogenetically

orthologous sequences are generally expected to fulfill the same function in different species and

functionally orthologous sequences are expected to be similarly expressed across different species.

Therefore, phylogenetic orthologs can be expected to show a certain degree of similarity in their

expression across species. We based our comparison on prediction of sequence orthologs between

the Arabidopsis thaliana ATH1-121501 and Medicago truncatula Affymetrix chips. These species

were chosen specifically because sets of comparable microarray experiments were available and

provided us with the opportunity to test whether and how well sequence orthology, as predicted by

reciprocal best BLAST hits and AffyTrees, was reflected in similarity of expression.

The results of comparing the orthology predictions for these two microarray chips are shown in

Figure 5A. BLAST produced many more reciprocal best hits (7025) than AffyTrees predicted

orthologs (5793). Of these, 2926 predictions of sequence orthologs coincided, 4099 orthology

predictions were unique to the reciprocal best BLAST hits and 2867 orthology predictions were

unique to AffyTrees. Even though BLAST produced nearly 30% more orthology predictions, fewer

individual sequences were assigned an ortholog in BLAST than in AffyTrees. This was due to many

of the BLAST hits having multiple ortholog assignments. On average, each Medicago truncatula

chip sequence was assigned 1.78 Arabidopsis thaliana chip sequences as reciprocal best BLAST



12

hits and every Arabidopsis thaliana chip sequence was assigned 1.57 Medicago truncatula chip

sequences. This artificially inflated the number of “orthology” predictions provided by BLAST.

Dividing the number of reciprocal best BLAST hits by the amount of multiple predictions for each

species gives us the number of individual genes for each species that could be assigned at least one

ortholog in the other species: the exclusively BLAST based predictions assigned 2303 sequences

from Medicago one or more orthologs in Arabidopsis and 2611 sequences in Arabidopsis could be

assigned one or more orthologs in Medicago. The exclusively AffyTrees based predictions assigned

2515 Medicago sequences orthologs in Arabidopsis and 2537 Arabidopsis sequences orthologs in

Medicago; 138 more sequences than assigned by reciprocal best BLAST hits.

To determine which of the methods provided a more accurate orthology prediction, we compared

the expression of predicted sequence orthologs in two sets of microarray experiments, one for

Arabidopsis thaliana (Schmid 2005) and one for Medicago truncatula (Benedito et al., Medicago

Gene Atlas, manuscript in preparation, ArrayExpress accession: E-MEXP-1097). The expression of

genes was compared across 7 tissue types: stems, petioles, leaves, vegetative buds, flowers, roots

and seeds. Different laboratories generated the data, and differences in harvesting, preparation,

experimental procedure, growth conditions and of course the plants themselves, undoubtedly will

have affected the experiments and provide ample explanation for why some sequence orthologs

might not be correlated in their expression in these two species. Therefore, we did do not expect all

sequence orthologs to show a strong positive correlation in their expression, but a general positive

trend in correlation was certainly expected. However, our aim was not to show that sequence

orthologs share similar expression patterns, but to use the available expression data to assess the

accuracy of the two prediction methods.

Accepting the 2926 orthology assignments both BLAST and AffyTrees agreed upon as “true”

orthologs, we used the Pearson (linear) correlation coefficient of the expression values to measure

the co-expression of all predicted ortholog pairs. The histogram in Figure 5B shows the number of

predicted ortholog pairs for a given correlation coefficient as well as a fitted scaled extreme-value-

distribution (EVD) (Fig. 5B). Most of the predicted ortholog pairs produced positive correlation

coefficients, supporting our expectation that sequence orthologs, in general, should show similar

expression across different organisms. In addition, the graph provides us with a means of testing the

accuracy of reciprocal best BLAST hits and AffyTrees orthology predictions as seen in Figure 5C.

Rather than comparing histograms directly, we approximated the histograms by a distribution with a

small number of parameters to facilitate comparison of multiple datasets. The EVD approximates

the various histograms depicted in Figure 5 quite well. The more accurate the set of orthologs

predicted by each method, the better the corresponding fitted EVD should approximate the EVD



13

derived from our set of 2926 “true” orthologs.

We then compared the sets of genes for which sequence orthologs could only be predicted by either

BLAST or AffyTrees. Whenever one gene was assigned multiple sequence orthologs, we averaged

their correlation coefficients to reflect that the method generating the prediction could not decide in

more detail which of the predicted orthologs should be used. 4914 genes were assigned sequence

orthologs only in reciprocal BLAST hits and 5052 genes were assigned sequence orthologs only in

AffyTrees. The graphs of the histograms and fitted EVD's for these sets of genes are shown in

Figure 5C. Both BLAST and AffyTrees were able to predict orthologs for similar numbers of genes,

however, the maximum of the BLAST-EVD lies at 0.47, while the maximum of the AffyTrees-EVD

lies at 0.66. The EVD based on the AffyTrees predictions also better approximates the EVD based

on the set of “true” orthologs. Taking the median of the correlation coefficients as the comparison

metric leads to similar results (Figure 5B-D). Bootstrap sampling of the BLAST and AffyTrees

distributions (10000 samples, 1000 replicates) showed the median values of the distributions to be

very resilient to change. The probability of generating a randomly sampled distribution with the

median value observed in the other method was, in both cases, quite unlikely (BLAST: 2.1-36,

AffyTrees: 6.2-26). Both the median values of the distributions as well as the maximum of the fitted

EVD's show that the histogram of the AffyTrees predictions (blue) is more similar to the histogram

of the “true” orthologs (green), than the histogram of the best BLAST-based predictions (yellow) is

to the “true” orthologs. This points to the affytrees predictions being more reliable than the

predictions based on best BLAST hits.

However, it was recently shown that GCRMA (Wu 2004) normalization can lead to overprediction

of correlated genes (Lim 2007). To see whether this was affecting our results, we repeated the above

analysis using MAS5 (Hubbell 2002) normalized data. The median values of the resulting

distributions were 0.417 for our set of 'true' orthologs, 0.339 for the AffyTrees orthologs, 0.275 for

the BLAST predictions, 0.267 for AffyTrees homologs and 0.018 for random sequence pairs. These

values are similar to those calculated based on the GCRMA normalized data, indicating that,

although GCRMA normalization does seem to increase the median value of the distributions, the

increase is slight and no qualitative difference in how the methods compare to one another is

apparent.

In an attempt to determine why the BLAST-based prediction fare poorly, we examined how various

modes of orthology assignment influence the fitted EVD. We show the histograms and fitted EVD

for two further datasets (Figure 5D). The first set was generated by randomly pairing sequences

from within our set of “true” orthologs (black) and the second by accepting all sequence homologs

present in the AffyTrees phylogenies as sequence orthologs (pink). These phylogenies provide a



14

large number of groupings of homologous sequences. We know a large number of the trees to

contain paralogous sequences and mis-assigning sequence paralogs as orthologs is one of the key

difficulties in accurately detecting sequence orthologs. The graphs shows that an EVD fitted to the

random orthology assignments (black) has its maximum close to zero. Indiscriminately assigning all

sequence homologs present in a tree as sequence orthologs generates many more orthology

predictions, as visible by the increased amplitude of the EVD. However, the maximum of the fitted

EVD is close to 0.5, well below the 0.68 maximum we determined for the EVD of the set of “true”

orthologs (green). We therefore expect the maximum of EVD's fitted to various methods of

orthology assignment, for this dataset, to lie within 0 and 0.7. The closer the maximum lies to 0.7 or

above, the better the prediction method is likely to be. Not differentiating between “orthology” and

“homology”, thereby causing too many sequences to be assigned as sequence orthologs, shifts the

maximum of the fitted EVD to around 0.5. BLAST-based predictions more frequently assigned

multiple sequence orthologs to genes than the AffyTrees predictions. This might explain why the

maximum of the BLAST-EVD lies at 0.47. The best BLAST approach, while quite suited to

detecting sequence homologs, therefore does not appear very accurate when used to distinguish

between sequence orthologs and other homologs. The AffyTrees method, in contrast, appears far

better at reliably determining orthologous sequences.

Conclusions:

AffyTrees provides a repository of phylogenetic trees inferred from every consensus sequence

represented on a variety of Affymetrix plant microarray chips. This repository can be used to gain

insights into the relationship of sequence homologs, improve annotation data or automatically

generate a list of sequence orthologs between a species and the consensus sequences represented on

a specific microarray chip. The inclusion of sequences from the “nr” database and our method of

detecting sequence orthologs circumvent the problems reciprocal best hit approaches have when

dealing with co-orthologous genes. For sequences represented on Affymetrix plant microarray

chips, AffyTrees can identify sequence orthologs present on other Affymetrix plant microarray

chips, as well as sequence orthologs present in the “nr” database.

The ability to filter chip phylomes for specific selection criteria allows discrepancies or systematic

biases between the sequence complements of chips and the corresponding genomes to be detected.

Affymetrix chips were designed to measure the transcription of genes and therefore are biased

towards highly expressed and protein coding genes. This is a known and useful bias of these chips.

However, other biases, for example systematic preference for long or short sequences, differences in

the EST-libraries on which the chips were based or differences in the ability to successfully predict



15

short genes in different species, will have affected which sequences were included on a chip and

thereby influence the results.

We provide a means of comparing the sequence complement of microarray chips to the publicly

available sequence data of the corresponding organism as well as to the microarrays of other

species. Robust ways of assessing sequence orthologs and knowledge about systematic differences

in the sequence complement of various chips are prerequisites to making cross-species analyses of

microarray expression data feasible. Without knowledge of the sequence orthologs present on other

microarray chips, there is no way of determining which probe sets are comparable across chips.

Similarly, without a way of estimating sequence biases or genes missing on a chip, the conclusions

drawn from presence or absence of groups of genes derived from expression data are likely to be

flawed.

We show, to the extent that the limitations of the available experimental data permitted, that the

majority of genes predicted to be orthologous show a similar expression across the two examined

species. We also show that AffyTrees is able to assign sequence orthologs to more genes than a

comparable approach relying on reciprocal best BLAST hits and, by comparing the expression of

predicted sequence orthologs, that the AffyTrees orthologs appear more reliable than the BLAST-

based predictions.

AffyTrees provides prediction of sequence orthologs for a wide variety of species at greater

accuracy than reciprocal best BLAST hits. Combined with the available phylogenetic trees,

sequence alignments and additional utilities, AffyTrees should provide a useful resource for

comparative analyses of transcriptomes and proteomes.

Methods:

The sequences we based our sequence-similarity searches on originated from either the “nr”

database, downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz), or from 6-

frame translations of exemplar sequences for a variety of affymetrix chips. The nucleotide exemplar

sequences were downloaded, after registration, from the affymetrix website by following the links

to the various species (http://www.affymetrix.com/support/technical/byproduct.affx?cat=exparrays).

BLAST searches were performed against the NCBI non-redundant protein database “nr” and 6-

frame translation of consensus sequences for the Affymetrix microarray chips ATH1-121501,

AtGenome1, Barley1, Citrus, Cotton, Grape, Maize, Medicago, Poplar, Rice, Soybean, Sugar Cane,

Tomato and Wheat. The BLAST results for sequences represented on the Arabidopsis thaliana

ATH1-121501 and Medicago truncatula chips were retrieved via the AffyTrees web-interface.

Putative sequence orthologs between Medicago truncatula and Arabidopsis thaliana sequences



16

were predicted as described above (scaling factor = 0.5) based on the phylogenies provided by

AffyTrees. To keep the results as comparable as possible, the same cutoffs used to generate the

phylogenies (i.e. >70% coverage of the query and E-values better than 1e-5) were used as a lower

limit for analysis of the reciprocal best BLAST hits. BLAST hits that did not satisfy these cutoffs

were not taken into account. In cases where multiple BLAST hits had identical best E-values, all of

these best hits were taken into account. This made it possible for some genes to be assigned

multiple reciprocal best BLAST hits. The method of orthology prediction we describe allows genes

in one species to be assigned multiple orthologs in another. In such cases, all of the predicted

sequence orthologs were taken into account. A noticeable discrepancy was apparent in the number

of predicted sequence orthologs compared to the number of reciprocal best BLAST hits. To keep

both approaches of detecting sequence orthologs as comparable as possible, we compared

reciprocal AffyTrees orthologs to the reciprocal best BLAST hits. This allowed both methods to use

“reciprocality” as a further criterion to reduce the number of false positive orthology predictions.

For each plant species, the Affymetrix CEL files of the experiments we wanted to compare were

normalized using using both GCRMA (Wu 2004) and MAS5 (Hubbell 2002) for comparison. All

experimental files for a species were normalized at the same time, as normalizing each set of

experiments individually would have artificially increased the differences observed between the

experimental conditions. Linear correlation coefficients were calculated using the average

expression value of each gene over the three available experimental replicates.

Availability and requirements:

The tool is freely accessible at http://bioinfoserver.rsbs.anu.edu.au/utils/affytrees/. Further

information and help is available at http://bioinfoserver.rsbs.anu.edu.au/utils/affytrees/help.php.

Javascript should be enabled in the browser and a Java1.5 or above browser plugin should be

installed for visualization of phylogenetic trees.

Acknowledgements:

This research was funded by an Australian Research Council Centre of Excellence grant. Funding to

pay for the publication charges was provided by the same grant.



17

Literature cited

References:

Alexeyenko A, Tamas I, Liu G, Sonnhammer EL (2006). Automatic clustering of orthologs and

inparalogs shared by multiple proteomes. Bioinformatics. 22:e9-15.

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped

BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids

Res. 25:3389-3402.

Chiu JC, Lee EK, Egan MG, Sarkar IN, Coruzzi GM, DeSalle R (2006) OrthologID: automation of

genome-scale ortholog identification within a parsimony framework. Bioinformatics. 22:699-

707.

Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: NCBI gene expression and

hybridization array data repository. Nucleic Acids Res. 30:207-210.

Frickey T, Lupas AN (2004). PhyloGenie: automated phylome generation and analysis. Nucleic

Acids Res. 32:5231-5238.

Horan K, Lauricha J, Bailey-Serres J, Raikhel N, Girke T (2005) Genome cluster database. A

sequence family analysis platform for Arabidopsis and rice. Plant Physiol. 138:47-54.

Hubbell E, Liu WM, Mei R (2002) Robust estimators for expression analysis. Bioinformatics

18:1585-92.

Johnson X, Brcich T, Dun EA, Goussot M, Haurogne K, Beveridge CA, Rameau C (2006)

Branching genes are conserved across species. Genes controlling a novel signal in pea are

coregulated by other long-distance signals. Plant Physiol. 142:1014-1026.

Koski LB, Golding GB (2001) The closest BLAST hit is often not the nearest neighbor. J Mol Evol.

52:540-542.

Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog groups for eukaryotic

genomes. Genome Res. 13:2178-2189.

Lim W K, Wang K, Lefebvre C, Califano A (2007) Comparative analysis of microarray

normalization procedures: effects on reverse engineering gene networks. Bioinformatics 23: 282-

288.

O'Brien KP, Remm M, Sonnhammer EL (2005) Inparanoid: a comprehensive database of eukaryotic

orthologs. Nucleic Acids Res. 33:D476-480.



18

Saitou N, Nei M (1987). The neighbor-joining method: a new method for reconstructing

phylogenetic trees. Mol Biol Evol. 4:406-425.

Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, Scholkopf B, Weigel D and

Lohmann JU (2005) A gene expression map of Arabidopsis thaliana development. Nat. Genet.

5:501-506.

Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science

278:631-637.

Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM,

Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S,

Wolf YI, Yin JJ, Natale DA (2003) The COG database: an updated version includes eukaryotes.

BMC Bioinformatics 4:41.

Wu Z, Irizarry RA, Gentleman R, Murillo FM, Spencer F (2004) A Model Based Background

Adjustment for Oligonucleotide Expression Arrays. Technical Report. John Hopkins University,

Department of Biostatistics Working Papers, Baltimore, MD;



19

Figure legends

Figure 1:

An ancestral gene undergoes a duplication and gives rise to two paralogous genes A and B. Some

time later a speciation event gives rise to two species (light and dark). Each of these has retained

both paralogs in their genome, but only genes A for the dark species and B for the light species are

included on the chip. Simple pairwise comparison of the chip sequences alone would predict A

(dark) and B' (light) to be sequence orthologs as these would appear to be reciprocal closest

relatives. Including additional sequence data, such as sequences of outgroup species or the

sequences A' (light) and B (dark) missing on the chips but present in the genomes of the blue and

red species, can help clarify relationships and allow unambiguous asignment of sequence orthologs.

Figure 2:

Determining sequence orthologs based on the number of nodes separating them from the query.

This example provides a case where multiple clades containing both Medicago truncatula and

closely related Arabidopsis thaliana homologs are present. Sequences from the Arabidopsis

thaliana microarray chip ATH1-121501 are highlighted in blue, the query sequence for which this

tree was computed is highlighted in magenta and sequences from the Medicago truncatula

microarray chip are highlighted in yellow. The “permissive range” for the query is show with a

colored background (green), red and blue lines, above and below the tree, respectively, show the

permissive range for the reverse lookup for two of the three potential sequence orthologs. Circles

show which of the Arabidopsis thaliana ATH1-121501 sequences were recovered in the respective

reverse lookups.

Figure 3:



20

Screenshot of results using the Arabidopsis thaliana ATH1-121501 chip consensus sequence

261590_at as a query. Part of the corresponding phylogenetic tree is displayed. Red (dark) dots

highlight Medicago truncatula sequences, yellow (light) dots highlight ATH1-121501 sequences

and a blue dot (bottom) highlights the query sequence. The tabs at the top of the page allow

navigation between BLAST results (BLAST), the alignment of HSPs (CLN), the derived HMM

(HMM), the HMM-search results (HMS), the alignment from which the phylogeny is infered

(HLN) and either a text or graphical representation of the phylogenetic tree (TRE).

Figure 4:

Phylogenetic tree of a protein coding gene present in a wide variety of eukaryotes that is not

represented on either of the Affymetrix Arabidopsis thaliana chips. This is recognizable by the

sequence identifiers. The Arabidopsis thaliana sequences (yellow, light dot) have NCBI gi-numbers

instead of affymetrix identifiers, signifying that these sequences were taken from the “nr” database

and not one of the 6-frame translations of the microarray chip consensus sequences. The bottom-

most sequence is the Medicago truncatula query sequence for which this tree was generated. Other

Medicago truncatula sequences are highlighted with a red (dark) dot.

Figure 5:

A) Overlap of reciprocal best BLAST hits (yellow) with AffyTrees orthology predictions (blue).

B) Histogram and fitted EVD of ortholog pairs predicted by both BLAST and AffyTrees over the

correlation coefficient of their expression values across the microarray experiments. For comparison

purposes, the fitted EVD curve (green) for this data is represented in 5C and 5D as well. A vertical

dotted line is placed at the peak of the EVD and the correlation coefficient at which the peak is

found is stated in black numbers at the bottom. The median value of each dataset is marked in the

top left corner.

C) Histogram and fitted EVD of the genes assigned orthologs in either BLAST (yellow) or

AffyTrees (blue) over the average correlation coefficient of the assigned orthologs.

D) Histogram and fitted EVD over the average correlation coefficient for genes assigned orthologs

randomly (black) or by indiscriminately using any sequences present in the Affytrees phylogenies

as orthologs (magenta).



1 running head: affytrees corresponding author: georg weiller

Documents