from population to single cells: deconvolution of cell
TRANSCRIPT
From Population to Single Cells: Deconvolution of
Cell-cycle Dynamics
by
Xin Guo
Department of Computer ScienceDuke University
Date:Approved:
Alexander J. Hartemink, Supervisor
Pankaj K. Agarwal
Uwe Ohler
Steven B. Haase
Dissertation submitted in partial fulfillment of the requirements for the degree ofDoctor of Philosophy in the Department of Computer Science
in the Graduate School of Duke University2012
Abstract
From Population to Single Cells: Deconvolution of Cell-cycle
Dynamics
by
Xin Guo
Department of Computer ScienceDuke University
Date:Approved:
Alexander J. Hartemink, Supervisor
Pankaj K. Agarwal
Uwe Ohler
Steven B. Haase
An abstract of a dissertation submitted in partial fulfillment of the requirements forthe degree of Doctor of Philosophy in the Department of Computer Science
in the Graduate School of Duke University2012
Copyright c© 2012 by Xin GuoAll rights reserved except the rights granted by the
Creative Commons Attribution-Noncommercial License
Abstract
The cell cycle is one of the fundamental processes in all living organisms, and all
cells arise from the division of existing cells. To better understand the regulation of
the cell cycle, synchrony experiments are widely used to monitor cellular dynamics
during this process. In such experiments, a large population of cells is generally
arrested or selected at one stage of the cycle, and then released to progress through
subsequent division stages. Measurements are then taken in this population at a
variety of time points after release to provide insight into the dynamics of the cell
cycle. However, due to cell-to-cell variability and asymmetric cell division, cells in
a synchronized population lose synchrony over time. As a result, the time-series
measurements from the synchronized cell populations do not accurately reflect the
underlying dynamics of cell-cycle processes.
In this thesis, we introduce a deconvolution algorithm that learns a more accu-
rate view of cell-cycle dynamics, free from the convolution effects associated with
imperfect cell synchronization. Through wavelet-basis regularization, our method
sharpens signal without sharpening noise, and can remarkably increase both the
dynamic range and the temporal resolution of time-series data. Though it can be
applied to any such data, we demonstrate the utility of our method by applying
it to a recent cell-cycle transcription time course in the eukaryote Saccharomyces
cerevisiae. We show that our method more sensitively detects cell-cycle-regulated
transcription, and reveals subtle timing differences that are masked in the original
iv
population measurements. Our algorithm also explicitly learns distinct transcrip-
tion programs for both mother and daughter cells, enabling us to identify 82 genes
transcribed almost entirely in the early G1 in a daughter-specific manner.
In addition to the cell-cycle deconvolution algorithm, we introduce DOMAIN,
a protein-protein interaction (PPI) network alignment method, which employs a
novel direct-edge-alignment paradigm to detect conserved functional modules (e.g.,
protein complexes, molecular pathways) from pairwise PPI networks. By applying
our approach to detect protein complexes conserved in yeast-fly and yeast-worm
PPI networks, we show that our approach outperforms two widely used approaches
in most alignment performance metrics. We also show that our approach enables
us to identify conserved cell-cycle-related functional modules across yeast-fly PPI
networks.
v
Contents
Abstract iv
List of Tables x
List of Figures xi
List of Abbreviations and Symbols xiii
Acknowledgements xv
1 Introduction 1
1.1 Biological background . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Overview of cell cycle . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Phases of eukaryotic cell cycle . . . . . . . . . . . . . . . . . . 3
1.1.3 Asymmetric cell division of budding yeast . . . . . . . . . . . 6
1.1.4 Cell-cycle control system of budding yeast . . . . . . . . . . . 7
1.2 Cell-cycle synchrony experiment and its limitations . . . . . . . . . . 9
1.2.1 Biomarkers for monitoring cell-cycle progression . . . . . . . . 9
1.2.2 Cell-cycle synchrony experiment . . . . . . . . . . . . . . . . . 12
1.2.3 Synchrony lose significantly in a synchronized cell population . 14
1.3 Motivation: why deconvolution is necessary . . . . . . . . . . . . . . 15
1.3.1 Deconvolution: from population to single cells . . . . . . . . . 15
1.3.2 cloccs: modeling cell-cycle distributions . . . . . . . . . . . 16
1.3.3 The missing piece of deconvolution . . . . . . . . . . . . . . . 19
vi
1.4 Contribution of our deconvolution framework . . . . . . . . . . . . . . 20
1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2 The deconvolution framework 23
2.1 Previous deconvolution algorithms . . . . . . . . . . . . . . . . . . . . 23
2.2 General deconvolution objective function . . . . . . . . . . . . . . . . 28
2.3 Branching process in deconvolution . . . . . . . . . . . . . . . . . . . 30
2.4 Introduction to wavelets: selection of wavelets . . . . . . . . . . . . . 33
2.5 Selecting a regularization parameter . . . . . . . . . . . . . . . . . . . 35
2.6 Joint learning from multiple replicates . . . . . . . . . . . . . . . . . 35
3 Deconvolution of wild-type cell-cycle transcriptional profiles of bud-ding yeast 37
3.1 Experimental data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Branching process model and cell-cycle parameters . . . . . . . . . . 38
3.2.1 Branching process model . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 Cell-cycle parameters from cloccs . . . . . . . . . . . . . . . 39
3.3 Deconvolution model . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Deconvolution objective function . . . . . . . . . . . . . . . . 40
3.3.2 Constructing a convolution kernel . . . . . . . . . . . . . . . . 41
3.3.3 Selection a regularization parameter . . . . . . . . . . . . . . . 42
3.3.4 Adjustment of branching process construction from cloccs . 43
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.1 Deconvolving time-series yeast budding index data to assessalgorithm accuracy . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.2 Deconvolving replicate yeast microarray data to reveal single-cell transcription profiles . . . . . . . . . . . . . . . . . . . . . 45
3.4.3 Deconvolution is robust with respect to uncertainty in inputcloccs parameters . . . . . . . . . . . . . . . . . . . . . . . . 46
vii
3.4.4 Deconvolution increases temporal resolution and precision oftranscription profiles . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.5 Deconvolution increases amplitude and dynamic range of tran-scription profiles . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.6 Deconvolution reveals a large number of transcripts fluctuatingduring the cell cycle . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.7 Deconvolution is robust across replicates . . . . . . . . . . . . 55
3.4.8 Deconvolution reveals fine timing of transcription programs . . 56
3.4.9 Identifying over-represented transcription factors (TFs) . . . . 60
3.4.10 Deconvolution reveals R-specific transcriptional program . . . 60
3.4.11 Deconvolution reveals a daughter-specific G1 transcription pro-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.12 Transcriptional programs between G1 and DG1 . . . . . . . . 68
3.4.13 Visualizing transcription timing of gene groups . . . . . . . . . 70
4 Identifying conserved functional modules across species 72
4.1 Introduction to network alignment . . . . . . . . . . . . . . . . . . . 73
4.2 DOMAIN: a domain-oriented edge-based PPI network aligner . . . . 75
4.2.1 Constructing and scoring APEs . . . . . . . . . . . . . . . . . 75
4.2.2 Building an APE graph . . . . . . . . . . . . . . . . . . . . . 77
4.2.3 Detecting protein complexes . . . . . . . . . . . . . . . . . . . 79
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.2 DOMAIN outperforms previous methods in most performancemetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.3 DOMAIN is sensitive at detecting small alignments . . . . . . 84
4.3.4 DOMAIN provides a comprehensive means of interpreting align-ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
viii
4.3.5 Performance improves by combining cross-species pairwise align-ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Detecting conserved cell-cycle-related functional modules . . . . . . . 87
4.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5 Conclusions 90
A Intrinsic disorder within and flanking the DNA-binding domains ofhuman transcription factors 94
A.1 Introduction to intrinsically disordered structures and transcriptionfactors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
A.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A.2.1 Constructing the TF dataset and the non-TF control dataset . 96
A.2.2 Comparing the TF and non-TF sets of proteins . . . . . . . . 97
A.2.3 Identifying DNA-binding domains (DBDs) and their locationswithin proteins . . . . . . . . . . . . . . . . . . . . . . . . . . 98
A.2.4 Using multiple prediction methods to predict intrinsically dis-ordered regions (IDRs) within proteins . . . . . . . . . . . . . 99
A.2.5 Defining disorder features: spatial relationships of IDRs rela-tive to DBDs within TFs . . . . . . . . . . . . . . . . . . . . . 100
A.2.6 Calculating statistical significance of disorder features . . . . . 101
A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.3.1 Comparing the three methods to predict IDRs within proteins 102
A.3.2 IDRs associated with TF DBDs or their flanking regions . . . 103
A.3.3 Comparison of prediction methods in DBDs . . . . . . . . . . 105
A.3.4 Summary descriptions for some of the most prevalent DBDclasses found in human TFs . . . . . . . . . . . . . . . . . . . 107
A.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Bibliography 112
Biography 124
ix
List of Tables
3.1 Cell-cycle parameters estimated by cloccs from flow cytometric mea-surements of DNA content and budding index. . . . . . . . . . . . . . 40
3.2 Full list of over-represented TFs in subclusters of R-specific expressedgenes (Fig. 3.11). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 Full list of over-represented TFs in subclusters of daughter-specificgenes (Fig. 3.12). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4 The contingency table for 82 identified daughter-specific genes accord-ing to the daughter-specific and non-daughter-specific genes identifiedin Di Talia et al. (2009), Spellman et al. (1998), and Colman-Lerneret al. (2001). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1 Summary of backbone networks. . . . . . . . . . . . . . . . . . . . . . 81
4.2 Performance comparisons of DOMAIN with NetworkBLAST and MaW-ISh on yeast-fly backbone networks. . . . . . . . . . . . . . . . . . . . 82
4.3 Performance comparisons of DOMAIN with NetworkBLAST and MaW-ISh on yeast-worm backbone networks. . . . . . . . . . . . . . . . . . 82
4.4 Cell-cycle-related functional modules conserved across budding yeastand fruit fly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
A.1 Statistics summarizing disorder predictions on all the residues of allthe proteins in both the TF set and the non-TF control set using threedifferent disorder prediction tools. . . . . . . . . . . . . . . . . . . . . 103
A.2 Enrichment analysis of significantly occurring ordered and disorderedregions within and flanking human TF DBDs. . . . . . . . . . . . . . 106
x
List of Figures
1.1 Overview of eukaryotic cell cycle. . . . . . . . . . . . . . . . . . . . . 4
1.2 Asymmetric cell division of budding yeast . . . . . . . . . . . . . . . 7
1.3 Overview of the cell-cycle control system of budding yeast. . . . . . . 8
1.4 Examples of the measureable cell-cycle progression markers in buddingyeast. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Examples to illustrate that the synchronized population of cells losessynchrony over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Overview of deconvolution framework. . . . . . . . . . . . . . . . . . 17
1.7 Branching process in cloccs. . . . . . . . . . . . . . . . . . . . . . . 18
2.1 Branching process in deconvolution. . . . . . . . . . . . . . . . . . . . 31
2.2 Selection of a regularization parameter γ. . . . . . . . . . . . . . . . . 36
3.1 Overview of the deconvolution algorithm. . . . . . . . . . . . . . . . . 39
3.2 Detailed algorithm for selecting a regularization parameter γ. . . . . . 42
3.3 Deconvolution recovers dynamic single-cell profiles from population-level data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Deconvolution is of capability of de-noising. . . . . . . . . . . . . . . 46
3.5 Deconvolved profiles are robust to uncertainty in inputs. . . . . . . . 48
3.6 More examples on the robustness of deconvolved profiles with respectto uncertainty in cloccs parameter estimates. . . . . . . . . . . . . . 49
3.7 Genome-wide analysis of deconvolved transcription profiles reveals alarge number of transcripts fluctuating during the cell cycle. . . . . . 53
3.8 Transcript dynamics of 1,500 most cell-cycle-regulated genes. . . . . . 55
xi
3.9 Robustness of deconvolved profiles with respect to variation acrossmeasured data replicates. . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.10 High temporal resolution of deconvolution reveals fine timing of tran-scription programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.11 Genes whose transcriptional levels are elevated significantly under stress. 62
3.12 Branching process construction enables deconvolution to reveal a daughter-specific G1 transcription program. . . . . . . . . . . . . . . . . . . . . 63
3.13 Relationships of transcription profiles in G1 and DG1. . . . . . . . . . 69
3.14 Circular representation of peak timing of genes. . . . . . . . . . . . . 70
4.1 Overview of DOMAIN algorithm . . . . . . . . . . . . . . . . . . . . 75
4.2 Four connectivities in an APE graph. . . . . . . . . . . . . . . . . . . 78
4.3 Evaluation of alignment performance of DOMAIN. . . . . . . . . . . 85
A.1 Generation of TF set and the non-TF control set. . . . . . . . . . . . 98
A.2 Distributions of the fraction of each protein’s residues predicted asdisordered by each method. . . . . . . . . . . . . . . . . . . . . . . . 104
A.3 Meta-plots of five prevalent DBDs in human TFs. . . . . . . . . . . . 108
xii
List of Abbreviations and Symbols
Abbreviations
APC anaphase-promoting complex
APE alignable pairs of edges
ATM ataxia telangiectasia mutated
ATR ataxia telangiectasia and Rad3-related protein
CDK cyclin-dependent protein kinases
CLOCCS characterizing loss of cell cycle synchrony
CWT continuous wavelet transform
DBD DNA-binding domain
DDI domain-domain interaction
DG1 daughter-specific G1
DWT discrete wavelet transform
DOMAIN domain-oriented alignment of interaction networks
EM expectation-maximization
FACS fluorescence-activated cell sorter
FDR false positive rate
GO gene ontology
indel insertion/deletion
MBF MCB binding factor
MCM mini Chromosome Maintenance
xiii
MCMC Markov chain Monte Calro
ORC origin recognition complex
postG1 post G1 (G1 or DG1) interval, including S, G2, and M phases
PPI protein-protein interaction
pre-IC pre-initiation complex
pre-RC pre-replicative complex
PTR peak-to-trough ratio
R recovery interval
SBF SCB binding factor
SPB spindle-pole body
TF transcription factor
WT wild-type
YMC yeast metabolic culture
Symbols
f average levels of molecular species individual cells at variouspoints in the cell cycle
g measured cell-cycle time-series at population level
H (de)convolution kernel
xiv
Acknowledgements
First and foremost, I would like to express my earnest gratitude to my advisor,
Prof. Alex Hartemink, for his support, patience, encouragement, wisdom, countless
insightful suggestions, and long discussions. From Alex, I have learned so much, not
only about science, but also about all aspects in my research work and life. He is
a great mentor. He taught me how to think, to write, and to present in a scientific
way. He has always been there to listen to my thoughts, many times rambles, and
turn them into something meaningful. He truly makes our group an enjoyable place
to be, to discuss, and to learn.
I would like to acknowledge my committee members, Prof. Steve Haase, Prof.
Uwe Ohler, and Prof. Pankaj Agarwal. Thank you for helping me go throughout
these years. Steve, with his immense knowledge of yeast biology, often provided me
insightful feedbacks and suggestions to our cell-cycle projects. Uwe taught me a lot
about computational biology and genetics in his classes, and he is always willing to
help me with his experiences when I met any problem in my research work. I thank
Pankaj for his invaluable suggestions and feedbacks in writing this dissertation. I
also would like to thank Prof. Martha Bulyk, Prof. Edwin Iversen, Prof. Merlise
Clyde, Prof. Rebecca Willett, and Prof. David MacAlpine for helpful discussions at
various points during developing and writing this dissertation.
Thanks to all members of the Hartemink lab, past and present. I could not finish
my PhD study without the help and the support of them. A special thanks to Dr.
xv
Allister Bernard, who developed the initial framework of the cell-cycle deconvolution
algorithm, and to Dr. Josh Robinson, the officemate of mine for over two years,
who taught me a lot about statistics and brought me a lot of laughters. Dr. Raluca
Gordan, Dr. Narlikar LeeLavati, Dr. David Orlando, Dr. Todd Wasson, Abrita
Chakravarty, Yezhou Huang, Jianling Zhong, Michael Mayhew, Kaixuan Luo, and
Dr. Fantine Mordelet, thank you for so many helpful discussions, and for making
our lab such a lively and productive place.
At last but definitely not least, I thank my family. Dad and mom, thank you
for all the love and the encouragement over so many years. Thank you, Rui, my
wife. Without your help and support, I would never have been able to complete this
dissertation. And a big kiss to my daughter Mandy, who is growing up and such a
wonderful loving kid.
Thank you all!
xvi
1
Introduction
The cell is the basic structural and the functional unit of all known living organ-
isms. All necessary genetic information and molecular machinery are maintained
in individual cells, which enables the existing cells to produce new cells through an
intricate series of cell-cycle events. To better understand how these events are regu-
lated, studies in many organisms have monitored the dynamics of various molecular
species (e.g., transcript levels, protein levels, nucleosome positions) throughout the
cell cycle. Ideally, the dynamics of these species would be studied in individual cells
traversing the cell cycle. Unfortunately, accurate and genome-wide quantification of
many molecular species is still only possible in populations of cells. For population
measurements to provide insight into dynamics of molecular species in individual
cells, the cells in a population should be arrested at one stage of the cell cycle, and
then released to progress through subsequent division cycles. Molecular species can
then be monitored in the population at various time points after release.
However, perfect cell synchrony is neither attainable at synchronization nor main-
tainable after release. More importantly, cell division is an asymmetric procedure in
many kinds of cells, such as budding yeast; after cell division, the new born daugh-
1
ter cells are typically smaller than their mothers, and the cell-cycle period of these
daughter cells is significantly longer than that of mothers. For these reasons, time-
series measurements taken over a population of cells do not accurately reflect the
dynamics of individual cells as they traverse the cell cycle, but instead represent the
convolved dynamics of all cells in the imperfectly synchronized population.
In this thesis, we introduce a deconvolution algorithm that efficiently removes
these synchrony loss effects from population-level measurements and reveals a de-
tailed cell-cycle profile at a single-cell level. Our deconvolution algorithm is built
upon cloccs (Characterizing Loss of Cell Cycle Synchrony), a framework for quan-
titatively determining cell-cycle distributions in population synchrony experiments.
From cloccs parameter estimates, we construct a convolution kernel that trans-
forms the values from the individual cell level to the population level, and then the
problem of estimating cell-cycle dynamics from population-level measurements to a
single-cell level can be viewed as an ill-posed inverse problem. We address the ill-
posed nature of this problem—and simultaneously tackle the issue of noise in the
input data—by employing a novel wavelet-basis regularization approach.
Before elaborating the actual computational details of our deconvolution algo-
rithm, in this chapter, we first review some biological background and experimental
techniques that motivate our work toward accurately estimating cell-cycle dynamics
at a single-cell level, and then we describe the basis of cell-cycle deconvolution prob-
lem, and briefly introduce the cloccs model. We end this chapter with an outline
of this thesis.
1.1 Biological background
1.1.1 Overview of cell cycle
The cell cycle, or cell division cycle, is one of the fundamental processes in all living
organisms, from unicellular bacterium to the multicellular mammal. During the
2
course of cell cycle, a cell reproduces itself, replicates its genome and other cellular
contents to produce a new cell. In unicellular organisms such as bacteria or yeasts,
cell division generates an entire new organism. In multicellular species, countless cell
divisions starting from a single founder produce the diverse communities of cells that
make up tissues and organs.
The cell cycle is a series of events that take place in a cell leading to its division
and duplication. Although the details of cell cycle vary from organism to organism,
the certain characteristics are common. At minimum, a cell has to accomplish its
most fundamental task to passing on its genetic information to the next generation.
In cells without a nucleus (prokaryotic), the cell cycle occurs via a process termed
binary fission. In cells with a nucleus (eukaryotes), the cell cycle is controlled by
a complex network of regulatory proteins, known as cell-cycle control system, that
governs progression through the cell cycle. The core of this system is an ordered
series of biochemical switches that initiate the main events of the cycle, including
chromosome duplication and segmentation. In this thesis, we focus on the study
of eukaryotic cell cycle, especially the cell cycle of model organism, budding yeast
Saccharomyces cerevisiae.
1.1.2 Phases of eukaryotic cell cycle
The two most basic functions of the cell cycle are accurate duplication of the large
amount of DNA in the chromosomes and segmentation of precisely duplicated chro-
mosomes into two daughter cells. The stages of the eukaryotic cell cycle are typically
defined on the basis of these two chromosomal events, separated by two gap phases,
G1 and G2 (Fig. 1.1).
G1 phase (also known as post-mitotic phase) is the major period of cell growth
during one cell cycle. In G1 phase, a large amount of structural proteins and enzymes
are required for synthesizing new organelles, and therefore the rate of metabolism
3
Figure 1.1: Overview of eukaryotic cell cycle. The reproduction of cells includestwo major processes: chromosome duplication during S phase, and cell segregationduring M phase. These two phases are separated by two gap phases: G1 is the gapphase between the previous M and S phases, and G2 is the gap phase between S andM phases. Figure is adapted from Morgan (2007).
in the cell is high. The length of G1 phase can vary greatly depending on exter-
nal conditions and extracellular signals from other cells (in multicellular organisms).
Sometimes, cells delay progress through G1 and may even enter a specialized resting
state known as G0 phase. Near the end of G1, the cell progress through a com-
mitment point, known as Start (in yeasts) or the restriction point (in mammalian
cells), which is a major cell-cycle checkpoint to ensure the DNA is intact and the
cell is functioning normally. After passing this point, the cell is committed to DNA
replication, even if the extracellular signals that stimulate cell growth and division
are removed (Morgan, 1997; Alberts et al., 2007; Murray and Hunt, 1993).
The next is synthesis (S) phase, during which the chromosomes are duplicated.
The central event in this phase is DNA replication. It starts from specific locations
in the genome, called ‘replication origins’. A complex of initiator proteins binds
on these sites and opens the DNA, making two Y-shaped DNA structures called
‘replication forks’. DNA Polymerases and other replication proteins are recruited to
these forks, moving outwards in both directions, to form the new two DNA strands.
4
More than DNA replication, chromatin structures are constructed and DNA dam-
ages, if occur, are detected and fixed during this phase. Both of these processes
require increased synthesis of proteins, such as histones for packaging the DNA into
chromosomes (Osley, 1991), and ataxia telangiectasia mutated (ATM) and ataxia
telangiectasia & Rad3-related protein (ATR), two master kinases that response to
DNA double-strand breaks and distributions in chromatin structure (Bakkenist and
Kastan, 2003).
G2 phase is the second growth period of a cell cycle, occurring between S phase
and the mitosis (M) phase. Curiously, G2 phase is not a necessary part of the cell
cycle. Some cell types (particularly Xenopus embryos and some cancers (Liskay,
1977)) proceed directly from DNA replication to mitosis. Also budding yeast Sac-
charomyces cerevisiae, the model organism in the study of cell cycle, lacks a clear
definition of G2 phase (Forsburg and Nurse, 1991).
The second major phase of the cell cycle is mitosis (M) phase. M phase is typically
composed of two major events: nuclear division (mitosis) and cell division (cytoki-
nesis). The first mitosis event is a complex and precise process that distributes the
duplicated chromosomes equally into a pair of daughter nuclei. Mitosis can be di-
vided into four sub-phases: prophase, during which chromatin condenses into double
chromosomes; metaphase, during which the condensed chromosomes align in the
middle of the cell; anaphase, during which chromosomes move to opposite poles of
the cells; and telophase, during which two daughter nuclei form in the cell. In the
second event of cytokinesis, the cytoplasm of a single eukaryotic cell separates apart
to form two daughter cells, each with one pair of chromatid identical to the mother
cell (Morgan, 2007).
5
1.1.3 Asymmetric cell division of budding yeast
Budding yeast Saccharomyces cerevisiae is a unicellular fungus that has been widely
used in baking and brewing since ancient times, and thereby it is commonly called
baker’s or brewer’s yeast. Budding yeast is one of the most intensively studied
eukaryotic organisms in genetics and cell biology, particularly in the field of cell
cycle. As a unicellular eukaryote, budding yeast offers many advantages towards
studying cell-cycle regulation: first of all, it has a relatively small size of genome and
it is able to proliferate rapidly in simple culture conditions (e.g., approximately 90
minutes per cell division under ideal conditions). Secondly, the cell cycle of budding
yeast is very similar to the cell cycle of many higher eukaryotes, such as humans.
Thirdly and more importantly, budding yeast can proliferate in a haploid state, in
which only a single copy of each chromosome is present in the cell. This makes it
easy to manipulate the cells genetically, avoiding the pitfall of recessive mutations.
For example back to the 1970s, researchers have used haploid cells of budding yeast
to carry out large mutation screens, leading to many key regulatory discoveries of
cell division (Simchen, 1978). Above all, budding yeast Saccharomyces cerevisiae is
an ideal experimental model organism in the study of cell cycle.
A particularity of budding yeast S. cerevisiae lies in its asymmetric division
(Hartwell and Unger, 1977; Lord and Wheals, 1981, 1980; Woldringh et al., 1993;
Bean et al., 2006). As illustrated in Fig. 1.2, the cycle of S. cerevisiae is usually split
into three phases, G1, S, and G2/M phases, as there exists no normal G2 phase in
budding yeast (Forsburg and Nurse, 1991). Around the time that a cell progresses
from G1 into S phase, a bud is initiated from one side of the cell, grow steadily, and
finally separated from its mother after mitosis, forming a daughter cell. After cell
division, the newborn daughter cells are usually smaller than the mother cells, and
the cell-cycle period of these daughter cells is significantly longer than that of mother
6
G1
S
G2/M
START
Figure 1.2: Asymmetric cell division of budding yeast S. cerevisiae. The cycle ofbudding yeast is usually split by landmark events into G1, S, and G2/M phases. Thetransition from G1 to S is marked by the development of a bud, and the transitionfrom S to G2 is marked by the completion of DNA synthesis. At the end of M phase,the daughter cell separates apart from the mother cell. After yeast cell division, thenewborn daughter cell is usually smaller than the mother cell, and therefore it needsmore time in G1 to grow until it reaches a critical cell size.
cells. This is most likely due to mechanisms—not yet well understood—that delay
daughter cells in early G1 until they achieve a critical cell size (Jorgensen and Tyers,
2004). Mother cells are often already larger than this critical size and thus progress
more rapidly through G1 (Di Talia et al., 2007; Morgan, 2007).
1.1.4 Cell-cycle control system of budding yeast
The eukaryotic cell division cycle is controlled by a sequential activation and in-
activation of cyclin-dependent protein kinases (CDKs). CDKs are a family of ser-
ine/threonine protein kinases. In general, a CDK binds a regulatory protein called
a cyclin to play its regulatory role. Without cyclin, CDK has little kinase activity,
and therefore only the cyclin-CDK complex is an active kinase. CDKs are present in
all known eukaryotes, and their regulatory functions in the cell cycle are conserved
7
Figure 1.3: Overview of the cell-cycle control system of budding yeast. There existthree major sets of gene regulatory factors that provide the underlying frameworkfor an autonomous control system to trigger cell-cycle events in the correct order:SBF/MBF, Mcm-Fkh, and Swi5/Ace2 (blue boxes). In early G1, Cln3-Cdk1 activitysets the system in motion by activating SBF/MBF. Then, the regulatory signalsproceed forward through the various Cdks and gene regulatory factors as shown bythe solid red arrows, leading to ordered progression thorough the stages of the cellcycle and back to the stable G1 stage again. Positive feedback (dashed red arrows)enhances the activation of each gene regulatory factor, and negative feedback (dashedblue lines) allows some components to inhibit previous components in the sequence.Figure is adapted from Morgan (2007).
across species. For example, it has been shown that the yeast cells can prolifer-
ate normally when their CDK gene is replaced with homologous human gene (Lee
and Nurse, 1987; Morgan, 2007). In the budding yeast Saccharomyces cerevisiae,
Cdc28/Cdk1 is the only CDK involved in regulating the cell cycle, while in higher
eukaryotes, multiple CDKs (e.g., Cdc2/Cdk1, Cdk2, Cdk4, and Cdk6) control cell
cycle progression.
There exist 9 major cyclins in budding yeast: three G1 cyclins (Cln1-3) and six
B-type cyclins (Clb1-6). All these cyclins bind to and activate Cdc28, and they
together with some other regulatory factors establish a complex regulatory network
8
to control the progression of cell cycle (Futcher, 2002; Murray, 2004; Cross, 2003;
Chen et al., 2004; Morgan, 2007; Bloom and Cross, 2007; Alberts et al., 2007).
As shown in Fig. 1.3, in early G1, the activity of most Cdks is suppressed by Cdk
inhibitor Sic1 and cyclin ubiquitination by Anaphase-promoting complex (APC).
However, these inhibitory factors do not prevent growth-dependent accumulation of
G1 cyclin Cln3. Therefore during G1 the activity of Cln3-Cdk1 complex accumulates
and reaches a threshold level that triggers activation of the gene regulatory factors
SBF (Swi4-Swi6) and MBF (Mbp1-Swi6), and these factors sequentially stimulate the
expression of genes encoding G1/S cyclins (Cln1 and Cln2) and S cyclins (Clb5 and
Clb6). Since the G1/S-Cdk complexes are resistant to Sic1 and are not targeted by
APC, the activity of G1/S-Cdk increase greatly in late G1, leading to phosphorylate
Cdh1 and inactivate APC.
APC inactivation and Sic1 destruction allow M cyclins to start accumulating,
and the rising M-Cdk activity stimulates to activation of the next gene regulatory
factor in the sequence, Mcm1-Fkh. The Mcm1-Fkh co-factor further stimulates the
expression of M cyclin and other genes required for mitosis, leading the cell to entering
mitosis.
During M phase after metaphase-to-anaphase transition, cyclin destructs, which
leads to activation of the M/G1 gene regulators, Swi5 and Ace2. Swi5 and Ace2
stimulate the expression of Sic1 and other proteins that cause Cdk inactivation.
Therefore, after the division the system has returned to a stable G1 state with low
Cdk activity, poising to begin the next cycle.
1.2 Cell-cycle synchrony experiment and its limitations
1.2.1 Biomarkers for monitoring cell-cycle progression
How can we tell what stage that a budding yeast cell has reached in the cell cycle?
One simple and cost-efficient way is to look at the living cells with a light microscope
9
to check whether or not the cell is budded. As we have mentioned previously, the bud
of a yeast cell appears near the G1-to-S transition until the completion of mitosis,
when the mother cell and its bud (daughter cell) separate. Therefore, the appearance
of the bud in a cell can tell us whether or not the cell has passed G1 phase, and
the appearance of the buds in many cells can give us some clues how these cells
are distribution over the cell cycle. For instance, by counting the total number of
cells and the number of cells with buds under a microscope, we can calculate the
fraction of cells that are in G1 and the corresponding fraction of cells that has past
G1 (including S and G2/M phases). Using a cell-cycle distribution model, such
as cloccs described in Section 2.5, we can accurately estimate how the cells in a
population are distribution over the cell cycle. An example of typical budding index
profile is shown in Fig. 1.4A.
In addition to budding index measurements, actomyosin rings, nuclei, and spindle
pole bodies (SPBs) can also be used as cellular markers with fluorescent microscope
to study the stage of cell cycle in budding yeast. Fluorescent microscope has be-
come increasing common recently to provide a rich source of marker data (Stacey
and Hitomi, 2008; Harder et al., 2006; Dickinson, 2006; Aikawa et al., 2007). For
example, by tagging proteins associated with these markers using fluorescent dyes
and quantifying the presence of these markers under a fluorescence microscopy, we
can determine at what cell-cycle stage a population of cells has reached. In detail,
• Actomyosin ring marks the G1/S transition, and disassembly of the actomyosin
ring marks the end of cytokinesis (Bi et al., 1998).
• Cell nucleus disassembles and re-forms during the cell cycle. At the beginning of
mitosis, the chromosomes condense, the nucleolus disappears, and the nuclear
envelope breaks down, resulting in the release of most of the contents of the
nucleus into the cytoplasm. At the end of mitosis, the process is reversed: The
10
A budding
0
100index
time
% o
f cel
lsbu
dded
C
1C 2C 1C 2C 1C 2C 1C 2C 1C 2C
B (1) (2) (3) (4)
Figure 1.4: Examples of the measureable cell-cycle progression markers in buddingyeast S. cerevisiae. (A) A typical budding index curve. The time-course records ofthe proportion of cells in a population in the G1 and postG1 phases. (B) Thefluorescence images. Shown are (1) the two dividing budding yeast the cells underdifferential interference contrast (DIC) microscopy. (2) Red fluorescence. Intensespots represent the myosin rings. (3) Blue fluorescence. Intense spots representnuclei. (4) Green fluorescence. The small punctate bolbs represent the spindle polebodies (SPBs). (C) A typical DNA content histogram as measured by flow cytometryfor an asynchronous population.
chromosomes de-condense, and nuclear envelopes re-form around the separated
sets of daughter chromosomes. Hence, the movement of nucleus, especially the
nucleus at cell neck, can provide many information about the current stage of
cell cycle (Granovskaia et al., 2010; Lord and Wheals, 1981).
• The SPBs are used to mark the subintervals throughout S and G2/M phases.
A SPB duplicates and separates apart from a short spindle during the S phase,
and further separates as the spindle elongates during M phase (Simmons Ko-
vacs et al., 2008). Thus, we can determine cells at different stages of spindle
formation by tracking the distance between two SPBs.
11
In all, with fluorescent dyes, we can track many cellular features during the course
of cell cycle. Fig. 1.4B shows some fluorescence image examples for SPBs (in red),
myosin rings (in blue), and the nuclei (in green).
Another efficient means of determining the cell cycle position is to measure the
genomic DNA content of the cell using flow cytometry (Haase and Reed, 2002; Slater
et al., 1977; Tobey and Crissman, 1975). A haploid yeast cell begins the cycle with
one copy of genomic DNA in G1. During S phase, the DNA is replicated, and thus
at the end of S phase, the cell contains two copies of genomic DNA. Using flow
cytometry, the DNA content of thousands of cells can be rapidly measured. The
genomic DNA of cells is labeled with a fluorescent dye, and then flow cytometer bins
each cell into one of 1024 ordered channels on basis of its fluorescent intensity which
is proportional to its DNA content (Pierrez and Ronot, 1992). An example of typical
DNA content flow cytometry is shown in Fig. 1.4C.
1.2.2 Cell-cycle synchrony experiment
As described in the previous section, a variety of methods are available to determine
the cell-cycle stage of cells. However, all these methods require a large population
of cells to obtain an accurate measurement. To provide insight into the dynamics
of cell-cycle processes, the cells in such a population should be as synchronized as
possible as they progress through the cell division cycle. To effect this synchrony, cells
are arrested or selected at one stage of the cell cycle, and then released to progress
through subsequent division cycles. Molecular species can then be measured in the
population at various time points after release (Spellman et al., 1998; Cho et al.,
1998; Pramila et al., 2006; Orlando et al., 2008; Granovskaia et al., 2010).
A number of methods have been used for synchronization of a population of
yeast cells at various stages of the cell cycle. Two most common approaches include
the physical means of centrifugal elutriation and genetic means of α-factor block-
12
release (Orlando, 2009; Futcher, 1999; Amon, 2002).
Synchronization by centrifugal elutriation is a size-based method. The method
extracts small cells from a population of cells, and such cells are typically newborn
daughter cells in the early G1-phase. In detail, a population of cells in liquid of media
was first pumped into a rapidly spinning chamber. Then, the centrifugal forces cause
a gradient to form with cells sedimenting at the bottom (outside) of the chamber
and the fluid eluting out the top (inside) through the exit port. Because the small
cells have a higher surface and volume ratio than larger cells, their sedimentation are
relatively more effected by the rate of fluid flow. In the end, by carefully adjusting
the pump, the centrifuge speed, and the fraction of unbudded cells in the output,
small cells can be selectively washed out of the chamber and collected.
In elutriation-based synchronization experiments, the initially collected cells—
typically small cells early in G1—are released from synchrony after experiencing
significant cold and osmotic stress, and therefore such cells require a period of time
to recovery. Also, because the small cells are more likely to be cells in boarder
region of positions in the early G1 phase, the population synchronized by centrifugal
elutriation tends to lose synchrony faster due to the asymmetric nature of cell division
in S. cerevisiae (more details in Section 1.2.3) compared to other methods. However,
since the centrifugal elutriation is a size-based collection method and not an induced
arrest, theoretically there exist very little transcriptional alternation of G1 events.
An alternative synchronization method is α-factor block-release. α-factor syn-
chrony experiment is a genetic method, achieved by adding the α mating pheromone
(the arrest/block) to an asynchronous culture and then subsequently removing the
pheromone (the release). The α mating pheromone is a short peptide that binds to
the receptor Ste2 in MAT-α cells and induces a cascade which results in the inactiva-
tion of the G1 cyclin CDK kinase complexes, leading to a G1-phase arrest. Because
initial cell size collected from α-factor experiment is generally larger and well arrested
13
A budding index B PCL1
C SIC1
Cycle 1 Cycle 2
D
SSK22
Cycle 1 Cycle 2
Figure 1.5: Synchronized population of cells loses synchrony over time. Shownare the measured budding index profile (panel A) and transcriptional profiles ofthree genes (panel B-D) from Orlando et al. (2008) (time-course synchronized bycentrifugal elutriation).
in G1-phase, compared to centrifugal elutriation, the population of cells synchronized
in such experiments tend to maintain synchrony for a longer period of time. However,
because the α-factor synchrony method relies on an extra-cellular signal, it induces
large cellular changes during the cell-cycle arrest, causing significantly altering the
G1 transcriptional program.
1.2.3 Synchrony lose significantly in a synchronized cell population
In the cell-cycle synchrony experiments, the measurements of cell populations would
not be substantially different from average measurements of individual cells if the
cells in the population were always perfectly synchronized. However, as shown in
Fig. 1.5, the synchronized population of cells will lose synchrony greatly over time
after release. What can cause synchrony loss in a synchronized population of cells?
At least three factors should be taken into account,
1. For the initially collected popoulation, the cells may exhibit variability at the
time of the release.
14
2. Because there exist variability among cells and because individual cells progress
through the cell cycle at different rates, the synchrony in the population can
deteriorates gradually over time.
3. Asymmetric cell division is a major source of synchrony loss in many kinds
of cells, and especially in budding yeast S. cerevisiae (Hartwell and Unger,
1977; Lord and Wheals, 1981, 1980; Woldringh et al., 1993; Bean et al., 2006).
As mentioned previously, after yeast cell division, the size of the newborn
daughter cells are smaller than their mothers. Thus, these small daughter cells
need a longer time in early G1 to grow up until they achieve a critical cell
size (Jorgensen and Tyers, 2004). On the other hand, mother cells are often
already reached this critical size and therefore they can progress more rapidly
through G1 (Di Talia et al., 2007).
For these reasons, time-series measurements of a population of cells do not ac-
curately reflect the dynamics of individual cells as they traverse the cell cycle, but
instead represent the convolved dynamics of all cells in the imperfectly synchronized
population. Thus, observed population measurements are only a ‘blurred’ view of
the underlying behavior of individual cells, and this view becomes increasingly blurry
as the time course progresses. For example, the synchrony in the second cycle of the
profiles in Fig. 1.5 is apparently worse than that in the first cycle.
1.3 Motivation: why deconvolution is necessary
1.3.1 Deconvolution: from population to single cells
In the previous sections, we have introduced the cell-cycle synchrony experiments and
discussed about the limitations of these experiments: the synchrony in a synchronized
population loses significantly over time, and therefore the time-series measurements
15
taken over such a population of cells do not accurately reflect the underlying cell-cycle
dynamics. Let us use an example to have a closer look at this phenomenon.
Assume that at a specific time t, the cell-cycle measurement taken over a pop-
ulation of cells is gt. Also, assume that the average measurement level of indi-
vidual cells in G1 phase is fG1, the average measurement level of individual cells
in S phase is fS, and the average measurement level of individual cells in G2/M
phase is fG2/M . Then if we know the how the cells in the population are dis-
tributed in these three cell-cycle stages (denoted as ht,G1, ht,S, and ht,G2/M , respec-
tively), then the population-level measurement at this specific time can be written
as gt = fG1× ht,G1 + fS × ht,S + fG2/M × ht,G2/M . In general, gt is measurable, and if
we can calculate ht,G1, ht,S, and ht,G2/M , then estimating fG1, fS, and fG2/M can be
viewed as a deconvolution problem.
A generalized description of deconvolution is shown in Fig. 1.6, in which g de-
notes the measured population-level time-series data, f denotes the average cell-cycle
time-series profile of individual cells (e.g., transcriptional profile, protein expression
profile), and H is a matrix (deconvolution kernel), which quantifies how the cells in
the population are distributed over the course of cell cycle at each time point. In
real-world application, g is measured, f is unknown, and in the next section, we will
introduce cloccs, a cell-cycle distribution model, which can be used to accurately
estimate the deconvolution kernel, H.
1.3.2 cloccs: modeling cell-cycle distributions
In this section, we briefly introduce cloccs (Characterizing Loss of Cell Cycle Syn-
chrony) (Orlando et al., 2007, 2009; Mayhew et al., 2011), a framework for quantita-
tively determining cell-cycle distributions in population synchrony experiments, or
in other words, estimating the deconvolution kernel H (Fig. 1.6).
In cloccs, the cell-cycle progression of a synchronized population of cells is
16
time (min)0 50 100 150 200 250 300
6000
tran
scrip
t lev
el
4000
2000
0
G1 S G2/M
8000(unknown, n time-points)
: single-cell time-series profile: population-level time-series profile(measured, k time-points)
t(1)
t(2)
t(3)
t(k-1)
t(k)
...
...
G1 S G2/M
(unknown, cell-cycle distribution): convolution kernel
6000tr
ansc
ript l
evel
4000
2000
0
8000
Figure 1.6: Overview of the deconvolution framework. Estimating cell-cycle dy-namics of individual cells from population-level time-series data can be viewed as adeconvolution problem. Here, we formulate deconvolution as an discrete inverse prob-lem g = H×f , in which g is a column vector containing the measured population-leveltime-series data, H is the convolution kernel which estimates how the cells in thepopulation are distribution over the course of cell cycle at each time point, and f isa column vector representing the unknown cell-cycle dynamics profile of an averageindividual cell.
modeled using a linear graphical representation termed ‘branching process’. In con-
trast to the traditional circular form of cell-cycle representation (e.g. as in Fig. 1.2),
branching process enables us to explicitly distinguish the cell cycles of mother and
daughter cells, and allows us to observe cell-cycle events in different cycles. As shown
in Fig. 1.7, the branching process in cloccs is composed of three cell-cycle intervals:
recovery interval, or R for short, represents the interval that immediately following
the release from synchrony, during which initial cells recover from the synchrony
protocol; cell cycle of mother cells, during which the cells progress through a stan-
dard cell cycle; and cell cycle of daughter cells, during which daughter cells progress
through a longer daughter-specific cell cycle. According to the branching process,
after synchrony release, the initial population of cells first progresses through a R
interval before entering into a standard cell cycle. At the end of the first cycle, cells
divide into mother and daughter cells. Mother cells enter into the next standard
cell cycle immediately, and the newborn daughter cells traverse through a longer
daughter-specific cell cycle since they require more time to grow up. Every time cells
17
Figure 1.7: Branching process in cloccs. The branching process is composed ofthree intervals: recovery, cell cycle of mother cells, and cell cycle of daughter cells.(A) The initial population of cells is modeled as a normal-distributed cohort, reflect-ing the variability of cell-cycle positions in the initial population. (B) Along with thepopulation of cells traverses through the cell cycles, the variance in the populationcohort increases gradually, reflecting that individual cells progress through the cellcycle at different rates. (C) During each division, a new cohort is generated for thenewborn daughter cells, reflecting asymmetric cell division of budding yeast.
divide, a new daughter-specific branch appears and this process repeats.
cloccs models the synchronized population of cells as a normal-distributed co-
hort. The variance in the initial cell cohort (Fig. 1.7A) reflects the variability of
cell-cycle positions in the initially synchronized population. cloccs assumes that
each cell traverses at a constant velocity along the cell-cycle branches, and this ve-
locity is randomly sampled from a normal distribution. Hence, along with the cell
cohort progresses through the branches, the variance of cohort increases gradually
(Fig. 1.7B), reflecting that the individual cells go through cell cycles at different rates.
When each cohort passes the point of division (Fig. 1.7C), the population expands
18
in size, and a truncated normal-distributed cohort appears on the daughter branch
to represent the newborn population of daughter cells. According to the branching
process, cloccs explicitly models the asymmetric cell division of budding yeast, and
accounts for all three factors that cause synchrony loss in the population of cells.
Using morphological markers—such as budding index (Orlando et al., 2007), flow
cytometric measurement of DNA content (Orlando et al., 2009), and/or fluorescently
tagged molecular markers (Mayhew et al., 2011)—cloccs accurately estimates the
lengths of cell-cycle intervals, the variance in the rate at which cells move through
these intervals, and the positions in the cell cycle at which specific events take place,
such as when DNA replication starts or ends. For the purposes of deconvolving
population-level measurements, cloccs parameters can also be used to precisely
estimate how cells in a population are distributed over the cell cycle at any point in
time following synchrony release. More details are given in the next chapter.
Although we have briefly described the usefulness of cloccs based on the branch-
ing process for budding yeast, the concepts of cloccs are very general and can be
used with other branching processes (e.g., linear process construction for cell cycle
of mutated cells, symmetric branching process for symmetric cell division). All the
needs of cloccs are the construction of branching process to model the underlying
cell divisions, and the corresponding mathematical formulation in a closed form for
Markov chain Monte Carlo (MCMC) sampling.
1.3.3 The missing piece of deconvolution
We introduced the concept of deconvolution in the previous sections. As demon-
strated in Fig. 1.6, deconvolution can be represented in the formula of g = H × f :
where g is the measured cell-cycle time-course profile; H is the deconvolution kernel,
which can be calculated from cloccs; and f is the unknown cell-cycle time-course
profile of average individual cells. However, since we desire to obtain a higher-
19
resolution profile of average individual cells, implying that the number of time-points
in f should be much larger than that in g. Therefore, estimating f is not naive and
involves solving an ill-posed discrete inverse problem. In this thesis, we present
a general deconvolution algorithm that employs a wavelet-basis regularization ap-
proach to accurately estimate the cell-cycle dynamics of average individual cells from
population-level time-series measurements.
1.4 Contribution of our deconvolution framework
The major purpose of this thesis is to provide a methodology that removes synchrony
loss effects from population-level cell-cycle measurements and reveals a detailed cell-
cycle profile at a single-cell level. Compared to the previous approaches as introduced
in Section 2.1, our deconvolution framework has many advantages, three most im-
portant ones are
1. Previous deconvolution algorithms output either a refined cell-cycle profile
(e.g., a smoother cell-cycle transcription profile as in Bar-Joseph et al. (2004))
or peak timing of cell-cycle profiles (e.g., Rowicka et al. (2007)). Our deconvo-
lution algorithm removes synchrony loss effects and yields a continous cell-cycle
profile over the whole course of the cell cycle. In addition, the resolution of the
deconvolved profiles is improved many times.
2. Our algorithm can learn distinct cell-cycle profiles for both mother and daugh-
ter cells. Combined with the first feature that our algorithm can reliably es-
timate cell-cycle profiles at fine temporal resolution, we can now distinguish
subtle timing differences between mother and daughter cells, which is typical
obscured in population-level measurements.
3. Deconvolution aims to enhance the features of blurred population measure-
ments to sharpen underlying signal. However, may previous deconvolution
20
methods often end up sharpening noise as well. Our deconvolution algorithm
avoid this problem by formulation an objective function that is Bayesian l1-
regularized using a wavelet basis, and we show in chapter that such an approach
can effectively deblur signals while smoothing away noise.
To our knowledge, our deconvolution algorithm is the first approach that can
explicitly learn cell-cycle profiles at a single-cell level over the whole course of the
cell cycle. Although we essentially demonstrate the usefulness of our algorithm in
details by deconvolving genome-wide transcription profiles (chapter 3), our algorithm
is generlized and can be applied to many other population-level data sources, such as
nucleosome occupancy measurements, protein expression profiles obtained by West-
ern blots, or measurements in organisms other than budding yeast Saccharomyces
cerevisiae.
1.5 Thesis outline
The rest of the thesis is organized as follows. In chapter 2 we describe the general
framework of our deconvolution algorithm, which can be used to deconvolve different
types of cell-cycle time-series data to reveal a detailed cell-cycle profile at a single-
cell level. In chapter 3, we applied our deconvolution algorithm to learn single-cell
transcription profiles from two independent replication of cell-cycle synchrony exper-
iment in wild-type budding yeast (Orlando et al., 2008), and we carried out various
analyses on the resultant transcript profiles to characterize the deconvolution perfor-
mance. In chapter 4, we move our focus to network alignment problem, and introduce
DOMAIN, a network alignment method that employs a novel direct-edge-alignment
paradigm to detect conserved functional modules (e.g., protein complexes, molecular
pathways) across protein-protein interaction networks across species. We evaluate
the alignment performance of DOMAIN with two widely used alignment approaches,
21
and show that our approach outperforms these two approaches in most alignment
performance metrics. We also show that our approach enables us to detect some
cell-cycle-related functional modules between budding yeast and fruit fly protein-
protein interaction networks. In Chapter 5, we draw some conclusions regarding to
the present and the future states of cell-cycle deconvolution algorithms.
22
2
The deconvolution framework
In this chapter, we present the general deconvolution framework, which aims at re-
moving synchrony loss effects from time-series data collected at population level, and
recovering cell-cycle profiles at a single-cell level. In the first part, we introduce some
previous deconvolution approaches, and then we describe our deconvolution frame-
work as well as some technical issues that are used in our deconvolution algorithm.
Although this is a generalized algorithm that can be applied to many organisms, we
focus on model organism budding yeast and demonstrate the technical details of our
algorithm based on its asymmetric cell cycle. Most of work present in this chapter
and the chapter 3 appeared in Mayhew et al. (2012) and Guo et al. (2012a).
2.1 Previous deconvolution algorithms
A few studies have attempted to deconvolve time-series microarray data to survey
either transcript levels (Bar-Joseph et al., 2004; Qiu et al., 2006) or peak expres-
sion timing (Rowicka et al., 2007) during the cell cycle in budding yeast. These
approaches modeled variability in cell-cycle progression rate, but ignored the sig-
nificant synchrony loss caused by asymmetric cell division. As a result, they may
23
not be well-suited to budding yeast data, and certainly cannot distinguish the cell-
cycle transcription programs of mother and daughter cells. Another more recent
study (Siegal-Gaskins et al., 2009) developed a transcription deconvolution method
for Caulobacter cells that was used to deconvolve the transcription profiles of ten
cell-cycle-regulated genes in that bacterium. In the following, we briefly review these
methods, and compare them in various aspects.
Lu et al. (2003): the first literature to elaborate the concept of deconvolution
in population-level cell-cycle measurements. However, the method assumes a set of
perfectly synchronized expression values, and cannot be directly used to deconvolve
time series expression data.
◦ Species: Eukaryote, budding yeast Saccharomyces cerevisiae.
◦ Data type: 1. Basis experiments from synchronized cell-cycle experiments. 2.
Static transcription levels from populations of cells grown in a wide variety of
conditions.
◦ Cell-cycle phases: G1, S, G2, M, and M-to-G1
◦ Synchrony loss model: None. The fractions of cells in five cell-cycle phases
were determined from the basis experiments.
◦ Synchrony loss factors recovered: No synchrony loss consideration; static tran-
scriptional level at one time point is considered.
◦ Deconvolution model: Used transcriptional peaks of some characterized cell-
cycle genes to determine cell-cycle phases. Used a system of weighted linear
equations to fit the measured static transcription.
◦ Deconvolution outputs: Transcriptional levels of a gene at each cell-cycle phase.
24
◦ Number of genes used as cell-cycle-regulated: From literatures, authors picked
696 genes as cell-cycle-dependent.
◦ Resolution in the deconvolved profiles: Static transcription; not applicable.
Bar-Joseph et al. (2004): The work introduced a cell-cycle synchrony loss model
based on the budding index and fluorescence-activated cell sorting (FACS) data.
However, the major synchrony loss factor, asymmetric cell division, was not consid-
ered in this model. The work focused on reducing noise in experimental measure-
ments.
◦ Species: Eukaryote, budding yeast Saccharomyces cerevisiae.
◦ Data type: 1. Budding index or FACS data. 2. Synchronized cell-cycle time
course.
◦ Cell-cycle phases: G1, S, G2/M
◦ Synchrony loss model: Used budding index (or FACS) data to estimate the
duration of each cell-cycle phases and cell growth variance in the population.
◦ Synchrony loss factors recovered: Variability in cell-cycle rate.
◦ Deconvolution model: Used cubic splines to fit the time-series transcriptional
data.
◦ Deconvolution outputs: Refined transcription profiles of the first and the second
cell cycles.
◦ Number of genes inferred as cell-cycle-regulated: Inferred around 900 cell-cycle-
regulated genes.
◦ Resolution in the deconvolved profiles: Not explicitly estimated.
25
Qiu et al. (2006): The work focused on reducing variability of cell-cycle rates
in the population, and introduced a synchronization loss model by modeling the
gene expression measurements as a superposition of different cell populations going
through cell cycles at different rates.
◦ Species: Eukaryote: budding yeast Saccharomyces cerevisiae.
◦ Data type: Synchronized cell-cycle time course.
◦ Cell-cycle phases: Not specified.
◦ Synchrony loss model: Used a mixture model to account for cells traversing
through cell cycles at slightly different rates.
◦ Synchrony loss factors recovered: Variability in cell-cycle rates.
◦ Deconvolution model: Used polynomial model to fit the time-series transcrip-
tion data.
◦ Deconvolution outputs: Refined transcription profiles.
◦ Number of genes inferred as cell-cycle-regulated: Not discussed.
◦ Resolution in the deconvolved profiles: Not explicitly estimated.
Rowicka et al. (2007): The work introduced an algorithm based on a regularization-
based approach on the maximum-entropy principle to determine transcription peak
timing of cell-cycle-regulated genes. However, the work only focused on transcription
peaks and reported the transcription peak timing of genes, not “true” transcriptional
profiles.
◦ Species: Eukaryote, budding yeast Saccharomyces cerevisiae.
26
◦ Data type: cell-cycle time course of a synchronized population of cells in yeast
metabolic culture (YMC).
◦ Cell-cycle phases: G1, G1-to-S, S, G2, G2-to-M, M, M-to-G1
◦ Synchrony loss model: Used transcription peaks of some characterized cell-
cycle-regulated genes to determine cell-cycle phases and sub-phases.
◦ Synchrony loss factors recovered: Synchrony noise in initial populations.
◦ Deconvolution model: Used regularization-based approach on the maximum-
entropy principle.
◦ Deconvolution outputs: Timing of the transcription peaks (and in some cases
secondary transcription peaks).
◦ Number of genes inferred as cell-cycle-regulated: Inferred 694 high-confidence
cell-cycleregulated genes, with an extended set of 1,129 genes.
◦ Resolution in the deconvolved profiles: Resolution of transcription peaks around
2 min (≈2% of one cell cycle).
Siegal-Gaskins et al. (2009): The work estimated the proportion of different cell-
types (SW, ST) at cell division, and used these estimates to model the synchrony
loss by asymmetric cell division.
◦ Species: Bacterium, Caulobacter crescentus.
◦ Data type: Synchronized cell-cycle time course.
◦ Cell-cycle phases: SW, EPD, LPD, ST.
◦ Synchrony loss model: Used a probabilistic model to estimate the total cycle
time, SW-to-ST transition point, and cell-cycle distributions.
27
◦ Synchrony loss factors recovered: 1. Variability in cell-cycle rates. 2. Vari-
ability in the physiological and developmental state of the cell (asymmetric
cell-cycle division)
◦ Deconvolution model: Converted the deconvolution problem to an optimization
problem, using cross-validation to select an appropriate control parameter.
◦ Deconvolution outputs: Single-cell-like transcriptional profiles.
◦ Number of genes inferred as cell-cycle-regulated: Not discussed.
◦ Resolution in the deconvolved profiles: Not explicitly estimated.
2.2 General deconvolution objective function
Our deconvolution framework employs a wavelet-basis regularization approach to ex-
plicitly learn distinct cell-cycle profiles for both mother and daughter cells. The reg-
ularization objective function of our deconvolution framework includes two parts—a
solution norm to measure the goodness-of-fit and a residual norm to measure the
smoothness of the estimates.
In detail, let f ∈ Rn be a vector of size n, whose elements represent the average
levels of some molecular species in individual cells at various points in the cell cycle;
let H ∈ Rt×n be a convolution matrix that transforms values from the individual
cell level to the population level; and let g ∈ Rt be a measured population-level
time-series with t time points. As described in the previous chapter, estimating f
involves solving an ill-posed discrete inverse problem: Hf = g. Then, the solution
norm (goodness-of-fit) is calculated as ‖Hf − g‖22, where ‖ · ‖2 denotes l2 norm.
To avoid over-fitting, we use a residual norm to ensure a smooth estimate of f .
The composition of this residual norm is based on our prior knowledge about how
the cell-cycle profiles of average individual cells look like, which is related to the
28
underlying cell cycle model. For example, generally we should expect the cell-cycle
profile of a cell should be smooth during the whole cycle, but it may not be true that
the cell-cycle changes in the transition between the end of one cycle and the start
of the next cycle are continuous and smooth. In order to quantify the smoothness
of cell-cycle intervals, we introduce wavelet basis (Jansen, 2001). Thus, the general
residual norm can be represented as ‖fW‖, where W is orthonormal wavelet-basis
matrix, and ‖ · ‖1 denotes l1 norm.
Putting the solution norm and the residual norm together, our general deconvo-
lution objective function is written as
argminf‖Hf − g‖22 + γ ‖fW‖1 (2.1)
where γ is a regularization control parameter to take the tradeoff between the so-
lution norm and the residual norm. When deconvolving microarray transcription
data, use of an l2 norm for Hf − g is dubious since it represents an assumption
of additive Gaussian error, whereas transcript level measurements collected using
Affymetrix arrays are generally presumed to exhibit multiplicative Gaussian error.
To model multiplicative error, we can transform Hf and g into log-space, yielding a
more appropriate solution norm ‖log Hf − log g‖22, and the corresponding objective
function is
argminf‖log Hf − log g‖22 + γ ‖fW‖1 (2.2)
However, this objective function is no longer convex. To recover convexity, we ap-
proximate this more appropriate objective function using a first-order Taylor series
expansion as
argminf
∥∥∥∥Hf
g− 1
∥∥∥∥22
+ γ ‖fW‖1 (2.3)
which is convex and hence has a unique global optimum. Constrains can be added
29
to this objective function according to the type of input data. For example, when
deconvolving microarray transcription data, we may require f ≥ 0 because the actual
transcript levels are always non-negative, and when deconvolving budding index data,
we can instead use the original objective function of Eq. (2.1), requiring f ∈ [0, 1]
because the fraction of budded cells is always between 0 and 1.
In the next sections, we discuss about some technical issues related to our decon-
volution objective function, namely, (1) how to specify the residual norm; (2) how to
select a regularization parameter γ; (3) how to choose the orthonormal wavelet-basis
matrix W ; and (4) how to jointly learn a single f from multiple replicate data.
2.3 Branching process in deconvolution
Our deconvolution framework is built upon the cell-cycle parameters of cloccs, and
it also employs a more detailed branching process compared to the original one in
cloccs. There are two reasons why we need a branching process in deconvolution:
(1) Same as the purpose of the branching process in cloccs, we need to use a
branching process to model the underlying cell-cycle procedure, such as asymmetric
cell division in budding yeast. (2) By decomposing the cell-cycle branches into small
cell-cycle intervals and then building up connections between these intervals, we are
able to model the cell-cycle dynamics under different assumptions and formulate
corresponding solution norms.
Fig. 2.1 illustrates a tree-like full map of the branching process models in decon-
volution. The model on the root of this map is of maximal flexibility. In this model,
R indicates the recovery interval, representing the interval immediately following re-
lease from synchrony; rG1, indicating the first G1 phase immediately following the R
interval, together with postG1 (including S, and G2/M phases) form the first stan-
dard cell cycle. Similarly, cG1, indicating the standard G1 phase of mother cells,
together with postG1 form the second and all the following standard cell cycles. The
30
RC
G1
post
G1
CG
1po
stG
1
DG
1po
stG
1
DG
1=st
retc
hed
CG
1
DG
1=D
.dG
1
rG1=
cG1
RG
1=R
.rG
1
dG1=
cG1
dG1=
cG1
C=
cG1.
post
G1
DG
1>st
retc
hed
CG
1
RG
1=R
.DG
1
RrG
1po
stG
1cG
1po
stG
1
DdG
1po
stG
1
RD
G1
post
G1
CG
1po
stG
1
DG
1po
stG
1
RG
1po
stG
1C
G1
post
G1
DC
G1
post
G1
RC
G1
post
G1
CG
1po
stG
1
DG
1po
stG
1
RcG
1po
stG
1cG
1po
stG
1
DcG
1po
stG
1
RG
1po
stG
1C
G1
post
G1
DG
1po
stG
1
RcG
1po
stG
1cG
1po
stG
1
DdG
1po
stG
1
post
G1
cG1
post
G1
DdG
1po
stG
1
RG
1
DG
1=D
.dG
1
RC
G1
post
G1
CG
1po
stG
1
CG
1po
stG
1
RC
C
DC
CG
1po
stG
1C
G1
post
G1
DG
1po
stG
1
DG
1po
stG
1C
G1
post
G1
DG
1po
stG
1
RG
1po
stG
1C
G1
post
G1
DG
1po
stG
1
RG
1po
stG
1C
G1
post
G1
DG
1po
stG
1
RG
1>st
retc
hed
CG
1R
G1>
stre
tche
d D
G1
RG
1=st
retc
hed
CG
1R
G1=
stre
tche
d D
G1
1.1.1
1.1.1.1
1.1.1.2
1.1.2.1
1.2.1
1.2.1.1
1.2.1.4
1.2.1.2
1.2.1.3
1.2.2
1.2.3
1
1.1
1.2
1.1.2
Figure2.1
:B
ranch
ing
pro
cess
indec
onvo
luti
on.
The
map
show
sth
ebio
logi
cally
inte
rpre
table
bra
nch
ing
pro
cess
model
sin
dec
onvo
luti
on.
At
each
split,
eith
erso
me
const
rain
sb
etw
een
cell-c
ycl
ein
terv
als
orso
me
bio
logi
cal
implica
tion
sar
ein
troduce
d.
The
inte
rval
sin
sam
eco
lor
indic
ate
that
the
cell-c
ycl
epro
gram
sin
thes
ein
terv
als
are
the
sam
eal
thou
ghso
met
imes
stre
tched
.T
he
bra
nch
ing
pro
cess
model
sin
the
gray
box
are
the
model
sw
ith
the
sam
enum
ber
offr
eece
ll-c
ycl
ein
terv
als.
31
bottom cell-cycle branch of this model contains three intervals: D, dG1, and postG1.
D is a daughter-specific interval whose length in time is equal to the time difference
between the cell-cycle of mother and daughter cells. dG1 indicates the daughter-
specific G1 whose length in time is equal to rG1 or cG1. The only assumption of
this model is that after G1 checkpoint, mother and daughter cells traverse through
the postG1 phases with the same cell-cycle programs. This model includes six free
cell-cycle intervals, R, rG1, cG1, D, DG1, and postG1, and there exist no constrains
on the G1 phases. Specifically, the cell-cycle programs of three G1 intervals—G1 for
the cells in the first cycle (rG1), G1 for the mother cells starting from the second
cycle (cG1), and G1 for the daughter cells (dG1)—could be totally different.
This root model, labeled 1, has two nested models. The left one, labeled 1.1,
is based on the assumption that the cell-cycle programs in rG1 is the same as the
programs in cG1. That is, the cell-cycle programs of mother cells are all the same
after R interval. The right one, labeled 1.2, concatenates the intervals R and rG1
together as a new RG1 interval. The union of these two intervals actually does
not reduce the flexibility of the model, but reduces the number of free cell-cycle
intervals. It has a different biological implication compared to the model 1: there
exist no boundary between the cell-cycle programs between R and rG1. Thus during
this RG1 interval, the cells in the initial population do not only recover from low
temperatures or other stress from arrest, but also get prepared for DNA replication
and mitosis.
Similarly, at each split of this tree-like map, new cell-cycle interval constrains or
biological implications are introduced. To save the space, we do not elaborate every
model, but instead describe the details of three models, labeled 1.1.1, 1.1.2.1, and
1.2.1.1. These models are of particular interests in cell-cycle modeling , and they are
actually used in our analysis.
The model 1.1.1 assumes that D and dG1 intervals should be merged together.
32
According to this model, the initial population of cells after release first progresses
through a R interval to recover from arrest, and then traverses through a standard
cell cycle which is composed of cG1 and postG1 intervals. At each division, daughter
cells are born, and they go through a DG1 interval whose length in time is longer
than CG1, and then progress through the postG1 interval. During this DG1 interval,
the daughter cells do not only grow up to reach the critical cell size, but also get
prepared for DNA replication and mitosis as the mother cells do in G1. This is the
model we actually used in deconvolving wild-type transcriptional profiles of budding
yeast in Chapter 3.
The model 1.1.2.1 gives an alternative interpretation for the cell cycle of budding
yeast. After the initial cells traverse through the R interval for recovery, they progress
through a standard cell cycle (C). For daughter cells, they first traverse a daughter-
specific D interval to grow up and reach the critical cell size, and then they progress
through a standard cell cycle. According to this model, the cell cycle of daughters
is constructed with a standard cell cycle and an appended daughter-specific growth
interval. We have attempted to use this model to deconvolve the wild-type tran-
scriptional profiles of budding yeast, and we found that the previous model 1.1.1 is a
more reasonable model as fewer constrains were made in daughter-specific G1 phase.
Another interesting model 1.2.1.1 suggests that the cell-cycle programs in the
interval from release until the first postG1 is equal to the cell-cycle programs in
CG1 but in a slower pace. This assumption makes some sense because the initially
collected cells are typically small cells early in G1, so they at least need to do the
preparation as the mother cells do in CG1 interval.
2.4 Introduction to wavelets: selection of wavelets
In this section, we first briefly give some background knowledge about wavelet trans-
forms, and then we introduce a few specific wavelet families that are useful in con-
33
structing the orthonormal wavelet-basis matrix W .
Wavelet transform is one of mathematical transformations applied to signals to
obtain a further information from that signal that is not readily available in the raw
signal. In contrast to Fourier transform, that converts a signal from time versus
amplitude to frequency versus amplitude across the whole time domain, wavelet
transform decomposes continuous-time signal into different scale components. The
wavelet transform can provide us the frequency of the signals at local domains and
the time associated to those frequencies, making it very effective in analyzing non-
periodic signals and very convenient for its application in numerous fields, such as
audio and image processing. For more information about wavelets, please check
Mallat (1989), Mallat (1999), Daubechies (1992), and Burrus et al. (1998).
Wavelet transforms are classified into discrete wavelet transforms (DWTs) and
continuous wavelet transforms (CWTs), and we are using DWTs in this work. There
is an important feature in wavelet transforms called vanishing moments, and having p
vanishing moments means that wavelet-coefficients for p-th order polynomial will be
zero. That is, any polynomial signal up to order p−1 can be represented completely
in the scaling space. In theory, more vanishing moments means that scaling function
can represent more complex signals accurately. p is also called the accuracy of the
wavelet.
There exist many discrete wavelets, and here we list a few that are useful in this
study
◦ Haar: The Haar wavelet is the simplest possible wavelet which was proposed in
1910 (Haar, 1910). The Haar wavelet has a unique advantage for the analysis of
signals with sudden transitions, such as monitoring of tool failure in machines.
In our work, we used it to decompose the signal of budding index, since it only
has two states, budded and unbudded.
34
◦ Daubechies: The Daubechies wavelets are a family of orthogonal wavelets defin-
ing a discrete wavelet transform and characterized by a maximal number of
vanishing moments for some given support. With each wavelet type of this
class, there is a scaling function (also called father wavelet) which generates an
orthogonal multi-resolution analysis (Daubechies, 1992). The Haar wavelet is
a special case of the Daubechies wavelet with vanishing moments of 2.
◦ Symmlets: The Symmlet wavelets are also wavelets within a minimum size
support for a given number of vanishing moments, but they are as symmetrical
as possible, as opposed to the Daubechies filters which are highly asymmetrical.
In deconvolving of gene expression profiles, we employ Symmlets instead of the
popular Daubechies because of this property of high symmetry.
2.5 Selecting a regularization parameter
A critical step in deconvolving a cell-cycle time-series data is to identify a good
regularization parameter γ. A reasonable choice of γ can avoid over-fitting and over-
smoothing and gives us biologically interpretable deconvolved estimates. In doing
so, as illustrated in Fig. 2.2, we first determined a region of γ that represents a
reasonable trade-off between the goodness-of-fit term (e.g., ‖Hf − g‖22 in Eq. (2.1)
or∥∥∥Hf
g− 1∥∥∥22
in Eq. (2.3)) and the smoothness term (‖fW‖1). Next, we selected
a target-specific optimal regularization parameter γ by calculating the maximum
curvature on the L-curve (Hansen, 1992) within this region.
2.6 Joint learning from multiple replicates
Our convolution kernel design allows us to learn a robust single transcription profile
jointly from multiple experimental replicates. For example, in case of two replicate
data, we can construct convolution kernels H1 and H2 for the two replicates using
35
Figure 2.2: Selection of a regularization parameter γ. First, a region of γ thatrepresents a reasonable trade-off between the goodness-of-fit term and smoothnessterm is identified (gray region). Next, a target-specific optimal regularization pa-rameter γ is selected within this region by calculating the maximum curvature onthe L-curve (Hansen, 1992) (red triangle).
their respective cloccs parameter estimates. To ensure the matrices refer to the
same points along the branching process, we should use the same number of subin-
tervals on the various cell-cycle branches when constructing both H1 and H2. In
this manner, corresponding columns in H1 and H2 represent the same fractional
population estimate for the same subinterval along the cell-cycle branches under
the two experimental conditions. Then, we can construct a joint convolution kernel
HJ = [Ht1H
t2]t, where t is the transpose operator. Similarly, we can construct a joint
population-level time-series gJ = [gt1gt2]t for a target with two replicates. Although
we have two replicates, we only need to learn a single deconvolved profile f . The
only thing we need to do is replace g and H within the objective function with gJ
and HJ , respectively. Generally, this jointly learned f is more robust and accurate
than the f learned from a single experiment.
36
3
Deconvolution of wild-type cell-cycle transcriptionalprofiles of budding yeast
In the previous chapter, we have introduced the general framework of our decon-
volution algorithm. To demonstrate the usefulness of our method, we applied it
to a recent cell-cycle transcription time course in the eukaryote Saccharomyces cere-
visiae (Orlando et al., 2008). The input data is genome-wide cell-cycle transcriptional
profiles at a temporal resolution of 16 minutes, and the output is jointly learned tran-
scription profiles at a nominal temporal resolution of less than one minute, with dis-
tinct transcription programs learned for mother and daughter cells. In this chapter,
we show various analyses that we carry out on the resultant deconvolved transcrip-
tional profiles to characterize the performance of our deconvolution algorithm.
3.1 Experimental data
We apply our deconvolution algorithm to learn single-cell transcription profiles jointly
from two independent replicates of cell-cycle synchrony experiments in wild-type bud-
ding yeast (Orlando et al., 2008). The experiments collected populations of synchro-
37
nized early G1 cells by centrifugal elutriation. Two wild-type time-series replicates
were collected with 15 samples taken at 16 minute intervals in each, starting 30 min-
utes after release in the first replicate, and 38 minutes after release in the second.
Both replicates covered approximately 2 complete cell cycles. For each replicate,
both budding index and flow cytometry data were collected 32 times at 8 minute
intervals, starting 30 minutes after release (Orlando et al., 2009). Budding index
was measured by light microscopy to record the number of budded and unbudded
cells observed out of at least 200 cells. The DNA content of 10,000 cells per sam-
ple was measured by flow cytometry as described in (Haase and Reed, 1999). We
downloaded the mRNA expression datasets from http://www.biology.duke.edu/
haaselab/publicData/index.html; for genes with multiple probes, we averaged
the transcript levels across the probes. Consequently, we were left with measured
transcription profiles of 5,670 unique genes.
3.2 Branching process model and cell-cycle parameters
3.2.1 Branching process model
In deconvolving of wild-type cell-cycle transcriptional profiles, we decompose the full
branching process of cloccs into four kinds of intervals (Fig. 3.1A): R (recovery)
represents the interval immediately following release from synchrony, during which
initial cells recover from the synchrony protocol; G1 and DG1 (daughter-specific
G1) represent G1 phases of mother and daughter cells, respectively; and postG1
represents the interval immediately following G1 or DG1, during which mother and
daughter cells progress through S, G2, and M. According to this model, after syn-
chrony release, cells progress through the R interval before entering a standard cell
cycle (G1 followed by postG1). At the end of the first cycle, cells divide into mother
and daughter cells; mother cells enter another standard cell cycle, while newborn
daughter cells instead traverse DG1 before entering postG1. Every time a cell di-
38
time (min)
CLN2
0 50 100 150 200 250 300
6000
tran
scrip
t lev
el
4000
2000
0
CLN2
R G1 DG1S G2/M
8000(unknown, n time-points)
: single-cell transcription profile: population-level transcription profile(measured, k time-points)
t(1)
t(2)
t(3)
t(k-1)
t(k)
...
...
R G1 DG1S G2/M
: convolution kernel(cell-cycle distribution)
6000
tran
scrip
t lev
el
4000
2000
0
8000
G1 S G2/MR
DG1 S G2/M
G1 S G2/M
A
B
Figure 3.1: Overview of deconvolution algorithm. (A) Branching process in decon-volution. The full branching process is split into four kinds of intervals, R, G1, DG1,and postG1 (including S and G2/M phases). (B)Deconvolution is formulated as anill-posed discrete inverse problem g = H× f , in which g is a column vector contain-ing the measured population-level time-series data, and here the real transcriptionprofile of the G1 cyclin CLN2 is plotted; H is the convolution kernel calculated fromcloccs parameters; and f is a column vector representing the components of theunknown dynamic profile of an average individual cell. After deconvolution, we canlearn smooth estimates for the four components of f , corresponding to the intervalsR, G1, postG1, and DG1; we consistently color the intervals R, G1, postG1, andDG1 in red, blue, orange, and cyan respectively throughout this chapter.
vides, a new branch appears and this process repeats.
3.2.2 Cell-cycle parameters from cloccs
Given the specified branching process model, we exploit cloccs to learn the cell-
cycle parameters. There are two types data available for cloccs, budding index data
and flow cytometry data. for deconvolving all measured transcription profiles, we
applied cloccs to learn cell-cycle parameters from both flow cytometry and budding
index (Orlando et al., 2009). The parameters learned only from flow cytometry were
39
Table 3.1: Cell-cycle parameters estimated by cloccs from flow cytometric mea-surements of DNA content and budding index.
Cell-cycle parametersFlow and budding Flow onlyWT1 WT2 WT1 WT2
length of R (minutes) 94.387 101.904 94.279 101.954length of C (minutes) 79.487 82.014 79.647 81.965length of DG1 (minutes) 44.318 37.436 44.326 37.425length of G1 (fraction of C) 0.153 0.165 - -length of G1+S (fraction of C) 0.349 0.391 0.349 0.391length of G2+M (fraction of C) 0.651 0.609 0.651 0.609
used for deconvolving budding index profiles, because deconvolving budding index
profiles with the aid of parameters learned from those profiles would produce overly-
optimistic estimates of deconvolution performance. The learned cell-cycle parameters
are listed in Table. 3.1.
3.3 Deconvolution model
3.3.1 Deconvolution objective function
According to the branching process model, we can split each gene’s single-cell tran-
scription profile f into four distinct blocks as f = [fR fG1 fDG1 fpostG1], representing
the transcription profile during subintervals R, G1, DG1, and postG1, respectively.
We expect that the estimated profile [fR fG1 fpostG1] should be smooth since it prevails
during the cell-cycle progression of initial cells, and similarly the profile [fDG1 fpostG1]
should be smooth since it prevails during the cell-cycle progression of daughter cells.
Then the objective function is
argminf
∥∥∥∥Hf
g− 1
∥∥∥∥22
+ γ(‖[fR fG1 fpostG1]W1‖1 + w ‖[fDG1 fpostG1]W2‖1) (3.1)
where ‖·‖1 and ‖·‖2 respectively denote l1 and l2 norms, γ is a regularization control
parameter, W1 and W2 are orthonormal wavelet-basis matrices, and w simply scales
the two regularization terms to account for the different lengths of the intervals
40
they cover; we always set w = 1.5 because the amount of time spent in R + G1 +
postG1 (regularized by W1) is roughly 1.5 times as long as the amount of time spent
in DG1 + postG1 (regularized by W2). Here, we require f ≥ 0 for deconvolving
microarray transcription data, because the actual transcript levels are always non-
negative, and we select Symlet (N = 5) wavelets because of their smoothness and
symmetry properties. When deconvolving budding index data, we instead use the
objective function as
argminf‖Hf − g‖22 + γ(‖[fR fG1 fpostG1]W1‖1 + w ‖[fDG1 fpostG1]W2‖1) (3.2)
and we require f ∈ [0, 1] because the fraction of budded cells is always between
0 and 1, and use Haar wavelets because of their step-function properties. In each
case, we performed constrained optimization of the respective convex function using
the MATLAB convex optimization package CVX, version 1.2 (Grant and Boyd, 2008,
2010).
3.3.2 Constructing a convolution kernel
cloccs enables us to determine the cell-cycle distribution of a cell population at
any given time and to estimate the fraction of cells within any given cell-cycle subin-
terval. Using the cell-cycle position distributions from cloccs (learned parameters
characterizing these distributions for each experiment are listed in Table 3.1), we
can construct a convolution kernel H ∈ Rt×n, where t denotes the number of time-
series observations in the population-level measurements g, and n denotes the total
number of subintervals along the various cell-cycle branches. Specifically, hij ∈ H
quantifies the fraction of cells within a given subinterval j at a given time i. For
the purposes of high temporal resolution, n is chosen much larger than t. In our
case, t = 15 and n = 258 since we used a total of 258 subintervals for deconvolving
transcription profiles: R has 88, G1 has 42, DG1 has 86, and postG1 has 42. In
41
implementation, we used padding entries and mirror-reflections in both directions of
f to remove the edge effects caused by circular wavelet packets (Lord and Wheals,
1981; Mallat, 2008).
3.3.3 Selection a regularization parameter
As described in previous chapter, to select a good regularization parameter γ for
each gene (or budding index) that avoids both over-fitting and over-smoothing, we
first determined a region of γ that represents a reasonable trade-off between the
fit term (‖Hf − g‖22 in Eq. (3.2) or∥∥∥Hf
g− 1∥∥∥22
in Eq. (3.1)) and the smoothness
term (‖[fRfG1fpostG1]W1‖1 + w‖[fDG1fpostG1]W2‖1). Next, we selected a gene-specific
optimal regularization parameter γ by calculating the maximum curvature on the L-
curve Hansen (1992) within this region. Precise details for selecting the regularization
parameter γ are given in Fig. 3.2.
Input : Observed transcription profile gOutput: Regularization parameter γ̂ and deconvolved transcription profile f
1 Deconvolution (g, γ ← 0, . . .) =⇒ ε0B determine ε0 as the best-fit estimator (no smoothing)
2 εl ← min(ε0 × φl, ε0 + εl) B the left fit error boundary εl3 BinarySearch (ε← εl, γ ∈ [0.001, 0.01]) =⇒ γlB search for the left boundary of γ
4 εr ← max(ε0 × φr, ε0 + εr) B the right fit error boundary εr5 BinarySearch (ε← εr, γ ∈ [γl, 0.01]) =⇒ γrB search for the right boundary of γ
6 FindElbow (γ ∈ [γl, γr]) =⇒ γ̂ B determine γ̂ at the elbow of the L-curve
7 Deconvolution (g, γ ← γ̂, . . .) =⇒ f B deconvolve using γ̂, determine f
Figure 3.2: Detailed algorithm for selecting a regularization parameter γ. We setthe fit error boundaries φl = 1.05, εl = 0.04, φr = 1.40, and εr = 0.32.
42
3.3.4 Adjustment of branching process construction from cloccs
The branching process model in our deconvolution algorithm would be identical to
that of the original cloccs branching process if mother and daughter cells separated
immediately upon the completion of mitosis and cytokinesis. In budding yeast,
however, mother and daughter cells remain attached to one another for a period of
time after cytokinesis, until the cell walls can be enzymatically detached (Kuranda
and Robbins, 1991). During this time, although the cells have distinct cytoplasmic
compartments and may be executing distinct transcription programs, they appear
under a microscope to be a single budded cell, which is how they are counted for
the purposes of estimating parameters in the original cloccs branching process.
When producing transcription profiles, we need to shift the branching times in our
branching process by a suitable duration to compensate.
To estimate the duration of this attachment period, we use as biomarkers four
genes DSE1-4 (Daughter-Specific Expression 1-4) known to have daughter-specific
transcription profiles. These are specifically transcribed in the daughter cell early
in the cell cycle (Colman-Lerner et al., 2001). We calibrate the duration of the at-
tachment period to be the smallest duration such that the deconvolved transcription
profiles of all four genes are primarily within DG1. The resultant durations for the
two wild-type replicate experiments are 26 and 27 minutes, respectively; in each case,
the duration is around 1/3 of the cell cycle of mother cells and 1/5 of the cell cycle
of daughter cells.
3.4 Results
3.4.1 Deconvolving time-series yeast budding index data to assess algorithm accu-racy
Perhaps the most important feature of a deconvolution algorithm is the accuracy of
its resultant estimates. To assess the accuracy of our method, we first deconvolve
43
B PCL1
time (min)
norm
aliz
ed tr
ansc
ript l
evel
0
max
0
max0 50 100 150 200 250 300
0 50 100 150 200 250 300
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
WT
1W
T2
CDC20
time (min)
norm
aliz
ed tr
ansc
ript l
evel
0
max
0
max0 50 100 150 200 250 300
0 50 100 150 200 250 300
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
WT
1W
T2
SSK22
time (min)
norm
aliz
ed tr
ansc
ript l
evel
0
max
0
max0 50 100 150 200 250 300
0 50 100 150 200 250 300
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
WT
1W
T2
SIC1
time (min)
norm
aliz
ed tr
ansc
ript l
evel
0
max
0
max0 50 100 150 200 250 300
0 50 100 150 200 250 300
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
WT
1W
T2
A r =0.992
R G1 S G2/M G1 S G2/M
DG1 S G2/M
0
100
0
100
budding
% o
f cel
ls b
udde
d
0
100
0
1000 50 100 150 200 250 300
0 50 100 150 200 250 300
time (min)
WT
1W
T2
index
Figure 3.3: Deconvolution recovers dynamic single-cell profiles from population-level data. (A) Joint deconvolution of replicate budding index measurements. Theleft panel illustrates the two replicate wild-type budding index measurements in red,along with the fit to those time series learned by our algorithm overlaid in green.The right panel shows the deconvolved budding profile, learned jointly from thetwo replicates. The true budding profile is shown as a dashed line for compari-son (r2 = 0.99). (B) Joint deconvolution of replicate transcription profiles for fourrepresentative genes. Shown for each gene are two replicate measured transcrip-tion profiles in red, the fit to those time series learned by our algorithm overlaidin green, and separate deconvolved transcription profiles for mother and daughtercells. To facilitate cross-comparison, all transcription profiles are normalized so thattheir maximum levels are the same height; consequently, the increased amplitudeproduced by deconvolution is not apparent. The cyclin PCL1 peaks late in both G1and DG1, the APC activator CDC20 peaks during mitosis, and the CDK inhibitorSIC1 is transcribed primarily during DG1. For genes whose two replicate profiles arein poor agreement—such as the MAP kinase SSK22 (Pearson correlation 0.14)—ouralgorithm removes apparent noise; the resultant deconvolved profile smoothly tracesthe broad trajectory of measured transcript levels across both replicates.
44
measurements of budding index because the true single-cell budding profile is known
and thus provides a clear basis for evaluation: yeast cells produce a bud near the start
of S phase and remain budded until the end of M phase. Although each wild-type
cell is either budded or not budded, time-course budding index measurements appear
like damped sinusoids due to synchrony loss in the population over time (Fig. 3.3A,
left).
We used cloccs parameters learned only from flow cytometry data (Orlando
et al., 2009) (i.e., without budding index data) to ensure fair assessment of our algo-
rithm’s accuracy. When the two observed population-level budding index measure-
ments are jointly deconvolved, our algorithm predicts the true single-cell budding
profile nearly perfectly: the originally measured damped sinusoids become square
waves with onset near the start of S and offset near the end of M, as desired (Fig. 3.3A,
right).
3.4.2 Deconvolving replicate yeast microarray data to reveal single-cell transcriptionprofiles
Reassured by the performance of our algorithm on budding index data, we jointly
learned deconvolved transcription profiles from two replicate cell-cycle time-course
microarray experiments in budding yeast (Orlando et al., 2008). Our decision to
keep G1 distinct from DG1 allowed us to capture possibly different transcription
programs for mother and daughter cells during G1. However, because both mother
and daughter cells subsequently enter a single postG1 interval, our model assumes
that both kinds of cells share a common transcription program in cell-cycle phases
after G1. The examples, as shown in Fig. 3.3B and Fig. 3.4, highlight the ability
of our deconvolution algorithm to not only sharpen transcription signal, but also
smooth out experimental noise.
45
norm
aliz
ed tr
ansc
ript l
evel
0 50 100 150 200 250 3000
max
maxW
T1
WT
2
time (min)0 50 100 150 200 250 300
0
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
Figure 3.4: Deconvolution is of capability of de-noising. Normalized transcrip-tion profiles of 129 ribosomal protein genes before and after deconvolution. Themedian transcription profile in each case is overlaid in red. The average of the 129peak-to-trough (PTR) scores, which is used to measured the degree of amplitudein expression (defined in Section 3.4.5, decreased from 1.027 to 1.018 after decon-volution, suggesting that our deconvolution algorithm is effective at not sharpeningnoise.
3.4.3 Deconvolution is robust with respect to uncertainty in input cloccs param-eters
One potential concern about the output of our algorithm is that because it relies on
posterior mean estimates of parameters from cloccs, its output might be sensitive
to uncertainty in those parameter estimates. To assess this, we generated a set of 100
deconvolved profiles using 100 random realizations from the cloccs Markov chain,
rather than using the single posterior mean parameterization. Specifically, to obtain
these 100 random parameterizations, we ran 10 independent cloccs Markov chains
with 100,000 iterations after a lengthy burn-in period. Then, we randomly selected
10 parameter estimates from the last 1,000 iterations of each of the 10 Markov chains,
resulting in 100 random parameterizations. These 100 random realizations reflect our
posterior uncertainty about the cloccs parameters used as input; differences in the
resulting 100 outputs reflect our posterior uncertainty in a deconvolved profile with
respect to the posterior uncertainty of cloccs.
For each gene, we then overlaid the 100 deconvolved profiles generated with 100
46
different cloccs parameterizations on top of one another to form a composite tran-
scription profile. Composite profiles for four representative genes whose transcripts
peak at different times in the cell cycle are shown in Fig. 3.5A. The posterior un-
certainty is so minimal that the 100 different profiles in each composite are nearly
identical, though the composite profile for DSE3 exhibits slightly higher uncertainty
in the middle of DG1. Non-uniform sampling (collecting data more frequently later
in the time course when synchrony loss has accumulated significantly) could perhaps
be employed in the future to ensure that profiles are equally certain in all intervals
of the cell cycle. Nevertheless, even with the uniformly-sampled data used here, our
deconvolution algorithm is robust enough to the posterior uncertainty in cloccs
parameter estimates that the profiles generated from 100 different parameterizations
are essentially indistinguishable. To further explore the robustness of deconvolved
profiles with respect to uncertainty in input cloccs, in Fig. 3.6, we showed genes
with different degree of amplitude in expression and their overlaid deconvolved tran-
scriptional profiles with respect to different cloccs parameters.
3.4.4 Deconvolution increases temporal resolution and precision of transcriptionprofiles
One particularly compelling property of a good deconvolution algorithm is the in-
creased temporal resolution of its estimates; for example, although the microarray
data used in this paper were collected at 16 minute intervals, our deconvolved tran-
scription profiles have a nominal temporal resolution of less than one minute. How-
ever, this is by construction; a more meaningful question is, what is the ‘effective
temporal resolution’ of our deconvolved profiles?
To quantitatively estimate the robustness of temporal resolution after deconvolu-
tion to experimental noise, we added random multiplicative noise to the input profile.
Specifically, for each input profile g = (g1, . . . , gt), we added multiplicative Gaussian
47
C
5 10 15 20
02
46
8
level of added noise(as a % of observed measurement)
timin
g di
ffere
nce
oftr
ansc
ript p
eaks
(m
in)
G1 S G2/M
DG1 S G2/M
0
max
0
maxm
othe
rda
ught
er
DSE3
CLN1 ACE2NDD1A
norm
aliz
ed tr
ansc
ript l
evel
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
DSE3
CLN1 ACE2NDD1B
norm
aliz
ed tr
ansc
ript l
evel
Figure 3.5: Deconvolved profiles are robust to uncertainty in inputs. (A) Ro-bustness of deconvolved profiles with respect to uncertainty in cloccs parameterestimates. Shown are 100 overlaid deconvolved transcription profiles for the G1 cy-clin CLN1 , the S-phase transcriptional activator NDD1 , the transcriptional activatorACE2 expressed late in the cell cycle to drive early G1 transcription in a daughter-specific manner, and the daughter-specifically expressed DSE3 . The 100 deconvolvedtranscription profiles for each gene were produced using 100 different cloccs pa-rameterizations, each a random realization from the cloccs Markov chain. Themost noticeable uncertainty in the deconvolved profiles seems to be for DSE3 in themiddle of DG1, but even this uncertainty is minimal. More examples are shownin Fig. 3.6. (B) Robustness of deconvolved profile with respect to uncertainty inmeasured input transcription profiles. Shown are 100 overlaid transcription profilesfor CLN1 , NDD1 , ACE2 and DSE3 . The 100 deconvolved transcription profiles foreach gene were produced by deconvolving 100 noise-injected (10% level) measuredtranscription profiles. (C) Effective temporal resolution of deconvolved profiles asa function of measurement noise. The x-axis indicates the average level of randommultiplicative noise added to input transcript levels at every point in the time-series.Box-plots display the distribution of timing differences (unsigned) between the tran-scription peaks of deconvolved profiles with and without noise added. Gray boxesindicate interquartile ranges, heavy black bars indicate median values, and small redsquares indicate mean values.
48
G1 S G2/M
DG1 S G2/M
0
max
0
maxm
othe
rda
ught
er
norm
aliz
ed tr
ansc
ript l
evel
DSE3 (27)
PRY2 (1)
ACE2 (30)
NDD1 (151)
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
norm
aliz
ed tr
ansc
ript l
evel
TEC1 (601)
MCM6 (503)
MYO2 (662)
HHT2 (513)
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
norm
aliz
ed tr
ansc
ript l
evel
SIW14 (1256)
EMP24 (1235)
APC1 (1042)
SPC24 (1061)
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
norm
aliz
ed tr
ansc
ript l
evel
SIN4 (2617)
DID2 (1677)
APC9 (2546)
HNT2 (2100)
A B
C D
Figure 3.6: More examples on the robustness of deconvolved profiles with respectto uncertainty in cloccs parameter estimates. Shown are 100 overlaid deconvolvedtranscription profiles for randomly selected genes with high PTR scores (ranked inthe top 500 by PTR scores; panel A), medium PTR scores (ranked in 501-1,000;panel B), and low PTR scores (ranked in 1,001-1,500; panel C), with transcriptionpeaks in G1, S, G2/M. and DG1, respectively. Panel D illustrates four genes withinsignificant PTR scores (ranked below 1,501). The 100 deconvolved transcriptionprofiles for each gene were produced using 100 different cloccs parameterizations,each a random realization from the cloccs Markov chain. The numbers in theparentheses indicate the ranks of genes by PTR scores. Here, PTR indicates peak-to-through, a scoring scheme we used to quantify the degree of amplitude of geneexpression. More details are introduced in Section 3.4.5.
noise at every time point, such that g′i = gi× (1 + εi), where εi ∼ N (0, σ2). Fig. 3.5B
shows the 100 overlaid deconvolved transcription profiles for the four characteristic
genes with noise injected at σ = 10%.
We further assess effective temporal resolution by determining how much the
timing of a profile changes as varying levels of noise are added to the input data.
This yields a measure of the robustness of timing information to noise in the data.
49
The simplest means of determining how much the timing of a profile changes is to
focus on how much the timing of the peak shifts, especially since the peak is typically
the most salient feature in a deconvolved profile. We therefore assessed how much
peak timing shifted—whether earlier or later (using unsigned timing differences)—as
varying amounts of multiplicative noise were added to the input data.
In doing so, we selected the 100 genes with most significant amplitude in ex-
pression (ranking by peak-to-trough ratio, as described in next section) before de-
convolution as our benchmark, since for these genes, the peaks in the deconvolved
transcription profiles are usually easy to ascertain. We say the mother and daughter
peaks of a deconvolved transcription profile occur where the transcript levels in the
mother and daughter cell-cycle intervals are maximal. If the maximal level in one of
those intervals is at least twice as high as that in the other interval, we define this to
be the dominant peak ; otherwise, we say the profile contains two dominant peaks. For
each of these genes, at each noise level, we generated 10 noisy transcription profiles,
deconvolved these profiles, and computed the unsigned timing differences of their
dominant peaks to those of the original deconvolved profile. With 10 noisy profiles
for 100 different genes, we thus had at least 1,000 unsigned peak differences (recall,
some profiles contain two dominant peaks) at each noise level.
Although the reproducibility of our two replicate microarray experiments was
high (Orlando et al., 2008), and although it has been shown that the intrinsic noise
level in the gene expression of budding yeast is relatively low (Raser and O’Shea,
2005), we chose to examine the effects of average multiplicative noise across a broad
range, from 5% up to 20%. Across this range, the median unsigned peak timing
shift ranged from 0.0 up to 1.6 minutes, and the mean ranged from 0.6 up to 2.7
minutes (Fig. 3.5B). As one specific example, if the input replicate transcript levels
were all perturbed an average of 10%, the timing of a peak would shift 1 minute, on
average. This indicates that the peak timing information in our deconvolved profiles
50
is relatively precise.
We observed that the effective temporal resolution of the deconvolved profiles
is not only related to the amount of noise in the input data, but also depends on
the time at which genes are transcribed during the cell cycle. For instance, when
adding 20% noise, while the mean shift in peak timing for all genes is 2.7 minutes
(Fig. 3.5C), it becomes 4.4 minutes for genes whose transcript levels peak late in
the cell cycle. This suggests that when collecting time-series measurements during
the cell cycle, it may again be beneficial to use non-uniform sampling, as suggested
above.
3.4.5 Deconvolution increases amplitude and dynamic range of transcription profiles
Because convolution is a form of smoothing, and deconvolution is therefore a form
of sharpening, deconvolution helps restore the dynamic range of transcript level
fluctuations whose measured levels have been dampened by the effects of convolution.
However, a serious risk of deconvolution is that it will sharpen not only the dampened
signal but also any noise in the measurements. For this reason, it is critical that the
deconvolution objective be regularized appropriately, which we have achieved in our
algorithm through use of a wavelet basis. The result is a deconvolution algorithm that
effectively sharpens signal (thereby increasing dynamic range) without sharpening
noise (Fig. 3.3B).
To assess this on a genome-wide scale, we need to quantify the dynamic range of
transcription profiles before and after deconvolution. Here, we developed a simple
peak-to-trough ratio (PTR) scoring scheme to quantitatively estimate the dynamic
range of transcription of a gene before and after deconvolution. Also, to be robust
against the influence of large or small outliers, we defined our PTR score as the
ratio between the 80th percentile and the 20th percentile of transcript levels over the
course of the cell cycle.
51
In detail, for a measured transcription profile (before deconvolution), the PTR
score was calculated as the ratio between the 80th percentile and the 20th percentile
of transcript levels after recovery (ignoring the R interval; from the first G1 to the end
of the time course). For a deconvolved transcription profile, we first calculated two
PTRs (rm and rd) as the ratios between the 80th percentile and the 20th percentile
of the transcript levels in mother and daughter cells, respectively. The deconvolved
PTR score of a gene was then computed as the weighted geometric mean of the two:
r = 3√r2mrd. A higher weight was placed on the PTR score from the mother because
we had slightly more confidence in the overall mother profile (more data available
for estimating the corresponding entries in f).
PTR scores before and after deconvolution are illustrated in the density scatter-
plot of Fig. 3.7A. Two things are apparent from this scatterplot: the vast majority of
genes exhibit a noticeable increase in their PTR score (they appear above the diag-
onal), as would be expected for a deconvolution method that sharpens transcription
profiles; at the same time, owing to the wavelet-based regularization employed by
our algorithm, and in contrast to most earlier deconvolution methods (e.g., Rowicka
et al. (2007)), genes can have smoother transcription profiles after deconvolution
than before (they can appear below the diagonal).
3.4.6 Deconvolution reveals a large number of transcripts fluctuating during the cellcycle
The increased dynamic range resulting from deconvolution affords us the opportu-
nity to more sensitively identify cell-cycle-regulated transcripts, those whose levels
fluctuate significantly over the course of the cell cycle. Indeed, one nice aspect of
our PTR score is that it provides a direct measure of how significantly a transcript’s
deconvolved levels are fluctuating over the course of the cell cycle. In particular,
our model-based deconvolution and PTR score allow us to avoid the Fourier-based
52
0 1000 2000 3000 4000 5000 6000
020
4060
8010
0B
genes ranked by deconvolved PTR score
% o
f cel
l-cyc
le-r
egul
ated
gen
es
Spellman
Pramila
Orlando
Intersection
iden
tifie
d by
pre
viou
s st
udie
s
A
deco
nvol
ved
PT
R s
core
original PTR score
1 2 5 10 20 50 100
12
510
2050
100+
SSK22
PCL1
density
highlow
CDC20
SIC1
CLN2
Figure 3.7: Genome-wide analysis of deconvolved transcription profiles reveals alarge number of transcripts fluctuating during the cell cycle. (A) Dynamic range oftranscription profiles before and after deconvolution. The density scatterplot depictsPTR scores for all 5,670 transcription profiles before and after deconvolution. PTRscores above 100 are shown truncated since the PTR score can become arbitrarilylarge if the denominator approaches zero. Note that while most genes have increaseddynamic range after deconvolution (above diagonal), some genes have decreased dy-namic range (below diagonal), owing to our wavelet-based regularization. The fivegenes whose deconvolved transcription profiles appear in Fig. 3.3B are highlighted inblue. The dashed red line indicates the deconvolved PTR score threshold we selectedto identify cell-cycle-regulated genes. (B) Recovery of previously identified cell-cycle-regulated genes in yeast. We ranked all 5,670 genes by their deconvolved PTR score.The plot shows the cumulative recall (sensitivity) of recallable genes identified ascell-cycle-regulated in previous studies. Genes with the highest 1,500 PTR scores(dashed red line) showed clear evidence of cell-cycle-regulation; these include 96% ofthe 440 genes identified by all three earlier studies to be cell-cycle-regulated.
periodicity analyses that have been used to identify cell-cycle-regulated genes in the
past (e.g., Spellman et al. (1998); de Lichtenberg et al. (2005)), with their attendant
limitations when applied to sparsely or irregularly sampled time-series data.
Transcripts cannot easily be categorized in a simple binary fashion as being cell-
cycle-regulated or not, since cell-cycle regulation occurs along a continuum from
strongly-regulated to weakly-regulated, as well as being condition- and strain-dependent.
For this reason, it makes more sense to simply rank genes in terms of their degree of
53
cell-cycle regulation, for which we used our deconvolved PTR score as a measure. To
visualize how well our deconvolved PTR score recovers genes identified in earlier stud-
ies as cell-cycle-regulated, we plotted the cumulative recall of previously identified
cell-cycle-regulated genes as a function of our deconvolved PTR rank (Fig. 3.7B).
Although the degree of cell-cycle regulation occurs along a continuum, for the
purposes of downstream analysis, we wished to establish a set of genes whose tran-
script levels exhibited a sufficiently high level of fluctuation to be clearly called cell-
cycle-regulated. We established a set of size 1,500 (corresponding to a deconvolved
PTR score ≥ 1.37, shown in Figs. 3.7A and 3.7B by a dashed red line). This set
includes 73% of the 1,271 periodic genes identified in Orlando et al. (2008), 69% of
the 895 recallable periodic genes identified in Pramila et al. (2006), 76% of the 709
recallable periodic genes identified in Spellman et al. (1998), and 96% of the 440
genes in the intersection of the three previous lists. Note that because these previous
studies made predictions without the aid of deconvolution, we should not expect to
see overwhelming agreement with any individual study.
Our set of 1,500 cell-cycle-regulated genes is noticeably larger than what has
previously been identified. Its increased size can be attributed primarily to the
increased sensitivity of our deconvolved profiles, which have had the “blurring” effects
of population asynchrony removed by our algorithm. After capping deconvolved PTR
scores at 100, the PTR scores of the 1,500 periodic genes increased by a factor of 4.7
on average after deconvolution, allowing us to more sensitively identify genes with
transcript-level fluctuations during the cell cycle. Heat-maps of transcript levels for
these 1,500 cell-cycle-regulated genes before and after deconvolution are shown in
Fig. 3.8.
Though we have chosen to focus on the 1,500 genes whose transcript levels are
most strongly cell-cycle-regulated, it is evident that an even larger number of genes
may be moderately or weakly regulated over the course of the cell cycle. This raises
54
A B mother daughter
G1 S G2/M DG1 S G2/MG1 S G2/M G1 S
WT1fold change versus m
ean
3
3/2
1
2/3
1/3
Figure 3.8: Transcript dynamics of 1,500 most cell-cycle-regulated genes. Heatmaps depict the dynamics of periodic transcripts in the measured (A) and decon-volved (B) transcription profiles of the identified 1,500 periodic genes. Correspondingrows in the various heat maps represent the same gene. Note that although our algo-rithm learns the deconvolved transcription profiles from two independent replicatesof the measured data, only WT1 is shown in panel A for space (WT2 data is nearlyidentical).
the prospect that a far more significant fraction of the yeast transcriptome may be
under cell-cycle control than previously suspected.
3.4.7 Deconvolution is robust across replicates
To investigate how much the deconvolved profiles would vary if one were to learn them
only from one single replicate versus the other single replicate, we re-deconvolved
our 1500 cell-cycle-regulated genes using only wild-type 1 (WT1) data and WT1
cloccs parameters, and again using only WT2 data and WT2 cloccs parameters
55
as listed in Table 3.1. To be clear, this analysis is less an assessment of our method
(and its reliance on good parameter estimates) and instead, more an assessment of
the way in which variations in measured data results in variations in deconvolved
profiles. However, the analysis reveals whether or not the two replicate of input data
are consistent. If they are consistent, then the two sets of output data should be
consistent. If not, then the two sets of output data should be in consistent.
In doing so, we first show specific results for four genes in Fig. 3.9 compared
to the reported jointly learned profiles. We can observe from the results that the
deconvolved profile is largely unaffected, but in general, the jointly learned profile is
slightly smoother, since it is based on more data. In Fig. 3.9, we present summary
analyses and heat-maps for all 1500 genes. As shown in the figure, the separate
deconvolved profiles of two replicates are nearly identical to each other, indicating
not only the reproducibility of two input datasets are high, but also our deconvolution
algorithm is robust across replicates.
3.4.8 Deconvolution reveals fine timing of transcription programs
We have shown that our deconvolution algorithm can reliably estimate transcrip-
tion profiles at fine temporal resolution. This enables us to distinguish subtle timing
differences previously obscured in population measurements taken only every 16 min-
utes. Fig. 3.10 provides two examples: the transcription profiles of genes that play
key roles in the selection and activation of origins of DNA replication (Fig. 3.10A),
and the transcription profiles of histone genes (Fig. 3.10B).
Origins of replication are selected and activated by the ordered assembly of pro-
tein complexes on the genome at discrete stages of the cell cycle. Potential origins are
initially marked by the arrival of the origin recognition complex (ORC). During G1,
ORC then associates with Cdt1 and Cdc6 to recruit the helicase MCM complex, form-
ing the pre-replicative complex (pre-RC) and licensing potential replication origins
56
ACE2
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
norm
aliz
ed tr
ansc
ript l
evel
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
norm
aliz
ed tr
ansc
ript l
evel
NDD1
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
norm
aliz
ed tr
ansc
ript l
evel
CLN1
G1 S G2/M
DG1 S G2/M
0
max
0
max
mot
her
daug
hter
norm
aliz
ed tr
ansc
ript l
evel
DSE3
joint
WT1
WT2
Bmother daughter
fold change versus mean
3
3/2
1
2/3
1/3
G1 S G2/M DG1 S G2/M G1 S G2/M DG1 S G2/M
mother daughter
A
WT1 WT2
Figure 3.9: Robustness of deconvolved profiles with respect to variation acrossmeasured data replicates. (A) Shown are the deconvolved profiles of ACE2 , NDD1 ,CLN1 , and DSE3 , learned from WT1 (in blue), WT2 (in green), and jointly fromtwo replicates (in red), respectively. (B) Heat maps depict the dynamics of periodtranscripts in the deconvolved transcription profiles of the identified 1,500 cell-cycle-regulated genes learned from WT1 and WT2, respectively.
57
Amother daughter
DG1 S G2/MG1 S G2/M
MCMcomplex
MCM2
MCM3
MCM5
MCM4
MCM6
MCM7
CDC6
0
max
norm
aliz
edtr
ansc
ript l
evel
CLB5
CLB6
DBF4
CDC7
CDC45
SLD2
SLD5
PSF1
PSF3
S-CDK
DDK
GINScomplex
Cdc45complexDpb11complex
DG1 S G2/MG1 S G2/M0
max
norm
aliz
edtr
ansc
ript l
evel
BH2A
H2B
H3
H4
H2A.Z
H1
HTA1
HTA2
HTB1
HTB2
HHT1
HHT2
HHF1/2
HHO1
HTZ1
DG1 S G2/MG1 S G2/M0
max
norm
aliz
edtr
ansc
ript l
evel
Figure 3.10: High temporal resolution of deconvolution reveals fine timing of tran-scription programs. (A) Normalized deconvolved transcription profiles of genes play-ing key roles in the origin-selection (top) and origin-activation (bottom) steps of DNAreplication. Profiles of CDT1 , MCM10 , SLD3 (in the Cdc45 complex), DPB11 (inthe Dpb11 complex), and PSF2 (in the GINS complex) are not shown since theirdeconvolved PTR scores are below our threshold for calling a gene strongly cell-cycle-regulated (none of these five are identified as cell-cycle-regulated in any previousstudy (Spellman et al., 1998; Pramila et al., 2006; Orlando et al., 2008) except forPSF2 in Orlando et al. (2008)). (B) Normalized deconvolved transcription profilesof histone genes in yeast. Note that the only two histone genes with somewhat dis-tinctive profiles are the H2A.Z histone variant which peaks later, and the H1 linkerhistone whose transcript levels approach zero during DG1.
58
for activation. Origins are activated late in G1 by S-CDK and DDK activity, lead-
ing to the assembly of a massive protein assembly called the pre-initiation complex
(pre-IC), including the Cdc45 complex, the Dpb11 complex, and the GINS complex.
Assembly of the pre-IC eventually leads to the initiation of DNA synthesis, defining
the start of S phase (Bell and Dutta, 2002).
Fig. 3.10A makes evident that the timing of transcription of genes involved in
the selection (pre-RC) and activation (pre-IC) steps of replication is tightly regu-
lated, with transcripts of pre-RC genes peaking together early in G1 (top panel) and
transcripts of pre-IC genes peaking together later in G1 (bottom panel). The two
catalytically-distinct MCM subgroups, Mcm2-3-5 and Mcm4-6-7 Schwacha and Bell
(2001), seem to be transcribed coordinately, especially in relation to the troughs of
each profile. Interestingly, the tight regulation evident in mother cells appears to
be relaxed in daughter cells, though it should be recalled that daughter profiles are
slightly more uncertain. Even so, the transcripts of all the pre-RC genes still peak
before the transcripts of all the pre-IC genes.
During replication, newly synthesized DNA is complexed with nucleosomes, his-
tone octamers consisting of two copies of each of the four core histones H2A, H2B, H3,
and H4 (Hereford et al., 1981). Fig. 3.10B reveals that these core histones are tran-
scribed in remarkably tight coordination, peaking precisely at the start of S phase.
In addition, we observe that in both mother and daughter cells, one histone gene
peaks distinctly later than the others: HTZ1 , the replication-independent histone
variant H2A.Z which is not assembled into nascent nucleosomes, but is exchanged
for H2A in a subset of nucleosomes afterwards (Kamakaka and Biggins, 2005). The
other histone gene with a somewhat distinctive transcription profile is the H1 linker
histone HHO1 (Bustin et al., 2005), whose transcript levels uniquely approach zero
during DG1, though they peak at essentially the same time as the core histones.
59
3.4.9 Identifying over-represented transcription factors (TFs)
According to the deconvolved transcription profiles, we classified daughter-specific
genes and stress-response genes into several subclusters by visual inspection. We
expect that the genes with coherent transcription patterns may be regulated by
common transcription factors (TFs), and therefore certain TFs might be signifi-
cantly associated with the promoters of the genes in a given subcluster. To test
this hypothesis, we used the TF-gene regulation mappings from the YEASTRACT
database (Teixeira et al., 2006) (direct evidence only, downloaded February 2011)
to look for over-represented TFs binding to promotor of genes within each subclus-
ter. To determine whether a TF is over-represented in a specified list of genes, we
calculated a p-value using a hypergeometric test, and designated it as being over-
represented if the p-value is less than or equal to 0.005. To increase the biological
significance of the identified TFs, we removed TFs that bound fewer than 3, or fewer
than 10%, of the genes in a subcluster.
3.4.10 Deconvolution reveals R-specific transcriptional program
In elutriation-based synchronization experiments, the initially collected cells—typically
small cells early in G1—are released from synchrony after experiencing significant
cold and osmotic stress. Thus, elevated transcript levels of a gene early in the time
course could arise because the gene is necessary for early G1 events, or because the
gene is part of a stress response, or both. If the former, we would expect to see high
levels of transcription again later in the time course; if the latter, we would expect
the high levels of transcription to be confined to the earliest samples of the time
course.
To identify genes whose high early transcription can be primarily attributed to
stress response, we established two criteria: the integrated transcript level of a gene in
the R interval is at least half the total across all cell-cycle branches (R, G1 + postG1,
60
Table 3.2: Full list of over-represented TFs in subclusters of R-specific expressedgenes (Fig. 3.11).
Subcluster ID # of genes Over-represented TFs with p-value ≤0.005
1 38 Adr1, Hot1, Sko1, Msn2, Hap1, Sko2, Skn7,Pdr1, Cad1, Fkh2, Nrg1, Rtg3
2 131 Hot1, Sko1, Cad1, Adr1, Yap5, Msn2, Cin5,Sok2, Pdr1, Yap6, Ste12, Skn7
3 12 Put3, Yap5, Pho4, Gcn4, Cin5, Yap6
4 3 Rap1, Sfp1
and DG1 + postG1), and the peak transcript level in R is at least twice as high as
that in mother (G1 + postG1) or daughter (DG1 + postG1) cells. We identified
184 genes satisfying these criteria, heat maps for which are shown in Fig. 3.11.
Gene Ontology (GO) (Ashburner et al., 2000) enrichment analysis reveals that the
biological functions of many of these genes are relevant to the processes of vacuolar
protein catabolic processes (p < 4× 10−26), response to temperature stimuli (p <
10−16), response to abiotic stimuli (p < 10−8), and similar, suggesting that these
genes are likely indeed stress-response genes, and more specifically, responding to the
cold temperatures during elutriation. On the basis of their deconvolved transcription
profiles, we refined these genes into four subclusters according to the time at which
the profiles first drop below their mean, and looked for over-represented TFs within
the promoters of genes in each subcluster. Up to five over-represented TFs for each
subcluster are shown in Fig. 3.11. TFs that are involved in the regulation of genes
during stress or amino acid starvation (e.g., Gcn4) are labeled in red. The full list
of over-represented TFs is given in Table 3.2.
To measure the degree of amplitude in expression, we used a simple PTR scoring
scheme to identify cell-cycle-regulated genes. However, since PTR scores intention-
ally ignore the recovery interval to focus on the mother and daughter cell cycle, they
do not take into account transcript levels during R, which for stress response genes
61
DG1G1 SR
subcluster 1: Adr1, Hot1, Sko1, Msn2, Hap1
subcluster 2: Hot1, Sko1, Cad1, Adr1, Yap5
subcluster 3: Put3, Yap5, Pho4, Gcn4, Cin5
subcluster 4: Rap1, Sfp1
G2/M S G2/M
fold change versus mean
3
3/2
1
2/3
1/3
Figure 3.11: Genes whose transcriptional levels are elevated significantly understress. 184 out of 1500 cell-cycle-regulated genes were identified to be R-specificexpressed genes. According to the time at which the profiles first drop below theirmean, we refined these genes into four subclusters, and up to five over-representedTFs within the promoters of genes in each subcluster. for each subcluster are shown.TFs that are involved in the regulation of genes during stress or amino acid starvation(e.g., Gcn4) are labeled in red. The full list of the over-represented TFs is given inTable. 3.2.
may be significantly elevated. In particular, 128 of the 184 genes listed here are
also included in our set of 1,500 cell-cycle-regulated genes. As can be seen here, the
transcript levels of these genes is often much higher in R than later in the cell cycle,
indicating that their transcription is not exclusively regulated during the cell cycle,
but also through varying environmental conditions and stress.
62
3.4.11 Deconvolution reveals a daughter-specific G1 transcription program
B
A
early: Ace2, Swi5, Sok2, Phd1, Ste12
middle: Sok2, Ste12, Cin5, Yap6
late: Mac1, Tec1, Put3, Mcm1, Ste12
G1 S G2/M0
max mother
DG1 S G2/M
daughter
fold change versus mean
3
3/2
1
2/3
1/3
ASH1
EGT2AMN1DSE3
DSE4PRY3
SCW11
DSE1
DSE2
CTS1
norm
aliz
edtr
ansc
ript l
evel
G1 S G2/M DG1 S G2/M
Figure 3.12: Branching process construction enables deconvolution to reveal adaughter-specific G1 transcription program. Our deconvolution algorithm explic-itly learns distinct cell-cycle transcription programs for both mother and daughtercells, enabling us to explore transcriptional behavior of daughter cells that cannotbe observed from the population-level transcription profiles. (A) Deconvolved tran-scription profiles in mother (left) and daughter cells (right) of genes previously char-acterized as daughter-specific in Table 1 of Colman-Lerner et al. (2001). (B) Twocriteria were used to identify 82 genes transcribed primarily and almost entirely inthe DG1 interval (which we call daughter-specific genes). All daughter-specific genesin panel A were identified by our criteria and thus appear in this set. According tothe timing of transcription peaks in DG1, we classified these genes into 3 subclus-ters: early, middle, and late. Up to five over-represented TFs of each subcluster areshown. The full list of the over-represented TFs is given in Table. 3.4.
Coupled with our high-resolution estimates, the explicit modeling of asymmetric
63
Table 3.3: Full list of over-represented TFs in subclusters of daughter-specific genes(Fig. 3.12).
Subcluster name # of genes Over-represented TFs with p-value ≤0.005
Early 54Ace2, Swi5, Sok2, Phd1, Ste12, Fkh2, Mcm1,Ash1, Fkh1, Skn7, Adr1,Tos8, Swi4, Mbp1, Dal81, Pho4, Yap5
Middle 8 Sok2, Ste12, Cin5, Yap6
Late 20 Mac1, Tec1, Put3, Mcm1, Ste12
cell division enables us to monitor and differentiate distinct mother and daugh-
ter transcription programs. For example, Table 1 of Colman-Lerner et al. (2001)
identified a set of genes that are transcribed in daughter-specific early G1, and sug-
gested that this daughter-specific transcription may, in part, be due to Cbk1/Mob2-
dependent activation and localization of the Ace2 transcription factor to the daughter
cell nucleus. As shown in Fig. 3.12A, our deconvolution algorithm not only correctly
predicts the transcription of these genes as daughter-specific, but also provides a
finely timed view of relevant events in late mitosis and early G1 that are not evident
in the population-level transcription profiles. We observe four distinct sets of tran-
scription dynamics: 1) ASH1 is transcribed to peak levels first, but is also degraded
first; 2) EGT2 , AMN1 , and DSE3 transcript levels rise very closely on the heels of
ASH1 , but degrade more slowly; 3) DSE4 , PRY3 , and SCW11 transcript levels be-
gin to rise at a similar time, but reach their peaks more slowly; and 4) DSE1 , DSE2 ,
and CTS1 transcript levels begin to rise noticeably later and peak last (Fig. 3.12A).
This order of transcription timing is consistent with our knowledge about the
functions of these genes. Ash1 is one of the earliest regulators of daughter-specific
gene expression programs, and is required to repress the transcription of HO from the
beginning of DG1 to block mating-type switching (Sil and Herskowitz, 1996; Cosma,
2004). AMN1 is also transcribed very early in DG1 as Amn1 has been shown to
be part of a daughter-specific switch that helps cells complete mitotic exit (Wang
64
et al., 2003). On the other hand, DSE2 and CTS1 (chitinase) are transcribed later
in DG1 as they encode proteins that degrade the cell wall from the daughter side,
leading to mother-daughter separation (Colman-Lerner et al., 2001; Doolin et al.,
2001; Kuranda and Robbins, 1991).
Among genes that rise to their peaks concomitantly, we observe that their tran-
script levels may decay at different rates; interestingly, these rates are in rough
qualitative agreement with a recent global study of mRNA half-lives (Miller et al.,
2011). For instance, among the closely transcribed genes ASH1 , EGT2 , AMN1 , and
DSE3 , the half-life of ASH1 is shortest (9.35), the half-lives of AMN1 and EGT2
are close to one other (11.02 and 10.67), and the half-life of DSE3 is longest (24.65).
Similarly, the half-life of CTS1 (33.38) is significantly longer than those of the other
two closely transcribed genes DSE1 and DSE2 (7.64 and 7.49).
Having confirmed that the known daughter-specific transcripts of (Colman-Lerner
et al., 2001) were primarily transcribed during DG1 after deconvolution (Fig. 3.12A),
we sought to identify other genes that were similarly transcribed primarily during
DG1. We established two criteria: the integrated transcript level of a gene across
all of DG1 should be at least 30% of the total across all cell-cycle branches (R,
G1 + postG1, and DG1 + postG1), and the peak transcript level in DG1 should
be at least 1.5 times higher than the peak during recovery (R) or in mother (G1
+ postG1) cells. We identified 82 genes satisfying these criteria which we consider
to be primarily transcribed in daughter cells during G1 (Fig. 3.12B). Many known
daughter-specific genes are in the list, including all ten genes in Table 1 of Colman-
Lerner et al. (2001), all six genes identified by Di Talia et al. (2009) as “strongly and
fairly specifically activated by Ace2”, and a remarkable 19 of the 22 genes identified
by (Di Talia et al., 2009) as “responding to a greater or lesser extent to both Ace2
and Swi5” (p<2× 10−33); these include the cyclin Pcl9 and the CDK inhibitor Sic1
that drives cells out of mitosis (Toyn et al., 1997).
65
Table 3.4: The contingency table for 82 identified daughter-specific genes accordingto the daughter-specific and non-daughter-specific genes identified in Di Talia et al.(2009), Spellman et al. (1998), and Colman-Lerner et al. (2001).
true false
positives true positives: 25 genes. We categorizeas true positives the 25 identified genesthat are reported by Di Talia et al.(2009) in their Supplementary TextS1 to be among 28 genes transcribedonly in daughter cells, or particularlyresponsive to either Ace2 or Swi5:AMN1 , ASH1 , BUD9 , CTS1 , CYK3 ,DSE1 , DSE2 , DSE3 , DSE4 , EGT2 ,GAT1 , ISR1 , NIS1 (mistyped in theirSupplementary Text as HIS1 , but evi-dent from their figure as NIS1 ), PCL9 ,PIR1 , PRR1 , PRY3 , PST1 , RME1 ,SCW11 , SIC1 , SUN4 , YLR049C ,YNL046W , and YPL158C .
false positives: 4 genes. Colman-Lerner et al. (2001) suggested that 19genes were not daughter-specific genesin their Table. 2. However, amongthese, 8 were subsequently confirmedby Di Talia et al. (2009) to actuallybe daughter-specific: BUD9 , CYK3 ,PCL9 , PST1 , SIC1 , YNL046W ,NIS1 , and RME1 (mistyped as REM1 ,but evident from their Fig. 2a asRME1 ). Of the remaining 11 genes,
◦ 2 are not included on our mi-croarrays: YMR316C-A, andYOR263C .
◦ 4 are in the set we identifiedas daughter-specific: CHS1 , HO ,PIR3 , and TEC1 . Thus, theseare false positives.
◦ 5 are not in the set we identi-fied as daughter-specific: CDC6 ,FAA3 , PCL2 , YGR149W , andPIL1 . Thus, these are catego-rized as true positives.
negatives true negatives: 5 genes. Refer to thedescription of false positives.
false negatives: 3 genes. We categorizeas false negatives the 3 non-identifiedgenes that are reported by Di Taliaet al. Di Talia et al. (2009) in theirSupplementary Text S1 to be among28 genes transcribed only in daughtercells, or particularly responsive to ei-ther Ace2 or Swi5: YLR414C , FTH1 ,and ESF2 .
Although this is not a proper quantitative estimate of the false discovery rate
(FDR), from the above categorizations it suggests that the FDR is perhaps something
in the ballpark of 4/29 = 0.138. However, since much of the data from Colman-
Lerner et al. Colman-Lerner et al. (2001) seems to have been over-ridden by more
recent results (in particular, 8 of the 19 genes claimed not to be daughter-specific
have subsequently been shown to actually be daughter-specific), this may be a high
66
estimate of the true FDR.
Regarding the 4 false positives, we identified HO , which controls mating type
switching and is known to participate in mother/daughter differentiation (by being
asymmetrically localized to mothers rather than daughters); and TEC1 , which plays
a key role in regulating pseudohyphal growth, and whose binding sites are sugges-
tively enriched in our “late” cluster of daughter-specific genes, along with STE12 ,
the key mating pheromone response transcription factor (TF). Taken together, these
results suggest a linkage between mating type/pheromone response pathways and
how mothers and daughters differentiate. We also identified CHS1 , a chitin syn-
thase required to repair the septum after mother/daughter separation, which seems
to be a Swi5 target rather than an Ace2 target; and PIR3 , a cell wall protein. The
presence of both HO and CHS1 among our false positives suggests that sometimes
a gene may be included in our list if it is mother- rather than daughter-specific, but
is not present early in our time course experiments. So false-positives may include
genes that are asymmetrically localized during mother/daughter differentiation to
mothers, but don’t appear until late in our time course experiments.
Gene Ontology (GO) (Ashburner et al., 2000) enrichment analysis indicates that
many of the proteins corresponding to these genes play a role in the processes of
transcription elongation (p < 3× 10−8), completion of separation (p < 2× 10−7),
cytokinetic cell separation (p < 2× 10−6), cell wall organization or biogenesis (p <
7× 10−4), etc. We visually clustered the 82 genes into three clusters and performed
TF-promoter enrichment analysis of the genes in each cluster. Not surprisingly, genes
whose transcript levels peak early in DG1 (Fig. 3.12B, early) share Ace2 and Swi5
as key TFs; also identified are Sok2, Phd1, and Ste12, all regulators of pseudohyphal
growth. Genes whose profiles are above average for almost all of DG1 (Fig. 3.12B,
middle) are further enriched for Cin5 (previously called Yap4) and Yap6, yeast AP-1
homologues that both recruit the Tup1/Ssn6 repressor under stress conditions (Han-
67
lon et al., 2011). Genes whose onset is a bit later in DG1 (Fig. 3.12B, late) are
enriched for Mcm1, Tec1, and Ste12—all involved in responses to pheromone or
pseudohyphal growth—as well as Mac1, a copper sensing TF, and Put3, a regulator
of the proline utilization pathway.
Since it is experimentally difficult to measure mother and daughter transcrip-
tion programs independently, knowledge of daughter-specific events is still rather
limited, and high-throughput identification of daughter-specific genes has been an
open problem in the field. Our deconvolution algorithm, with its unique ability to
reveal a daughter-specific transcription program from population-level data, provides
a method for generating hypotheses in this direction, and reveals a much larger list
of daughter-specific genes than has previously been identified (Colman-Lerner et al.,
2001). Along with the recent results of Di Talia et al. (2009) and others, this list
provides a step toward understanding the nature of mother-daughter cell differenti-
ation.
3.4.12 Transcriptional programs between G1 and DG1
G1 is the major period of cell growth during the cell cycle. During this phase, either
mother or daughter cells require a large amount of structural proteins and enzymes
for synthesizing new organelles, and many genes are transcribed for both mother
cells in G1 or daughter cells in DG1. Since mother and daughter cells are permitted
by our model to transcribe genes differently during G1, it might be interesting to
ask how the transcription programs in G1 and DG1 are related. For example, the
transcription program of a gene in G1 may be essentially identical to that in DG1,
albeit proceeding at a faster pace so that the profile appears to be compressed, an
example being MCM7 (Fig. 3.10A); or the transcription profile in DG1 may be a
delayed-onset version of the G1 profile, preceded by some daughter-specific early G1
profile, an example being MCM3 (Fig. 3.10A).
68
A
B
1 gene (PRY1)
in mother G1 in daughter DG1
119 genes
Peak ratio ≥ 2 (120 genes)
One dominant peak(mother or daughter)
13.7%
Compressed
High(corr≥0.9)
Medium(0.9>corr≥0.7)
Low(corr<0.7)
High
Medium
Low
Delayed 8.2%
8.5% 7.1%
5.2%
13.2% 17.0%
8.5%
18.4%
delayed (30.2%)
compressed (30.5%)
mixed (20.9%)
uncorrelated (18.4%)
Peak ratio < 2 (1380 genes)
Two dominant peaks(mother & daughter)
364 genes
in mother G1& in daughter DG1
in mother G1& in daughter post-G1
381 genes
in mother post-G1& in daughter DG1
345 genes
in mother post-G1& in daughter post-G1
290 genes
Figure 3.13: Relationships of transcription profiles in G1 and DG1. First, asdiscussed in the Methods, we separated our 1,500 cell-cycle-regulated genes into twogroups: genes with one dominant peak and genes with two dominant peaks. (A)One-dominant-peak genes. In our cell-cycle branching process model, we allowedmother and daughter cells to transcribe genes differently during G1 and DG1, butassumed they share a common transcription program postG1. Therefore, since thesegenes have only one dominant peak, it must occur either in mother G1 or in daughterDG1. Interestingly, we found only one gene (PRY1 ) in the first category, but 119genes in the second. (B) Two-dominant-peak genes. We split the remaining 1380genes into four subgroups according to where their two dominant peaks occurred.For the 364 genes whose two dominant peaks are in mother G1 and in daughterDG1, we calculated two Pearson correlation coefficients between the transcriptionprofiles in G1 and DG1: one between the G1 profile and a compressed version of theDG1 profile (compressed); the other between the G1 profile and the later segmentof the DG1 profile (delayed). According to the strengths of these two correlationcoefficients, we separated the 364 genes into 9 groups, and combined some of thegroups into four categories, as shown.
To study relationships of transcription profiles in G1 and DG1, we can calculate
two Pearson correlation coefficients: one between the G1 profile and a compressed
version of the DG1 profile; the other between the G1 profile and the latter segment of
the DG1 profile. Since correlation ignores amplitudes, we also compare the maximum
transcript levels in these intervals to ensure rough equivalence. Focusing on the cell-
cycle-regulated genes that have clear peaks in G1 and DG1, we observed that about
69
G1
S
G2/M
1.37
5
10
20
50
100+
MCM2
MCM3
MCM5
MCM4
MCM6
MCM7
CDC6
CLB5
CLB6DBF4CDC7
CDC45
SLD2
SLD5
PSF1PSF3
HTA1
HTA2HTB1
HTB2
HHT1
HHT2HHF1
HHO1
HTZ1
genes in origin-selection step ofDNA replication (Fig. 4.10A, top)
genes in origin-activation step ofDNA replication (Fig. 4.10A, bottom)
histone genes (Fig. 4.10B)
Figure 3.14: Circular representation of peak timing of genes. The figure depicts thetiming of transcriptional peaks of genes in Fig. 3.10, where colored sectors indicaterespectively the cell cycle phases of G1, S, and G2/M, and the gray dash circlesindicate the deconvolved PTR scores, starting from 1.37, the score threshold to callcell-cycle-regulated. Therefore, this representation shows not only the peak timingof genes, but also the amplitude of cell-cycle oscillation.
30% are exclusively in the first category, about 30% are exclusively in the second
category, and about 20% can be classified into both categories, like CDC6 (Fig. 5A);
the final 20% are not easily categorized. The details are given in Fig. 3.13.
3.4.13 Visualizing transcription timing of gene groups
In previous sections, we have shown that our deconvolution algorithm can provide
us single-cell-like transcriptional profiles explicitly for mother and daughter, and the
resolution of deconvolved profiles increase significantly from initial 16 minutes to
around 1-2 minutes. In addition, our method are more sensitive and enables us to
reveal subtle timing differences between genes with similar transcriptional programs
70
in the original population measurements. Based on the PTR scores, we here pro-
pose a novel means of visualizing the transcriptional timing of gene groups (e.g.,
protein complexes, genetic pathways, functional modules). As shown in Fig. 3.14,
we represent the standard cell cycle using a circular group, which is composed of
three sectors, indicating G1, S, and G2/M, respectively. Then, we draw a circle on
the plot to label the timing of its transcriptional peak for each gene. The position
of each circle is determined by not only its peak timing, but also the amplitude of
deconvolved transcription. For such plots, we can identify genes that are transcribed
together, but also investigate subtle timing different from the gene in a functional
group, such as protein complexes.
71
4
Identifying conserved functional modules acrossspecies
In this chapter, we move our focus from genes to proteins. We introduce a PPI
network alignment method, called ‘DOMAIN’, which exploits protein functional do-
mains to identify equivalent functional modules from pairwise protein-protein inter-
action networks across species.
Conventionally, most network alignment algorithms which adopt a node-then-
edge-alignment paradigm: they first identify homologous proteins across networks
and then consider interactions among them to construct network alignments. DO-
MAIN, instead, is propose upon a novel direct-edge-alignment paradigm. Specifically,
instead of explicit identification of homologous proteins, we directly infer plausible
alignable PPIs across species by comparing conservation of their constituent domain
interactions. By applying our approach to detect conserved protein complexes in
yeast-fly and yeast-worm PPI networks, we show that our approach outperforms
two recent approaches in most alignment performance metrics. Also, we show that
our approach enables us to identify conserved cell-cycle functional modules across
72
species. Most of work present in this chapter appeared in Guo and Hartemink (2009).
4.1 Introduction to network alignment
Understanding complicated networks of interacting proteins is a major challenge in
systems biology. Recently, with the rapid progress of high-throughput experimental
techniques, protein-protein interaction (PPI) databases have exponentially increased
in size, allowing for comparative analysis of PPI networks from which conserved
modules can be identified across PPI networks of different species (Sharan and Ideker,
2006; Srinivasan et al., 2007). By analogy to sequence alignment, this problem is
called PPI network alignment.
Typically, PPI network alignment algorithms compare PPI networks of two or
more species and identify conserved modules (e.g., pathways or protein complexes).
Often a PPI network is represented as an undirected graph in which nodes indicate
proteins and edges indicate interactions. Hence, the network alignment problem can
also be viewed as a graph isomorphism problem.
Many network alignment algorithms have been proposed in recent years and most
of them focus on the pairwise alignment of PPI networks. As an early approach, Path-
BLAST (Kelley et al., 2003) proposed a likelihood-based scoring scheme to search for
conserved pathways. Sharan et al. (2005b) extended PathBLAST to employ a greedy
heuristic to detect conserved protein complexes across species. NetworkBLAST-
E (Hirsh and Sharan, 2007) introduced an evolutionary model of networks into the
alignment scoring function to extract conserved complexes. MaWISh (Koyuturk
et al., 2006) merged pairwise interaction networks into a single alignment graph
and treated network alignment as a maximum weight induced subgraph problem.
MNAligner (Zhenping et al., 2007) described an integer quadratic programming
(IQP) model to identify conserved substructures.
Recently, several network alignment algorithms have been developed that can
73
align more than two species. Graemlin (Flannick et al., 2006) is capable of aligning
more than ten microbial networks at once. NetworkBLAST (Sharan et al., 2005a),
another extension of PathBLAST, can align networks of up to three species, and
its later version, NetworkBLAST-M (Kalaev et al., 2008), can align ten networks
with tens of thousands of proteins in minutes. In addition, Singh et al. (2008)
described a method inspired by Google’s PageRank to detect global alignments from
five eukaryotic PPI networks.
However, all these network alignment algorithms follow a node-then-edge-alignment
paradigm. That is, they generally first need to identify homologous proteins across
species before they can exploit protein interaction and network topology information
to detect conserved subnetworks. The node alignment step essentially acts as a filter,
artificially constraining the search space of conserved modules to putatively homolo-
gous protein pairs. On the other hand, proteins rarely act alone. They interact with
each other to carry out their activities, and these interacting proteins are likely to
evolve with high correlation during the evolution of species (Pazos et al., 1997; Goh
et al., 2000; Mintseris and Weng, 2005). Further, it has been shown recently that such
co-evolution is more evident if we focus our attention on interacting domains that are
responsible for the PPIs (Jothi et al., 2006; Itzhaki et al., 2006; Schuster-Bockler and
Bateman, 2007). Based on these observations, we present DOMAIN, an algorithm
for domain-oriented alignment of interaction networks, that follows an alternative
direct-edge-alignment paradigm. DOMAIN does not explicitly restrict its attention
to putatively homologous proteins. Instead, it directly aligns PPIs across species by
decomposing PPIs in terms of their constituent domain-domain interactions (DDIs)
and looking for conservation of these DDIs.
74
Figure 4.1: Overview of DOMAIN algorithm. (1) Constructing alignable pairs ofedges (APEs). The input of DOMAIN includes two PPI networks and the constituentdomains of the proteins. Using this information, DOMAIN calculates species-specificdomain-domain interaction (DDI) probabilities, and then identifies a set of APEsacross networks. (2) Building an APE graph. An APE graph is a merged represen-tation of the PPI networks, in which each node represents an APE and each edgerepresents one of four network connectivities connecting two APEs: a) alignmentextension, b) node duplication, c) edge indel (insertion/deletion), or d) edge jump.The details of these connectivities are given in section 4.2.2. (3) Searching for high-scoring non-redundant subgraphs within the APE graph. We use a greedy heuristicto carry out this task.
4.2 DOMAIN: a domain-oriented edge-based PPI network aligner
As illustrated in Fig. 4.1, DOMAIN consists of three stages: (1) it constructs a
complete set of alignable pairs of edges (APEs); (2) it builds an APE graph; (3)
it employs a heuristic search to identify conserved protein complexes across species.
The three subsections that follow elaborate upon these three stages.
4.2.1 Constructing and scoring APEs
Domains are the structural and functional units of proteins. Many studies (Deng
et al., 2002; Riley et al., 2005; Bernard et al., 2007) have revealed that direct PPIs
are often mediated by interactions between the constituent domains of the two inter-
75
acting proteins. These studies have made two particular assumptions that we adopt
as well: (1) DDIs are independent of each other, and (2) two proteins interact if and
only if at least one pair of domains from two proteins interact. These assumptions
allow us to formulate the probability of an interaction between two proteins in terms
of a “noisy-or” over the DDIs that might possibly mediate the interaction between
those two proteins. In our network alignment scenario where we seek to align edges
directly, we additionally assume that a pair of cross-species PPIs can be aligned to
one other only if they are plausibly mediated by at least one common DDI.
We represent the input PPI networks from two species as undirected graphs
G1(V1, E1) and G2(V2, E2), where nodes indicate proteins and edges indicate the
observed PPIs. We first wish to construct a complete set of alignable pairs of edges
(APEs). We say that a pair of edges, e1∈E1 and e2∈E2, is alignable if there exists
a DDI that can plausibly mediate the two PPIs represented by that pair of edges.
We say that a DDI can plausibly mediate a PPI if the corresponding interaction
probability between the two domains is above some value ε > 0. Using a nonzero
value for ε allows us to filter out domains between which there is negligible evidence
of a DDI.
For an edge e∈E1 or E2, we define D(e) to be all the possible interactions between
the constituent domains of the two proteins. Given the species-specific probabilities
of DDIs that mediate PPIs, we can then write the score of an APE c = (e1, e2) using
a “noisy-or” formulation:
f(c) = Pr(e1, e2|Θ1,Θ2) = 1−∏
dα,β∈D(e1)⋂D(e2)
(1− g(θ1α,β, θ2α,β)) (4.1)
where dα,β denotes an interaction between domains α and β, and θα,β = Pr(dα,β),
and Θ = {θα,β}. The function g(θ1α,β, θ2α,β) measures the probability of aligning the
PPI e1 to the PPI e2 mediated by interactions between domains α and β. In this
76
work, we have chosen to set g(θ1α,β, θ2α,β) = (θ1α,β · θ2α,β)1/2.
As previous authors have also done, to estimate the species-specific DDI proba-
bilities Θ, we applied the EM (expectation-maximization) algorithm of Deng et al.
(2002) for each given network.
4.2.2 Building an APE graph
The APE graph is motivated by the evolutionary model of PPI networks suggested
by Berg et al. (2004). The model indicates that PPI networks are shaped primarily
by two kinds of evolutionary events, link dynamics and gene duplication. Link dy-
namics events are primarily caused by sequence mutations of a gene and affect the
connectivities of the protein whose coding sequence undergoes mutations. Gene du-
plication, the second kind of evolutionary event, is often followed by either silencing
of one of the duplicated genes or by functional divergence of the duplicates. From
the perspective of protein domains, a link dynamics event may result from switching
a constituent domain of a protein to another, or a change in a domain’s interaction
partners; a gene duplication event consists of duplication of one protein, followed by
a domain switching or being removed in one or both of the duplicates, or followed
by progressive small changes from point mutations that cause a change in domain
interaction partners.
With this motivation in place, we define an APE graph to be an undirected
weighted graph, where nodes correspond to the APEs identified above, and edges
correspond to one of four evolutionary relationships that we consider between two
APEs, as illustrated in Fig. 4.2 and as listed below:
a. Alignment extension: two APEs are connected if they share two proteins, one
per species.
b. Node duplication: two APEs are connected if they share a protein in one species
77
Figure 4.2: Four connectivities in an APE graph. The details of these connectivitiesare given in text, and the legend of the figure is the same as is given in Fig. 4.1.
and a PPI in the other.
c. Edge indel (insertion/deletion): two APEs are connected if they share a protein
in one species and the graph distance between the two PPIs in the other network
is 1.
d. Edge jump: in this case, all proteins within the two APEs are distinct, but for
each species, the graph distance between the two PPIs in their corresponding
network is 1. We consider this case because our current knowledge of both
PPIs and DDIs is noisy and incomplete. Thus, if there exists a pair of PPIs
that can make two APEs connected in each network, we treat the pair as a
potential APE. Note that some insignificant DDIs (probabilities of DDIs < ε)
78
are shared in such potential APEs.
Given this definition of an APE graph, we note that every subgraph in an APE graph
corresponds to a network alignment.
Each node in an APE graph contributes the score f(c) of its corresponding APE,
and each edge is scored by a positive number according to its connection relationship.
Using these edge scores, we want to reward alignment extension and penalize both
node duplication and edge indel. Let γa, γb, γc, and γd be the edge scores of alignment
extension, node duplication, edge indel, and edge jump, respectively. We thus need
to assign γa > 1 and γb, γc < 1. Because we neither wish to reward nor penalize an
edge jump, we simply assign γd = 1. For a subgraph Gs(Vs, Es) in an APE graph,
the overall score for its corresponding network alignment is calculated as
S(Gs) =∏e∈Es
γ(e) ·∏c∈Vs
f(c) (4.2)
where γ(e) is the edge score for e∈Es, and f(c) is the score of the APE c∈Vs.
4.2.3 Detecting protein complexes
Network alignment methods generally require a search algorithm to detect high-
scoring subgraphs from a single or several weighted graphs. Such tasks are computa-
tionally difficult, so a number of search heuristics have been proposed: for example,
PathBLAST uses the randomized dynamic programming to search for conserved
pathways across networks, while NetworkBLAST-E implements a greedy heuristic
to search for conserved protein complexes. As many pairwise network methods aim
to identify conserved protein complexes, for comparative proposes, we devise a greedy
heuristic for finding conserved protein complexes across species.
The heuristic aims to identify high-scoring non-redundant subgraphs from the
resultant APE graph. Specially, exhaustively starting from each APE, we iteratively
79
expand the subgraph by introducing a new APE that increases the alignment score
the most, until any of the following empirical stopping conditions occur: (1) the
number of proteins in either species exceeds an upper limit (we used 15); (2) the
score of the next expanding APE is smaller than a threshold (we used 10−2); (3)
the overall alignment score of the subgraph is smaller than a threshold (we used
10−3); (4) the graph distance of the next expanding APE exceeds an upper limit (we
used 4). At the end, small and redundant subgraphs are removed if the number of
proteins in a subgraph is less than four, or if there exists a higher-scored subgraph
overlapping more than 80% of proteins in either species.
4.3 Results
4.3.1 Experimental setup
We compare our method to two extant pairwise network alignment algorithms, Net-
workBLAST and MaWISh. We do not include NetworkBLAST-M and Graemlin in
our comparisons because they mainly focus on alignment of multiple networks, and
because Graemlin requires the unavailable in-house SRINI algorithm (Srinivasan
et al., 2006) to assign weights to PPIs. The ISOrank algorithm aims at resolving a
different problem of aligning networks globally, and NetworkBLAST-E performs sim-
ilarly to NetworkBLAST and is not available online. We thus exclude these methods
from the comparisons as well.
We apply DOMAIN on yeast-fly and yeast-worm PPI networks taken from DIP
(Database of Interacting Proteins, Oct 2008) (Xenarios et al., 2002), as they were
widely used in pairwise network alignment studies as benchmarks. The protein-to-
domain mappings are taken from Pfam (Pfam 23.0) (Finn et al., 2008), and we only
consider high-quality Pfam-A entries. Because not all proteins contain significant
Pfam domains, we generate a so-called “backbone” network, a subnetwork of DIP in
which all proteins contain at least one Pfam-A domain. As summarized in Table 4.1,
80
Table 4.1: Summary of backbone networks.
DIP Backbone DIPYeast Fly Worm Yeast Fly Worm
# PPIs 17,528 22,381 4,038 11,426 11,013 2,213# proteins 4,928 7,446 2,644 3,300 4,500 1,620
# GO annotated proteins ∗ 4,625 4,477 1,566 3,280 3,253 1,145# MIPS annotated proteins ∗∗ 1,100 — — 860 — —∗ With respect to biological process annotation of Gene Ontology.∗∗ Excluding MIPS category 550.
78.2% of MIPS annotated proteins and over 70% of GO annotated proteins are
contained in backbone networks. To simplify the setting of the four γ parameters,
we reduced the parameter space to one dimension by insisting that γa = k, γb = γc =
1/k, and γd = 1, for some value of k>1. We found that DOMAIN was not sensitive
to changes in k. In the results that follow, we used k=10.
4.3.2 DOMAIN outperforms previous methods in most performance metrics
We employ three measures to evaluate the biological significance of the alignments:
sensitivity/specificity, MIPS purity, and GO enrichment. These measures are also
suggested in several other network alignment studies (Hirsh and Sharan, 2007; Dutkowski
and Tiuryn, 2007; Kalaev et al., 2008).
The first two measures use the known yeast protein complexes cataloged in MIPS
(May 2006) (Mewes et al., 2002) as a gold standard. We exclude category 550
(obtained from high-throughput experiments) and only use complexes at level 3 or
lower. In consequence, there exist 122 MIPS complexes spanning 519 yeast proteins
in yeast backbone network, and 62 of them contain at least 3 proteins spanning 438
proteins. For each identified yeast alignment, we try to find a complex from MIPS
that maximizes the hypergeometric score and calculate an empirical enrichment p-
value. The significance level is obtained from sampling 10,000 random sets of proteins
of the same size, and the p-values are corrected for multiple testing using the false
81
Table 4.2: Performance comparisons of DOMAIN with NetworkBLAST and MaWIShon yeast-fly backbone networks.
method # of # proteins SPE SEN MIPS GO enrichmentcomplexes yeast fly (%) (%) (%) yeast(%) fly(%)
DOMAIN 100 338 313 34.0 9.0 66.7 89.0 78.0NetworkBLAST 82 299 213 31.7 7.4 40.6 87.8 79.3MaWISh 54 193 142 18.5 4.1 30.0 75.9 66.7
Table 4.3: Performance comparisons of DOMAIN with NetworkBLAST and MaWIShon yeast-worm backbone networks.
method # of # proteins SPE SEN MIPS GO enrichmentcomplexes yeast worm (%) (%) (%) yeast(%) worm(%)
DOMAIN 21 84 63 36.4 3.3 75.0 90.5 9.5NetworkBLAST 19 82 51 7.7 0.8 60.0 89.5 10.5MaWISh 11 42 32 11.1 1.6 42.8 63.6 9.1
discovery rate (FDR) (Benjamini and Hochberg, 1995). Then, the specificity is
defined as the percent of yeast alignments that have a significant match in MIPS (p-
value <0.05), and the sensitivity is defined as the percent of MIPS alignments that
have significant matches in the resulting alignments. Moreover, an alignment is called
a pure alignment if it satisfies two conditions: (1) it contains at least three MIPS
annotated proteins and (2) there exists a complex in MIPS that covers more than
75% of its MIPS annotated proteins. We report purity, calculated by the number
of pure alignments divided by the total number of alignments with at least three
MIPS annotated proteins, as an alternative measure of the sensitive identification of
specific complexes.
GO enrichment measures the functional coherence of the proteins in an identified
alignment with respect to the biological process annotation of GO, for each species
separately. We use the tool GO TermFinder (Boyle et al., 2004) to compute empirical
enrichment p-values, and correct for multiple testing using FDR. For each species, we
report the fraction of process-coherent alignments with p-value < 0.05 (considering
only the alignments with at least one GO annotated protein).
82
We chose to set the probability threshold of DDIs ε to the low but nonzero value
of 10−20 so as to take into account as much DDI information as possible. For yeast-
fly alignment, DOMAIN generated an APE graph consisting of 6,918 APEs with
47,964 alignment extension links, 24,549 node duplication links, 5,573 edge indel
links, and 1,149 edge jump links; for yeast-worm alignment, it returned a 1,410-
node APE graph with 4,230 alignment extension links, 4,087 node duplication links,
140 edge indel links, and 37 edge jump links. For accurate comparison, we applied
NetworkBLAST and MaWISh on backbone networks with their suggested parameter
settings (see Sharan et al., 2005a; Koyuturk et al., 2006 for details). As summarized
in Tables 4.2 and 4.3, DOMAIN identified more significant non-redundant alignments
than NetworkBLAST and MaWISh in both alignments—explaining the good scores
on the sensitivity metric—but also managed to outperform the other methods on the
specificity and purity metrics. Indeed, it achieved the highest performance on almost
every evaluation metric, and in the instances in which it was bested, the difference
is slight.
The running time of DOMAIN is comparable to NetworkBLAST and MaW-
ISh. DOMAIN is currently implemented in Perl, and its running time on yeast-
fly and yeast-worm backbone networks is less than one minute (Intel Core 2 CPU
[email protected], 2GB RAM). Because the running time is so small, we were able to
exhaustively expand from all APEs. If for some reason we needed to further reduce
computational complexity, we could instead consider an alternative expansion strat-
egy where we would expand only from “seed” APEs. The idea would be that if a
protein complex is conserved in many species, the PPIs in this complex are likely to
be conserved as well, and therefore the corresponding subgraph in the APE graph
should contain many alignment extension links. With this in mind, we could rank
the APEs by counting the number of their surrounding alignment extension links
and select, say, the top 25% as seeds for expansion. We tested this, and the results
83
were nearly identical to those listed in Tables 4.2 and 4.3, but the running time
for yeast-fly and yeast-worm alignments reduces to 30 and 15 seconds, respectively.
In our case, the running time was not a problem, but it is reassuring that a seed-
based expansion strategy seems to be effective at reducing the running time without
affecting the results.
4.3.3 DOMAIN is sensitive at detecting small alignments
DOMAIN is sensitive at detecting small network alignments that might be deemed by
other algorithms to be topologically insignificant. For example, DOMAIN reported
a network alignment between the yeast NEF1 complex and the fly proteins mei-9,
Ercc1, and Xpac with high confidence (Fig. 4.3A). The GO process coherence of
these three fly proteins is significant: nucleotide-excision repair (p-value ' 10−8),
DNA repair (p-value ' 10−6), cellular response to DNA damage stimulus (p-value
'10−6), etc. However, neither MaWISh nor NetworkBLAST reports any alignment
involving the yeast NEF1 complex. They are likely to miss such alignments because
1) the sequence similarity between RAD10 and Ercc1 is insignificant (BLAST E-value
' 10−8) and may be ignored if using a restrictive BLAST E-value threshold (e.g.,
10−10 suggested in Hirsh and Sharan 2007), and 2) this alignment consists of only
three matched proteins and two conserved interactions, so it may not be sufficiently
topologically significant for some aligners to detect. On the other hand, the DDIs
within this alignment are well-conserved across species (the DDI probabilities of
ERCC4-Rad10 are 1.00 in both species; the DDI probabilities of Rad10-XPA C are
1.00 and 0.54 in yeast and fly, respectively).
4.3.4 DOMAIN provides a comprehensive means of interpreting alignments
Another advantage of DOMAIN is that often it provides a more comprehensive means
of interpreting the identified network alignments, because protein domains are di-
84
Figure 4.3: Evaluation of alignment performance of DOMAIN. (A) DOMAINis sensitive to small alignments. DOMAIN reports a network alignment betweenthe yeast NEF1 complex (MIPS category 510.180.10.10) and the fly proteins mei-9,Ercc1, and Xpac. The object to the right of the double arrow depicts the corre-sponding subgraph of this alignment in the APE graph. (B) DOMAIN provides acomprehensive means to interpret network alignments. DOMAIN reports an align-ment between 10 yeast proteins and 3 worm proteins that significantly matches thepathway of SNARE interactions in vesicular transport in KEGG. (C) An example ofimproving network alignment by combining several cross-species pairwise alignments.(Green: yeast proteins; blue: fly proteins; orange: worm proteins.)
85
rectly relevant to function in many cases. For instance, RAD14 and Xpac may play
a similar role in the biological process of nucleotide-excision repair, as they share a
common XPA C domain. Furthermore, although the XPA N domain is not reported
as a significant domain for RAD14 in Pfam (E-value = 0.023), the alignment of
yeast RAD14 to fly Xpac suggests that XPA N is potentially an important functional
domain in RAD14.
Identifying conserved biological pathways across species is another important
application of network alignment. Fig. 4.3B demonstrates an example of alignment
reported by DOMAIN between 10 yeast proteins and 3 worm proteins, in which 9
yeast proteins (all except NYV1) and all 3 worm proteins are known to be involved
in the pathway of SNARE interactions in vesicular transport in KEGG (Kanehisa
and Goto, 2000).
4.3.5 Performance improves by combining cross-species pairwise alignments
Alignment performance may further be improved by combining several cross-species
pairwise network alignments. Fig. 4.3C shows an example of combining three align-
ments taken from yeast-fly, yeast-worm, and fly-worm network alignments, respec-
tively. By aligning yeast and fly networks, DOMAIN detects an alignment between
3 fly proteins (CG8142, RfC3, and RfC40) and 7 yeast proteins, and 4 of them
(RFC1-4) are involved in the replication factor C complex (MIPS: 410.40.30). As
the yeast replication factor C complex contains 5 proteins (RFC1-5), the F-score1 is
0.666. Further, we see that 2 worm proteins (F44B9.8 and rFc-2) are aligned to all
these 3 fly proteins in fly-worm alignment and 3 of these 7 yeast proteins (RFC2-4)
in yeast-worm alignment. This three-way alignment suggests that the alignment be-
tween fly proteins CG8142, RfC3, and RfC40 and yeast proteins RFC2-4 are of high
confidence, and the F-score is increased to 0.750.
1 F-score is defined as F = 2× (precision× recall)/(precision + recall)
86
Table 4.4: Cell-cycle-related functional modules conserved across budding yeast andfruit fly
clusters clusters enriched GO term alignment infoin yeast in fly in common
Spr28,Cdc11,Cdc12,Cdc10,Shs1,Cdc3
Sep4,Sep2,Sep1
cytokinesis, cell division conserved proteins that play keyroles in septin ring assembly
Htb1,Hta1,Hhf1,Hht1
Hs2A,His4,cenH3,His2B
chromatin assembly ordisassembly, chromatinorganization, chromo-some organization
conserved histone genes
Clb2,Rts1,Cdc28
CycD,dMST,PP2A
Inferred from the yeast cluster, flycluster seems to play a role in Mphase (e.g., chromosome segrega-tion, spindle assembly).
Clb2,Swe1,Cdc28
CycD,Cdk5,Cdk4
regulation of cell cycleprocess
both clusters play a rule in regula-tion of cell cycle (inferred: M-Cdk).
Clb5,Kin1,Cdc28
CycD,Cdk5,Cdk2
regulation of cell cycle,phosphorylation
both clusters play a rule in regula-tion of cell cycle (inferred: S-Cdk)
Ste20,Cdc28,Cln2
Cdk2,CycG,CG11533
regulation of cell cycle both clusters play a rule in regula-tion of cell cycle (inferred: G1/S-Cdk)
4.4 Detecting conserved cell-cycle-related functional modules
In the previous sections, we have introduced a network alignment method DOMAIN,
which employs a novel direct-edge-alignment paradigm to detected conserved func-
tional modules across pare-wise protein-protein interaction networks across species.
We demonstrated that DOMAIN is sensitive at detecting small conserved alignments
across species, and on the basis of protein functional domains, DOMAIN can also
provide us functional information about the resulting alignments. In this section, we
focus on the resultant alignments of DOMAIN between protein-protein interaction
networks of budding yeast and fruit fly, two largest and most studied networks. As
the major goal of this thesis is to study cell-cycle regulation, we ask the questions
how the cell-cycle-related functional modules are conserved across species.
87
The comparison of cell-cycle-related network alignments are listed in Table 4.4.
There exist 6 identified clusters across two species that seem to be related to cell cycle,
including yeast clusters of Cdc28-Clb2, Cdc28-Clb5, Cdc28-Cln2, three cyclin-CDK
complexes. The aligned clusters in fruit fly are all composed of cyclin-dependent
kinases (i.e., Cdc2, Cdc4, Cdc6), together with cyclins CycD and CycG, suggesting
that these fly clusters may also be cyclin-CDK complexes and play key roles dur-
ing the cell cycle. These results not only indicate that cyclin-CDK complexes are
highly conserved during evolution (Ubersax et al., 2003), but also demonstrate the
alignment performance of our method.
4.5 Discussions
In this chapter, we described DOMAIN, a domain-oriented pairwise network align-
ment framework. To our knowledge, DOMAIN is the first algorithm to introduce
protein domains into the network alignment problem. Also, DOMAIN uses a novel
direct-edge-alignment paradigm to directly detect equivalent PPI pairs across species
and suggests a new graph representation to merge these equivalent PPI pairs and
their network-evolutionary based relationships into one graph. We tested DOMAIN
to identify conserved protein complexes in the yeast-fly and yeast-worm protein in-
teraction networks, and the experimental results show that DOMAIN exhibits better
performance than two recent pairwise network alignment methods in most perfor-
mance metrics.
Although DOMAIN can be applied only to a subset of proteins with domain map-
pings, we notice that most functionally annotated proteins contain domain structures
and remain in this subset. To further overcome this restriction, we may employ a
larger domain database (e.g., CDD (Marchler-Bauer et al., 2007)), or combine DO-
MAIN with other network aligners. In addition, as the set of defined domains expands
and is refined over time, this will gradually become less of a restriction.
88
Further directions for research include extending this approach to multiple net-
work alignment and to network querying. Since multiple network alignment requires
more than two networks by definition, we would simply need to devise an appropriate
scoring scheme that can handle more than a pair of alignable PPIs at once, and then
extend the notion of the APE graph accordingly.
The goal of network querying is to identify subnetworks in a given network that
are similar to the query. Typically, the network query is a hypothetical or known
functional module. We may simply treat the query as a small input network and
apply our DOMAIN method directly on them. A more sophisticated approach would
be to devise a sequence-profile-like structure to describe the DDI contents of the
network query, as well as perhaps constructing such structures for the full network
as a one-time expense for many successive queries.
89
5
Conclusions
The imperfect synchrony of a synchronized population of cells prevents us from di-
rectly using populations to precisely observe the dynamics of processes that occur
in single cells. In this thesis, we mainly present a deconvolution algorithm that ef-
ficiently removes the effects of synchrony loss from population-level measurements.
When applied to recent replicate microarray data, it robustly recovers precise tran-
scription profiles with markedly increased dynamic range and temporal resolution.
Our algorithm is built upon the cloccs framework which models three distinct
asynchrony sources: imperfect synchronization in the initial cell populations, vari-
ance in progression rates of individual cells through the cell cycle, and asymmetric
cell division. It should be explicitly noted that our deconvolution method cannot
assess variability across single cells, which might be interesting, especially for molec-
ular species at very low concentrations where noise plays an important role. Rather,
our method provides a high-resolution view of the transcript levels of the average
single cell; or alternatively, it learns what would be observed if we were to measure
a population of cells that starts and remains in perfect synchrony throughout a time
course.
90
Our approach has several algorithmic advantages: (1) Our algorithm optimizes
a convex objective function, and thus has a unique global optimum. Mature convex
optimization techniques and implementations enable an optimal solution to be found
efficiently: in practice, we can deconvolve a transcription profile in a few seconds in
MATLAB running on standard hardware. (2) By design, deconvolution algorithms
enhance the features of blurred population-level measurements to sharpen underlying
signal. However, previous deconvolution methods often end up sharpening noise as
well. We avoid this problem by formulating an objective function that is Bayesian
l1-regularized using a wavelet basis. Such an approach has been used in the signal
and image processing communities, where it has been shown to effectively deblur
signals and images while smoothing away noise (Donoho et al., 1994); to our knowl-
edge, however, wavelet-basis regularization has never been applied in a branching
process context as we require here. The usefulness of this approach is evident, as
about one third of genes had a PTR that decreased after deconvolution, presumably
because the fluctuation in measured transcript levels was due to noise rather than
cell-cycle regulation. For example, after deconvolution, the constitutively expressed
actin gene ACT1 and almost all ribosomal protein genes are essentially flat over
the entire course of the cell cycle (Fig. 3.4). These observations indicate that our
deconvolution algorithm can correctly dampen noise even while sharpening signal.
(3) The extensible design of our convolution kernel approach allows us to learn a
single transcription profile from replicate time-series experiments, leading to more
accurate and robust estimates.
A further advantage of our deconvolution algorithm is that when applying it to
population-level measurements of transcript dynamics across the yeast cell cycle, it
can learn distinct cell-cycle transcription programs for mother and daughter cells,
because we explicitly model them as distinct within the branching process. Our
algorithm identifies 82 genes that appear to be transcribed specifically in daughter
91
cells, and we anticipate this finding will be useful for studying late mitotic and early
G1 cell-cycle events, as well as cell differentiation in yeast. Moreover, the ability to
distinguish programs for biologically relevant sub-populations is not limited simply to
mother and daughter cells in budding yeast; by modifying the underlying branching
process model, this feature of our deconvolution algorithm could be extended to other
systems, and thereby lead to the identification of transcription programs that occur
only in distinct sub-populations of cells.
Our deconvolved estimates show a significant increase in amplitude of cell-cycle
oscillation for most of the genes measured. Using our results, we established a larger
periodic gene set (nearly twice as large as that identified in Spellman et al. (1998))
that includes about 70% of the periodic genes identified in the previous studies.
Although we do not believe all these genes are exclusively cell-cycle-regulated—for
example, some genes with significant stress-response regulation are included (see
Supplementary Figure S2 for details)—the size of this set suggests that many genes
may exhibit previously unrecognized transcriptional regulation during the cell cycle.
On the other hand, we also noticed that some well-studied cell-cycle-involved genes
like MCM1 and CDT1 are not in our cell-cycle-regulated set (or any previously
established sets, for that matter). One explanation may be that their expression
does not vary during the cell cycle. Another explanation is that their expression is
variable but regulated post-transcriptionally (i.e., we might see fluctuating expression
if we monitored protein abundance, or in the case of kinase targets, abundance of
phosphorylated protein). A more remote third possibility is that these genes may
be transcriptionally regulated, but transcribed at multiple times during a single cell
cycle, possibly because they may play multiple roles; due to convolution effects, the
transcription profiles of such genes would be greatly muddled in a cell population,
and deconvolving them to achieve sufficiently large PTR scores may be difficult,
given the level of noise in microarray experiments.
92
Although we have demonstrated the usefulness of our algorithm by deconvolving
genome-wide transcription profiles, the algorithm is general and can be used to de-
convolve many other population-level data sources, such as nucleosome occupancy
measurements, protein expression profiles obtained by Western blots, or measure-
ments in organisms other than budding yeast. All the algorithm needs as input
are synchrony measurements from cloccs or some other distribution model (e.g.,
the cell-type distribution model used in Siegal-Gaskins et al. (2009)) and time-series
measurements to be deconvolved.
The further work of the cell-cycle deconvolution may lead to two directions. The
first one is to develop a user-friendly application. Because our deconvolution al-
gorithm is general and can be applied to many different types of cell-cycle data
sources in many organisms, and because it is upon the cell-cycle distribution model
cloccs, it would be beneficial if we develop an application, ideally a web applica-
tion, which integrates cloccs and the deconvolution framework. Users can upload
their time-series data as well as the corresponding cell-cycle profiles of biomarkers,
and our application may first estimate the cell-cycle parameters from cloccs, and
then using these parameters to deconvolve the given time-series data. For users, they
can simply review the resultant deconvolved cell-cycle profiles without knowing any
back-end algorithms. The second direction is to expand our deconvolution frame-
work with some other biological processes during the cell cycle. For example, the
transcriptional levels learned from the deconvolution is a result of gene production
and degradation. So if we know the mRNA decay rates at a single-cell level, we can,
at least in principle, accurately estimate the corresponding mRNA production rates,
which is conceptually regulated by the binding of upstream transcription factors.
93
Appendix A
Intrinsic disorder within and flanking theDNA-binding domains of human transcription
factors
In this appendix chapter, we introduce an associative study between intrinsically
disordered regions (IDRs) and transcriptional factors (TFs). By using different com-
putational disorder prediction methods, we investigate the prevalence of IDRs within
DNA-binding domains (DBDs) and in their flanking regions across human TFs. Our
results confirm the hypothesis that the most prevalent DBDs in human TFs exhibit
significant order, but the flanking regions of these DBDs generally exhibit significant
disorder. Most of the work present in this chapter appeared in Guo et al. (2012b).
A.1 Introduction to intrinsically disordered structures and transcrip-tion factors
The function of a protein is encoded in its amino acid sequence (i.e., primary struc-
ture). However, protein activity typically depends on the protein being folded prop-
erly into its component secondary structure elements (e.g., alpha helices, beta sheets)
and the overall, global conformation of the protein (i.e., tertiary structure). Protein
94
structure can be determined experimentally at high resolution either by X-ray crys-
tallography or by nuclear magnetic resonance (NMR). X-ray crystallography is often
used, but cannot provide information on the conformation of regions that are either
highly dynamic or unstructured in the crystal. NMR can provide information about
flexibility and dynamics in proteins, but this technique is limited to smaller proteins.
Through a combination of structural and biochemical studies, it has become
increasingly appreciated that a protein may not adopt a single, well-defined “struc-
ture”, a term connoting a measure of rigidity. Rather, a protein may sample an
ensemble of global conformations; parts of the protein may be largely constantly
structured across this ensemble, while other parts may be quite variable or flexible
across the ensemble. These latter regions are sometimes termed “intrinsically disor-
dered regions” (IDRs), though they may adopt a more structured conformation upon
interaction with another molecule, whether a protein, DNA, or other ligand (Eliezer,
2009).
Proteins are largely involved in processes related to molecular recognition (e.g.,
binding, signaling, complex formation, enzymatic catalysis), and IDRs may enable
these recognition events either directly (e.g., serving as the recognition domain of
a protein) or indirectly (e.g., serving as a hinge that allows two ordered regions of
a protein to come together to effect recognition). For this reason, IDRs have been
studied rather extensively over the past decade, and a large number of computational
methods have been developed for the prediction of IDRs on the basis of amino acid
sequence, though this remains an imperfect art (see He et al. (2009) for a review).
In this study, we were interested in exploring the role(s) that IDRs might play
in the recognition tasks of transcription factors (TFs) in particular. Computational
explorations have found that IDRs are generally more prevalent in TFs than would
be expected by chance, especially in eukaryotes (Minezaki et al., 2006; Liu et al.,
2006; Fuxreiter et al., 2011). As a specific example, careful molecular studies have
95
shown that a region of fifteen amino acids within the DNA-binding domain (DBD) of
the estrogen receptor is disordered in solution, and makes contacts with DNA (and
with another ER DBD monomer), as shown in a co-crystal structure of the ER DBD
bound to DNA (Schwabe et al., 1993). Moreover, IDRs outside the homeodomain
DBD have also been found to impact the DNA-binding affinity of the Drosophila TF
Ubx (Liu et al., 2008). In addition, the region N-terminal to the proximal accessory
region of the Saccharomyces cerevisiae C2H2 zinc finger TF Adr1 is disordered in
solution (even after binding DNA) and increases the affinity for non-specific DNA,
mainly by increasing the DNA association rate; increased affinity for non-specific
DNA might allow a protein to find its specific sites more quickly after translocation
from non-specific sites that are bound initially (Schaufler and Klevit, 2003). Finally,
DBDs often have N- or C-terminal extensions, referred to as ‘arms’ or ‘tails’, that
bind DNA but are disordered when free in solution (Crane-Robinson et al., 2006).
Intrigued by this ensemble of findings pointing to the importance of IDRs in TFs
and their interactions with DNA, we sought to explore the connection between IDRs
and TF function more precisely and systematically. We were particularly interested
in determining whether IDRs were more prevalent in the regions flanking the DBDs
that are responsible for the binding of sequence-specific TFs to DNA.
A.2 Materials and Methods
A.2.1 Constructing the TF dataset and the non-TF control dataset
We created two non-redundant datasets of human proteins: a TF set and a non-TF
set for use as a control. The procedure for constructing these sets and ensuring their
non-redundancy is described below and summarized in Figure A.1A.
We assembled the TF set from a published repertoire of human TFs (Vaquerizas
et al., 2009). In their study, Vaquerizas and colleagues manually curated and iden-
tified 1,987 TF-coding human genomic loci in the Ensembl database (Flicek et al.,
96
2011); the list includes 1,960 high-confidence entries and 27 entries curated as prob-
able. We cross-referenced these Ensembl loci against the RefSeq database (release
47) (Pruitt et al., 2009) to obtain 2,362 protein isoforms associated with 1,747 genes.
To reduce sequence redundancy and thus potential bias, if multiple isoforms were
associated with the same gene, we selected only the longest. This resulted in a final
total of 1,747 unique TF protein sequences, and in subsequent analysis, we call this
our TF set.
We assembled our non-TF control set by downloading all human proteins from
RefSeq, and excluding the 2,362 TF-associated isoforms from above, which yielded a
total of 32,567 non-TF proteins. To match the size and sequence length distribution
of our TF set, we randomly sampled 1,747 proteins from the 32,567 according to
the empirical sequence length distribution of the TF set; to ensure non-redundancy
during this process, at each iteration we required that the sampled protein come
from a locus not previously sampled. Therefore, the resulting control set contains
1,747 unique non-TF protein sequences.
A.2.2 Comparing the TF and non-TF sets of proteins
To ensure that the non-TF set represents a well-constructed control for the TF set,
we compared various properties of the two sets. First, we compared the sequence
length distributions of the TF set and the non-TF control set, in addition to the
set of all human TFs (i.e. with redundancy). As shown in Fig. A.1B, no apparent
differences exist between the sequence length distributions in the TF set, the non-TF
control set, and the set of all human TFs.
Next, we compared the amino acid compositions of the TF set, the non-TF
control set, and the set of all human TFs (Fig. A.1C). The amino acid composition
of sequences in IDRs have been shown to be significantly different from those in
ordered regions (Dunker et al., 2001), and IDRs have been shown to have high
97
0 500 1000 1500 2000
04
8
length (amino acids)
freq
uenc
y (%
)
all TFsnr TFsnon−TF ctrl
W F Y I M L V N C T A G R D H Q K S E P
freq
uenc
y (%
)
02
46
810 all TFs
nr TFsnon−TF ctrl
all human TF loci in Ensembl(1,987)
protein entries in RefSeq(2,362)
non-redundant (nr) TFs(1,747)
all human protein entriesin RefSeq (~35K)
non-TF protein entries(~32.6K)
sampled non-TF ctrl set(1,747)
cross-reference
remove redundancy:1 isoform per locus
remove TFs
A B
C
sample w.r.t.sequence length distribution
of nr TF set
Figure A.1: Generation of TF set and the non-TF control set. (A) A schematicof the pipeline for generating the TF set and the non-TF control set. (B) Sequencelength distributions of the TF set, the non-TF control set, and the set of all humanTFs (with redundancy). (C) The amino acid compositions of the TF set, the non-TFcontrol set, and the set of all human TFs (with redundancy). Amino acids are listedfrom most order-promoting to most disorder-promoting, according to (Campen et al.,2008). It is apparent from the histogram that compared to proteins in general, TFshave fewer order-promoting residues (e.g., W, F, Y, I, M, L, V) and more disorder-promoting residues (e.g., P, E, S, K, Q, H).
prevalence in TFs (Liu et al., 2006), so we might expect compositional differences
between the TF sets and the non-TF control set. Indeed, compared to the non-TF
control set, both TF sets are enriched in disorder-promoting amino acids (e.g., P,
E, S, K, Q, H), and depleted in order-promoting amino acids (e.g., W, F, Y, I, M,
L, V) (Dunker et al., 2001; Campen et al., 2008), as expected. However, the amino
acid compositions of our non-redundant TF set and the set of all human TFs are
nearly identical, suggesting that our procedure for removing redundancy introduces
no significant compositional bias.
A.2.3 Identifying DNA-binding domains (DBDs) and their locations within proteins
Our goal is to investigate the prevalence and locations of IDRs within human TFs,
and in particular, the spatial relationships between IDRs and DBDs in TFs. To iden-
tify all sequence-specific DBDs that occur within human TFs, we started with the
98
entire set of human proteins from RefSeq and identified every Pfam domain (Finn
et al., 2010) that was contained in a human protein with a p-value below 0.05. We
manually filtered for those domains whose text descriptions in the Pfam or Inter-
Pro (Hunter et al., 2009) databases indicated that the domain mediates sequence-
specific DNA binding, resulting in 76 domains which we henceforth call Pfam DBDs.
Using HMMER (Eddy, 2009) with default parameters, we searched for the loca-
tions of matches to Pfam DBDs within our TF set. We found 71 of the 76 Pfam DBDs
matched to proteins in our TF set, with 32 DBDs appearing more than five times.
Of the 1,747 proteins in our TF set, 669 contained only a single DBD, while another
642 contained multiple DBDs; proteins with multiple DBDs are typically those con-
taining multiple zinc fingers, which are annotated as separate domains even if they
occur in tandem within a protein. Indeed, the TF with the highest number of DBDs
is zinc finger protein 91 (RefSeq: NP 003421), which contains 31 zf-C2H2 (zinc fin-
ger, C2H2-type) domains. The zf-C2H2 domain is interesting in its own right as it
is by far the most prevalent domain in our TF set, appearing a total of 4,154 times,
almost 20 times as often as the next most prevalent domain.
A.2.4 Using multiple prediction methods to predict intrinsically disordered regions(IDRs) within proteins
To perform our analysis, we first needed to predict the ordered and disordered re-
gions within proteins using existing computational tools. Since this remains a bit of
an imperfect art, we took care to ensure that our conclusions would not be overly de-
pendent on the predictions of any single choice of method. Consequently, we chose to
use three distinct disorder prediction tools, each demonstrated to perform with high
accuracy (He et al., 2009): PONDR VSL2 (Peng et al., 2006), DISOPRED2 (Ward
et al., 2004), and PreDisorder 1.1 (Deng et al., 2009). PONDR VSL2 was evaluated
as the top-ranked disorder predictor in CASP7 in 2006 (Bordoli et al., 2007), and Pre-
99
Disorder was ranked among the top methods in disorder prediction during CASP8 in
2008 and CASP9 in 2010. These methods employ a variety of techniques to analyze
sequence and structural information for IDR prediction: PONDR VSL2 uses support
vector machines (SVMs) to separately address prediction problems in short versus
long sequence regions, and then merges the results using a logistic regression model;
DISOPRED2 is also based on SVMs, and compared to other prediction methods,
the main difference is that it is directly trained on the whole sequence using various
combinations of binary-encoded amino acid sequence, secondary structure predic-
tions, and sequence profiles; and PreDisorder 1.1 is based on an ab initio prediction
method along with a meta-prediction method.
A.2.5 Defining disorder features: spatial relationships of IDRs relative to DBDswithin TFs
Given the annotated DBDs and the predicted disorder regions in the TF set and the
non-TF control set, we sought to systematically analyze the association between TF
DBDs and predicted IDRs by testing for enrichment of IDRs at different locations
relative to DBDs. Specifically, we were interested in IDRs within the DBD itself,
as well as the regions flanking the DBD, and we developed five distinct ‘disorder
features’: we say that a DBD is disordered if at least a fraction f of its residues are
predicted to be disordered; we say that the N-terminal flank of a DBD is disordered
if at least a fraction f of the 30 residues flanking the DBD in the N-terminal direction
are predicted to be disordered; analogously, we say that the C-terminal flank of a
DBD is disordered if at least a fraction f of the 30 residues flanking the DBD in
the C-terminal direction are predicted to be disordered; we say that both flanks of
a DBD are disordered if both the N-terminal and C-terminal flanks are disordered;
and finally, we say that an entire TF is disordered if at least a fraction f of all of its
residues are disordered. We wanted to be fairly stringent in identifying these disorder
100
features, so that we could focus on those with the highest confidence; therefore, we
chose the value of 0.8 for f .
A.2.6 Calculating statistical significance of disorder features
To assess whether the prevalence of disorder features within and flanking DBDs was
unusually high or low, we needed to determine a suitable measure of significance.
Moreover, since different computational tools predict IDRs at different rates, our
significance measure needed to enable the comparison of results across methods,
and not be biased by methods that are systematically more or less likely to predict
disorder within proteins.
We thus developed two different null models to test for the significance of our
disorder features (e.g., disordered DBD, N-terminal flank, or C-terminal flank). The
first null model pretended that the location of a DBD occurred uniformly at random
within each sequence, and was based on the TF set. The second null model also
pretended that the location of a DBD occurred uniformly at random in each sequence,
but was based on the non-TF control set. In summary, these two null models—in
which the location of a DBD was chosen uniformly at random—were designed to
test whether the spatial relationships between IDRs and DBDs were statistically
significant or simply occurred by chance.
With each null model providing a baseline expectation for how often a disorder
feature might be found by chance, we could then compute a significance measure
based on the p-value from a hypergeometric distribution (i.e., Fisher’s exact test).
For each disorder feature we considered, we were able to compute two separate p-
values, one for each null model. Consistency of significance across the two different
null models thus gave us some confidence that our results were robust to the specific
choice of null model.
101
A.3 Results
A.3.1 Comparing the three methods to predict IDRs within proteins
We used three different disorder prediction tools to predict IDRs in both the TF set
and the non-TF control set. Though the purpose of this paper is to make use of
existing prediction methods and not to evaluate them (which has already been done
by others), it is important to at least have a summary sense of how each method
is performing on our various protein sets. A summary of the results of the three
methods is listed in Table A.1 and shown in Figure A.2. In Table A.1, we calculate
the total percentage of protein residues predicted as disordered by each method,
along with the average length of each predicted IDR. In Figure A.2, we compare the
fraction of each protein’s residues predicted as disordered by each method. The table
and figure reveal that all three methods consistently predict proteins in the TF set
to have more disordered residues, longer IDRs, and a greater fraction of disordered
residues than proteins in the non-TF control set, confirming earlier findings that
IDRs are enriched in TFs.
As an aside, it is apparent that PONDR VSL2 is far more likely than the other
two methods to call a residue as disordered, in both the TF set and the non-TF
control set, suggesting that the method is probably operating at a different point
on its receiver operating characteristic (ROC) curve, with high sensitivity but also
perhaps a relatively high false positive rate (Bordoli et al., 2007). In addition, the
average length of IDRs predicted by PONDR VSL2 is higher than the other two
methods, which may be related to the previous point, but may also be because the
method uses different SVMs to predict IDRs in short and long sequences separately.
102
Table A.1: Statistics summarizing disorder predictions on all the residues of all theproteins in both the TF set and the non-TF control set using three different disorderprediction tools.
TF set non-TF ctrl set
% of res. avg length % of res. avg lengthpredicted of IDRs predicted of IDRsin IDRs in IDRs
PONDR VSL2 83.2% 106 53.3% 39
DISOPRED2 47.4% 44 34.1% 36
PreDisorder 1.1 50.1% 19 38.3% 18
A.3.2 IDRs associated with TF DBDs or their flanking regions
To systematically study the associations between IDRs and DBDs, for each occur-
rence of a DBD class within a human TF, we calculated 30 different p-values: the
significance under two different null models (based on the TF set and the non-TF
control set) of five different kinds of disorder features (DBD, N-terminal flank, C-
terminal flank, both flanks, and entire TF) as computed by three different prediction
methods (PONDR VSL2, DISOPRED2, and PreDisorder 1.1). For each combination
of null model and feature, we say that the feature exhibits significant disorder under
that null model if at least two of the three prediction methods predict disorder at
p-value ≤ 0.005; on the other hand, we say that the feature exhibits significant order
under that null model if at least two of the three prediction methods predict disorder
at p-value ≥ 0.995. Note that it is certainly possible for a feature to be neither
significantly ordered nor significantly disordered under a particular null model.
Although we computed whether features exhibited significant order or disorder
across all Pfam DBDs occurring in our TF set, to avoid artifacts due to small sample
size, we restricted our subsequent analysis to the 32 DBD classes with at least five
occurrences in the TF set. Many of the most frequent DBD classes, including the
10 most prevalent ones, are structurally similar and can be roughly classified into
two groups: (1) those containing zinc fingers, and (2) those containing a basic helix-
103
turn-helix type of domain, domains in which helices are separated by loops (e.g.,
Homeobox, HLH, Fork head, Ets). The enrichment analysis results for these 32
DBD classes are listed in Table A.2; at the bottom of the table, we also included the
Pfam domains Basic, AT hook, and P53 (Basic and AT hook are included because
we mention them below in comparison to another study; P53 is a well-studied DBD
included for general interest).
The top 10 most frequently occurring DBD classes in human TFs all exhibit
significant order within the DBD itself, suggesting that structural flexibility within
0.0 0.2 0.4 0.6 0.8 1.0
05
1015
2025
30
fraction of a protein's residuespredicted as disordered
freq
uenc
y (%
)
DISOPRED2PreDisorder1.1
PONDR VSL2A
0.0 0.2 0.4 0.6 0.8 1.0
05
1015
freq
uenc
y (%
)
DISOPRED2PreDisorder1.1
PONDR VSL2
fraction of a protein's residuespredicted as disordered
B
Figure A.2: Distributions of the fraction of each protein’s residues predicted asdisordered by each method for the proteins in (A) the TF set and (B) the non-TFcontrol set.
104
these domains is rather limited. Strikingly, our results indicate that although the
DBDs themselves exhibit significant order, the regions flanking the DBDs are likely
to exhibit significant disorder. Only in the case of zf-C2H2 do the flanking regions ex-
hibit significant order (this will be discussed further in the next section). In contrast,
26 of the other 31 DBDs exhibit significant disorder in either the N-terminal flank,
the C-terminal flank, or both; and none of the other 31 DBDs exhibit significant
order in either flank under either null model. This is consistent with prior studies in
which it was found that DBDs are often separated by flexible linker regions, allowing
TFs to bind DNA with fine control over DNA binding affinity (Zhou, 2001; Fukuchi
et al., 2006).
A.3.3 Comparison of prediction methods in DBDs
To further investigate the detailed spatial relationships of the IDR predictions of the
three different methods to protein DBDs, we generated a meta-plot of the average
predicted order/disorder in the vicinity of each Pfam DBD according to each pre-
diction method. To do this, we first identified all occurrences of a Pfam DBD in
the TF set, and then across all those occurrences, calculated the average (mean) or-
der/disorder score predicted by each method at each residue within the DBD match
and both of its flanks (up to 30 amino acids). In cases where a TF contained only a
partial DBD match and not a full domain according to the HMMER alignment, we
considered only the aligned region in our calculations. We normalized the resulting
scores for the purpose of comparison across methods, and for uniformity in scale
across plots for different DBD classes (Fig. A.3).
Fig. A.3 displays meta-plots for five of the ten Pfam DBDs most prevalent in
human TFs. Results from DISOPRED2 and PreDisorder 1.1 are fairly consistent
across all five domain classes. Moreover, all three methods are in good agreement in
zf-C2HC and demonstrate similar prediction trends in zf-C4, Homeobox, and HLH.
105
Tab
leA
.2:
Enri
chm
ent
anal
ysi
sof
sign
ifica
ntl
yocc
urr
ing
order
edan
ddis
order
edre
gion
sw
ithin
and
flan
kin
ghum
anT
FD
BD
s.
DB
DN
-term
inal
C-t
erm
inal
both
flanks
whole
TF
flank
flank
sequence
No.
DB
DT
Ffa
mil
yaverage
DB
Dnum
ber
of
TF
set
non-T
FT
Fse
tnon-T
FT
Fse
tnon-T
FT
Fse
tnon-T
FT
Fse
tnon-T
F(P
fam
)le
ngth
(res.)
DB
Ds
inT
Fs
ctr
lse
tctr
lse
tctr
lse
tctr
lse
tctr
lse
t
1P
F00096
zf-
C2H
223.1
4154
OR
OR
OR
OR
OR
OR
OR
OR
OR
OR
2P
F00046
Hom
eob
ox
56.3
216
OR
OR
ID
ID
ID
ID
ID
ID
ID
ID
3P
F00010
HL
H53.3
100
OR
OR
ID
ID
ID
ID
ID
ID
–ID
4P
F00505
HM
Gb
ox
68.0
56
OR
OR
ID
ID
ID
ID
ID
ID
–ID
5P
F00250
Fork
head
98.2
47
OR
OR
ID
ID
ID
ID
ID
ID
ID
ID
6P
F00105
zf-
C4
70.2
45
OR
OR
ID
ID
ID
ID
ID
ID
OR
–
7P
F00249
Myb
DN
A-b
indin
g47.2
43
OR
OR
––
–ID
––
––
8P
F00170
bZ
IP1
64.2
34
OR
OR
ID
ID
––
ID
ID
ID
ID
9P
F00178
Ets
85.0
27
OR
OR
ID
ID
ID
ID
ID
ID
––
10
PF
00320
GA
TA
35.1
20
–O
RID
ID
ID
ID
ID
ID
–ID
11
PF
00907
T-b
ox
187.6
18
––
ID
ID
ID
ID
ID
ID
––
12
PF
01530
zf-
C2H
C31.0
14
ID
ID
ID
ID
ID
ID
ID
ID
ID
ID
13
PF
02319
E2F
TD
P73.8
13
––
ID
ID
––
–ID
––
14
PF
00313
CSD
68.6
12
––
––
––
––
––
15
PF
05485
TH
AP
89.2
12
––
ID
ID
ID
ID
ID
ID
––
16
PF
01422
zf-
NF
-X1
21.5
11
––
––
––
––
––
17
PF
03165
MH
1109.9
11
––
ID
ID
ID
ID
ID
ID
––
18
PF
07716
bZ
IP2
54.0
10
––
ID
ID
––
ID
ID
––
19
PF
00292
PA
X125.6
9–
–ID
ID
ID
ID
ID
ID
––
20
PF
00098
zf-
CC
HC
17.9
8–
––
ID
–ID
–ID
––
21
PF
00808
CB
FD
NF
YB
HM
F63.1
8–
––
ID
––
––
––
22
PF
04218
CE
NP
-BN
52.5
8–
––
––
––
––
–
23
PF
00751
DM
47.0
7–
–ID
ID
ID
ID
ID
ID
–ID
24
PF
01342
SA
ND
79.0
7–
–ID
ID
ID
ID
ID
ID
––
25
PF
02257
RF
XD
NA
bin
din
g72.7
7–
–ID
ID
––
––
––
26
PF
02864
ST
AT
bin
d251.9
7–
––
––
––
––
–
27
PF
02892
zf-
BE
D50.1
7–
–ID
ID
ID
ID
ID
ID
––
28
PF
10401
IRF
-3174.0
7–
––
––
––
ID
––
29
PF
00447
HSF
DN
A-b
ind
104.2
6–
––
–ID
ID
ID
ID
––
30
PF
04516
CP
2227.2
6–
––
––
––
––
–
31
PF
03299
TF
AP
-2208.2
5–
––
ID
––
ID
ID
––
32
PF
05044
Pro
x1
224.0
5–
–ID
ID
––
ID
ID
––
PF
01586
Basi
c91.0
4ID
ID
ID
ID
ID
ID
ID
ID
ID
ID
PF
00870
P53
196.3
3–
–ID
ID
ID
ID
ID
ID
––
PF
02178
AT
hook
109.0
1ID
ID
ID
ID
ID
ID
ID
ID
ID
ID
Notes:
The
DB
Ds
wit
hat
least
5occurr
ences
inth
eT
Fse
tare
list
ed
inth
eta
ble
,to
geth
er
wit
hB
asi
c,
P53,
and
AT
hook.
IDin
dic
ate
ssi
gnifi
cant
dis
ord
ere
dD
BD
s,D
BD
flanks,
or
TF
s(i
nat
least
two
of
thre
em
eth
ods,
p-v
alu
e≤
0.0
05).
OR
indic
ate
ssi
gnifi
cant
ord
ere
dD
BD
s,D
BD
flanks,
or
TF
s(i
nat
least
two
of
thre
em
eth
ods,
p-v
alu
e≤
0.0
05).
Adash
(–)
indic
ate
sentr
ies
that
are
neit
her
signifi
cantl
yord
ere
dnor
signifi
cantl
ydis
ord
ere
d.
The
DB
Ds
wit
hfe
wer
than
5occurr
ences
inth
eT
Fse
tin
clu
de:
Runt,
TE
A,
Basi
c,
HA
LZ
,z-a
lpha,
FY
RN
,F
YR
C,
P53,
AR
ID,
DM
A,
AK
AP
95,
GA
TA
-N,
P53
tetr
am
er,
Hom
ez,
XP
AN
,zf-
DH
HC
,G
CM
,C
G-1
,V
ert
HS
TF
,SIM
C,
Rad51,
HA
ND
,B
eta
-tre
foil,
LA
G1-D
NA
bin
d,
PW
I,zf-
MY
ND
,SA
P,
GC
R,
Oest
recep,
Pro
gre
cepto
r,zf-
TR
AF
,zf-
CH
Y,
Vert
IL3-r
eg
TF
,H
SA
,R
io2
N,
Brk
DB
D,
zf-
RA
G1,
AT
hook,
and
TM
FD
NA
bd.
106
Extended to all the DBDs listed in Table A.2, over 67.2% of the DBD classes that
are found to exhibit either significant disorder or significant order are identified as
such by all three methods.
Nevertheless, some discrepancies in the results from the different methods are
evident, such as zf-C2H2. The C2H2-type zinc finger domain is the most prevalent
DBD class found in metazoan TFs, including in human Tupler et al. (2001). It is
also one of the most highly ordered DBDs; however, the linker regions between these
C2H2 zinc finger domains are often disordered (Pabo et al., 2001). As shown in Fig-
ure A.3A, PONDR VSL2 reports that the C2H2 domain occurrences in human TFs
exhibit significant disorder in both the C2H2 domain itself and the adjacent N- and
C-terminal flanks; however, DISOPRED2 and PreDisorder both report the opposite,
namely that zf-C2H2 and its flanks exhibit significant order. Liu et al. (2006) care-
fully analyzed the difficulties of predicting intrinsic disorder in the zf-C2H2 domains
and their linker regions. They concluded that because many linker regions between
C2H2 zinc fingers are quite short, the windowing procedures employed by some IDR
prediction algorithms prevent them from being detected as disordered; the result is
an artifact in which linker regions between C2H2 zinc fingers are over-predicted as
being ordered.
A.3.4 Summary descriptions for some of the most prevalent DBD classes found inhuman TFs
Zinc fingers
Zinc fingers are small structural motifs whose folds are stabilized by coordination
of one or more zinc ions. Zinc fingers can be classified according to their zinc-
coordinating residues and folds. In Fig. A.3A-C, we show our IDR prediction results
for the three major zinc finger domain classes found in human TFs: zf-C2H2 (the
most prevalent DBD class in human TFs), zf-C4 (also referred to as nuclear recep-
107
diso
rder
edor
dere
d
DISOPRED2
PreDisorder1.1
zf-C2H2N-term flank C-term flank
(A)
N-term flank C-term flank
zf-C2HC
(B)
C-term flankN-term flank
zf-C4
(C)
diso
rder
edor
dere
d
N-term flank C-term flank
HLH
(D) (E)
PONDR VSL2
N-term flank C-term flank
Homeobox
Figure A.3: Shown are meta-plots for five prevalent DBDs in human TFs. (A) zinc-finger C2H2-type (length: ∼23 amino acids), (B) zinc-finger C2HC-type (length: ∼31amino acids), (C) zinc-finger C4-type (length: ∼70 amino acids), (D) homeodomainfold (length: ∼58 amino acids), and (E) helix-loop-helix (length: ∼53 amino acids).
tors), and zf-C2HC. Although all three classes contain zinc fingers, we find variability
in their regions of order and disorder. As discussed above, the C2H2 zinc finger do-
main is itself believed to be highly ordered, with individual ordered zinc fingers
separated by highly flexible linker regions (Pabo et al., 2001). We find that the C4
domain exhibits significant order within the DBD itself, but significant disorder in
flanking regions. In contrast, we find that the C2HC domain exhibits significant
disorder in both the DBD and flanking regions.
Homeobox
Homeobox (homeodomain fold) is the second-most abundant DBD class within hu-
man TFs. The homeodomain fold consists of an approximately 60 amino acid helix-
turn-helix structure in which three alpha helices are connected by short loop regions.
Our results (Fig. A.3D) extend the results of a prior study (Liu et al., 2008) that
108
found multiple intrinsically disordered sequences located outside the homeodomain
DBD of the Drosophila TF Ubx, that allow Hox family members (i.e., a subclass
of TFs with Homeobox DBDs) to bind DNA with high affinity but relatively low
specificity (Gehring et al., 1994; Hoey and Levine, 1988).
HLH
HLH (basic helix-loop-helix) is the third-most abundant DBD class within human
TFs, and is characterized by two α-helices connected by a loop. TFs that have this
domain typically bind DNA as either homo- or hetero-dimers, with each monomer
contacting DNA through a helix containing basic residues that facilitate DNA bind-
ing (Littlewood and Evan, 1995). As shown in Fig. A.3E, all three methods report
that HLH exhibits significant order within the domain itself, but significant disorder
in both the N- and C-terminal flanking regions. Our results also indicate that a short
but highly disordered region may frequently occur in the middle of the HLH domain,
consistent with prior observations that the linker regions and the loop region of HLH
proteins are of higher flexibility, allowing dimerization by folding and packing one
smaller helix against the other one (Littlewood and Evan, 1995).
A.4 Discussion
In this study, we used three different computational disorder prediction methods to
investigate the prevalence of IDRs within DBDs and in their flanking regions across
essentially the entire repertoire of human, sequence-specific TFs and their associated
Pfam DBDs. Our choice of multiple prediction methods was motivated by a desire
to be able to draw robust conclusions that were not dependent on any one particular
method.
Previously it was found that TFs are enriched for IDRs (Liu et al., 2006; Minezaki
et al., 2006). At the same time, DBDs responsible for TF binding did not seem
109
themselves to be particularly enriched for IDRs. For example, of the 25 DBDs studied
in (Liu et al., 2006), only the Basic and AT hook domains exhibited high amounts
of disorder; however, those domains are not particularly prevalent in human TFs,
occurring just four times and once in our TF set, respectively.1 We were intrigued
by the possibility that the enrichment of IDRs observed in TFs might be at least
partly due to disorder in the regions flanking DBDs; under such a hypothesis, DBDs
can be thought of as islands of order flanked by regions of disorder.
Our results support exactly such a hypothesis: the most prevalent DBDs in hu-
man TFs exhibit significant order, but the flanking regions of these DBDs generally
exhibit significant disorder. Similarly, among DBDs of intermediate prevalence (oc-
curring between 5 and 20 times in our TF set), although they do not appear often
enough to exhibit either significant order or disorder within the domains themselves,
most of them still exhibit significant disorder in one or both flanking regions.
The functional role played by the significant prevalence of disorder in the regions
flanking DBDs of human TFs is unclear. However, we can speculate that the in-
creased flexibility afforded by these flanking IDRs might contribute to the ability of
TFs to 1) recognize target sequences in the DNA appropriately, 2) bind to a wider
diversity of DNA target sequences, 3) be anchored with higher affinity to the DNA
after recognizing target sequences, 4) bind to other factors and complexes positioned
on the DNA or involved in transcriptional regulation, or 5) present activation do-
mains to downstream transcriptional regulatory machinery. It should be emphasized
that these possibilities are speculative; however, the results of this study suggest nu-
merous testable hypotheses regarding the roles of N- and C-terminal regions flanking
DBDs for many frequently occurring DBDs in hundreds of human TFs. For example,
the importance of the predicted disorder in these flanking regions in determining or
1 Though they do not occur often, where they do occur, they exhibit significant disorder in ourresults as well, corroborating the results in Liu et al. (2006); see Table A.2.
110
modulating the DNA binding affinity and/or specificity of the associated TFs could
be investigated with protein binding microarrays (PBMs) (Mukherjee et al., 2004;
Berger et al., 2008). PBMs could assay the affinity and/or specificity of proteins
representing the DBDs with their flanking regions, as compared to either the DBDs
alone or the DBDs with mutant flanking regions predicted not to be significantly dis-
ordered. If found to contribute to the DNA binding affinity and/or specificity of TFs,
IDRs that flank DBDs would broaden the scope of functional domains to be consid-
ered when evaluating the potential impact of mutations or natural polymorphisms
within exomes, such as in medical sequencing projects.
This study was focused on human TFs; however, since these DBD classes are the
predominant DBD classes not just in human TFs but throughout eukaryotes, the
results of this study may have important implications for studies of TFs across all
eukaryotes.
111
Bibliography
Aikawa, E., Nahrendorf, M., Sosnovik, D., Lok, V. M., Jaffer, F. A., Aikawa, M.,and Weissleder, R. (2007), “Multimodality molecular imaging identifies proteolyticand osteogenic activities in early aortic valve disease,” Circulation, 115, 377–386.
Alberts, B., Johnson, A., Lewis, J., Roberts, K., and Walter, P. (2007), MolecularBiology of the Cell in Cell, 5th Edition, Garland Science.
Amon, A. (2002), “Synchronization procedures,” Meth. Enzymol., 351, 457–467.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M.,Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P.,Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ring-wald, M., Rubin, G. M., and Sherlock, G. (2000), “Gene Ontology: Tool for theunification of biology. The Gene Ontology Consortium,” Nat. Genet., 25, 25–29.
Bakkenist, C. J. and Kastan, M. B. (2003), “DNA damage activates ATM throughintermolecular autophosphorylation and dimer dissociation,” Nature, 421, 499–506.
Bar-Joseph, Z., Farkash, S., Gifford, D. K., Simon, I., and Rosenfeld, R. (2004), “De-convolving cell cycle expression data with complementary information,” Bioinfor-matics, 20 Suppl 1, 23–30.
Bean, J. M., Siggia, E. D., and Cross, F. R. (2006), “Coherence and timing of cellcycle start examined at single-cell resolution,” Mol. Cell, 21, 3–14.
Bell, S. P. and Dutta, A. (2002), “DNA replication in eukaryotic cells,” Annu. Rev.Biochem., 71, 333–374.
Benjamini, Y. and Hochberg, Y. (1995), “Controlling the false discovery rate: a prac-tical and powerful approach to multiple testing,” Journal of the Royal StatisticalSociety. Series B (Methodological), pp. 289–300.
Berg, J., Lassig, M., and Wagner, A. (2004), “Structure and evolution of proteininteraction networks: a statistical model for link dynamics and gene duplications,”BMC Evol. Biol., 4, 51.
112
Berger, M. F., Badis, G., Gehrke, A. R., Talukder, S., Philippakis, A. A., Pena-Castillo, L., Alleyne, T. M., Mnaimneh, S., Botvinnik, O. B., Chan, E. T., Khalid,F., Zhang, W., Newburger, D., Jaeger, S. A., Morris, Q. D., Bulyk, M. L., andHughes, T. R. (2008), “Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences,” Cell, 133, 1266–1276.
Bernard, A., Vaughn, D., and Hartemink, A. (2007), “Reconstructing the topologyof protein complexes,” in Research in Computational Molecular Biology, pp. 32–46,Springer.
Bi, E., Maddox, P., Lew, D. J., Salmon, E. D., McMillan, J. N., Yeh, E., and Pringle,J. R. (1998), “Involvement of an actomyosin contractile ring in Saccharomycescerevisiae cytokinesis,” J. Cell Biol., 142, 1301–1312.
Bloom, J. and Cross, F. R. (2007), “Multiple levels of cyclin specificity in cell-cyclecontrol,” Nat. Rev. Mol. Cell Biol., 8, 149–160.
Bordoli, L., Kiefer, F., and Schwede, T. (2007), “Assessment of disorder predictionsin CASP7,” Proteins, 69 Suppl 8, 129–136.
Boyle, E. I., Weng, S., Gollub, J., Jin, H., Botstein, D., Cherry, J. M., and Sherlock,G. (2004), “GO::TermFinder–open source software for accessing Gene Ontologyinformation and finding significantly enriched Gene Ontology terms associatedwith a list of genes,” Bioinformatics, 20, 3710–3715.
Burrus, C., Gopinath, R., and Guo, H. (1998), Introduction to wavelets and wavelettransforms: a primer, Prentice Hall.
Bustin, M., Catez, F., and Lim, J. H. (2005), “The dynamics of histone H1 functionin chromatin,” Mol. Cell, 17, 617–620.
Campen, A., Williams, R. M., Brown, C. J., Meng, J., Uversky, V. N., and Dunker,A. K. (2008), “TOP-IDP-scale: a new amino acid scale measuring propensity forintrinsic disorder,” Protein Pept. Lett., 15, 956–963.
Chen, K. C., Calzone, L., Csikasz-Nagy, A., Cross, F. R., Novak, B., and Tyson,J. J. (2004), “Integrative analysis of cell cycle control in budding yeast,” Mol.Biol. Cell, 15, 3841–3862.
Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka,L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart, D. J., and Davis,R. W. (1998), “A genome-wide transcriptional analysis of the mitotic cell cycle,”Mol. Cell, 2, 65–73.
Colman-Lerner, A., Chin, T. E., and Brent, R. (2001), “Yeast Cbk1 and Mob2activate daughter-specific genetic programs to induce asymmetric cell fates,” Cell,107, 739–750.
113
Cosma, M. P. (2004), “Daughter-specific repression of Saccharomyces cerevisiae HO:Ash1 is the commander,” EMBO Rep., 5, 953–957.
Crane-Robinson, C., Dragan, A. I., and Privalov, P. L. (2006), “The extended armsof DNA-binding domains: a tale of tails,” Trends Biochem. Sci., 31, 547–552.
Cross, F. R. (2003), “Two redundant oscillatory mechanisms in the yeast cell cycle,”Dev. Cell, 4, 741–752.
Daubechies, I. (1992), Ten lectures on wavelets, vol. 61, Society for Industrial Math-ematics.
de Lichtenberg, U., Jensen, L. J., Fausboll, A., Jensen, T. S., Bork, P., and Brunak,S. (2005), “Comparison of computational methods for the identification of cellcycle-regulated genes,” Bioinformatics, 21, 1164–1171.
Deng, M., Mehta, S., Sun, F., and Chen, T. (2002), “Inferring domain-domain inter-actions from protein-protein interactions,” Genome Res., 12, 1540–1548.
Deng, X., Eickholt, J., and Cheng, J. (2009), “PreDisorder: ab initio sequence-basedprediction of protein disordered regions,” BMC Bioinformatics, 10, 436.
Di Talia, S., Skotheim, J. M., Bean, J. M., Siggia, E. D., and Cross, F. R. (2007),“The effects of molecular noise and size control on variability in the budding yeastcell cycle,” Nature, 448, 947–951.
Di Talia, S., Wang, H., Skotheim, J. M., Rosebrock, A. P., Futcher, B., and Cross,F. R. (2009), “Daughter-specific transcription factors regulate cell size control inbudding yeast,” PLoS Biol., 7, e1000221.
Dickinson, M. E. (2006), “Multimodal imaging of mouse development: tools for thepostgenomic era,” Dev. Dyn., 235, 2386–2400.
Donoho, D., Johnstone, I., and Johnstone, I. M. (1994), “Ideal spatial adaptation bywavelet shrinkage,” Biometrika, 81, 425–455.
Doolin, M. T., Johnson, A. L., Johnston, L. H., and Butler, G. (2001), “Overlappingand distinct roles of the duplicated yeast transcription factors Ace2p and Swi5p,”Mol. Microbiol., 40, 422–432.
Dunker, A. K., Lawson, J. D., Brown, C. J., Williams, R. M., Romero, P., Oh, J. S.,Oldfield, C. J., Campen, A. M., Ratliff, C. M., Hipps, K. W., Ausio, J., Nissen,M. S., Reeves, R., Kang, C., Kissinger, C. R., Bailey, R. W., Griswold, M. D., Chiu,W., Garner, E. C., and Obradovic, Z. (2001), “Intrinsically disordered protein,”J. Mol. Graph. Model., 19, 26–59.
114
Dutkowski, J. and Tiuryn, J. (2007), “Identification of functional modules fromconserved ancestral protein-protein interactions,” Bioinformatics, 23, i149–158.
Eddy, S. R. (2009), “A new generation of homology search tools based on probabilisticinference,” Genome Inform, 23, 205–211.
Eliezer, D. (2009), “Biophysical characterization of intrinsically disordered proteins,”Curr. Opin. Struct. Biol., 19, 23–30.
Finn, R. D., Tate, J., Mistry, J., Coggill, P. C., Sammut, S. J., Hotz, H. R., Ceric,G., Forslund, K., Eddy, S. R., Sonnhammer, E. L., and Bateman, A. (2008), “ThePfam protein families database,” Nucleic Acids Res., 36, D281–288.
Finn, R. D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J. E., Gavin,O. L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E. L.,Eddy, S. R., and Bateman, A. (2010), “The Pfam protein families database,”Nucleic Acids Res., 38, D211–222.
Flannick, J., Novak, A., Srinivasan, B. S., McAdams, H. H., and Batzoglou, S. (2006),“Graemlin: general and robust alignment of multiple large interaction networks,”Genome Res., 16, 1169–1181.
Flicek, P., Amode, M. R., Barrell, D., Beal, K., Brent, S., Chen, Y., Clapham,P., Coates, G., Fairley, S., Fitzgerald, S., Gordon, L., Hendrix, M., Hourlier, T.,Johnson, N., Kahari, A., Keefe, D., Keenan, S., Kinsella, R., Kokocinski, F.,Kulesha, E., Larsson, P., Longden, I., McLaren, W., Overduin, B., Pritchard, B.,Riat, H. S., Rios, D., Ritchie, G. R., Ruffier, M., Schuster, M., Sobral, D., Spudich,G., Tang, Y. A., Trevanion, S., Vandrovcova, J., Vilella, A. J., White, S., Wilder,S. P., Zadissa, A., Zamora, J., Aken, B. L., Birney, E., Cunningham, F., Dunham,I., Durbin, R., Fernandez-Suarez, X. M., Herrero, J., Hubbard, T. J., Parker, A.,Proctor, G., Vogel, J., and Searle, S. M. (2011), “Ensembl 2011,” Nucleic AcidsRes., 39, D800–806.
Forsburg, S. L. and Nurse, P. (1991), “Cell cycle regulation in the yeasts Saccha-romyces cerevisiae and Schizosaccharomyces pombe,” Annu. Rev. Cell Biol., 7,227–256.
Fukuchi, S., Homma, K., Minezaki, Y., and Nishikawa, K. (2006), “Intrinsicallydisordered loops inserted into the structural domains of human proteins,” J. Mol.Biol., 355, 845–857.
Futcher, B. (1999), “Cell cycle synchronization,” Methods Cell Sci, 21, 79–86.
Futcher, B. (2002), “Transcriptional regulatory networks and the yeast cell cycle,”Curr. Opin. Cell Biol., 14, 676–683.
115
Fuxreiter, M., Simon, I., and Bondos, S. (2011), “Dynamic protein-DNA recognition:beyond what can be seen,” Trends Biochem. Sci., 36, 415–423.
Gehring, W. J., Qian, Y. Q., Billeter, M., Furukubo-Tokunaga, K., Schier,A. F., Resendez-Perez, D., Affolter, M., Otting, G., and Wuthrich, K. (1994),“Homeodomain-DNA recognition,” Cell, 78, 211–223.
Goh, C. S., Bogan, A. A., Joachimiak, M., Walther, D., and Cohen, F. E. (2000),“Co-evolution of proteins with their interaction partners,” J. Mol. Biol., 299, 283–293.
Granovskaia, M. V., Jensen, L. J., Ritchie, M. E., Toedling, J., Ning, Y., Bork, P.,Huber, W., and Steinmetz, L. M. (2010), “High-resolution transcription atlas ofthe mitotic cell cycle in budding yeast,” Genome Biol., 11, R24.
Grant, M. and Boyd, S. (2008), Graph implementations for nonsmooth convex pro-grams, Lecture Notes in Control and Information Sciences, Springer-Verlag Lim-ited.
Grant, M. and Boyd, S. (2010), “CVX: Matlab Software for Disciplined ConvexProgramming, version 1.21,” http://cvxr.com/cvx.
Guo, X. and Hartemink, A. (2009), “Domain-oriented edge-based alignment of pro-tein interaction networks,” Bioinformatics, 25, i240–1246.
Guo, X., Bernard, A., Orlando, A. D., Haase, S. B., and Hartemink, A. (2012a),“Branching process deconvolution algorithm reveals a detailed cell-cycle transcrip-tional program,” submitted.
Guo, X., Bulky, M. L., and Hartemink, A. J. (2012b), “Intrinsic disorder within andflanking the DNA-binding domains of human transcription factors,” in PacificSymposium on Biocomputing., p. 104.
Haar, A. (1910), “Zur theorie der orthogonalen funktionensysteme,” MathematischeAnnalen, 69, 331–371.
Haase, S. B. and Reed, S. I. (1999), “Evidence that a free-running oscillator drivesG1 events in the budding yeast cell cycle,” Nature, 401, 394–397.
Haase, S. B. and Reed, S. I. (2002), “Improved flow cytometric analysis of the bud-ding yeast cell cycle,” Cell Cycle, 1, 132–136.
Hanlon, S. E., Rizzo, J. M., Tatomer, D. C., Lieb, J. D., and Buck, M. J. (2011),“The Stress Response Factors Yap6, Cin5, Phd1, and Skn7 Direct Targeting of theConserved Co-Repressor Tup1-Ssn6 in S. cerevisiae,” PLoS ONE, 6, e19060.
116
Hansen, P. (1992), “Analysis of discrete ill-posed problems by means of the L-curve,”SIAM Review, 34, 561–580.
Harder, N., Mora-Bermudez, F., Godinez, W. J., Ellenberg, J., Eils, R., and Rohr,K. (2006), “Automated analysis of the mitotic phases of human cells in 3D fluo-rescence microscopy image sequences,” Med Image Comput Comput Assist Interv,9, 840–848.
Hartwell, L. H. and Unger, M. W. (1977), “Unequal division in Saccharomyces cere-visiae and its implications for the control of cell division,” J. Cell Biol., 75, 422–435.
He, B., Wang, K., Liu, Y., Xue, B., Uversky, V. N., and Dunker, A. K. (2009),“Predicting intrinsic disorder in proteins: an overview,” Cell Res., 19, 929–949.
Hereford, L. M., Osley, M. A., Ludwig, T. R., and McLaughlin, C. S. (1981), “Cell-cycle regulation of yeast histone mRNA,” Cell, 24, 367–375.
Hirsh, E. and Sharan, R. (2007), “Identification of conserved protein complexes basedon a model of protein network evolution,” Bioinformatics, 23, e170–176.
Hoey, T. and Levine, M. (1988), “Divergent homeo box proteins recognize similarDNA sequences in Drosophila,” Nature, 332, 858–861.
Hunter, S., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D.,Bork, P., Das, U., Daugherty, L., Duquenne, L., Finn, R. D., Gough, J., Haft,D., Hulo, N., Kahn, D., Kelly, E., Laugraud, A., Letunic, I., Lonsdale, D., Lopez,R., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A.,Mulder, N., Natale, D., Orengo, C., Quinn, A. F., Selengut, J. D., Sigrist, C. J.,Thimma, M., Thomas, P. D., Valentin, F., Wilson, D., Wu, C. H., and Yeats, C.(2009), “InterPro: the integrative protein signature database,” Nucleic Acids Res.,37, D211–215.
Itzhaki, Z., Akiva, E., Altuvia, Y., and Margalit, H. (2006), “Evolutionary conser-vation of domain-domain interactions,” Genome Biol., 7, R125.
Jansen, M. (2001), Noise reduction by wavelet thresholding, Lecture Notes in Statis-tics, Springer-Verlag.
Jorgensen, P. and Tyers, M. (2004), “How cells coordinate growth and division,”Curr. Biol., 14, R1014–1027.
Jothi, R., Cherukuri, P. F., Tasneem, A., and Przytycka, T. M. (2006), “Co-evolutionary analysis of domains in interacting proteins reveals insights intodomain-domain interactions mediating protein-protein interactions,” J. Mol. Biol.,362, 861–875.
117
Kalaev, M., Bafna, V., and Sharan, R. (2008), “Fast and accurate alignment ofmultiple protein networks,” in Research in Computational Molecular Biology, pp.246–256, Springer.
Kamakaka, R. T. and Biggins, S. (2005), “Histone variants: Deviants?” Genes Dev.,19, 295–310.
Kanehisa, M. and Goto, S. (2000), “KEGG: kyoto encyclopedia of genes andgenomes,” Nucleic Acids Res., 28, 27–30.
Kelley, B. P., Sharan, R., Karp, R. M., Sittler, T., Root, D. E., Stockwell, B. R.,and Ideker, T. (2003), “Conserved pathways within bacteria and yeast as revealedby global protein network alignment,” Proc. Natl. Acad. Sci. U.S.A., 100, 11394–11399.
Koyuturk, M., Kim, Y., Topkara, U., Subramaniam, S., Szpankowski, W., andGrama, A. (2006), “Pairwise alignment of protein interaction networks,” J. Com-put. Biol., 13, 182–199.
Kuranda, M. J. and Robbins, P. W. (1991), “Chitinase is required for cell separationduring growth of Saccharomyces cerevisiae,” J. Biol. Chem., 266, 19758–19767.
Lee, M. G. and Nurse, P. (1987), “Complementation used to clone a human homo-logue of the fission yeast cell cycle control gene cdc2,” Nature, 327, 31–35.
Liskay, R. M. (1977), “Absence of a measurable G2 phase in two Chinese hamstercell lines,” Proc. Natl. Acad. Sci. U.S.A., 74, 1622–1625.
Littlewood, T. D. and Evan, G. I. (1995), “Transcription factors 2: helix-loop-helix,”Protein Profile, 2, 621–702.
Liu, J., Perumal, N. B., Oldfield, C. J., Su, E. W., Uversky, V. N., and Dunker, A. K.(2006), “Intrinsic disorder in transcription factors,” Biochemistry, 45, 6873–6888.
Liu, Y., Matthews, K. S., and Bondos, S. E. (2008), “Multiple intrinsically disorderedsequences alter DNA binding by the homeodomain of the Drosophila hox proteinultrabithorax,” J. Biol. Chem., 283, 20874–20887.
Lord, P. G. and Wheals, A. E. (1980), “Asymmetrical division of Saccharomycescerevisiae,” J. Bacteriol., 142, 808–818.
Lord, P. G. and Wheals, A. E. (1981), “Variability in individual cell cycles of Sac-charomyces cerevisiae,” J. Cell. Sci., 50, 361–376.
Lu, P., Nakorchevskiy, A., and Marcotte, E. M. (2003), “Expression deconvolution:A reinterpretation of DNA microarray data reveals dynamic changes in cell popu-lations,” Proc. Natl. Acad. Sci. U.S.A., 100, 10370–10375.
118
Mallat, S. (1989), “A theory for multiresolution signal decomposition: The waveletrepresentation,” Pattern Analysis and Machine Intelligence, IEEE Transactionson, 11, 674–693.
Mallat, S. (1999), A wavelet tour of signal processing, Academic Pr.
Mallat, S. G. (2008), A wavelet tour of signal processing, Academic Press.
Marchler-Bauer, A., Anderson, J. B., Derbyshire, M. K., DeWeese-Scott, C., Gon-zales, N. R., Gwadz, M., Hao, L., He, S., Hurwitz, D. I., Jackson, J. D., Ke, Z.,Krylov, D., Lanczycki, C. J., Liebert, C. A., Liu, C., Lu, F., Lu, S., Marchler,G. H., Mullokandov, M., Song, J. S., Thanki, N., Yamashita, R. A., Yin, J. J.,Zhang, D., and Bryant, S. H. (2007), “CDD: a conserved domain database forinteractive domain family analysis,” Nucleic Acids Res., 35, D237–240.
Mayhew, M. B., Robinson, J. W., Jung, B., Haase, S. B., and Hartemink, A. J.(2011), “A generalized model for multi-marker analysis of cell cycle progression insynchrony experiments,” Bioinformatics, 27, i295–i303.
Mayhew, M. B., Guo, X., Haase, S. B., and Hartemink, A. J. (2012), “Close encoun-ters of the collaborative kind,” Computer, 45, 24–30.
Mewes, H. W., Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K., Mokrejs,M., Morgenstern, B., Munsterkotter, M., Rudd, S., and Weil, B. (2002), “MIPS:a database for genomes and protein sequences,” Nucleic Acids Res., 30, 31–34.
Miller, C., Schwalb, B., Maier, K., Schulz, D., Dumcke, S., Zacher, B., Mayer, A.,Sydow, J., Marcinowski, L., Dolken, L., Martin, D. E., Tresch, A., and Cramer, P.(2011), “Dynamic transcriptome analysis measures rates of mRNA synthesis anddecay in yeast,” Mol. Syst. Biol., 7, 458.
Minezaki, Y., Homma, K., Kinjo, A. R., and Nishikawa, K. (2006), “Human tran-scription factors contain a high fraction of intrinsically disordered regions essentialfor transcriptional regulation,” J. Mol. Biol., 359, 1137–1149.
Mintseris, J. and Weng, Z. (2005), “Structure, function, and evolution of transientand obligate protein-protein interactions,” Proc. Natl. Acad. Sci. U.S.A., 102,10930–10935.
Morgan, D. (2007), The Cell Cycle: Principles of Control, London, New SciencePress.
Morgan, D. O. (1997), “Cyclin-dependent kinases: engines, clocks, and microproces-sors,” Annu. Rev. Cell Dev. Biol., 13, 261–291.
119
Mukherjee, S., Berger, M. F., Jona, G., Wang, X. S., Muzzey, D., Snyder, M., Young,R. A., and Bulyk, M. L. (2004), “Rapid analysis of the DNA-binding specificitiesof transcription factors with DNA microarrays,” Nat. Genet., 36, 1331–1339.
Murray, A. and Hunt, T. (1993), The Cell Cycle. An introduction, New York, W. H.Freeman & Co.
Murray, A. W. (2004), “Recycling the cell cycle: cyclins revisited,” Cell, 116, 221–234.
Orlando, D. A. (2009), “Regulation of Global Transcription Dynamics During CellDivision and Root Development,” PhD dissertation, Duke University.
Orlando, D. A., Lin, C. Y., Bernard, A., Iversen, E. S., Hartemink, A. J., andHaase, S. B. (2007), “A probabilistic model for cell cycle distributions in synchronyexperiments,” Cell Cycle, 6, 478–488.
Orlando, D. A., Lin, C. Y., Bernard, A., Wang, J. Y., Socolar, J. E., Iversen, E. S.,Hartemink, A. J., and Haase, S. B. (2008), “Global control of cell-cycle transcrip-tion by coupled CDK and network oscillators,” Nature, 453, 944–947.
Orlando, D. A., Iversen, E. S., Hartemink, A. J., and Haase, S. B. (2009), “Abranching process model for flow cytometry and budding index measurementsin cell synchrony experiments,” Annals of Applied Statistics, 3, 1521–1541.
Osley, M. A. (1991), “The regulation of histone synthesis in the cell cycle,” Annu.Rev. Biochem., 60, 827–861.
Pabo, C. O., Peisach, E., and Grant, R. A. (2001), “Design and selection of novelCys2His2 zinc finger proteins,” Annu. Rev. Biochem., 70, 313–340.
Pazos, F., Helmer-Citterich, M., Ausiello, G., and Valencia, A. (1997), “Correlatedmutations contain information about protein-protein interaction,” J. Mol. Biol.,271, 511–523.
Peng, K., Radivojac, P., Vucetic, S., Dunker, A. K., and Obradovic, Z. (2006),“Length-dependent prediction of protein intrinsic disorder,” BMC Bioinformatics,7, 208.
Pierrez, J. and Ronot, X. (1992), “Flow cytometric analysis of the cell cycle: math-ematical modeling and biological interpretation,” Acta Biotheor., 40, 131–137.
Pramila, T., Wu, W., Miles, S., Noble, W. S., and Breeden, L. L. (2006), “TheForkhead transcription factor Hcm1 regulates chromosome segregation genes andfills the S-phase gap in the transcriptional circuitry of the cell cycle,” Genes Dev.,20, 2266–2278.
120
Pruitt, K. D., Tatusova, T., Klimke, W., and Maglott, D. R. (2009), “NCBI ReferenceSequences: current status, policy and new initiatives,” Nucleic Acids Res., 37,D32–36.
Qiu, P., Wang, Z. J., and Liu, K. J. (2006), “Polynomial model approach for resyn-chronization analysis of cell-cycle gene expression data,” Bioinformatics, 22, 959–966.
Raser, J. M. and O’Shea, E. K. (2005), “Noise in gene expression: Origins, conse-quences, and control,” Science, 309, 2010–2013.
Riley, R., Lee, C., Sabatti, C., and Eisenberg, D. (2005), “Inferring protein domaininteractions from databases of interacting proteins,” Genome Biol., 6, R89.
Rowicka, M., Kudlicki, A., Tu, B. P., and Otwinowski, Z. (2007), “High-resolutiontiming of cell cycle-regulated gene expression,” Proc. Natl. Acad. Sci. U.S.A., 104,16892–16897.
Schaufler, L. E. and Klevit, R. E. (2003), “Mechanism of DNA binding by the ADR1zinc finger transcription factor as determined by SPR,” J. Mol. Biol., 329, 931–939.
Schuster-Bockler, B. and Bateman, A. (2007), “Reuse of structural domain-domaininteractions in protein networks,” BMC Bioinformatics, 8, 259.
Schwabe, J. W., Chapman, L., Finch, J. T., Rhodes, D., and Neuhaus, D. (1993),“DNA recognition by the oestrogen receptor: from solution to the crystal,” Struc-ture, 1, 187–204.
Schwacha, A. and Bell, S. P. (2001), “Interactions between two catalytically distinctMCM subgroups are essential for coordinated ATP hydrolysis and DNA replica-tion,” Mol. Cell, 8, 1093–1104.
Sharan, R. and Ideker, T. (2006), “Modeling cellular machinery through biologicalnetwork comparison,” Nat. Biotechnol., 24, 427–433.
Sharan, R., Suthram, S., Kelley, R. M., Kuhn, T., McCuine, S., Uetz, P., Sittler, T.,Karp, R. M., and Ideker, T. (2005a), “Conserved patterns of protein interactionin multiple species,” Proc. Natl. Acad. Sci. U.S.A., 102, 1974–1979.
Sharan, R., Ideker, T., Kelley, B., Shamir, R., and Karp, R. M. (2005b), “Identifica-tion of protein complexes by comparative analysis of yeast and bacterial proteininteraction data,” J. Comput. Biol., 12, 835–846.
Siegal-Gaskins, D., Ash, J. N., and Crosson, S. (2009), “Model-based deconvolutionof cell cycle time-series data reveals gene expression details at high resolution,”PLoS Comput. Biol., 5, e1000460.
121
Sil, A. and Herskowitz, I. (1996), “Identification of asymmetrically localized deter-minant, Ash1p, required for lineage-specific transcription of the yeast HO gene,”Cell, 84, 711–722.
Simchen, G. (1978), “Cell cycle mutants,” Annual review of genetics, 12, 161–191.
Simmons Kovacs, L. A., Nelson, C. L., and Haase, S. B. (2008), “Intrinsic and cyclin-dependent kinase-dependent control of spindle pole body duplication in buddingyeast,” Mol. Biol. Cell, 19, 3243–3253.
Singh, R., Xu, J., and Berger, B. (2008), “Global alignment of multiple proteininteraction networks with application to functional orthology detection,” Proc.Natl. Acad. Sci. U.S.A., 105, 12763–12768.
Slater, M. L., Sharrow, S. O., and Gart, J. J. (1977), “Cell cycle of Saccha-romycescerevisiae in populations growing at different rates,” Proc. Natl. Acad.Sci. U.S.A., 74, 3850–3854.
Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B.,Brown, P. O., Botstein, D., and Futcher, B. (1998), “Comprehensive identificationof cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarrayhybridization,” Mol. Biol. Cell, 9, 3273–3297.
Srinivasan, B., Novak, A., Flannick, J., Batzoglou, S., and McAdams, H. (2006),“Integrated protein interaction networks for 11 microbes,” in Research in Compu-tational Molecular Biology, pp. 1–14, Springer.
Srinivasan, B. S., Shah, N. H., Flannick, J. A., Abeliuk, E., Novak, A. F., andBatzoglou, S. (2007), “Current progress in network research: toward referencenetworks for key model organisms,” Brief. Bioinformatics, 8, 318–332.
Stacey, D. W. and Hitomi, M. (2008), “Cell cycle studies based upon quantitativeimage analysis,” Cytometry A, 73, 270–278.
Teixeira, M. C., Monteiro, P., Jain, P., Tenreiro, S., Fernandes, A. R., Mira, N. P.,Alenquer, M., Freitas, A. T., Oliveira, A. L., and Sa-Correia, I. (2006), “TheYEASTRACT database: A tool for the analysis of transcription regulatory asso-ciations in Saccharomyces cerevisiae,” Nucleic Acids Res., 34, D446–451.
Tobey, R. A. and Crissman, H. A. (1975), “Unique techniques for cell analysis uti-lizing mithramycin and flow microfluorometry,” Exp. Cell Res., 93, 235–239.
Toyn, J. H., Johnson, A. L., Donovan, J. D., Toone, W. M., and Johnston, L. H.(1997), “The Swi5 transcription factor of Saccharomyces cerevisiae has a role inexit from mitosis through induction of the CDK-inhibitor Sic1 in telophase,” Ge-netics, 145, 85–96.
122
Tupler, R., Perini, G., and Green, M. R. (2001), “Expressing the human genome,”Nature, 409, 832–833.
Ubersax, J. A., Woodbury, E. L., Quang, P. N., Paraz, M., Blethrow, J. D., Shah,K., Shokat, K. M., and Morgan, D. O. (2003), “Targets of the cyclin-dependentkinase Cdk1,” Nature, 425, 859–864.
Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A., and Luscombe, N. M.(2009), “A census of human transcription factors: function, expression and evolu-tion,” Nat. Rev. Genet., 10, 252–263.
Wang, Y., Shirogane, T., Liu, D., Harper, J. W., and Elledge, S. J. (2003), “Exitfrom exit: Resetting the cell cycle through Amn1 inhibition of G protein signaling,”Cell, 112, 697–709.
Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F., and Jones, D. T. (2004),“Prediction and functional analysis of native disorder in proteins from the threekingdoms of life,” J. Mol. Biol., 337, 635–645.
Woldringh, C. L., Huls, P. G., and Vischer, N. O. (1993), “Volume growth of daugh-ter and parent cells during the cell cycle of Saccharomyces cerevisiae a/alpha asdetermined by image cytometry,” J. Bacteriol., 175, 3174–3181.
Xenarios, I., Salwinski, L., Duan, X. J., Higney, P., Kim, S. M., and Eisenberg, D.(2002), “DIP, the Database of Interacting Proteins: a research tool for studyingcellular networks of protein interactions,” Nucleic Acids Res., 30, 303–305.
Zhenping, L., Zhang, S., Wang, Y., Zhang, X. S., and Chen, L. (2007), “Alignmentof molecular networks by integer quadratic programming,” Bioinformatics, 23,1631–1639.
Zhou, H. X. (2001), “The affinity-enhancing roles of flexible linkers in two-domainDNA-binding proteins,” Biochemistry, 40, 15069–15073.
123
Biography
Xin Guo was born on June 13, 1979 in Harbin, China. He earned a B.E degree
from Chiba Institute of Technology, Japan in April 2002, and earned two master
degrees from Tokyo Institute of Technology, Japan and Saarland University, Ger-
many, respectively. In 2006, he joined the Ph.D. program in Computer Science at
Duke University. Upon completion of his degree, he will join Gilead Sciences, a
biotechnology company headquartered in Foster City, CA, as a research scientist.
Publications:
1. Guo, X., Bernard, A., Orlando, O. A., Haase, S. B., Hartemink. A. J. (2012)
“Branching process deconvolution algorithm reveals a detailed cell-cycle tran-
scriptional program,” (submitted).
2. Mayhew, M. B., Guo, X., Haase, S. B., Hartemink, A. J. (2012) “Close en-
counters of the collaborative kind”, IEEE Computer. 45: pp. 24–30.
3. Guo, X., Bulyk, L. M., Hartemink, A. J. (2012) “Intrinsic disorder within and
flanking the DNA-binding domains of human transcription factors”, Pacific
Symposium on Biocomputing (PSB2012), 17:104–115, January 2012.
4. Guo, X., Hartemink, A. J. (2009) “Domain-oriented edge-based alignment of
protein interaction networks”, Intelligent Systems in Molecular Biology 2009
(ISMB09). Bioinformatics, 25:i240–246, July 2009.
124