from population to single cells: deconvolution of cell

From Population to Single Cells: Deconvolution of

Cell-cycle Dynamics

by

Xin Guo

Department of Computer ScienceDuke University

Date:Approved:

Alexander J. Hartemink, Supervisor

Pankaj K. Agarwal

Uwe Ohler

Steven B. Haase

Dissertation submitted in partial fulfillment of the requirements for the degree ofDoctor of Philosophy in the Department of Computer Science

in the Graduate School of Duke University2012

Abstract

From Population to Single Cells: Deconvolution of Cell-cycle

Dynamics

by

Xin Guo

Department of Computer ScienceDuke University

Date:Approved:

Alexander J. Hartemink, Supervisor

Pankaj K. Agarwal

Uwe Ohler

Steven B. Haase

An abstract of a dissertation submitted in partial fulfillment of the requirements forthe degree of Doctor of Philosophy in the Department of Computer Science

in the Graduate School of Duke University2012

Copyright c© 2012 by Xin GuoAll rights reserved except the rights granted by the

Creative Commons Attribution-Noncommercial License

http://creativecommons.org/licenses/by-nc/3.0/us/

Abstract

The cell cycle is one of the fundamental processes in all living organisms, and all

cells arise from the division of existing cells. To better understand the regulation of

the cell cycle, synchrony experiments are widely used to monitor cellular dynamics

during this process. In such experiments, a large population of cells is generally

arrested or selected at one stage of the cycle, and then released to progress through

subsequent division stages. Measurements are then taken in this population at a

variety of time points after release to provide insight into the dynamics of the cell

cycle. However, due to cell-to-cell variability and asymmetric cell division, cells in

a synchronized population lose synchrony over time. As a result, the time-series

measurements from the synchronized cell populations do not accurately reflect the

underlying dynamics of cell-cycle processes.

In this thesis, we introduce a deconvolution algorithm that learns a more accu-

rate view of cell-cycle dynamics, free from the convolution effects associated with

imperfect cell synchronization. Through wavelet-basis regularization, our method

sharpens signal without sharpening noise, and can remarkably increase both the

dynamic range and the temporal resolution of time-series data. Though it can be

applied to any such data, we demonstrate the utility of our method by applying

it to a recent cell-cycle transcription time course in the eukaryote Saccharomyces

cerevisiae. We show that our method more sensitively detects cell-cycle-regulated

transcription, and reveals subtle timing differences that are masked in the original

iv

population measurements. Our algorithm also explicitly learns distinct transcrip-

tion programs for both mother and daughter cells, enabling us to identify 82 genes

transcribed almost entirely in the early G1 in a daughter-specific manner.

In addition to the cell-cycle deconvolution algorithm, we introduce DOMAIN,

a protein-protein interaction (PPI) network alignment method, which employs a

novel direct-edge-alignment paradigm to detect conserved functional modules (e.g.,

protein complexes, molecular pathways) from pairwise PPI networks. By applying

our approach to detect protein complexes conserved in yeast-fly and yeast-worm

PPI networks, we show that our approach outperforms two widely used approaches

in most alignment performance metrics. We also show that our approach enables

us to identify conserved cell-cycle-related functional modules across yeast-fly PPI

networks.

v

Contents

Abstract iv

List of Tables x

List of Figures xi

List of Abbreviations and Symbols xiii

Acknowledgements xv

1 Introduction 1

1.1 Biological background . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Overview of cell cycle . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Phases of eukaryotic cell cycle . . . . . . . . . . . . . . . . . . 3

1.1.3 Asymmetric cell division of budding yeast . . . . . . . . . . . 6

1.1.4 Cell-cycle control system of budding yeast . . . . . . . . . . . 7

1.2 Cell-cycle synchrony experiment and its limitations . . . . . . . . . . 9

1.2.1 Biomarkers for monitoring cell-cycle progression . . . . . . . . 9

1.2.2 Cell-cycle synchrony experiment . . . . . . . . . . . . . . . . . 12

1.2.3 Synchrony lose significantly in a synchronized cell population . 14

1.3 Motivation: why deconvolution is necessary . . . . . . . . . . . . . . 15

1.3.1 Deconvolution: from population to single cells . . . . . . . . . 15

1.3.2 cloccs: modeling cell-cycle distributions . . . . . . . . . . . 16

1.3.3 The missing piece of deconvolution . . . . . . . . . . . . . . . 19

vi

1.4 Contribution of our deconvolution framework . . . . . . . . . . . . . . 20

1.5 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 The deconvolution framework 23

2.1 Previous deconvolution algorithms . . . . . . . . . . . . . . . . . . . . 23

2.2 General deconvolution objective function . . . . . . . . . . . . . . . . 28

2.3 Branching process in deconvolution . . . . . . . . . . . . . . . . . . . 30

2.4 Introduction to wavelets: selection of wavelets . . . . . . . . . . . . . 33

2.5 Selecting a regularization parameter . . . . . . . . . . . . . . . . . . . 35

2.6 Joint learning from multiple replicates . . . . . . . . . . . . . . . . . 35

3 Deconvolution of wild-type cell-cycle transcriptional profiles of bud-ding yeast 37

3.1 Experimental data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Branching process model and cell-cycle parameters . . . . . . . . . . 38

3.2.1 Branching process model . . . . . . . . . . . . . . . . . . . . . 38

3.2.2 Cell-cycle parameters from cloccs . . . . . . . . . . . . . . . 39

3.3 Deconvolution model . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.1 Deconvolution objective function . . . . . . . . . . . . . . . . 40

3.3.2 Constructing a convolution kernel . . . . . . . . . . . . . . . . 41

3.3.3 Selection a regularization parameter . . . . . . . . . . . . . . . 42

3.3.4 Adjustment of branching process construction from cloccs . 43

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.1 Deconvolving time-series yeast budding index data to assessalgorithm accuracy . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4.2 Deconvolving replicate yeast microarray data to reveal single-cell transcription profiles . . . . . . . . . . . . . . . . . . . . . 45

3.4.3 Deconvolution is robust with respect to uncertainty in inputcloccs parameters . . . . . . . . . . . . . . . . . . . . . . . . 46

vii

3.4.4 Deconvolution increases temporal resolution and precision oftranscription profiles . . . . . . . . . . . . . . . . . . . . . . . 47

3.4.5 Deconvolution increases amplitude and dynamic range of tran-scription profiles . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.4.6 Deconvolution reveals a large number of transcripts fluctuatingduring the cell cycle . . . . . . . . . . . . . . . . . . . . . . . 52

3.4.7 Deconvolution is robust across replicates . . . . . . . . . . . . 55

3.4.8 Deconvolution reveals fine timing of transcription programs . . 56

3.4.9 Identifying over-represented transcription factors (TFs) . . . . 60

3.4.10 Deconvolution reveals R-specific transcriptional program . . . 60

3.4.11 Deconvolution reveals a daughter-specific G1 transcription pro-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4.12 Transcriptional programs between G1 and DG1 . . . . . . . . 68

3.4.13 Visualizing transcription timing of gene groups . . . . . . . . . 70

4 Identifying conserved functional modules across species 72

4.1 Introduction to network alignment . . . . . . . . . . . . . . . . . . . 73

4.2 DOMAIN: a domain-oriented edge-based PPI network aligner . . . . 75

4.2.1 Constructing and scoring APEs . . . . . . . . . . . . . . . . . 75

4.2.2 Building an APE graph . . . . . . . . . . . . . . . . . . . . . 77

4.2.3 Detecting protein complexes . . . . . . . . . . . . . . . . . . . 79

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3.2 DOMAIN outperforms previous methods in most performancemetrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.3.3 DOMAIN is sensitive at detecting small alignments . . . . . . 84

4.3.4 DOMAIN provides a comprehensive means of interpreting align-ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

viii

4.3.5 Performance improves by combining cross-species pairwise align-ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.4 Detecting conserved cell-cycle-related functional modules . . . . . . . 87

4.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5 Conclusions 90

A Intrinsic disorder within and flanking the DNA-binding domains ofhuman transcription factors 94

A.1 Introduction to intrinsically disordered structures and transcriptionfactors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

A.2 Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 96

A.2.1 Constructing the TF dataset and the non-TF control dataset . 96

A.2.2 Comparing the TF and non-TF sets of proteins . . . . . . . . 97

A.2.3 Identifying DNA-binding domains (DBDs) and their locationswithin proteins . . . . . . . . . . . . . . . . . . . . . . . . . . 98

A.2.4 Using multiple prediction methods to predict intrinsically dis-ordered regions (IDRs) within proteins . . . . . . . . . . . . . 99

A.2.5 Defining disorder features: spatial relationships of IDRs rela-tive to DBDs within TFs . . . . . . . . . . . . . . . . . . . . . 100

A.2.6 Calculating statistical significance of disorder features . . . . . 101

A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A.3.1 Comparing the three methods to predict IDRs within proteins 102

A.3.2 IDRs associated with TF DBDs or their flanking regions . . . 103

A.3.3 Comparison of prediction methods in DBDs . . . . . . . . . . 105

A.3.4 Summary descriptions for some of the most prevalent DBDclasses found in human TFs . . . . . . . . . . . . . . . . . . . 107

A.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Bibliography 112

Biography 124

ix

List of Tables

3.1 Cell-cycle parameters estimated by cloccs from flow cytometric mea-surements of DNA content and budding index. . . . . . . . . . . . . . 40

3.2 Full list of over-represented TFs in subclusters of R-specific expressedgenes (Fig. 3.11). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.3 Full list of over-represented TFs in subclusters of daughter-specificgenes (Fig. 3.12). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.4 The contingency table for 82 identified daughter-specific genes accord-ing to the daughter-specific and non-daughter-specific genes identifiedin Di Talia et al. (2009), Spellman et al. (1998), and Colman-Lerneret al. (2001). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.1 Summary of backbone networks. . . . . . . . . . . . . . . . . . . . . . 81

4.2 Performance comparisons of DOMAIN with NetworkBLAST and MaW-ISh on yeast-fly backbone networks. . . . . . . . . . . . . . . . . . . . 82

4.3 Performance comparisons of DOMAIN with NetworkBLAST and MaW-ISh on yeast-worm backbone networks. . . . . . . . . . . . . . . . . . 82

4.4 Cell-cycle-related functional modules conserved across budding yeastand fruit fly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

A.1 Statistics summarizing disorder predictions on all the residues of allthe proteins in both the TF set and the non-TF control set using threedifferent disorder prediction tools. . . . . . . . . . . . . . . . . . . . . 103

A.2 Enrichment analysis of significantly occurring ordered and disorderedregions within and flanking human TF DBDs. . . . . . . . . . . . . . 106

x

List of Figures

1.1 Overview of eukaryotic cell cycle. . . . . . . . . . . . . . . . . . . . . 4

1.2 Asymmetric cell division of budding yeast . . . . . . . . . . . . . . . 7

1.3 Overview of the cell-cycle control system of budding yeast. . . . . . . 8

1.4 Examples of the measureable cell-cycle progression markers in buddingyeast. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Examples to illustrate that the synchronized population of cells losessynchrony over time. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.6 Overview of deconvolution framework. . . . . . . . . . . . . . . . . . 17

1.7 Branching process in cloccs. . . . . . . . . . . . . . . . . . . . . . . 18

2.1 Branching process in deconvolution. . . . . . . . . . . . . . . . . . . . 31

2.2 Selection of a regularization parameter γ. . . . . . . . . . . . . . . . . 36

3.1 Overview of the deconvolution algorithm. . . . . . . . . . . . . . . . . 39

3.2 Detailed algorithm for selecting a regularization parameter γ. . . . . . 42

3.3 Deconvolution recovers dynamic single-cell profiles from population-level data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4 Deconvolution is of capability of de-noising. . . . . . . . . . . . . . . 46

3.5 Deconvolved profiles are robust to uncertainty in inputs. . . . . . . . 48

3.6 More examples on the robustness of deconvolved profiles with respectto uncertainty in cloccs parameter estimates. . . . . . . . . . . . . . 49

3.7 Genome-wide analysis of deconvolved transcription profiles reveals alarge number of transcripts fluctuating during the cell cycle. . . . . . 53

3.8 Transcript dynamics of 1,500 most cell-cycle-regulated genes. . . . . . 55

xi

3.9 Robustness of deconvolved profiles with respect to variation acrossmeasured data replicates. . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.10 High temporal resolution of deconvolution reveals fine timing of tran-scription programs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.11 Genes whose transcriptional levels are elevated significantly under stress. 62

3.12 Branching process construction enables deconvolution to reveal a daughter-specific G1 transcription program. . . . . . . . . . . . . . . . . . . . . 63

3.13 Relationships of transcription profiles in G1 and DG1. . . . . . . . . . 69

3.14 Circular representation of peak timing of genes. . . . . . . . . . . . . 70

4.1 Overview of DOMAIN algorithm . . . . . . . . . . . . . . . . . . . . 75

4.2 Four connectivities in an APE graph. . . . . . . . . . . . . . . . . . . 78

4.3 Evaluation of alignment performance of DOMAIN. . . . . . . . . . . 85

A.1 Generation of TF set and the non-TF control set. . . . . . . . . . . . 98

A.2 Distributions of the fraction of each protein’s residues predicted asdisordered by each method. . . . . . . . . . . . . . . . . . . . . . . . 104

A.3 Meta-plots of five prevalent DBDs in human TFs. . . . . . . . . . . . 108

xii

List of Abbreviations and Symbols

Abbreviations

APC anaphase-promoting complex

APE alignable pairs of edges

ATM ataxia telangiectasia mutated

ATR ataxia telangiectasia and Rad3-related protein

CDK cyclin-dependent protein kinases

CLOCCS characterizing loss of cell cycle synchrony

CWT continuous wavelet transform

DBD DNA-binding domain

DDI domain-domain interaction

DG1 daughter-specific G1

DWT discrete wavelet transform

DOMAIN domain-oriented alignment of interaction networks

EM expectation-maximization

FACS fluorescence-activated cell sorter

FDR false positive rate

GO gene ontology

indel insertion/deletion

MBF MCB binding factor

MCM mini Chromosome Maintenance

xiii

MCMC Markov chain Monte Calro

ORC origin recognition complex

postG1 post G1 (G1 or DG1) interval, including S, G2, and M phases

PPI protein-protein interaction

pre-IC pre-initiation complex

pre-RC pre-replicative complex

PTR peak-to-trough ratio

R recovery interval

SBF SCB binding factor

SPB spindle-pole body

TF transcription factor

WT wild-type

YMC yeast metabolic culture

Symbols

f average levels of molecular species individual cells at variouspoints in the cell cycle

g measured cell-cycle time-series at population level

H (de)convolution kernel

xiv

Acknowledgements

First and foremost, I would like to express my earnest gratitude to my advisor,

Prof. Alex Hartemink, for his support, patience, encouragement, wisdom, countless

insightful suggestions, and long discussions. From Alex, I have learned so much, not

only about science, but also about all aspects in my research work and life. He is

a great mentor. He taught me how to think, to write, and to present in a scientific

way. He has always been there to listen to my thoughts, many times rambles, and

turn them into something meaningful. He truly makes our group an enjoyable place

to be, to discuss, and to learn.

I would like to acknowledge my committee members, Prof. Steve Haase, Prof.

Uwe Ohler, and Prof. Pankaj Agarwal. Thank you for helping me go throughout

these years. Steve, with his immense knowledge of yeast biology, often provided me

insightful feedbacks and suggestions to our cell-cycle projects. Uwe taught me a lot

about computational biology and genetics in his classes, and he is always willing to

help me with his experiences when I met any problem in my research work. I thank

Pankaj for his invaluable suggestions and feedbacks in writing this dissertation. I

also would like to thank Prof. Martha Bulyk, Prof. Edwin Iversen, Prof. Merlise

Clyde, Prof. Rebecca Willett, and Prof. David MacAlpine for helpful discussions at

various points during developing and writing this dissertation.

Thanks to all members of the Hartemink lab, past and present. I could not finish

my PhD study without the help and the support of them. A special thanks to Dr.

xv

Allister Bernard, who developed the initial framework of the cell-cycle deconvolution

algorithm, and to Dr. Josh Robinson, the officemate of mine for over two years,

who taught me a lot about statistics and brought me a lot of laughters. Dr. Raluca

Gordan, Dr. Narlikar LeeLavati, Dr. David Orlando, Dr. Todd Wasson, Abrita

Chakravarty, Yezhou Huang, Jianling Zhong, Michael Mayhew, Kaixuan Luo, and

Dr. Fantine Mordelet, thank you for so many helpful discussions, and for making

our lab such a lively and productive place.

At last but definitely not least, I thank my family. Dad and mom, thank you

for all the love and the encouragement over so many years. Thank you, Rui, my

wife. Without your help and support, I would never have been able to complete this

dissertation. And a big kiss to my daughter Mandy, who is growing up and such a

wonderful loving kid.

Thank you all!

xvi

1

Introduction

The cell is the basic structural and the functional unit of all known living organ-

isms. All necessary genetic information and molecular machinery are maintained

in individual cells, which enables the existing cells to produce new cells through an

intricate series of cell-cycle events. To better understand how these events are regu-

lated, studies in many organisms have monitored the dynamics of various molecular

species (e.g., transcript levels, protein levels, nucleosome positions) throughout the

cell cycle. Ideally, the dynamics of these species would be studied in individual cells

traversing the cell cycle. Unfortunately, accurate and genome-wide quantification of

many molecular species is still only possible in populations of cells. For population

measurements to provide insight into dynamics of molecular species in individual

cells, the cells in a population should be arrested at one stage of the cell cycle, and

then released to progress through subsequent division cycles. Molecular species can

then be monitored in the population at various time points after release.

However, perfect cell synchrony is neither attainable at synchronization nor main-

tainable after release. More importantly, cell division is an asymmetric procedure in

many kinds of cells, such as budding yeast; after cell division, the new born daugh-

1

ter cells are typically smaller than their mothers, and the cell-cycle period of these

daughter cells is significantly longer than that of mothers. For these reasons, time-

series measurements taken over a population of cells do not accurately reflect the

dynamics of individual cells as they traverse the cell cycle, but instead represent the

convolved dynamics of all cells in the imperfectly synchronized population.

In this thesis, we introduce a deconvolution algorithm that efficiently removes

these synchrony loss effects from population-level measurements and reveals a de-

tailed cell-cycle profile at a single-cell level. Our deconvolution algorithm is built

upon cloccs (Characterizing Loss of Cell Cycle Synchrony), a framework for quan-

titatively determining cell-cycle distributions in population synchrony experiments.

From cloccs parameter estimates, we construct a convolution kernel that trans-

forms the values from the individual cell level to the population level, and then the

problem of estimating cell-cycle dynamics from population-level measurements to a

single-cell level can be viewed as an ill-posed inverse problem. We address the ill-

posed nature of this problem—and simultaneously tackle the issue of noise in the

input data—by employing a novel wavelet-basis regularization approach.

Before elaborating the actual computational details of our deconvolution algo-

rithm, in this chapter, we first review some biological background and experimental

techniques that motivate our work toward accurately estimating cell-cycle dynamics

at a single-cell level, and then we describe the basis of cell-cycle deconvolution prob-

lem, and briefly introduce the cloccs model. We end this chapter with an outline

of this thesis.

1.1 Biological background

1.1.1 Overview of cell cycle

The cell cycle, or cell division cycle, is one of the fundamental processes in all living

organisms, from unicellular bacterium to the multicellular mammal. During the

2

course of cell cycle, a cell reproduces itself, replicates its genome and other cellular

contents to produce a new cell. In unicellular organisms such as bacteria or yeasts,

cell division generates an entire new organism. In multicellular species, countless cell

divisions starting from a single founder produce the diverse communities of cells that

make up tissues and organs.

The cell cycle is a series of events that take place in a cell leading to its division

and duplication. Although the details of cell cycle vary from organism to organism,

the certain characteristics are common. At minimum, a cell has to accomplish its

most fundamental task to passing on its genetic information to the next generation.

In cells without a nucleus (prokaryotic), the cell cycle occurs via a process termed

binary fission. In cells with a nucleus (eukaryotes), the cell cycle is controlled by

a complex network of regulatory proteins, known as cell-cycle control system, that

governs progression through the cell cycle. The core of this system is an ordered

series of biochemical switches that initiate the main events of the cycle, including

chromosome duplication and segmentation. In this thesis, we focus on the study

of eukaryotic cell cycle, especially the cell cycle of model organism, budding yeast

Saccharomyces cerevisiae.

1.1.2 Phases of eukaryotic cell cycle

The two most basic functions of the cell cycle are accurate duplication of the large

amount of DNA in the chromosomes and segmentation of precisely duplicated chro-

mosomes into two daughter cells. The stages of the eukaryotic cell cycle are typically

defined on the basis of these two chromosomal events, separated by two gap phases,

G1 and G2 (Fig. 1.1).

G1 phase (also known as post-mitotic phase) is the major period of cell growth

during one cell cycle. In G1 phase, a large amount of structural proteins and enzymes

are required for synthesizing new organelles, and therefore the rate of metabolism

3

Figure 1.1: Overview of eukaryotic cell cycle. The reproduction of cells includestwo major processes: chromosome duplication during S phase, and cell segregationduring M phase. These two phases are separated by two gap phases: G1 is the gapphase between the previous M and S phases, and G2 is the gap phase between S andM phases. Figure is adapted from Morgan (2007).

in the cell is high. The length of G1 phase can vary greatly depending on exter-

nal conditions and extracellular signals from other cells (in multicellular organisms).

Sometimes, cells delay progress through G1 and may even enter a specialized resting

state known as G0 phase. Near the end of G1, the cell progress through a com-

mitment point, known as Start (in yeasts) or the restriction point (in mammalian

cells), which is a major cell-cycle checkpoint to ensure the DNA is intact and the

cell is functioning normally. After passing this point, the cell is committed to DNA

replication, even if the extracellular signals that stimulate cell growth and division

are removed (Morgan, 1997; Alberts et al., 2007; Murray and Hunt, 1993).

The next is synthesis (S) phase, during which the chromosomes are duplicated.

The central event in this phase is DNA replication. It starts from specific locations

in the genome, called ‘replication origins’. A complex of initiator proteins binds

on these sites and opens the DNA, making two Y-shaped DNA structures called

‘replication forks’. DNA Polymerases and other replication proteins are recruited to

these forks, moving outwards in both directions, to form the new two DNA strands.

4

More than DNA replication, chromatin structures are constructed and DNA dam-

ages, if occur, are detected and fixed during this phase. Both of these processes

require increased synthesis of proteins, such as histones for packaging the DNA into

chromosomes (Osley, 1991), and ataxia telangiectasia mutated (ATM) and ataxia

telangiectasia & Rad3-related protein (ATR), two master kinases that response to

DNA double-strand breaks and distributions in chromatin structure (Bakkenist and

Kastan, 2003).

G2 phase is the second growth period of a cell cycle, occurring between S phase

and the mitosis (M) phase. Curiously, G2 phase is not a necessary part of the cell

cycle. Some cell types (particularly Xenopus embryos and some cancers (Liskay,

1977)) proceed directly from DNA replication to mitosis. Also budding yeast Sac-

charomyces cerevisiae, the model organism in the study of cell cycle, lacks a clear

definition of G2 phase (Forsburg and Nurse, 1991).

The second major phase of the cell cycle is mitosis (M) phase. M phase is typically

composed of two major events: nuclear division (mitosis) and cell division (cytoki-

nesis). The first mitosis event is a complex and precise process that distributes the

duplicated chromosomes equally into a pair of daughter nuclei. Mitosis can be di-

vided into four sub-phases: prophase, during which chromatin condenses into double

chromosomes; metaphase, during which the condensed chromosomes align in the

middle of the cell; anaphase, during which chromosomes move to opposite poles of

the cells; and telophase, during which two daughter nuclei form in the cell. In the

second event of cytokinesis, the cytoplasm of a single eukaryotic cell separates apart

to form two daughter cells, each with one pair of chromatid identical to the mother

cell (Morgan, 2007).

5

1.1.3 Asymmetric cell division of budding yeast

Budding yeast Saccharomyces cerevisiae is a unicellular fungus that has been widely

used in baking and brewing since ancient times, and thereby it is commonly called

baker’s or brewer’s yeast. Budding yeast is one of the most intensively studied

eukaryotic organisms in genetics and cell biology, particularly in the field of cell

cycle. As a unicellular eukaryote, budding yeast offers many advantages towards

studying cell-cycle regulation: first of all, it has a relatively small size of genome and

it is able to proliferate rapidly in simple culture conditions (e.g., approximately 90

minutes per cell division under ideal conditions). Secondly, the cell cycle of budding

yeast is very similar to the cell cycle of many higher eukaryotes, such as humans.

Thirdly and more importantly, budding yeast can proliferate in a haploid state, in

which only a single copy of each chromosome is present in the cell. This makes it

easy to manipulate the cells genetically, avoiding the pitfall of recessive mutations.

For example back to the 1970s, researchers have used haploid cells of budding yeast

to carry out large mutation screens, leading to many key regulatory discoveries of

cell division (Simchen, 1978). Above all, budding yeast Saccharomyces cerevisiae is

an ideal experimental model organism in the study of cell cycle.

A particularity of budding yeast S. cerevisiae lies in its asymmetric division

(Hartwell and Unger, 1977; Lord and Wheals, 1981, 1980; Woldringh et al., 1993;

Bean et al., 2006). As illustrated in Fig. 1.2, the cycle of S. cerevisiae is usually split

into three phases, G1, S, and G2/M phases, as there exists no normal G2 phase in

budding yeast (Forsburg and Nurse, 1991). Around the time that a cell progresses

from G1 into S phase, a bud is initiated from one side of the cell, grow steadily, and

finally separated from its mother after mitosis, forming a daughter cell. After cell

division, the newborn daughter cells are usually smaller than the mother cells, and

the cell-cycle period of these daughter cells is significantly longer than that of mother

6

G1

S

G2/M

START

Figure 1.2: Asymmetric cell division of budding yeast S. cerevisiae. The cycle ofbudding yeast is usually split by landmark events into G1, S, and G2/M phases. Thetransition from G1 to S is marked by the development of a bud, and the transitionfrom S to G2 is marked by the completion of DNA synthesis. At the end of M phase,the daughter cell separates apart from the mother cell. After yeast cell division, thenewborn daughter cell is usually smaller than the mother cell, and therefore it needsmore time in G1 to grow until it reaches a critical cell size.

cells. This is most likely due to mechanisms—not yet well understood—that delay

daughter cells in early G1 until they achieve a critical cell size (Jorgensen and Tyers,

2004). Mother cells are often already larger than this critical size and thus progress

more rapidly through G1 (Di Talia et al., 2007; Morgan, 2007).

1.1.4 Cell-cycle control system of budding yeast

The eukaryotic cell division cycle is controlled by a sequential activation and in-

activation of cyclin-dependent protein kinases (CDKs). CDKs are a family of ser-

ine/threonine protein kinases. In general, a CDK binds a regulatory protein called

a cyclin to play its regulatory role. Without cyclin, CDK has little kinase activity,

and therefore only the cyclin-CDK complex is an active kinase. CDKs are present in

all known eukaryotes, and their regulatory functions in the cell cycle are conserved

7

Figure 1.3: Overview of the cell-cycle control system of budding yeast. There existthree major sets of gene regulatory factors that provide the underlying frameworkfor an autonomous control system to trigger cell-cycle events in the correct order:SBF/MBF, Mcm-Fkh, and Swi5/Ace2 (blue boxes). In early G1, Cln3-Cdk1 activitysets the system in motion by activating SBF/MBF. Then, the regulatory signalsproceed forward through the various Cdks and gene regulatory factors as shown bythe solid red arrows, leading to ordered progression thorough the stages of the cellcycle and back to the stable G1 stage again. Positive feedback (dashed red arrows)enhances the activation of each gene regulatory factor, and negative feedback (dashedblue lines) allows some components to inhibit previous components in the sequence.Figure is adapted from Morgan (2007).

across species. For example, it has been shown that the yeast cells can prolifer-

ate normally when their CDK gene is replaced with homologous human gene (Lee

and Nurse, 1987; Morgan, 2007). In the budding yeast Saccharomyces cerevisiae,

Cdc28/Cdk1 is the only CDK involved in regulating the cell cycle, while in higher

eukaryotes, multiple CDKs (e.g., Cdc2/Cdk1, Cdk2, Cdk4, and Cdk6) control cell

cycle progression.

There exist 9 major cyclins in budding yeast: three G1 cyclins (Cln1-3) and six

B-type cyclins (Clb1-6). All these cyclins bind to and activate Cdc28, and they

together with some other regulatory factors establish a complex regulatory network

8

to control the progression of cell cycle (Futcher, 2002; Murray, 2004; Cross, 2003;

Chen et al., 2004; Morgan, 2007; Bloom and Cross, 2007; Alberts et al., 2007).

As shown in Fig. 1.3, in early G1, the activity of most Cdks is suppressed by Cdk

inhibitor Sic1 and cyclin ubiquitination by Anaphase-promoting complex (APC).

However, these inhibitory factors do not prevent growth-dependent accumulation of

G1 cyclin Cln3. Therefore during G1 the activity of Cln3-Cdk1 complex accumulates

and reaches a threshold level that triggers activation of the gene regulatory factors

SBF (Swi4-Swi6) and MBF (Mbp1-Swi6), and these factors sequentially stimulate the

expression of genes encoding G1/S cyclins (Cln1 and Cln2) and S cyclins (Clb5 and

Clb6). Since the G1/S-Cdk complexes are resistant to Sic1 and are not targeted by

APC, the activity of G1/S-Cdk increase greatly in late G1, leading to phosphorylate

Cdh1 and inactivate APC.

APC inactivation and Sic1 destruction allow M cyclins to start accumulating,

and the rising M-Cdk activity stimulates to activation of the next gene regulatory

factor in the sequence, Mcm1-Fkh. The Mcm1-Fkh co-factor further stimulates the

expression of M cyclin and other genes required for mitosis, leading the cell to entering

mitosis.

During M phase after metaphase-to-anaphase transition, cyclin destructs, which

leads to activation of the M/G1 gene regulators, Swi5 and Ace2. Swi5 and Ace2

stimulate the expression of Sic1 and other proteins that cause Cdk inactivation.

Therefore, after the division the system has returned to a stable G1 state with low

Cdk activity, poising to begin the next cycle.

1.2 Cell-cycle synchrony experiment and its limitations

1.2.1 Biomarkers for monitoring cell-cycle progression

How can we tell what stage that a budding yeast cell has reached in the cell cycle?

One simple and cost-efficient way is to look at the living cells with a light microscope

9

to check whether or not the cell is budded. As we have mentioned previously, the bud

of a yeast cell appears near the G1-to-S transition until the completion of mitosis,

when the mother cell and its bud (daughter cell) separate. Therefore, the appearance

of the bud in a cell can tell us whether or not the cell has passed G1 phase, and

the appearance of the buds in many cells can give us some clues how these cells

are distribution over the cell cycle. For instance, by counting the total number of

cells and the number of cells with buds under a microscope, we can calculate the

fraction of cells that are in G1 and the corresponding fraction of cells that has past

G1 (including S and G2/M phases). Using a cell-cycle distribution model, such

as cloccs described in Section 2.5, we can accurately estimate how the cells in a

population are distribution over the cell cycle. An example of typical budding index

profile is shown in Fig. 1.4A.

In addition to budding index measurements, actomyosin rings, nuclei, and spindle

pole bodies (SPBs) can also be used as cellular markers with fluorescent microscope

to study the stage of cell cycle in budding yeast. Fluorescent microscope has be-

come increasing common recently to provide a rich source of marker data (Stacey

and Hitomi, 2008; Harder et al., 2006; Dickinson, 2006; Aikawa et al., 2007). For

example, by tagging proteins associated with these markers using fluorescent dyes

and quantifying the presence of these markers under a fluorescence microscopy, we

can determine at what cell-cycle stage a population of cells has reached. In detail,

• Actomyosin ring marks the G1/S transition, and disassembly of the actomyosin

ring marks the end of cytokinesis (Bi et al., 1998).

• Cell nucleus disassembles and re-forms during the cell cycle. At the beginning of

mitosis, the chromosomes condense, the nucleolus disappears, and the nuclear

envelope breaks down, resulting in the release of most of the contents of the

nucleus into the cytoplasm. At the end of mitosis, the process is reversed: The

10

A budding

0

100index

time

% o

f cel

lsbu

dded

C

1C 2C 1C 2C 1C 2C 1C 2C 1C 2C

B (1) (2) (3) (4)

Figure 1.4: Examples of the measureable cell-cycle progression markers in buddingyeast S. cerevisiae. (A) A typical budding index curve. The time-course records ofthe proportion of cells in a population in the G1 and postG1 phases. (B) Thefluorescence images. Shown are (1) the two dividing budding yeast the cells underdifferential interference contrast (DIC) microscopy. (2) Red fluorescence. Intensespots represent the myosin rings. (3) Blue fluorescence. Intense spots representnuclei. (4) Green fluorescence. The small punctate bolbs represent the spindle polebodies (SPBs). (C) A typical DNA content histogram as measured by flow cytometryfor an asynchronous population.

chromosomes de-condense, and nuclear envelopes re-form around the separated

sets of daughter chromosomes. Hence, the movement of nucleus, especially the

nucleus at cell neck, can provide many information about the current stage of

cell cycle (Granovskaia et al., 2010; Lord and Wheals, 1981).

• The SPBs are used to mark the subintervals throughout S and G2/M phases.

A SPB duplicates and separates apart from a short spindle during the S phase,

and further separates as the spindle elongates during M phase (Simmons Ko-

vacs et al., 2008). Thus, we can determine cells at different stages of spindle

formation by tracking the distance between two SPBs.

11

In all, with fluorescent dyes, we can track many cellular features during the course

of cell cycle. Fig. 1.4B shows some fluorescence image examples for SPBs (in red),

myosin rings (in blue), and the nuclei (in green).

Another efficient means of determining the cell cycle position is to measure the

genomic DNA content of the cell using flow cytometry (Haase and Reed, 2002; Slater

et al., 1977; Tobey and Crissman, 1975). A haploid yeast cell begins the cycle with

one copy of genomic DNA in G1. During S phase, the DNA is replicated, and thus

at the end of S phase, the cell contains two copies of genomic DNA. Using flow

cytometry, the DNA content of thousands of cells can be rapidly measured. The

genomic DNA of cells is labeled with a fluorescent dye, and then flow cytometer bins

each cell into one of 1024 ordered channels on basis of its fluorescent intensity which

is proportional to its DNA content (Pierrez and Ronot, 1992). An example of typical

DNA content flow cytometry is shown in Fig. 1.4C.

1.2.2 Cell-cycle synchrony experiment

As described in the previous section, a variety of methods are available to determine

the cell-cycle stage of cells. However, all these methods require a large population

of cells to obtain an accurate measurement. To provide insight into the dynamics

of cell-cycle processes, the cells in such a population should be as synchronized as

possible as they progress through the cell division cycle. To effect this synchrony, cells

are arrested or selected at one stage of the cell cycle, and then released to progress

through subsequent division cycles. Molecular species can then be measured in the

population at various time points after release (Spellman et al., 1998; Cho et al.,

1998; Pramila et al., 2006; Orlando et al., 2008; Granovskaia et al., 2010).

A number of methods have been used for synchronization of a population of

yeast cells at various stages of the cell cycle. Two most common approaches include

the physical means of centrifugal elutriation and genetic means of α-factor block-

12

release (Orlando, 2009; Futcher, 1999; Amon, 2002).

Synchronization by centrifugal elutriation is a size-based method. The method

extracts small cells from a population of cells, and such cells are typically newborn

daughter cells in the early G1-phase. In detail, a population of cells in liquid of media

was first pumped into a rapidly spinning chamber. Then, the centrifugal forces cause

a gradient to form with cells sedimenting at the bottom (outside) of the chamber

and the fluid eluting out the top (inside) through the exit port. Because the small

cells have a higher surface and volume ratio than larger cells, their sedimentation are

relatively more effected by the rate of fluid flow. In the end, by carefully adjusting

the pump, the centrifuge speed, and the fraction of unbudded cells in the output,

small cells can be selectively washed out of the chamber and collected.

In elutriation-based synchronization experiments, the initially collected cells—

typically small cells early in G1—are released from synchrony after experiencing

significant cold and osmotic stress, and therefore such cells require a period of time

to recovery. Also, because the small cells are more likely to be cells in boarder

region of positions in the early G1 phase, the population synchronized by centrifugal

elutriation tends to lose synchrony faster due to the asymmetric nature of cell division

in S. cerevisiae (more details in Section 1.2.3) compared to other methods. However,

since the centrifugal elutriation is a size-based collection method and not an induced

arrest, theoretically there exist very little transcriptional alternation of G1 events.

An alternative synchronization method is α-factor block-release. α-factor syn-

chrony experiment is a genetic method, achieved by adding the α mating pheromone

(the arrest/block) to an asynchronous culture and then subsequently removing the

pheromone (the release). The α mating pheromone is a short peptide that binds to

the receptor Ste2 in MAT-α cells and induces a cascade which results in the inactiva-

tion of the G1 cyclin CDK kinase complexes, leading to a G1-phase arrest. Because

initial cell size collected from α-factor experiment is generally larger and well arrested

13

A budding index B PCL1

C SIC1

Cycle 1 Cycle 2

D

SSK22

Cycle 1 Cycle 2

Figure 1.5: Synchronized population of cells loses synchrony over time. Shownare the measured budding index profile (panel A) and transcriptional profiles ofthree genes (panel B-D) from Orlando et al. (2008) (time-course synchronized bycentrifugal elutriation).

in G1-phase, compared to centrifugal elutriation, the population of cells synchronized

in such experiments tend to maintain synchrony for a longer period of time. However,

because the α-factor synchrony method relies on an extra-cellular signal, it induces

large cellular changes during the cell-cycle arrest, causing significantly altering the

G1 transcriptional program.

1.2.3 Synchrony lose significantly in a synchronized cell population

In the cell-cycle synchrony experiments, the measurements of cell populations would

not be substantially different from average measurements of individual cells if the

cells in the population were always perfectly synchronized. However, as shown in

Fig. 1.5, the synchronized population of cells will lose synchrony greatly over time

after release. What can cause synchrony loss in a synchronized population of cells?

At least three factors should be taken into account,

1. For the initially collected popoulation, the cells may exhibit variability at the

time of the release.

14

2. Because there exist variability among cells and because individual cells progress

through the cell cycle at different rates, the synchrony in the population can

deteriorates gradually over time.

3. Asymmetric cell division is a major source of synchrony loss in many kinds

of cells, and especially in budding yeast S. cerevisiae (Hartwell and Unger,

1977; Lord and Wheals, 1981, 1980; Woldringh et al., 1993; Bean et al., 2006).

As mentioned previously, after yeast cell division, the size of the newborn

daughter cells are smaller than their mothers. Thus, these small daughter cells

need a longer time in early G1 to grow up until they achieve a critical cell

size (Jorgensen and Tyers, 2004). On the other hand, mother cells are often

already reached this critical size and therefore they can progress more rapidly

through G1 (Di Talia et al., 2007).

For these reasons, time-series measurements of a population of cells do not ac-

curately reflect the dynamics of individual cells as they traverse the cell cycle, but

instead represent the convolved dynamics of all cells in the imperfectly synchronized

population. Thus, observed population measurements are only a ‘blurred’ view of

the underlying behavior of individual cells, and this view becomes increasingly blurry

as the time course progresses. For example, the synchrony in the second cycle of the

profiles in Fig. 1.5 is apparently worse than that in the first cycle.

1.3 Motivation: why deconvolution is necessary

1.3.1 Deconvolution: from population to single cells

In the previous sections, we have introduced the cell-cycle synchrony experiments and

discussed about the limitations of these experiments: the synchrony in a synchronized

population loses significantly over time, and therefore the time-series measurements

15

taken over such a population of cells do not accurately reflect the underlying cell-cycle

dynamics. Let us use an example to have a closer look at this phenomenon.

Assume that at a specific time t, the cell-cycle measurement taken over a pop-

ulation of cells is gt. Also, assume that the average measurement level of indi-

vidual cells in G1 phase is fG1, the average measurement level of individual cells

in S phase is fS, and the average measurement level of individual cells in G2/M

phase is fG2/M . Then if we know the how the cells in the population are dis-

tributed in these three cell-cycle stages (denoted as ht,G1, ht,S, and ht,G2/M , respec-

tively), then the population-level measurement at this specific time can be written

as gt = fG1× ht,G1 + fS × ht,S + fG2/M × ht,G2/M . In general, gt is measurable, and if

we can calculate ht,G1, ht,S, and ht,G2/M , then estimating fG1, fS, and fG2/M can be

viewed as a deconvolution problem.

A generalized description of deconvolution is shown in Fig. 1.6, in which g de-

notes the measured population-level time-series data, f denotes the average cell-cycle

time-series profile of individual cells (e.g., transcriptional profile, protein expression

profile), and H is a matrix (deconvolution kernel), which quantifies how the cells in

the population are distributed over the course of cell cycle at each time point. In

real-world application, g is measured, f is unknown, and in the next section, we will

introduce cloccs, a cell-cycle distribution model, which can be used to accurately

estimate the deconvolution kernel, H.

1.3.2 cloccs: modeling cell-cycle distributions

In this section, we briefly introduce cloccs (Characterizing Loss of Cell Cycle Syn-

chrony) (Orlando et al., 2007, 2009; Mayhew et al., 2011), a framework for quantita-

tively determining cell-cycle distributions in population synchrony experiments, or

in other words, estimating the deconvolution kernel H (Fig. 1.6).

In cloccs, the cell-cycle progression of a synchronized population of cells is

16

time (min)0 50 100 150 200 250 300

6000

tran

scrip

t lev

el

4000

2000

0

G1 S G2/M

8000(unknown, n time-points)

: single-cell time-series profile: population-level time-series profile(measured, k time-points)

t(1)

t(2)

t(3)

t(k-1)

t(k)

...

...

G1 S G2/M

(unknown, cell-cycle distribution): convolution kernel

6000tr

ansc

ript l

evel

4000

2000

0

8000

Figure 1.6: Overview of the deconvolution framework. Estimating cell-cycle dy-namics of individual cells from population-level time-series data can be viewed as adeconvolution problem. Here, we formulate deconvolution as an discrete inverse prob-lem g = H×f , in which g is a column vector containing the measured population-leveltime-series data, H is the convolution kernel which estimates how the cells in thepopulation are distribution over the course of cell cycle at each time point, and f isa column vector representing the unknown cell-cycle dynamics profile of an averageindividual cell.

modeled using a linear graphical representation termed ‘branching process’. In con-

trast to the traditional circular form of cell-cycle representation (e.g. as in Fig. 1.2),

branching process enables us to explicitly distinguish the cell cycles of mother and

daughter cells, and allows us to observe cell-cycle events in different cycles. As shown

in Fig. 1.7, the branching process in cloccs is composed of three cell-cycle intervals:

recovery interval, or R for short, represents the interval that immediately following

the release from synchrony, during which initial cells recover from the synchrony

protocol; cell cycle of mother cells, during which the cells progress through a stan-

dard cell cycle; and cell cycle of daughter cells, during which daughter cells progress

through a longer daughter-specific cell cycle. According to the branching process,

after synchrony release, the initial population of cells first progresses through a R

interval before entering into a standard cell cycle. At the end of the first cycle, cells

divide into mother and daughter cells. Mother cells enter into the next standard

cell cycle immediately, and the newborn daughter cells traverse through a longer

daughter-specific cell cycle since they require more time to grow up. Every time cells

17

Figure 1.7: Branching process in cloccs. The branching process is composed ofthree intervals: recovery, cell cycle of mother cells, and cell cycle of daughter cells.(A) The initial population of cells is modeled as a normal-distributed cohort, reflect-ing the variability of cell-cycle positions in the initial population. (B) Along with thepopulation of cells traverses through the cell cycles, the variance in the populationcohort increases gradually, reflecting that individual cells progress through the cellcycle at different rates. (C) During each division, a new cohort is generated for thenewborn daughter cells, reflecting asymmetric cell division of budding yeast.

divide, a new daughter-specific branch appears and this process repeats.

cloccs models the synchronized population of cells as a normal-distributed co-

hort. The variance in the initial cell cohort (Fig. 1.7A) reflects the variability of

cell-cycle positions in the initially synchronized population. cloccs assumes that

each cell traverses at a constant velocity along the cell-cycle branches, and this ve-

locity is randomly sampled from a normal distribution. Hence, along with the cell

cohort progresses through the branches, the variance of cohort increases gradually

(Fig. 1.7B), reflecting that the individual cells go through cell cycles at different rates.

When each cohort passes the point of division (Fig. 1.7C), the population expands

18

in size, and a truncated normal-distributed cohort appears on the daughter branch

to represent the newborn population of daughter cells. According to the branching

process, cloccs explicitly models the asymmetric cell division of budding yeast, and

accounts for all three factors that cause synchrony loss in the population of cells.

Using morphological markers—such as budding index (Orlando et al., 2007), flow

cytometric measurement of DNA content (Orlando et al., 2009), and/or fluorescently

tagged molecular markers (Mayhew et al., 2011)—cloccs accurately estimates the

lengths of cell-cycle intervals, the variance in the rate at which cells move through

these intervals, and the positions in the cell cycle at which specific events take place,

such as when DNA replication starts or ends. For the purposes of deconvolving

population-level measurements, cloccs parameters can also be used to precisely

estimate how cells in a population are distributed over the cell cycle at any point in

time following synchrony release. More details are given in the next chapter.

Although we have briefly described the usefulness of cloccs based on the branch-

ing process for budding yeast, the concepts of cloccs are very general and can be

used with other branching processes (e.g., linear process construction for cell cycle

of mutated cells, symmetric branching process for symmetric cell division). All the

needs of cloccs are the construction of branching process to model the underlying

cell divisions, and the corresponding mathematical formulation in a closed form for

Markov chain Monte Carlo (MCMC) sampling.

1.3.3 The missing piece of deconvolution

We introduced the concept of deconvolution in the previous sections. As demon-

strated in Fig. 1.6, deconvolution can be represented in the formula of g = H × f :

where g is the measured cell-cycle time-course profile; H is the deconvolution kernel,

which can be calculated from cloccs; and f is the unknown cell-cycle time-course

profile of average individual cells. However, since we desire to obtain a higher-

19

resolution profile of average individual cells, implying that the number of time-points

in f should be much larger than that in g. Therefore, estimating f is not naive and

involves solving an ill-posed discrete inverse problem. In this thesis, we present

a general deconvolution algorithm that employs a wavelet-basis regularization ap-

proach to accurately estimate the cell-cycle dynamics of average individual cells from

population-level time-series measurements.

1.4 Contribution of our deconvolution framework

The major purpose of this thesis is to provide a methodology that removes synchrony

loss effects from population-level cell-cycle measurements and reveals a detailed cell-

cycle profile at a single-cell level. Compared to the previous approaches as introduced

in Section 2.1, our deconvolution framework has many advantages, three most im-

portant ones are

1. Previous deconvolution algorithms output either a refined cell-cycle profile

(e.g., a smoother cell-cycle transcription profile as in Bar-Joseph et al. (2004))

or peak timing of cell-cycle profiles (e.g., Rowicka et al. (2007)). Our deconvo-

lution algorithm removes synchrony loss effects and yields a continous cell-cycle

profile over the whole course of the cell cycle. In addition, the resolution of the

deconvolved profiles is improved many times.

2. Our algorithm can learn distinct cell-cycle profiles for both mother and daugh-

ter cells. Combined with the first feature that our algorithm can reliably es-

timate cell-cycle profiles at fine temporal resolution, we can now distinguish

subtle timing differences between mother and daughter cells, which is typical

obscured in population-level measurements.

3. Deconvolution aims to enhance the features of blurred population measure-

ments to sharpen underlying signal. However, may previous deconvolution

20

methods often end up sharpening noise as well. Our deconvolution algorithm

avoid this problem by formulation an objective function that is Bayesian l1-

regularized using a wavelet basis, and we show in chapter that such an approach

can effectively deblur signals while smoothing away noise.

To our knowledge, our deconvolution algorithm is the first approach that can

explicitly learn cell-cycle profiles at a single-cell level over the whole course of the

cell cycle. Although we essentially demonstrate the usefulness of our algorithm in

details by deconvolving genome-wide transcription profiles (chapter 3), our algorithm

is generlized and can be applied to many other population-level data sources, such as

nucleosome occupancy measurements, protein expression profiles obtained by West-

ern blots, or measurements in organisms other than budding yeast Saccharomyces

cerevisiae.

1.5 Thesis outline

The rest of the thesis is organized as follows. In chapter 2 we describe the general

framework of our deconvolution algorithm, which can be used to deconvolve different

types of cell-cycle time-series data to reveal a detailed cell-cycle profile at a single-

cell level. In chapter 3, we applied our deconvolution algorithm to learn single-cell

transcription profiles from two independent replication of cell-cycle synchrony exper-

iment in wild-type budding yeast (Orlando et al., 2008), and we carried out various

analyses on the resultant transcript profiles to characterize the deconvolution perfor-

mance. In chapter 4, we move our focus to network alignment problem, and introduce

DOMAIN, a network alignment method that employs a novel direct-edge-alignment

paradigm to detect conserved functional modules (e.g., protein complexes, molecular

pathways) across protein-protein interaction networks across species. We evaluate

the alignment performance of DOMAIN with two widely used alignment approaches,

21

and show that our approach outperforms these two approaches in most alignment

performance metrics. We also show that our approach enables us to detect some

cell-cycle-related functional modules between budding yeast and fruit fly protein-

protein interaction networks. In Chapter 5, we draw some conclusions regarding to

the present and the future states of cell-cycle deconvolution algorithms.

22

2

The deconvolution framework

In this chapter, we present the general deconvolution framework, which aims at re-

moving synchrony loss effects from time-series data collected at population level, and

recovering cell-cycle profiles at a single-cell level. In the first part, we introduce some

previous deconvolution approaches, and then we describe our deconvolution frame-

work as well as some technical issues that are used in our deconvolution algorithm.

Although this is a generalized algorithm that can be applied to many organisms, we

focus on model organism budding yeast and demonstrate the technical details of our

algorithm based on its asymmetric cell cycle. Most of work present in this chapter

and the chapter 3 appeared in Mayhew et al. (2012) and Guo et al. (2012a).

2.1 Previous deconvolution algorithms

A few studies have attempted to deconvolve time-series microarray data to survey

either transcript levels (Bar-Joseph et al., 2004; Qiu et al., 2006) or peak expres-

sion timing (Rowicka et al., 2007) during the cell cycle in budding yeast. These

approaches modeled variability in cell-cycle progression rate, but ignored the sig-

nificant synchrony loss caused by asymmetric cell division. As a result, they may

23

not be well-suited to budding yeast data, and certainly cannot distinguish the cell-

cycle transcription programs of mother and daughter cells. Another more recent

study (Siegal-Gaskins et al., 2009) developed a transcription deconvolution method

for Caulobacter cells that was used to deconvolve the transcription profiles of ten

cell-cycle-regulated genes in that bacterium. In the following, we briefly review these

methods, and compare them in various aspects.

Lu et al. (2003): the first literature to elaborate the concept of deconvolution

in population-level cell-cycle measurements. However, the method assumes a set of

perfectly synchronized expression values, and cannot be directly used to deconvolve

time series expression data.

◦ Species: Eukaryote, budding yeast Saccharomyces cerevisiae.

◦ Data type: 1. Basis experiments from synchronized cell-cycle experiments. 2.

Static transcription levels from populations of cells grown in a wide variety of

conditions.

◦ Cell-cycle phases: G1, S, G2, M, and M-to-G1

◦ Synchrony loss model: None. The fractions of cells in five cell-cycle phases

were determined from the basis experiments.

◦ Synchrony loss factors recovered: No synchrony loss consideration; static tran-

scriptional level at one time point is considered.

◦ Deconvolution model: Used transcriptional peaks of some characterized cell-

cycle genes to determine cell-cycle phases. Used a system of weighted linear

equations to fit the measured static transcription.

◦ Deconvolution outputs: Transcriptional levels of a gene at each cell-cycle phase.

24

◦ Number of genes used as cell-cycle-regulated: From literatures, authors picked

696 genes as cell-cycle-dependent.

◦ Resolution in the deconvolved profiles: Static transcription; not applicable.

Bar-Joseph et al. (2004): The work introduced a cell-cycle synchrony loss model

based on the budding index and fluorescence-activated cell sorting (FACS) data.

However, the major synchrony loss factor, asymmetric cell division, was not consid-

ered in this model. The work focused on reducing noise in experimental measure-

ments.


◦ Data type: 1. Budding index or FACS data. 2. Synchronized cell-cycle time

course.

◦ Cell-cycle phases: G1, S, G2/M

◦ Synchrony loss model: Used budding index (or FACS) data to estimate the

duration of each cell-cycle phases and cell growth variance in the population.

◦ Synchrony loss factors recovered: Variability in cell-cycle rate.

◦ Deconvolution model: Used cubic splines to fit the time-series transcriptional

data.

◦ Deconvolution outputs: Refined transcription profiles of the first and the second

cell cycles.

◦ Number of genes inferred as cell-cycle-regulated: Inferred around 900 cell-cycle-

regulated genes.

◦ Resolution in the deconvolved profiles: Not explicitly estimated.

25

Qiu et al. (2006): The work focused on reducing variability of cell-cycle rates

in the population, and introduced a synchronization loss model by modeling the

gene expression measurements as a superposition of different cell populations going

through cell cycles at different rates.

◦ Species: Eukaryote: budding yeast Saccharomyces cerevisiae.

◦ Data type: Synchronized cell-cycle time course.

◦ Cell-cycle phases: Not specified.

◦ Synchrony loss model: Used a mixture model to account for cells traversing

through cell cycles at slightly different rates.

◦ Synchrony loss factors recovered: Variability in cell-cycle rates.

◦ Deconvolution model: Used polynomial model to fit the time-series transcrip-

tion data.

◦ Deconvolution outputs: Refined transcription profiles.

◦ Number of genes inferred as cell-cycle-regulated: Not discussed.


Rowicka et al. (2007): The work introduced an algorithm based on a regularization-

based approach on the maximum-entropy principle to determine transcription peak

timing of cell-cycle-regulated genes. However, the work only focused on transcription

peaks and reported the transcription peak timing of genes, not “true” transcriptional

profiles.


26

◦ Data type: cell-cycle time course of a synchronized population of cells in yeast

metabolic culture (YMC).

◦ Cell-cycle phases: G1, G1-to-S, S, G2, G2-to-M, M, M-to-G1

◦ Synchrony loss model: Used transcription peaks of some characterized cell-

cycle-regulated genes to determine cell-cycle phases and sub-phases.

◦ Synchrony loss factors recovered: Synchrony noise in initial populations.

◦ Deconvolution model: Used regularization-based approach on the maximum-

entropy principle.

◦ Deconvolution outputs: Timing of the transcription peaks (and in some cases

secondary transcription peaks).

◦ Number of genes inferred as cell-cycle-regulated: Inferred 694 high-confidence

cell-cycleregulated genes, with an extended set of 1,129 genes.

◦ Resolution in the deconvolved profiles: Resolution of transcription peaks around

2 min (≈2% of one cell cycle).

Siegal-Gaskins et al. (2009): The work estimated the proportion of different cell-

types (SW, ST) at cell division, and used these estimates to model the synchrony

loss by asymmetric cell division.

◦ Species: Bacterium, Caulobacter crescentus.

◦ Data type: Synchronized cell-cycle time course.

◦ Cell-cycle phases: SW, EPD, LPD, ST.

◦ Synchrony loss model: Used a probabilistic model to estimate the total cycle

time, SW-to-ST transition point, and cell-cycle distributions.

27

◦ Synchrony loss factors recovered: 1. Variability in cell-cycle rates. 2. Vari-

ability in the physiological and developmental state of the cell (asymmetric

cell-cycle division)

◦ Deconvolution model: Converted the deconvolution problem to an optimization

problem, using cross-validation to select an appropriate control parameter.

◦ Deconvolution outputs: Single-cell-like transcriptional profiles.

◦ Number of genes inferred as cell-cycle-regulated: Not discussed.


2.2 General deconvolution objective function

Our deconvolution framework employs a wavelet-basis regularization approach to ex-

plicitly learn distinct cell-cycle profiles for both mother and daughter cells. The reg-

ularization objective function of our deconvolution framework includes two parts—a

solution norm to measure the goodness-of-fit and a residual norm to measure the

smoothness of the estimates.

In detail, let f ∈ Rn be a vector of size n, whose elements represent the average

levels of some molecular species in individual cells at various points in the cell cycle;

let H ∈ Rt×n be a convolution matrix that transforms values from the individual

cell level to the population level; and let g ∈ Rt be a measured population-level

time-series with t time points. As described in the previous chapter, estimating f

involves solving an ill-posed discrete inverse problem: Hf = g. Then, the solution

norm (goodness-of-fit) is calculated as ‖Hf − g‖22, where ‖ · ‖2 denotes l2 norm.

To avoid over-fitting, we use a residual norm to ensure a smooth estimate of f .

The composition of this residual norm is based on our prior knowledge about how

the cell-cycle profiles of average individual cells look like, which is related to the

28

underlying cell cycle model. For example, generally we should expect the cell-cycle

profile of a cell should be smooth during the whole cycle, but it may not be true that

the cell-cycle changes in the transition between the end of one cycle and the start

of the next cycle are continuous and smooth. In order to quantify the smoothness

of cell-cycle intervals, we introduce wavelet basis (Jansen, 2001). Thus, the general

residual norm can be represented as ‖fW‖, where W is orthonormal wavelet-basis

matrix, and ‖ · ‖1 denotes l1 norm.

Putting the solution norm and the residual norm together, our general deconvo-

lution objective function is written as

argminf‖Hf − g‖22 + γ ‖fW‖1 (2.1)

where γ is a regularization control parameter to take the tradeoff between the so-

lution norm and the residual norm. When deconvolving microarray transcription

data, use of an l2 norm for Hf − g is dubious since it represents an assumption

of additive Gaussian error, whereas transcript level measurements collected using

Affymetrix arrays are generally presumed to exhibit multiplicative Gaussian error.

To model multiplicative error, we can transform Hf and g into log-space, yielding a

more appropriate solution norm ‖log Hf − log g‖22, and the corresponding objective

function is

argminf‖log Hf − log g‖22 + γ ‖fW‖1 (2.2)

However, this objective function is no longer convex. To recover convexity, we ap-

proximate this more appropriate objective function using a first-order Taylor series

expansion as

argminf

∥∥∥∥Hf

g− 1

∥∥∥∥22

+ γ ‖fW‖1 (2.3)

which is convex and hence has a unique global optimum. Constrains can be added

29

to this objective function according to the type of input data. For example, when

deconvolving microarray transcription data, we may require f ≥ 0 because the actual

transcript levels are always non-negative, and when deconvolving budding index data,

we can instead use the original objective function of Eq. (2.1), requiring f ∈ [0, 1]

because the fraction of budded cells is always between 0 and 1.

In the next sections, we discuss about some technical issues related to our decon-

volution objective function, namely, (1) how to specify the residual norm; (2) how to

select a regularization parameter γ; (3) how to choose the orthonormal wavelet-basis

matrix W ; and (4) how to jointly learn a single f from multiple replicate data.

2.3 Branching process in deconvolution

Our deconvolution framework is built upon the cell-cycle parameters of cloccs, and

it also employs a more detailed branching process compared to the original one in

cloccs. There are two reasons why we need a branching process in deconvolution:

(1) Same as the purpose of the branching process in cloccs, we need to use a

branching process to model the underlying cell-cycle procedure, such as asymmetric

cell division in budding yeast. (2) By decomposing the cell-cycle branches into small

cell-cycle intervals and then building up connections between these intervals, we are

able to model the cell-cycle dynamics under different assumptions and formulate

corresponding solution norms.

Fig. 2.1 illustrates a tree-like full map of the branching process models in decon-

volution. The model on the root of this map is of maximal flexibility. In this model,

R indicates the recovery interval, representing the interval immediately following re-

lease from synchrony; rG1, indicating the first G1 phase immediately following the R

interval, together with postG1 (including S, and G2/M phases) form the first stan-

dard cell cycle. Similarly, cG1, indicating the standard G1 phase of mother cells,

together with postG1 form the second and all the following standard cell cycles. The

30

RC

G1

post

G1

CG

1po

stG

1

DG

1po

stG

1

DG

1=st

retc

hed

CG

1

DG

1=D

.dG

1

rG1=

cG1

RG

1=R

.rG

1

dG1=

cG1

dG1=

cG1

C=

cG1.

post

G1

DG

1>st

retc

hed

CG

1

RG

1=R

.DG

1

RrG

1po

stG

1cG

1po

stG

1

DdG

1po

stG

1

RD

G1

post

G1

CG

1po

stG

1

DG

1po

stG

1

RG

1po

stG

1C

G1

post

G1

DC

G1

post

G1

RC

G1

post

G1

CG

1po

stG

1

DG

1po

stG

1

RcG

1po

stG

1cG

1po

stG

1

DcG

1po

stG

1

RG

1po

stG

1C

G1

post

G1

DG

1po

stG

1

RcG

1po

stG

1cG

1po

stG

1

DdG

1po

stG

1

post

G1

cG1

post

G1

DdG

1po

stG

1

RG

1

DG

1=D

.dG

1

RC

G1

post

G1

CG

1po

stG

1

CG

1po

stG

1

RC

C

DC

CG

1po

stG

1C

G1

post

G1

DG

1po

stG

1

DG

1po

stG

1C

G1

post

G1

DG

1po

stG

1

RG

1po

stG

1C

G1

post

G1

DG

1po

stG

1

RG

1po

stG

1C

G1

post

G1

DG

1po

stG

1

RG

1>st

retc

hed

CG

1R

G1>

stre

tche

d D

G1

RG

1=st

retc

hed

CG

1R

G1=

stre

tche

d D

G1

1.1.1

1.1.1.1

1.1.1.2

1.1.2.1

1.2.1

1.2.1.1

1.2.1.4

1.2.1.2

1.2.1.3

1.2.2

1.2.3

1

1.1

1.2

1.1.2

Figure2.1

:B

ranch

ing

pro

cess

indec

onvo

luti

on.

The

map

show

sth

ebio

logi

cally

inte

rpre

table

bra

nch

ing

pro

cess

model

sin

dec

onvo

luti

on.

At

each

split,

eith

erso

me

const

rain

sb

etw

een

cell-c

ycl

ein

terv

als

orso

me

bio

logi

cal

implica

tion

sar

ein

troduce

d.

The

inte

rval

sin

sam

eco

lor

indic

ate

that

the

cell-c

ycl

epro

gram

sin

thes

ein

terv

als

are

the

sam

eal

thou

ghso

met

imes

stre

tched

.T

he

bra

nch

ing

pro

cess

model

sin

the

gray

box

are

the

model

sw

ith

the

sam

enum

ber

offr

eece

ll-c

ycl

ein

terv

als.

31

bottom cell-cycle branch of this model contains three intervals: D, dG1, and postG1.

D is a daughter-specific interval whose length in time is equal to the time difference

between the cell-cycle of mother and daughter cells. dG1 indicates the daughter-

specific G1 whose length in time is equal to rG1 or cG1. The only assumption of

this model is that after G1 checkpoint, mother and daughter cells traverse through

the postG1 phases with the same cell-cycle programs. This model includes six free

cell-cycle intervals, R, rG1, cG1, D, DG1, and postG1, and there exist no constrains

on the G1 phases. Specifically, the cell-cycle programs of three G1 intervals—G1 for

the cells in the first cycle (rG1), G1 for the mother cells starting from the second

cycle (cG1), and G1 for the daughter cells (dG1)—could be totally different.

This root model, labeled 1, has two nested models. The left one, labeled 1.1,

is based on the assumption that the cell-cycle programs in rG1 is the same as the

programs in cG1. That is, the cell-cycle programs of mother cells are all the same

after R interval. The right one, labeled 1.2, concatenates the intervals R and rG1

together as a new RG1 interval. The union of these two intervals actually does

not reduce the flexibility of the model, but reduces the number of free cell-cycle

intervals. It has a different biological implication compared to the model 1: there

exist no boundary between the cell-cycle programs between R and rG1. Thus during

this RG1 interval, the cells in the initial population do not only recover from low

temperatures or other stress from arrest, but also get prepared for DNA replication

and mitosis.

Similarly, at each split of this tree-like map, new cell-cycle interval constrains or

biological implications are introduced. To save the space, we do not elaborate every

model, but instead describe the details of three models, labeled 1.1.1, 1.1.2.1, and

1.2.1.1. These models are of particular interests in cell-cycle modeling , and they are

actually used in our analysis.

The model 1.1.1 assumes that D and dG1 intervals should be merged together.

32

According to this model, the initial population of cells after release first progresses

through a R interval to recover from arrest, and then traverses through a standard

cell cycle which is composed of cG1 and postG1 intervals. At each division, daughter

cells are born, and they go through a DG1 interval whose length in time is longer

than CG1, and then progress through the postG1 interval. During this DG1 interval,

the daughter cells do not only grow up to reach the critical cell size, but also get

prepared for DNA replication and mitosis as the mother cells do in G1. This is the

model we actually used in deconvolving wild-type transcriptional profiles of budding

yeast in Chapter 3.

The model 1.1.2.1 gives an alternative interpretation for the cell cycle of budding

yeast. After the initial cells traverse through the R interval for recovery, they progress

through a standard cell cycle (C). For daughter cells, they first traverse a daughter-

specific D interval to grow up and reach the critical cell size, and then they progress

through a standard cell cycle. According to this model, the cell cycle of daughters

is constructed with a standard cell cycle and an appended daughter-specific growth

interval. We have attempted to use this model to deconvolve the wild-type tran-

scriptional profiles of budding yeast, and we found that the previous model 1.1.1 is a

more reasonable model as fewer constrains were made in daughter-specific G1 phase.

Another interesting model 1.2.1.1 suggests that the cell-cycle programs in the

interval from release until the first postG1 is equal to the cell-cycle programs in

CG1 but in a slower pace. This assumption makes some sense because the initially

collected cells are typically small cells early in G1, so they at least need to do the

preparation as the mother cells do in CG1 interval.

2.4 Introduction to wavelets: selection of wavelets

In this section, we first briefly give some background knowledge about wavelet trans-

forms, and then we introduce a few specific wavelet families that are useful in con-

33

structing the orthonormal wavelet-basis matrix W .

Wavelet transform is one of mathematical transformations applied to signals to

obtain a further information from that signal that is not readily available in the raw

signal. In contrast to Fourier transform, that converts a signal from time versus

amplitude to frequency versus amplitude across the whole time domain, wavelet

transform decomposes continuous-time signal into different scale components. The

wavelet transform can provide us the frequency of the signals at local domains and

the time associated to those frequencies, making it very effective in analyzing non-

periodic signals and very convenient for its application in numerous fields, such as

audio and image processing. For more information about wavelets, please check

Mallat (1989), Mallat (1999), Daubechies (1992), and Burrus et al. (1998).

Wavelet transforms are classified into discrete wavelet transforms (DWTs) and

continuous wavelet transforms (CWTs), and we are using DWTs in this work. There

is an important feature in wavelet transforms called vanishing moments, and having p

vanishing moments means that wavelet-coefficients for p-th order polynomial will be

zero. That is, any polynomial signal up to order p−1 can be represented completely

in the scaling space. In theory, more vanishing moments means that scaling function

can represent more complex signals accurately. p is also called the accuracy of the

wavelet.

There exist many discrete wavelets, and here we list a few that are useful in this

study

◦ Haar: The Haar wavelet is the simplest possible wavelet which was proposed in

1910 (Haar, 1910). The Haar wavelet has a unique advantage for the analysis of

signals with sudden transitions, such as monitoring of tool failure in machines.

In our work, we used it to decompose the signal of budding index, since it only

has two states, budded and unbudded.

34

◦ Daubechies: The Daubechies wavelets are a family of orthogonal wavelets defin-

ing a discrete wavelet transform and characterized by a maximal number of

vanishing moments for some given support. With each wavelet type of this

class, there is a scaling function (also called father wavelet) which generates an

orthogonal multi-resolution analysis (Daubechies, 1992). The Haar wavelet is

a special case of the Daubechies wavelet with vanishing moments of 2.

◦ Symmlets: The Symmlet wavelets are also wavelets within a minimum size

support for a given number of vanishing moments, but they are as symmetrical

as possible, as opposed to the Daubechies filters which are highly asymmetrical.

In deconvolving of gene expression profiles, we employ Symmlets instead of the

popular Daubechies because of this property of high symmetry.

2.5 Selecting a regularization parameter

A critical step in deconvolving a cell-cycle time-series data is to identify a good

regularization parameter γ. A reasonable choice of γ can avoid over-fitting and over-

smoothing and gives us biologically interpretable deconvolved estimates. In doing

so, as illustrated in Fig. 2.2, we first determined a region of γ that represents a

reasonable trade-off between the goodness-of-fit term (e.g., ‖Hf − g‖22 in Eq. (2.1)

or∥∥∥Hf

g− 1∥∥∥22

in Eq. (2.3)) and the smoothness term (‖fW‖1). Next, we selected

a target-specific optimal regularization parameter γ by calculating the maximum

curvature on the L-curve (Hansen, 1992) within this region.

2.6 Joint learning from multiple replicates

Our convolution kernel design allows us to learn a robust single transcription profile

jointly from multiple experimental replicates. For example, in case of two replicate

data, we can construct convolution kernels H1 and H2 for the two replicates using

35

Figure 2.2: Selection of a regularization parameter γ. First, a region of γ thatrepresents a reasonable trade-off between the goodness-of-fit term and smoothnessterm is identified (gray region). Next, a target-specific optimal regularization pa-rameter γ is selected within this region by calculating the maximum curvature onthe L-curve (Hansen, 1992) (red triangle).

their respective cloccs parameter estimates. To ensure the matrices refer to the

same points along the branching process, we should use the same number of subin-

tervals on the various cell-cycle branches when constructing both H1 and H2. In

this manner, corresponding columns in H1 and H2 represent the same fractional

population estimate for the same subinterval along the cell-cycle branches under

the two experimental conditions. Then, we can construct a joint convolution kernel

HJ = [Ht1H

t2]t, where t is the transpose operator. Similarly, we can construct a joint

population-level time-series gJ = [gt1gt2]t for a target with two replicates. Although

we have two replicates, we only need to learn a single deconvolved profile f . The

only thing we need to do is replace g and H within the objective function with gJ

and HJ , respectively. Generally, this jointly learned f is more robust and accurate

than the f learned from a single experiment.

36

3

Deconvolution of wild-type cell-cycle transcriptionalprofiles of budding yeast

In the previous chapter, we have introduced the general framework of our decon-

volution algorithm. To demonstrate the usefulness of our method, we applied it

to a recent cell-cycle transcription time course in the eukaryote Saccharomyces cere-

visiae (Orlando et al., 2008). The input data is genome-wide cell-cycle transcriptional

profiles at a temporal resolution of 16 minutes, and the output is jointly learned tran-

scription profiles at a nominal temporal resolution of less than one minute, with dis-

tinct transcription programs learned for mother and daughter cells. In this chapter,

we show various analyses that we carry out on the resultant deconvolved transcrip-

tional profiles to characterize the performance of our deconvolution algorithm.

3.1 Experimental data

We apply our deconvolution algorithm to learn single-cell transcription profiles jointly

from two independent replicates of cell-cycle synchrony experiments in wild-type bud-

ding yeast (Orlando et al., 2008). The experiments collected populations of synchro-

37

nized early G1 cells by centrifugal elutriation. Two wild-type time-series replicates

were collected with 15 samples taken at 16 minute intervals in each, starting 30 min-

utes after release in the first replicate, and 38 minutes after release in the second.

Both replicates covered approximately 2 complete cell cycles. For each replicate,

both budding index and flow cytometry data were collected 32 times at 8 minute

intervals, starting 30 minutes after release (Orlando et al., 2009). Budding index

was measured by light microscopy to record the number of budded and unbudded

cells observed out of at least 200 cells. The DNA content of 10,000 cells per sam-

ple was measured by flow cytometry as described in (Haase and Reed, 1999). We

downloaded the mRNA expression datasets from http://www.biology.duke.edu/

haaselab/publicData/index.html; for genes with multiple probes, we averaged

the transcript levels across the probes. Consequently, we were left with measured

transcription profiles of 5,670 unique genes.

3.2 Branching process model and cell-cycle parameters

3.2.1 Branching process model

In deconvolving of wild-type cell-cycle transcriptional profiles, we decompose the full

branching process of cloccs into four kinds of intervals (Fig. 3.1A): R (recovery)

represents the interval immediately following release from synchrony, during which

initial cells recover from the synchrony protocol; G1 and DG1 (daughter-specific

G1) represent G1 phases of mother and daughter cells, respectively; and postG1

represents the interval immediately following G1 or DG1, during which mother and

daughter cells progress through S, G2, and M. According to this model, after syn-

chrony release, cells progress through the R interval before entering a standard cell

cycle (G1 followed by postG1). At the end of the first cycle, cells divide into mother

and daughter cells; mother cells enter another standard cell cycle, while newborn

daughter cells instead traverse DG1 before entering postG1. Every time a cell di-

38

http://www.biology.duke.edu/haaselab/publicData/index.html

http://www.biology.duke.edu/haaselab/publicData/index.html

time (min)

CLN2

0 50 100 150 200 250 300

6000

tran

scrip

t lev

el

4000

2000

0

CLN2

R G1 DG1S G2/M

8000(unknown, n time-points)

: single-cell transcription profile: population-level transcription profile(measured, k time-points)

t(1)

t(2)

t(3)

t(k-1)

t(k)

...

...

R G1 DG1S G2/M

: convolution kernel(cell-cycle distribution)

6000

tran

scrip

t lev

el

4000

2000

0

8000

G1 S G2/MR

DG1 S G2/M

G1 S G2/M

A

B

Figure 3.1: Overview of deconvolution algorithm. (A) Branching process in decon-volution. The full branching process is split into four kinds of intervals, R, G1, DG1,and postG1 (including S and G2/M phases). (B)Deconvolution is formulated as anill-posed discrete inverse problem g = H× f , in which g is a column vector contain-ing the measured population-level time-series data, and here the real transcriptionprofile of the G1 cyclin CLN2 is plotted; H is the convolution kernel calculated fromcloccs parameters; and f is a column vector representing the components of theunknown dynamic profile of an average individual cell. After deconvolution, we canlearn smooth estimates for the four components of f , corresponding to the intervalsR, G1, postG1, and DG1; we consistently color the intervals R, G1, postG1, andDG1 in red, blue, orange, and cyan respectively throughout this chapter.

vides, a new branch appears and this process repeats.

3.2.2 Cell-cycle parameters from cloccs

Given the specified branching process model, we exploit cloccs to learn the cell-

cycle parameters. There are two types data available for cloccs, budding index data

and flow cytometry data. for deconvolving all measured transcription profiles, we

applied cloccs to learn cell-cycle parameters from both flow cytometry and budding

index (Orlando et al., 2009). The parameters learned only from flow cytometry were

39

Table 3.1: Cell-cycle parameters estimated by cloccs from flow cytometric mea-surements of DNA content and budding index.

Cell-cycle parametersFlow and budding Flow onlyWT1 WT2 WT1 WT2

length of R (minutes) 94.387 101.904 94.279 101.954length of C (minutes) 79.487 82.014 79.647 81.965length of DG1 (minutes) 44.318 37.436 44.326 37.425length of G1 (fraction of C) 0.153 0.165 - -length of G1+S (fraction of C) 0.349 0.391 0.349 0.391length of G2+M (fraction of C) 0.651 0.609 0.651 0.609

used for deconvolving budding index profiles, because deconvolving budding index

profiles with the aid of parameters learned from those profiles would produce overly-

optimistic estimates of deconvolution performance. The learned cell-cycle parameters

are listed in Table. 3.1.

3.3 Deconvolution model

3.3.1 Deconvolution objective function

According to the branching process model, we can split each gene’s single-cell tran-

scription profile f into four distinct blocks as f = [fR fG1 fDG1 fpostG1], representing

the transcription profile during subintervals R, G1, DG1, and postG1, respectively.

We expect that the estimated profile [fR fG1 fpostG1] should be smooth since it prevails

during the cell-cycle progression of initial cells, and similarly the profile [fDG1 fpostG1]

should be smooth since it prevails during the cell-cycle progression of daughter cells.

Then the objective function is

argminf

∥∥∥∥Hf

g− 1

∥∥∥∥22

+ γ(‖[fR fG1 fpostG1]W1‖1 + w ‖[fDG1 fpostG1]W2‖1) (3.1)

where ‖·‖1 and ‖·‖2 respectively denote l1 and l2 norms, γ is a regularization control

parameter, W1 and W2 are orthonormal wavelet-basis matrices, and w simply scales

the two regularization terms to account for the different lengths of the intervals

40

they cover; we always set w = 1.5 because the amount of time spent in R + G1 +

postG1 (regularized by W1) is roughly 1.5 times as long as the amount of time spent

in DG1 + postG1 (regularized by W2). Here, we require f ≥ 0 for deconvolving

microarray transcription data, because the actual transcript levels are always non-

negative, and we select Symlet (N = 5) wavelets because of their smoothness and

symmetry properties. When deconvolving budding index data, we instead use the

objective function as

argminf‖Hf − g‖22 + γ(‖[fR fG1 fpostG1]W1‖1 + w ‖[fDG1 fpostG1]W2‖1) (3.2)

and we require f ∈ [0, 1] because the fraction of budded cells is always between

0 and 1, and use Haar wavelets because of their step-function properties. In each

case, we performed constrained optimization of the respective convex function using

the MATLAB convex optimization package CVX, version 1.2 (Grant and Boyd, 2008,

2010).

3.3.2 Constructing a convolution kernel

cloccs enables us to determine the cell-cycle distribution of a cell population at

any given time and to estimate the fraction of cells within any given cell-cycle subin-

terval. Using the cell-cycle position distributions from cloccs (learned parameters

characterizing these distributions for each experiment are listed in Table 3.1), we

can construct a convolution kernel H ∈ Rt×n, where t denotes the number of time-

series observations in the population-level measurements g, and n denotes the total

number of subintervals along the various cell-cycle branches. Specifically, hij ∈ H

quantifies the fraction of cells within a given subinterval j at a given time i. For

the purposes of high temporal resolution, n is chosen much larger than t. In our

case, t = 15 and n = 258 since we used a total of 258 subintervals for deconvolving

transcription profiles: R has 88, G1 has 42, DG1 has 86, and postG1 has 42. In

41

implementation, we used padding entries and mirror-reflections in both directions of

f to remove the edge effects caused by circular wavelet packets (Lord and Wheals,

1981; Mallat, 2008).

3.3.3 Selection a regularization parameter

As described in previous chapter, to select a good regularization parameter γ for

each gene (or budding index) that avoids both over-fitting and over-smoothing, we

first determined a region of γ that represents a reasonable trade-off between the

fit term (‖Hf − g‖22 in Eq. (3.2) or∥∥∥Hf

g− 1∥∥∥22

in Eq. (3.1)) and the smoothness

term (‖[fRfG1fpostG1]W1‖1 + w‖[fDG1fpostG1]W2‖1). Next, we selected a gene-specific

optimal regularization parameter γ by calculating the maximum curvature on the L-

curve Hansen (1992) within this region. Precise details for selecting the regularization

parameter γ are given in Fig. 3.2.

Input : Observed transcription profile gOutput: Regularization parameter γ̂ and deconvolved transcription profile f

1 Deconvolution (g, γ ← 0, . . .) =⇒ ε0B determine ε0 as the best-fit estimator (no smoothing)

2 εl ← min(ε0 × φl, ε0 + εl) B the left fit error boundary εl3 BinarySearch (ε← εl, γ ∈ [0.001, 0.01]) =⇒ γlB search for the left boundary of γ

4 εr ← max(ε0 × φr, ε0 + εr) B the right fit error boundary εr5 BinarySearch (ε← εr, γ ∈ [γl, 0.01]) =⇒ γrB search for the right boundary of γ

6 FindElbow (γ ∈ [γl, γr]) =⇒ γ̂ B determine γ̂ at the elbow of the L-curve

7 Deconvolution (g, γ ← γ̂, . . .) =⇒ f B deconvolve using γ̂, determine f

Figure 3.2: Detailed algorithm for selecting a regularization parameter γ. We setthe fit error boundaries φl = 1.05, εl = 0.04, φr = 1.40, and εr = 0.32.

42

3.3.4 Adjustment of branching process construction from cloccs

The branching process model in our deconvolution algorithm would be identical to

that of the original cloccs branching process if mother and daughter cells separated

immediately upon the completion of mitosis and cytokinesis. In budding yeast,

however, mother and daughter cells remain attached to one another for a period of

time after cytokinesis, until the cell walls can be enzymatically detached (Kuranda

and Robbins, 1991). During this time, although the cells have distinct cytoplasmic

compartments and may be executing distinct transcription programs, they appear

under a microscope to be a single budded cell, which is how they are counted for

the purposes of estimating parameters in the original cloccs branching process.

When producing transcription profiles, we need to shift the branching times in our

branching process by a suitable duration to compensate.

To estimate the duration of this attachment period, we use as biomarkers four

genes DSE1-4 (Daughter-Specific Expression 1-4) known to have daughter-specific

transcription profiles. These are specifically transcribed in the daughter cell early

in the cell cycle (Colman-Lerner et al., 2001). We calibrate the duration of the at-

tachment period to be the smallest duration such that the deconvolved transcription

profiles of all four genes are primarily within DG1. The resultant durations for the

two wild-type replicate experiments are 26 and 27 minutes, respectively; in each case,

the duration is around 1/3 of the cell cycle of mother cells and 1/5 of the cell cycle

of daughter cells.

3.4 Results

3.4.1 Deconvolving time-series yeast budding index data to assess algorithm accu-racy

Perhaps the most important feature of a deconvolution algorithm is the accuracy of

its resultant estimates. To assess the accuracy of our method, we first deconvolve

43

B PCL1

time (min)

norm

aliz

ed tr

ansc

ript l

evel

0

max

0

max0 50 100 150 200 250 300

0 50 100 150 200 250 300

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

WT

1W

T2

CDC20

time (min)

norm

aliz

ed tr

ansc

ript l

evel

0

max

0

max0 50 100 150 200 250 300

0 50 100 150 200 250 300

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

WT

1W

T2

SSK22

time (min)

norm

aliz

ed tr

ansc

ript l

evel

0

max

0

max0 50 100 150 200 250 300

0 50 100 150 200 250 300

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

WT

1W

T2

SIC1

time (min)

norm

aliz

ed tr

ansc

ript l

evel

0

max

0

max0 50 100 150 200 250 300

0 50 100 150 200 250 300

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

WT

1W

T2

A r =0.992

R G1 S G2/M G1 S G2/M

DG1 S G2/M

0

100

0

100

budding

% o

f cel

ls b

udde

d

0

100

0

1000 50 100 150 200 250 300

0 50 100 150 200 250 300

time (min)

WT

1W

T2

index

Figure 3.3: Deconvolution recovers dynamic single-cell profiles from population-level data. (A) Joint deconvolution of replicate budding index measurements. Theleft panel illustrates the two replicate wild-type budding index measurements in red,along with the fit to those time series learned by our algorithm overlaid in green.The right panel shows the deconvolved budding profile, learned jointly from thetwo replicates. The true budding profile is shown as a dashed line for compari-son (r2 = 0.99). (B) Joint deconvolution of replicate transcription profiles for fourrepresentative genes. Shown for each gene are two replicate measured transcrip-tion profiles in red, the fit to those time series learned by our algorithm overlaidin green, and separate deconvolved transcription profiles for mother and daughtercells. To facilitate cross-comparison, all transcription profiles are normalized so thattheir maximum levels are the same height; consequently, the increased amplitudeproduced by deconvolution is not apparent. The cyclin PCL1 peaks late in both G1and DG1, the APC activator CDC20 peaks during mitosis, and the CDK inhibitorSIC1 is transcribed primarily during DG1. For genes whose two replicate profiles arein poor agreement—such as the MAP kinase SSK22 (Pearson correlation 0.14)—ouralgorithm removes apparent noise; the resultant deconvolved profile smoothly tracesthe broad trajectory of measured transcript levels across both replicates.

44

measurements of budding index because the true single-cell budding profile is known

and thus provides a clear basis for evaluation: yeast cells produce a bud near the start

of S phase and remain budded until the end of M phase. Although each wild-type

cell is either budded or not budded, time-course budding index measurements appear

like damped sinusoids due to synchrony loss in the population over time (Fig. 3.3A,

left).

We used cloccs parameters learned only from flow cytometry data (Orlando

et al., 2009) (i.e., without budding index data) to ensure fair assessment of our algo-

rithm’s accuracy. When the two observed population-level budding index measure-

ments are jointly deconvolved, our algorithm predicts the true single-cell budding

profile nearly perfectly: the originally measured damped sinusoids become square

waves with onset near the start of S and offset near the end of M, as desired (Fig. 3.3A,

right).

3.4.2 Deconvolving replicate yeast microarray data to reveal single-cell transcriptionprofiles

Reassured by the performance of our algorithm on budding index data, we jointly

learned deconvolved transcription profiles from two replicate cell-cycle time-course

microarray experiments in budding yeast (Orlando et al., 2008). Our decision to

keep G1 distinct from DG1 allowed us to capture possibly different transcription

programs for mother and daughter cells during G1. However, because both mother

and daughter cells subsequently enter a single postG1 interval, our model assumes

that both kinds of cells share a common transcription program in cell-cycle phases

after G1. The examples, as shown in Fig. 3.3B and Fig. 3.4, highlight the ability

of our deconvolution algorithm to not only sharpen transcription signal, but also

smooth out experimental noise.

45

norm

aliz

ed tr

ansc

ript l

evel

0 50 100 150 200 250 3000

max

maxW

T1

WT

2

time (min)0 50 100 150 200 250 300

0

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

Figure 3.4: Deconvolution is of capability of de-noising. Normalized transcrip-tion profiles of 129 ribosomal protein genes before and after deconvolution. Themedian transcription profile in each case is overlaid in red. The average of the 129peak-to-trough (PTR) scores, which is used to measured the degree of amplitudein expression (defined in Section 3.4.5, decreased from 1.027 to 1.018 after decon-volution, suggesting that our deconvolution algorithm is effective at not sharpeningnoise.

3.4.3 Deconvolution is robust with respect to uncertainty in input cloccs param-eters

One potential concern about the output of our algorithm is that because it relies on

posterior mean estimates of parameters from cloccs, its output might be sensitive

to uncertainty in those parameter estimates. To assess this, we generated a set of 100

deconvolved profiles using 100 random realizations from the cloccs Markov chain,

rather than using the single posterior mean parameterization. Specifically, to obtain

these 100 random parameterizations, we ran 10 independent cloccs Markov chains

with 100,000 iterations after a lengthy burn-in period. Then, we randomly selected

10 parameter estimates from the last 1,000 iterations of each of the 10 Markov chains,

resulting in 100 random parameterizations. These 100 random realizations reflect our

posterior uncertainty about the cloccs parameters used as input; differences in the

resulting 100 outputs reflect our posterior uncertainty in a deconvolved profile with

respect to the posterior uncertainty of cloccs.

For each gene, we then overlaid the 100 deconvolved profiles generated with 100

46

different cloccs parameterizations on top of one another to form a composite tran-

scription profile. Composite profiles for four representative genes whose transcripts

peak at different times in the cell cycle are shown in Fig. 3.5A. The posterior un-

certainty is so minimal that the 100 different profiles in each composite are nearly

identical, though the composite profile for DSE3 exhibits slightly higher uncertainty

in the middle of DG1. Non-uniform sampling (collecting data more frequently later

in the time course when synchrony loss has accumulated significantly) could perhaps

be employed in the future to ensure that profiles are equally certain in all intervals

of the cell cycle. Nevertheless, even with the uniformly-sampled data used here, our

deconvolution algorithm is robust enough to the posterior uncertainty in cloccs

parameter estimates that the profiles generated from 100 different parameterizations

are essentially indistinguishable. To further explore the robustness of deconvolved

profiles with respect to uncertainty in input cloccs, in Fig. 3.6, we showed genes

with different degree of amplitude in expression and their overlaid deconvolved tran-

scriptional profiles with respect to different cloccs parameters.

3.4.4 Deconvolution increases temporal resolution and precision of transcriptionprofiles

One particularly compelling property of a good deconvolution algorithm is the in-

creased temporal resolution of its estimates; for example, although the microarray

data used in this paper were collected at 16 minute intervals, our deconvolved tran-

scription profiles have a nominal temporal resolution of less than one minute. How-

ever, this is by construction; a more meaningful question is, what is the ‘effective

temporal resolution’ of our deconvolved profiles?

To quantitatively estimate the robustness of temporal resolution after deconvolu-

tion to experimental noise, we added random multiplicative noise to the input profile.

Specifically, for each input profile g = (g1, . . . , gt), we added multiplicative Gaussian

47

C

5 10 15 20

02

46

8

level of added noise(as a % of observed measurement)

timin

g di

ffere

nce

oftr

ansc

ript p

eaks

(m

in)

G1 S G2/M

DG1 S G2/M

0

max

0

maxm

othe

rda

ught

er

DSE3

CLN1 ACE2NDD1A

norm

aliz

ed tr

ansc

ript l

evel

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

DSE3

CLN1 ACE2NDD1B

norm

aliz

ed tr

ansc

ript l

evel

Figure 3.5: Deconvolved profiles are robust to uncertainty in inputs. (A) Ro-bustness of deconvolved profiles with respect to uncertainty in cloccs parameterestimates. Shown are 100 overlaid deconvolved transcription profiles for the G1 cy-clin CLN1 , the S-phase transcriptional activator NDD1 , the transcriptional activatorACE2 expressed late in the cell cycle to drive early G1 transcription in a daughter-specific manner, and the daughter-specifically expressed DSE3 . The 100 deconvolvedtranscription profiles for each gene were produced using 100 different cloccs pa-rameterizations, each a random realization from the cloccs Markov chain. Themost noticeable uncertainty in the deconvolved profiles seems to be for DSE3 in themiddle of DG1, but even this uncertainty is minimal. More examples are shownin Fig. 3.6. (B) Robustness of deconvolved profile with respect to uncertainty inmeasured input transcription profiles. Shown are 100 overlaid transcription profilesfor CLN1 , NDD1 , ACE2 and DSE3 . The 100 deconvolved transcription profiles foreach gene were produced by deconvolving 100 noise-injected (10% level) measuredtranscription profiles. (C) Effective temporal resolution of deconvolved profiles asa function of measurement noise. The x-axis indicates the average level of randommultiplicative noise added to input transcript levels at every point in the time-series.Box-plots display the distribution of timing differences (unsigned) between the tran-scription peaks of deconvolved profiles with and without noise added. Gray boxesindicate interquartile ranges, heavy black bars indicate median values, and small redsquares indicate mean values.

48

G1 S G2/M

DG1 S G2/M

0

max

0

maxm

othe

rda

ught

er

norm

aliz

ed tr

ansc

ript l

evel

DSE3 (27)

PRY2 (1)

ACE2 (30)

NDD1 (151)

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

norm

aliz

ed tr

ansc

ript l

evel

TEC1 (601)

MCM6 (503)

MYO2 (662)

HHT2 (513)

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

norm

aliz

ed tr

ansc

ript l

evel

SIW14 (1256)

EMP24 (1235)

APC1 (1042)

SPC24 (1061)

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

norm

aliz

ed tr

ansc

ript l

evel

SIN4 (2617)

DID2 (1677)

APC9 (2546)

HNT2 (2100)

A B

C D

Figure 3.6: More examples on the robustness of deconvolved profiles with respectto uncertainty in cloccs parameter estimates. Shown are 100 overlaid deconvolvedtranscription profiles for randomly selected genes with high PTR scores (ranked inthe top 500 by PTR scores; panel A), medium PTR scores (ranked in 501-1,000;panel B), and low PTR scores (ranked in 1,001-1,500; panel C), with transcriptionpeaks in G1, S, G2/M. and DG1, respectively. Panel D illustrates four genes withinsignificant PTR scores (ranked below 1,501). The 100 deconvolved transcriptionprofiles for each gene were produced using 100 different cloccs parameterizations,each a random realization from the cloccs Markov chain. The numbers in theparentheses indicate the ranks of genes by PTR scores. Here, PTR indicates peak-to-through, a scoring scheme we used to quantify the degree of amplitude of geneexpression. More details are introduced in Section 3.4.5.

noise at every time point, such that g′i = gi× (1 + εi), where εi ∼ N (0, σ2). Fig. 3.5B

shows the 100 overlaid deconvolved transcription profiles for the four characteristic

genes with noise injected at σ = 10%.

We further assess effective temporal resolution by determining how much the

timing of a profile changes as varying levels of noise are added to the input data.

This yields a measure of the robustness of timing information to noise in the data.

49

The simplest means of determining how much the timing of a profile changes is to

focus on how much the timing of the peak shifts, especially since the peak is typically

the most salient feature in a deconvolved profile. We therefore assessed how much

peak timing shifted—whether earlier or later (using unsigned timing differences)—as

varying amounts of multiplicative noise were added to the input data.

In doing so, we selected the 100 genes with most significant amplitude in ex-

pression (ranking by peak-to-trough ratio, as described in next section) before de-

convolution as our benchmark, since for these genes, the peaks in the deconvolved

transcription profiles are usually easy to ascertain. We say the mother and daughter

peaks of a deconvolved transcription profile occur where the transcript levels in the

mother and daughter cell-cycle intervals are maximal. If the maximal level in one of

those intervals is at least twice as high as that in the other interval, we define this to

be the dominant peak ; otherwise, we say the profile contains two dominant peaks. For

each of these genes, at each noise level, we generated 10 noisy transcription profiles,

deconvolved these profiles, and computed the unsigned timing differences of their

dominant peaks to those of the original deconvolved profile. With 10 noisy profiles

for 100 different genes, we thus had at least 1,000 unsigned peak differences (recall,

some profiles contain two dominant peaks) at each noise level.

Although the reproducibility of our two replicate microarray experiments was

high (Orlando et al., 2008), and although it has been shown that the intrinsic noise

level in the gene expression of budding yeast is relatively low (Raser and O’Shea,

2005), we chose to examine the effects of average multiplicative noise across a broad

range, from 5% up to 20%. Across this range, the median unsigned peak timing

shift ranged from 0.0 up to 1.6 minutes, and the mean ranged from 0.6 up to 2.7

minutes (Fig. 3.5B). As one specific example, if the input replicate transcript levels

were all perturbed an average of 10%, the timing of a peak would shift 1 minute, on

average. This indicates that the peak timing information in our deconvolved profiles

50

is relatively precise.

We observed that the effective temporal resolution of the deconvolved profiles

is not only related to the amount of noise in the input data, but also depends on

the time at which genes are transcribed during the cell cycle. For instance, when

adding 20% noise, while the mean shift in peak timing for all genes is 2.7 minutes

(Fig. 3.5C), it becomes 4.4 minutes for genes whose transcript levels peak late in

the cell cycle. This suggests that when collecting time-series measurements during

the cell cycle, it may again be beneficial to use non-uniform sampling, as suggested

above.

3.4.5 Deconvolution increases amplitude and dynamic range of transcription profiles

Because convolution is a form of smoothing, and deconvolution is therefore a form

of sharpening, deconvolution helps restore the dynamic range of transcript level

fluctuations whose measured levels have been dampened by the effects of convolution.

However, a serious risk of deconvolution is that it will sharpen not only the dampened

signal but also any noise in the measurements. For this reason, it is critical that the

deconvolution objective be regularized appropriately, which we have achieved in our

algorithm through use of a wavelet basis. The result is a deconvolution algorithm that

effectively sharpens signal (thereby increasing dynamic range) without sharpening

noise (Fig. 3.3B).

To assess this on a genome-wide scale, we need to quantify the dynamic range of

transcription profiles before and after deconvolution. Here, we developed a simple

peak-to-trough ratio (PTR) scoring scheme to quantitatively estimate the dynamic

range of transcription of a gene before and after deconvolution. Also, to be robust

against the influence of large or small outliers, we defined our PTR score as the

ratio between the 80th percentile and the 20th percentile of transcript levels over the

course of the cell cycle.

51

In detail, for a measured transcription profile (before deconvolution), the PTR

score was calculated as the ratio between the 80th percentile and the 20th percentile

of transcript levels after recovery (ignoring the R interval; from the first G1 to the end

of the time course). For a deconvolved transcription profile, we first calculated two

PTRs (rm and rd) as the ratios between the 80th percentile and the 20th percentile

of the transcript levels in mother and daughter cells, respectively. The deconvolved

PTR score of a gene was then computed as the weighted geometric mean of the two:

r = 3√r2mrd. A higher weight was placed on the PTR score from the mother because

we had slightly more confidence in the overall mother profile (more data available

for estimating the corresponding entries in f).

PTR scores before and after deconvolution are illustrated in the density scatter-

plot of Fig. 3.7A. Two things are apparent from this scatterplot: the vast majority of

genes exhibit a noticeable increase in their PTR score (they appear above the diag-

onal), as would be expected for a deconvolution method that sharpens transcription

profiles; at the same time, owing to the wavelet-based regularization employed by

our algorithm, and in contrast to most earlier deconvolution methods (e.g., Rowicka

et al. (2007)), genes can have smoother transcription profiles after deconvolution

than before (they can appear below the diagonal).

3.4.6 Deconvolution reveals a large number of transcripts fluctuating during the cellcycle

The increased dynamic range resulting from deconvolution affords us the opportu-

nity to more sensitively identify cell-cycle-regulated transcripts, those whose levels

fluctuate significantly over the course of the cell cycle. Indeed, one nice aspect of

our PTR score is that it provides a direct measure of how significantly a transcript’s

deconvolved levels are fluctuating over the course of the cell cycle. In particular,

our model-based deconvolution and PTR score allow us to avoid the Fourier-based

52

0 1000 2000 3000 4000 5000 6000

020

4060

8010

0B

genes ranked by deconvolved PTR score

% o

f cel

l-cyc

le-r

egul

ated

gen

es

Spellman

Pramila

Orlando

Intersection

iden

tifie

d by

pre

viou

s st

udie

s

A

deco

nvol

ved

PT

R s

core

original PTR score

1 2 5 10 20 50 100

12

510

2050

100+

SSK22

PCL1

density

highlow

CDC20

SIC1

CLN2

Figure 3.7: Genome-wide analysis of deconvolved transcription profiles reveals alarge number of transcripts fluctuating during the cell cycle. (A) Dynamic range oftranscription profiles before and after deconvolution. The density scatterplot depictsPTR scores for all 5,670 transcription profiles before and after deconvolution. PTRscores above 100 are shown truncated since the PTR score can become arbitrarilylarge if the denominator approaches zero. Note that while most genes have increaseddynamic range after deconvolution (above diagonal), some genes have decreased dy-namic range (below diagonal), owing to our wavelet-based regularization. The fivegenes whose deconvolved transcription profiles appear in Fig. 3.3B are highlighted inblue. The dashed red line indicates the deconvolved PTR score threshold we selectedto identify cell-cycle-regulated genes. (B) Recovery of previously identified cell-cycle-regulated genes in yeast. We ranked all 5,670 genes by their deconvolved PTR score.The plot shows the cumulative recall (sensitivity) of recallable genes identified ascell-cycle-regulated in previous studies. Genes with the highest 1,500 PTR scores(dashed red line) showed clear evidence of cell-cycle-regulation; these include 96% ofthe 440 genes identified by all three earlier studies to be cell-cycle-regulated.

periodicity analyses that have been used to identify cell-cycle-regulated genes in the

past (e.g., Spellman et al. (1998); de Lichtenberg et al. (2005)), with their attendant

limitations when applied to sparsely or irregularly sampled time-series data.

Transcripts cannot easily be categorized in a simple binary fashion as being cell-

cycle-regulated or not, since cell-cycle regulation occurs along a continuum from

strongly-regulated to weakly-regulated, as well as being condition- and strain-dependent.

For this reason, it makes more sense to simply rank genes in terms of their degree of

53

cell-cycle regulation, for which we used our deconvolved PTR score as a measure. To

visualize how well our deconvolved PTR score recovers genes identified in earlier stud-

ies as cell-cycle-regulated, we plotted the cumulative recall of previously identified

cell-cycle-regulated genes as a function of our deconvolved PTR rank (Fig. 3.7B).

Although the degree of cell-cycle regulation occurs along a continuum, for the

purposes of downstream analysis, we wished to establish a set of genes whose tran-

script levels exhibited a sufficiently high level of fluctuation to be clearly called cell-

cycle-regulated. We established a set of size 1,500 (corresponding to a deconvolved

PTR score ≥ 1.37, shown in Figs. 3.7A and 3.7B by a dashed red line). This set

includes 73% of the 1,271 periodic genes identified in Orlando et al. (2008), 69% of

the 895 recallable periodic genes identified in Pramila et al. (2006), 76% of the 709

recallable periodic genes identified in Spellman et al. (1998), and 96% of the 440

genes in the intersection of the three previous lists. Note that because these previous

studies made predictions without the aid of deconvolution, we should not expect to

see overwhelming agreement with any individual study.

Our set of 1,500 cell-cycle-regulated genes is noticeably larger than what has

previously been identified. Its increased size can be attributed primarily to the

increased sensitivity of our deconvolved profiles, which have had the “blurring” effects

of population asynchrony removed by our algorithm. After capping deconvolved PTR

scores at 100, the PTR scores of the 1,500 periodic genes increased by a factor of 4.7

on average after deconvolution, allowing us to more sensitively identify genes with

transcript-level fluctuations during the cell cycle. Heat-maps of transcript levels for

these 1,500 cell-cycle-regulated genes before and after deconvolution are shown in

Fig. 3.8.

Though we have chosen to focus on the 1,500 genes whose transcript levels are

most strongly cell-cycle-regulated, it is evident that an even larger number of genes

may be moderately or weakly regulated over the course of the cell cycle. This raises

54

A B mother daughter

G1 S G2/M DG1 S G2/MG1 S G2/M G1 S

WT1fold change versus m

ean

3

3/2

1

2/3

1/3

Figure 3.8: Transcript dynamics of 1,500 most cell-cycle-regulated genes. Heatmaps depict the dynamics of periodic transcripts in the measured (A) and decon-volved (B) transcription profiles of the identified 1,500 periodic genes. Correspondingrows in the various heat maps represent the same gene. Note that although our algo-rithm learns the deconvolved transcription profiles from two independent replicatesof the measured data, only WT1 is shown in panel A for space (WT2 data is nearlyidentical).

the prospect that a far more significant fraction of the yeast transcriptome may be

under cell-cycle control than previously suspected.

3.4.7 Deconvolution is robust across replicates

To investigate how much the deconvolved profiles would vary if one were to learn them

only from one single replicate versus the other single replicate, we re-deconvolved

our 1500 cell-cycle-regulated genes using only wild-type 1 (WT1) data and WT1

cloccs parameters, and again using only WT2 data and WT2 cloccs parameters

55

as listed in Table 3.1. To be clear, this analysis is less an assessment of our method

(and its reliance on good parameter estimates) and instead, more an assessment of

the way in which variations in measured data results in variations in deconvolved

profiles. However, the analysis reveals whether or not the two replicate of input data

are consistent. If they are consistent, then the two sets of output data should be

consistent. If not, then the two sets of output data should be in consistent.

In doing so, we first show specific results for four genes in Fig. 3.9 compared

to the reported jointly learned profiles. We can observe from the results that the

deconvolved profile is largely unaffected, but in general, the jointly learned profile is

slightly smoother, since it is based on more data. In Fig. 3.9, we present summary

analyses and heat-maps for all 1500 genes. As shown in the figure, the separate

deconvolved profiles of two replicates are nearly identical to each other, indicating

not only the reproducibility of two input datasets are high, but also our deconvolution

algorithm is robust across replicates.

3.4.8 Deconvolution reveals fine timing of transcription programs

We have shown that our deconvolution algorithm can reliably estimate transcrip-

tion profiles at fine temporal resolution. This enables us to distinguish subtle timing

differences previously obscured in population measurements taken only every 16 min-

utes. Fig. 3.10 provides two examples: the transcription profiles of genes that play

key roles in the selection and activation of origins of DNA replication (Fig. 3.10A),

and the transcription profiles of histone genes (Fig. 3.10B).

Origins of replication are selected and activated by the ordered assembly of pro-

tein complexes on the genome at discrete stages of the cell cycle. Potential origins are

initially marked by the arrival of the origin recognition complex (ORC). During G1,

ORC then associates with Cdt1 and Cdc6 to recruit the helicase MCM complex, form-

ing the pre-replicative complex (pre-RC) and licensing potential replication origins

56

ACE2

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

norm

aliz

ed tr

ansc

ript l

evel

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

norm

aliz

ed tr

ansc

ript l

evel

NDD1

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

norm

aliz

ed tr

ansc

ript l

evel

CLN1

G1 S G2/M

DG1 S G2/M

0

max

0

max

mot

her

daug

hter

norm

aliz

ed tr

ansc

ript l

evel

DSE3

joint

WT1

WT2

Bmother daughter

fold change versus mean

3

3/2

1

2/3

1/3

G1 S G2/M DG1 S G2/M G1 S G2/M DG1 S G2/M

mother daughter

A

WT1 WT2

Figure 3.9: Robustness of deconvolved profiles with respect to variation acrossmeasured data replicates. (A) Shown are the deconvolved profiles of ACE2 , NDD1 ,CLN1 , and DSE3 , learned from WT1 (in blue), WT2 (in green), and jointly fromtwo replicates (in red), respectively. (B) Heat maps depict the dynamics of periodtranscripts in the deconvolved transcription profiles of the identified 1,500 cell-cycle-regulated genes learned from WT1 and WT2, respectively.

57

Amother daughter

DG1 S G2/MG1 S G2/M

MCMcomplex

MCM2

MCM3

MCM5

MCM4

MCM6

MCM7

CDC6

0

max

norm

aliz

edtr

ansc

ript l

evel

CLB5

CLB6

DBF4

CDC7

CDC45

SLD2

SLD5

PSF1

PSF3

S-CDK

DDK

GINScomplex

Cdc45complexDpb11complex

DG1 S G2/MG1 S G2/M0

max

norm

aliz

edtr

ansc

ript l

evel

BH2A

H2B

H3

H4

H2A.Z

H1

HTA1

HTA2

HTB1

HTB2

HHT1

HHT2

HHF1/2

HHO1

HTZ1

DG1 S G2/MG1 S G2/M0

max

norm

aliz

edtr

ansc

ript l

evel

Figure 3.10: High temporal resolution of deconvolution reveals fine timing of tran-scription programs. (A) Normalized deconvolved transcription profiles of genes play-ing key roles in the origin-selection (top) and origin-activation (bottom) steps of DNAreplication. Profiles of CDT1 , MCM10 , SLD3 (in the Cdc45 complex), DPB11 (inthe Dpb11 complex), and PSF2 (in the GINS complex) are not shown since theirdeconvolved PTR scores are below our threshold for calling a gene strongly cell-cycle-regulated (none of these five are identified as cell-cycle-regulated in any previousstudy (Spellman et al., 1998; Pramila et al., 2006; Orlando et al., 2008) except forPSF2 in Orlando et al. (2008)). (B) Normalized deconvolved transcription profilesof histone genes in yeast. Note that the only two histone genes with somewhat dis-tinctive profiles are the H2A.Z histone variant which peaks later, and the H1 linkerhistone whose transcript levels approach zero during DG1.

58

for activation. Origins are activated late in G1 by S-CDK and DDK activity, lead-

ing to the assembly of a massive protein assembly called the pre-initiation complex

(pre-IC), including the Cdc45 complex, the Dpb11 complex, and the GINS complex.

Assembly of the pre-IC eventually leads to the initiation of DNA synthesis, defining

the start of S phase (Bell and Dutta, 2002).

Fig. 3.10A makes evident that the timing of transcription of genes involved in

the selection (pre-RC) and activation (pre-IC) steps of replication is tightly regu-

lated, with transcripts of pre-RC genes peaking together early in G1 (top panel) and

transcripts of pre-IC genes peaking together later in G1 (bottom panel). The two

catalytically-distinct MCM subgroups, Mcm2-3-5 and Mcm4-6-7 Schwacha and Bell

(2001), seem to be transcribed coordinately, especially in relation to the troughs of

each profile. Interestingly, the tight regulation evident in mother cells appears to

be relaxed in daughter cells, though it should be recalled that daughter profiles are

slightly more uncertain. Even so, the transcripts of all the pre-RC genes still peak

before the transcripts of all the pre-IC genes.

During replication, newly synthesized DNA is complexed with nucleosomes, his-

tone octamers consisting of two copies of each of the four core histones H2A, H2B, H3,

and H4 (Hereford et al., 1981). Fig. 3.10B reveals that these core histones are tran-

scribed in remarkably tight coordination, peaking precisely at the start of S phase.

In addition, we observe that in both mother and daughter cells, one histone gene

peaks distinctly later than the others: HTZ1 , the replication-independent histone

variant H2A.Z which is not assembled into nascent nucleosomes, but is exchanged

for H2A in a subset of nucleosomes afterwards (Kamakaka and Biggins, 2005). The

other histone gene with a somewhat distinctive transcription profile is the H1 linker

histone HHO1 (Bustin et al., 2005), whose transcript levels uniquely approach zero

during DG1, though they peak at essentially the same time as the core histones.

59

3.4.9 Identifying over-represented transcription factors (TFs)

According to the deconvolved transcription profiles, we classified daughter-specific

genes and stress-response genes into several subclusters by visual inspection. We

expect that the genes with coherent transcription patterns may be regulated by

common transcription factors (TFs), and therefore certain TFs might be signifi-

cantly associated with the promoters of the genes in a given subcluster. To test

this hypothesis, we used the TF-gene regulation mappings from the YEASTRACT

database (Teixeira et al., 2006) (direct evidence only, downloaded February 2011)

to look for over-represented TFs binding to promotor of genes within each subclus-

ter. To determine whether a TF is over-represented in a specified list of genes, we

calculated a p-value using a hypergeometric test, and designated it as being over-

represented if the p-value is less than or equal to 0.005. To increase the biological

significance of the identified TFs, we removed TFs that bound fewer than 3, or fewer

than 10%, of the genes in a subcluster.

3.4.10 Deconvolution reveals R-specific transcriptional program

In elutriation-based synchronization experiments, the initially collected cells—typically

small cells early in G1—are released from synchrony after experiencing significant

cold and osmotic stress. Thus, elevated transcript levels of a gene early in the time

course could arise because the gene is necessary for early G1 events, or because the

gene is part of a stress response, or both. If the former, we would expect to see high

levels of transcription again later in the time course; if the latter, we would expect

the high levels of transcription to be confined to the earliest samples of the time

course.

To identify genes whose high early transcription can be primarily attributed to

stress response, we established two criteria: the integrated transcript level of a gene in

the R interval is at least half the total across all cell-cycle branches (R, G1 + postG1,

60

Table 3.2: Full list of over-represented TFs in subclusters of R-specific expressedgenes (Fig. 3.11).

Subcluster ID # of genes Over-represented TFs with p-value ≤0.005

1 38 Adr1, Hot1, Sko1, Msn2, Hap1, Sko2, Skn7,Pdr1, Cad1, Fkh2, Nrg1, Rtg3

2 131 Hot1, Sko1, Cad1, Adr1, Yap5, Msn2, Cin5,Sok2, Pdr1, Yap6, Ste12, Skn7

3 12 Put3, Yap5, Pho4, Gcn4, Cin5, Yap6

4 3 Rap1, Sfp1

and DG1 + postG1), and the peak transcript level in R is at least twice as high as

that in mother (G1 + postG1) or daughter (DG1 + postG1) cells. We identified

184 genes satisfying these criteria, heat maps for which are shown in Fig. 3.11.

Gene Ontology (GO) (Ashburner et al., 2000) enrichment analysis reveals that the

biological functions of many of these genes are relevant to the processes of vacuolar

protein catabolic processes (p < 4× 10−26), response to temperature stimuli (p <

10−16), response to abiotic stimuli (p < 10−8), and similar, suggesting that these

genes are likely indeed stress-response genes, and more specifically, responding to the

cold temperatures during elutriation. On the basis of their deconvolved transcription

profiles, we refined these genes into four subclusters according to the time at which

the profiles first drop below their mean, and looked for over-represented TFs within

the promoters of genes in each subcluster. Up to five over-represented TFs for each

subcluster are shown in Fig. 3.11. TFs that are involved in the regulation of genes

during stress or amino acid starvation (e.g., Gcn4) are labeled in red. The full list

of over-represented TFs is given in Table 3.2.

To measure the degree of amplitude in expression, we used a simple PTR scoring

scheme to identify cell-cycle-regulated genes. However, since PTR scores intention-

ally ignore the recovery interval to focus on the mother and daughter cell cycle, they

do not take into account transcript levels during R, which for stress response genes

61

DG1G1 SR

subcluster 1: Adr1, Hot1, Sko1, Msn2, Hap1

subcluster 2: Hot1, Sko1, Cad1, Adr1, Yap5

subcluster 3: Put3, Yap5, Pho4, Gcn4, Cin5

subcluster 4: Rap1, Sfp1

G2/M S G2/M


3

3/2

1

2/3

1/3

Figure 3.11: Genes whose transcriptional levels are elevated significantly understress. 184 out of 1500 cell-cycle-regulated genes were identified to be R-specificexpressed genes. According to the time at which the profiles first drop below theirmean, we refined these genes into four subclusters, and up to five over-representedTFs within the promoters of genes in each subcluster. for each subcluster are shown.TFs that are involved in the regulation of genes during stress or amino acid starvation(e.g., Gcn4) are labeled in red. The full list of the over-represented TFs is given inTable. 3.2.

may be significantly elevated. In particular, 128 of the 184 genes listed here are

also included in our set of 1,500 cell-cycle-regulated genes. As can be seen here, the

transcript levels of these genes is often much higher in R than later in the cell cycle,

indicating that their transcription is not exclusively regulated during the cell cycle,

but also through varying environmental conditions and stress.

62

3.4.11 Deconvolution reveals a daughter-specific G1 transcription program

B

A

early: Ace2, Swi5, Sok2, Phd1, Ste12

middle: Sok2, Ste12, Cin5, Yap6

late: Mac1, Tec1, Put3, Mcm1, Ste12

G1 S G2/M0

max mother

DG1 S G2/M

daughter


3

3/2

1

2/3

1/3

ASH1

EGT2AMN1DSE3

DSE4PRY3

SCW11

DSE1

DSE2

CTS1

norm

aliz

edtr

ansc

ript l

evel

G1 S G2/M DG1 S G2/M

Figure 3.12: Branching process construction enables deconvolution to reveal adaughter-specific G1 transcription program. Our deconvolution algorithm explic-itly learns distinct cell-cycle transcription programs for both mother and daughtercells, enabling us to explore transcriptional behavior of daughter cells that cannotbe observed from the population-level transcription profiles. (A) Deconvolved tran-scription profiles in mother (left) and daughter cells (right) of genes previously char-acterized as daughter-specific in Table 1 of Colman-Lerner et al. (2001). (B) Twocriteria were used to identify 82 genes transcribed primarily and almost entirely inthe DG1 interval (which we call daughter-specific genes). All daughter-specific genesin panel A were identified by our criteria and thus appear in this set. According tothe timing of transcription peaks in DG1, we classified these genes into 3 subclus-ters: early, middle, and late. Up to five over-represented TFs of each subcluster areshown. The full list of the over-represented TFs is given in Table. 3.4.

Coupled with our high-resolution estimates, the explicit modeling of asymmetric

63

Table 3.3: Full list of over-represented TFs in subclusters of daughter-specific genes(Fig. 3.12).

Subcluster name # of genes Over-represented TFs with p-value ≤0.005

Early 54Ace2, Swi5, Sok2, Phd1, Ste12, Fkh2, Mcm1,Ash1, Fkh1, Skn7, Adr1,Tos8, Swi4, Mbp1, Dal81, Pho4, Yap5

Middle 8 Sok2, Ste12, Cin5, Yap6

Late 20 Mac1, Tec1, Put3, Mcm1, Ste12

cell division enables us to monitor and differentiate distinct mother and daugh-

ter transcription programs. For example, Table 1 of Colman-Lerner et al. (2001)

identified a set of genes that are transcribed in daughter-specific early G1, and sug-

gested that this daughter-specific transcription may, in part, be due to Cbk1/Mob2-

dependent activation and localization of the Ace2 transcription factor to the daughter

cell nucleus. As shown in Fig. 3.12A, our deconvolution algorithm not only correctly

predicts the transcription of these genes as daughter-specific, but also provides a

finely timed view of relevant events in late mitosis and early G1 that are not evident

in the population-level transcription profiles. We observe four distinct sets of tran-

scription dynamics: 1) ASH1 is transcribed to peak levels first, but is also degraded

first; 2) EGT2 , AMN1 , and DSE3 transcript levels rise very closely on the heels of

ASH1 , but degrade more slowly; 3) DSE4 , PRY3 , and SCW11 transcript levels be-

gin to rise at a similar time, but reach their peaks more slowly; and 4) DSE1 , DSE2 ,

and CTS1 transcript levels begin to rise noticeably later and peak last (Fig. 3.12A).

This order of transcription timing is consistent with our knowledge about the

functions of these genes. Ash1 is one of the earliest regulators of daughter-specific

gene expression programs, and is required to repress the transcription of HO from the

beginning of DG1 to block mating-type switching (Sil and Herskowitz, 1996; Cosma,

2004). AMN1 is also transcribed very early in DG1 as Amn1 has been shown to

be part of a daughter-specific switch that helps cells complete mitotic exit (Wang

64

et al., 2003). On the other hand, DSE2 and CTS1 (chitinase) are transcribed later

in DG1 as they encode proteins that degrade the cell wall from the daughter side,

leading to mother-daughter separation (Colman-Lerner et al., 2001; Doolin et al.,

2001; Kuranda and Robbins, 1991).

Among genes that rise to their peaks concomitantly, we observe that their tran-

script levels may decay at different rates; interestingly, these rates are in rough

qualitative agreement with a recent global study of mRNA half-lives (Miller et al.,

2011). For instance, among the closely transcribed genes ASH1 , EGT2 , AMN1 , and

DSE3 , the half-life of ASH1 is shortest (9.35), the half-lives of AMN1 and EGT2

are close to one other (11.02 and 10.67), and the half-life of DSE3 is longest (24.65).

Similarly, the half-life of CTS1 (33.38) is significantly longer than those of the other

two closely transcribed genes DSE1 and DSE2 (7.64 and 7.49).

Having confirmed that the known daughter-specific transcripts of (Colman-Lerner

et al., 2001) were primarily transcribed during DG1 after deconvolution (Fig. 3.12A),

we sought to identify other genes that were similarly transcribed primarily during

DG1. We established two criteria: the integrated transcript level of a gene across

all of DG1 should be at least 30% of the total across all cell-cycle branches (R,

G1 + postG1, and DG1 + postG1), and the peak transcript level in DG1 should

be at least 1.5 times higher than the peak during recovery (R) or in mother (G1

+ postG1) cells. We identified 82 genes satisfying these criteria which we consider

to be primarily transcribed in daughter cells during G1 (Fig. 3.12B). Many known

daughter-specific genes are in the list, including all ten genes in Table 1 of Colman-

Lerner et al. (2001), all six genes identified by Di Talia et al. (2009) as “strongly and

fairly specifically activated by Ace2”, and a remarkable 19 of the 22 genes identified

by (Di Talia et al., 2009) as “responding to a greater or lesser extent to both Ace2

and Swi5” (p<2× 10−33); these include the cyclin Pcl9 and the CDK inhibitor Sic1

that drives cells out of mitosis (Toyn et al., 1997).

65

Table 3.4: The contingency table for 82 identified daughter-specific genes accordingto the daughter-specific and non-daughter-specific genes identified in Di Talia et al.(2009), Spellman et al. (1998), and Colman-Lerner et al. (2001).

true false

positives true positives: 25 genes. We categorizeas true positives the 25 identified genesthat are reported by Di Talia et al.(2009) in their Supplementary TextS1 to be among 28 genes transcribedonly in daughter cells, or particularlyresponsive to either Ace2 or Swi5:AMN1 , ASH1 , BUD9 , CTS1 , CYK3 ,DSE1 , DSE2 , DSE3 , DSE4 , EGT2 ,GAT1 , ISR1 , NIS1 (mistyped in theirSupplementary Text as HIS1 , but evi-dent from their figure as NIS1 ), PCL9 ,PIR1 , PRR1 , PRY3 , PST1 , RME1 ,SCW11 , SIC1 , SUN4 , YLR049C ,YNL046W , and YPL158C .

false positives: 4 genes. Colman-Lerner et al. (2001) suggested that 19genes were not daughter-specific genesin their Table. 2. However, amongthese, 8 were subsequently confirmedby Di Talia et al. (2009) to actuallybe daughter-specific: BUD9 , CYK3 ,PCL9 , PST1 , SIC1 , YNL046W ,NIS1 , and RME1 (mistyped as REM1 ,but evident from their Fig. 2a asRME1 ). Of the remaining 11 genes,

◦ 2 are not included on our mi-croarrays: YMR316C-A, andYOR263C .

◦ 4 are in the set we identifiedas daughter-specific: CHS1 , HO ,PIR3 , and TEC1 . Thus, theseare false positives.

◦ 5 are not in the set we identi-fied as daughter-specific: CDC6 ,FAA3 , PCL2 , YGR149W , andPIL1 . Thus, these are catego-rized as true positives.

negatives true negatives: 5 genes. Refer to thedescription of false positives.

false negatives: 3 genes. We categorizeas false negatives the 3 non-identifiedgenes that are reported by Di Taliaet al. Di Talia et al. (2009) in theirSupplementary Text S1 to be among28 genes transcribed only in daughtercells, or particularly responsive to ei-ther Ace2 or Swi5: YLR414C , FTH1 ,and ESF2 .

Although this is not a proper quantitative estimate of the false discovery rate

(FDR), from the above categorizations it suggests that the FDR is perhaps something

in the ballpark of 4/29 = 0.138. However, since much of the data from Colman-

Lerner et al. Colman-Lerner et al. (2001) seems to have been over-ridden by more

recent results (in particular, 8 of the 19 genes claimed not to be daughter-specific

have subsequently been shown to actually be daughter-specific), this may be a high

66

estimate of the true FDR.

Regarding the 4 false positives, we identified HO , which controls mating type

switching and is known to participate in mother/daughter differentiation (by being

asymmetrically localized to mothers rather than daughters); and TEC1 , which plays

a key role in regulating pseudohyphal growth, and whose binding sites are sugges-

tively enriched in our “late” cluster of daughter-specific genes, along with STE12 ,

the key mating pheromone response transcription factor (TF). Taken together, these

results suggest a linkage between mating type/pheromone response pathways and

how mothers and daughters differentiate. We also identified CHS1 , a chitin syn-

thase required to repair the septum after mother/daughter separation, which seems

to be a Swi5 target rather than an Ace2 target; and PIR3 , a cell wall protein. The

presence of both HO and CHS1 among our false positives suggests that sometimes

a gene may be included in our list if it is mother- rather than daughter-specific, but

is not present early in our time course experiments. So false-positives may include

genes that are asymmetrically localized during mother/daughter differentiation to

mothers, but don’t appear until late in our time course experiments.

Gene Ontology (GO) (Ashburner et al., 2000) enrichment analysis indicates that

many of the proteins corresponding to these genes play a role in the processes of

transcription elongation (p < 3× 10−8), completion of separation (p < 2× 10−7),

cytokinetic cell separation (p < 2× 10−6), cell wall organization or biogenesis (p <

7× 10−4), etc. We visually clustered the 82 genes into three clusters and performed

TF-promoter enrichment analysis of the genes in each cluster. Not surprisingly, genes

whose transcript levels peak early in DG1 (Fig. 3.12B, early) share Ace2 and Swi5

as key TFs; also identified are Sok2, Phd1, and Ste12, all regulators of pseudohyphal

growth. Genes whose profiles are above average for almost all of DG1 (Fig. 3.12B,

middle) are further enriched for Cin5 (previously called Yap4) and Yap6, yeast AP-1

homologues that both recruit the Tup1/Ssn6 repressor under stress conditions (Han-

67

lon et al., 2011). Genes whose onset is a bit later in DG1 (Fig. 3.12B, late) are

enriched for Mcm1, Tec1, and Ste12—all involved in responses to pheromone or

pseudohyphal growth—as well as Mac1, a copper sensing TF, and Put3, a regulator

of the proline utilization pathway.

Since it is experimentally difficult to measure mother and daughter transcrip-

tion programs independently, knowledge of daughter-specific events is still rather

limited, and high-throughput identification of daughter-specific genes has been an

open problem in the field. Our deconvolution algorithm, with its unique ability to

reveal a daughter-specific transcription program from population-level data, provides

a method for generating hypotheses in this direction, and reveals a much larger list

of daughter-specific genes than has previously been identified (Colman-Lerner et al.,

2001). Along with the recent results of Di Talia et al. (2009) and others, this list

provides a step toward understanding the nature of mother-daughter cell differenti-

ation.

3.4.12 Transcriptional programs between G1 and DG1

G1 is the major period of cell growth during the cell cycle. During this phase, either

mother or daughter cells require a large amount of structural proteins and enzymes

for synthesizing new organelles, and many genes are transcribed for both mother

cells in G1 or daughter cells in DG1. Since mother and daughter cells are permitted

by our model to transcribe genes differently during G1, it might be interesting to

ask how the transcription programs in G1 and DG1 are related. For example, the

transcription program of a gene in G1 may be essentially identical to that in DG1,

albeit proceeding at a faster pace so that the profile appears to be compressed, an

example being MCM7 (Fig. 3.10A); or the transcription profile in DG1 may be a

delayed-onset version of the G1 profile, preceded by some daughter-specific early G1

profile, an example being MCM3 (Fig. 3.10A).

68

A

B

1 gene (PRY1)

in mother G1 in daughter DG1

119 genes

Peak ratio ≥ 2 (120 genes)

One dominant peak(mother or daughter)

13.7%

Compressed

High(corr≥0.9)

Medium(0.9>corr≥0.7)

Low(corr<0.7)

High

Medium

Low

Delayed 8.2%

8.5% 7.1%

5.2%

13.2% 17.0%

8.5%

18.4%

delayed (30.2%)

compressed (30.5%)

mixed (20.9%)

uncorrelated (18.4%)

Peak ratio < 2 (1380 genes)

Two dominant peaks(mother & daughter)

364 genes

in mother G1& in daughter DG1

in mother G1& in daughter post-G1

381 genes

in mother post-G1& in daughter DG1

345 genes

in mother post-G1& in daughter post-G1

290 genes

Figure 3.13: Relationships of transcription profiles in G1 and DG1. First, asdiscussed in the Methods, we separated our 1,500 cell-cycle-regulated genes into twogroups: genes with one dominant peak and genes with two dominant peaks. (A)One-dominant-peak genes. In our cell-cycle branching process model, we allowedmother and daughter cells to transcribe genes differently during G1 and DG1, butassumed they share a common transcription program postG1. Therefore, since thesegenes have only one dominant peak, it must occur either in mother G1 or in daughterDG1. Interestingly, we found only one gene (PRY1 ) in the first category, but 119genes in the second. (B) Two-dominant-peak genes. We split the remaining 1380genes into four subgroups according to where their two dominant peaks occurred.For the 364 genes whose two dominant peaks are in mother G1 and in daughterDG1, we calculated two Pearson correlation coefficients between the transcriptionprofiles in G1 and DG1: one between the G1 profile and a compressed version of theDG1 profile (compressed); the other between the G1 profile and the later segmentof the DG1 profile (delayed). According to the strengths of these two correlationcoefficients, we separated the 364 genes into 9 groups, and combined some of thegroups into four categories, as shown.

To study relationships of transcription profiles in G1 and DG1, we can calculate

two Pearson correlation coefficients: one between the G1 profile and a compressed

version of the DG1 profile; the other between the G1 profile and the latter segment of

the DG1 profile. Since correlation ignores amplitudes, we also compare the maximum

transcript levels in these intervals to ensure rough equivalence. Focusing on the cell-

cycle-regulated genes that have clear peaks in G1 and DG1, we observed that about

69

G1

S

G2/M

1.37

5

10

20

50

100+

MCM2

MCM3

MCM5

MCM4

MCM6

MCM7

CDC6

CLB5

CLB6DBF4CDC7

CDC45

SLD2

SLD5

PSF1PSF3

HTA1

HTA2HTB1

HTB2

HHT1

HHT2HHF1

HHO1

HTZ1

genes in origin-selection step ofDNA replication (Fig. 4.10A, top)

genes in origin-activation step ofDNA replication (Fig. 4.10A, bottom)

histone genes (Fig. 4.10B)

Figure 3.14: Circular representation of peak timing of genes. The figure depicts thetiming of transcriptional peaks of genes in Fig. 3.10, where colored sectors indicaterespectively the cell cycle phases of G1, S, and G2/M, and the gray dash circlesindicate the deconvolved PTR scores, starting from 1.37, the score threshold to callcell-cycle-regulated. Therefore, this representation shows not only the peak timingof genes, but also the amplitude of cell-cycle oscillation.

30% are exclusively in the first category, about 30% are exclusively in the second

category, and about 20% can be classified into both categories, like CDC6 (Fig. 5A);

the final 20% are not easily categorized. The details are given in Fig. 3.13.

3.4.13 Visualizing transcription timing of gene groups

In previous sections, we have shown that our deconvolution algorithm can provide

us single-cell-like transcriptional profiles explicitly for mother and daughter, and the

resolution of deconvolved profiles increase significantly from initial 16 minutes to

around 1-2 minutes. In addition, our method are more sensitive and enables us to

reveal subtle timing differences between genes with similar transcriptional programs

70

in the original population measurements. Based on the PTR scores, we here pro-

pose a novel means of visualizing the transcriptional timing of gene groups (e.g.,

protein complexes, genetic pathways, functional modules). As shown in Fig. 3.14,

we represent the standard cell cycle using a circular group, which is composed of

three sectors, indicating G1, S, and G2/M, respectively. Then, we draw a circle on

the plot to label the timing of its transcriptional peak for each gene. The position

of each circle is determined by not only its peak timing, but also the amplitude of

deconvolved transcription. For such plots, we can identify genes that are transcribed

together, but also investigate subtle timing different from the gene in a functional

group, such as protein complexes.

71

4

Identifying conserved functional modules acrossspecies

In this chapter, we move our focus from genes to proteins. We introduce a PPI

network alignment method, called ‘DOMAIN’, which exploits protein functional do-

mains to identify equivalent functional modules from pairwise protein-protein inter-

action networks across species.

Conventionally, most network alignment algorithms which adopt a node-then-

edge-alignment paradigm: they first identify homologous proteins across networks

and then consider interactions among them to construct network alignments. DO-

MAIN, instead, is propose upon a novel direct-edge-alignment paradigm. Specifically,

instead of explicit identification of homologous proteins, we directly infer plausible

alignable PPIs across species by comparing conservation of their constituent domain

interactions. By applying our approach to detect conserved protein complexes in

yeast-fly and yeast-worm PPI networks, we show that our approach outperforms

two recent approaches in most alignment performance metrics. Also, we show that

our approach enables us to identify conserved cell-cycle functional modules across

72

species. Most of work present in this chapter appeared in Guo and Hartemink (2009).

4.1 Introduction to network alignment

Understanding complicated networks of interacting proteins is a major challenge in

systems biology. Recently, with the rapid progress of high-throughput experimental

techniques, protein-protein interaction (PPI) databases have exponentially increased

in size, allowing for comparative analysis of PPI networks from which conserved

modules can be identified across PPI networks of different species (Sharan and Ideker,

2006; Srinivasan et al., 2007). By analogy to sequence alignment, this problem is

called PPI network alignment.

Typically, PPI network alignment algorithms compare PPI networks of two or

more species and identify conserved modules (e.g., pathways or protein complexes).

Often a PPI network is represented as an undirected graph in which nodes indicate

proteins and edges indicate interactions. Hence, the network alignment problem can

also be viewed as a graph isomorphism problem.

Many network alignment algorithms have been proposed in recent years and most

of them focus on the pairwise alignment of PPI networks. As an early approach, Path-

BLAST (Kelley et al., 2003) proposed a likelihood-based scoring scheme to search for

conserved pathways. Sharan et al. (2005b) extended PathBLAST to employ a greedy

heuristic to detect conserved protein complexes across species. NetworkBLAST-

E (Hirsh and Sharan, 2007) introduced an evolutionary model of networks into the

alignment scoring function to extract conserved complexes. MaWISh (Koyuturk

et al., 2006) merged pairwise interaction networks into a single alignment graph

and treated network alignment as a maximum weight induced subgraph problem.

MNAligner (Zhenping et al., 2007) described an integer quadratic programming

(IQP) model to identify conserved substructures.

Recently, several network alignment algorithms have been developed that can

73

align more than two species. Graemlin (Flannick et al., 2006) is capable of aligning

more than ten microbial networks at once. NetworkBLAST (Sharan et al., 2005a),

another extension of PathBLAST, can align networks of up to three species, and

its later version, NetworkBLAST-M (Kalaev et al., 2008), can align ten networks

with tens of thousands of proteins in minutes. In addition, Singh et al. (2008)

described a method inspired by Google’s PageRank to detect global alignments from

five eukaryotic PPI networks.

However, all these network alignment algorithms follow a node-then-edge-alignment

paradigm. That is, they generally first need to identify homologous proteins across

species before they can exploit protein interaction and network topology information

to detect conserved subnetworks. The node alignment step essentially acts as a filter,

artificially constraining the search space of conserved modules to putatively homolo-

gous protein pairs. On the other hand, proteins rarely act alone. They interact with

each other to carry out their activities, and these interacting proteins are likely to

evolve with high correlation during the evolution of species (Pazos et al., 1997; Goh

et al., 2000; Mintseris and Weng, 2005). Further, it has been shown recently that such

co-evolution is more evident if we focus our attention on interacting domains that are

responsible for the PPIs (Jothi et al., 2006; Itzhaki et al., 2006; Schuster-Bockler and

Bateman, 2007). Based on these observations, we present DOMAIN, an algorithm

for domain-oriented alignment of interaction networks, that follows an alternative

direct-edge-alignment paradigm. DOMAIN does not explicitly restrict its attention

to putatively homologous proteins. Instead, it directly aligns PPIs across species by

decomposing PPIs in terms of their constituent domain-domain interactions (DDIs)

and looking for conservation of these DDIs.

74

Figure 4.1: Overview of DOMAIN algorithm. (1) Constructing alignable pairs ofedges (APEs). The input of DOMAIN includes two PPI networks and the constituentdomains of the proteins. Using this information, DOMAIN calculates species-specificdomain-domain interaction (DDI) probabilities, and then identifies a set of APEsacross networks. (2) Building an APE graph. An APE graph is a merged represen-tation of the PPI networks, in which each node represents an APE and each edgerepresents one of four network connectivities connecting two APEs: a) alignmentextension, b) node duplication, c) edge indel (insertion/deletion), or d) edge jump.The details of these connectivities are given in section 4.2.2. (3) Searching for high-scoring non-redundant subgraphs within the APE graph. We use a greedy heuristicto carry out this task.

4.2 DOMAIN: a domain-oriented edge-based PPI network aligner

As illustrated in Fig. 4.1, DOMAIN consists of three stages: (1) it constructs a

complete set of alignable pairs of edges (APEs); (2) it builds an APE graph; (3)

it employs a heuristic search to identify conserved protein complexes across species.

The three subsections that follow elaborate upon these three stages.

4.2.1 Constructing and scoring APEs

Domains are the structural and functional units of proteins. Many studies (Deng

et al., 2002; Riley et al., 2005; Bernard et al., 2007) have revealed that direct PPIs

are often mediated by interactions between the constituent domains of the two inter-

75

acting proteins. These studies have made two particular assumptions that we adopt

as well: (1) DDIs are independent of each other, and (2) two proteins interact if and

only if at least one pair of domains from two proteins interact. These assumptions

allow us to formulate the probability of an interaction between two proteins in terms

of a “noisy-or” over the DDIs that might possibly mediate the interaction between

those two proteins. In our network alignment scenario where we seek to align edges

directly, we additionally assume that a pair of cross-species PPIs can be aligned to

one other only if they are plausibly mediated by at least one common DDI.

We represent the input PPI networks from two species as undirected graphs

G1(V1, E1) and G2(V2, E2), where nodes indicate proteins and edges indicate the

observed PPIs. We first wish to construct a complete set of alignable pairs of edges

(APEs). We say that a pair of edges, e1∈E1 and e2∈E2, is alignable if there exists

a DDI that can plausibly mediate the two PPIs represented by that pair of edges.

We say that a DDI can plausibly mediate a PPI if the corresponding interaction

probability between the two domains is above some value ε > 0. Using a nonzero

value for ε allows us to filter out domains between which there is negligible evidence

of a DDI.

For an edge e∈E1 or E2, we define D(e) to be all the possible interactions between

the constituent domains of the two proteins. Given the species-specific probabilities

of DDIs that mediate PPIs, we can then write the score of an APE c = (e1, e2) using

a “noisy-or” formulation:

f(c) = Pr(e1, e2|Θ1,Θ2) = 1−∏

dα,β∈D(e1)⋂D(e2)

(1− g(θ1α,β, θ2α,β)) (4.1)

where dα,β denotes an interaction between domains α and β, and θα,β = Pr(dα,β),

and Θ = {θα,β}. The function g(θ1α,β, θ2α,β) measures the probability of aligning the

PPI e1 to the PPI e2 mediated by interactions between domains α and β. In this

76

work, we have chosen to set g(θ1α,β, θ2α,β) = (θ1α,β · θ2α,β)1/2.

As previous authors have also done, to estimate the species-specific DDI proba-

bilities Θ, we applied the EM (expectation-maximization) algorithm of Deng et al.

(2002) for each given network.

4.2.2 Building an APE graph

The APE graph is motivated by the evolutionary model of PPI networks suggested

by Berg et al. (2004). The model indicates that PPI networks are shaped primarily

by two kinds of evolutionary events, link dynamics and gene duplication. Link dy-

namics events are primarily caused by sequence mutations of a gene and affect the

connectivities of the protein whose coding sequence undergoes mutations. Gene du-

plication, the second kind of evolutionary event, is often followed by either silencing

of one of the duplicated genes or by functional divergence of the duplicates. From

the perspective of protein domains, a link dynamics event may result from switching

a constituent domain of a protein to another, or a change in a domain’s interaction

partners; a gene duplication event consists of duplication of one protein, followed by

a domain switching or being removed in one or both of the duplicates, or followed

by progressive small changes from point mutations that cause a change in domain

interaction partners.

With this motivation in place, we define an APE graph to be an undirected

weighted graph, where nodes correspond to the APEs identified above, and edges

correspond to one of four evolutionary relationships that we consider between two

APEs, as illustrated in Fig. 4.2 and as listed below:

a. Alignment extension: two APEs are connected if they share two proteins, one

per species.

b. Node duplication: two APEs are connected if they share a protein in one species

77

Figure 4.2: Four connectivities in an APE graph. The details of these connectivitiesare given in text, and the legend of the figure is the same as is given in Fig. 4.1.

and a PPI in the other.

c. Edge indel (insertion/deletion): two APEs are connected if they share a protein

in one species and the graph distance between the two PPIs in the other network

is 1.

d. Edge jump: in this case, all proteins within the two APEs are distinct, but for

each species, the graph distance between the two PPIs in their corresponding

network is 1. We consider this case because our current knowledge of both

PPIs and DDIs is noisy and incomplete. Thus, if there exists a pair of PPIs

that can make two APEs connected in each network, we treat the pair as a

potential APE. Note that some insignificant DDIs (probabilities of DDIs < ε)

78

are shared in such potential APEs.

Given this definition of an APE graph, we note that every subgraph in an APE graph

corresponds to a network alignment.

Each node in an APE graph contributes the score f(c) of its corresponding APE,

and each edge is scored by a positive number according to its connection relationship.

Using these edge scores, we want to reward alignment extension and penalize both

node duplication and edge indel. Let γa, γb, γc, and γd be the edge scores of alignment

extension, node duplication, edge indel, and edge jump, respectively. We thus need

to assign γa > 1 and γb, γc < 1. Because we neither wish to reward nor penalize an

edge jump, we simply assign γd = 1. For a subgraph Gs(Vs, Es) in an APE graph,

the overall score for its corresponding network alignment is calculated as

S(Gs) =∏e∈Es

γ(e) ·∏c∈Vs

f(c) (4.2)

where γ(e) is the edge score for e∈Es, and f(c) is the score of the APE c∈Vs.

4.2.3 Detecting protein complexes

Network alignment methods generally require a search algorithm to detect high-

scoring subgraphs from a single or several weighted graphs. Such tasks are computa-

tionally difficult, so a number of search heuristics have been proposed: for example,

PathBLAST uses the randomized dynamic programming to search for conserved

pathways across networks, while NetworkBLAST-E implements a greedy heuristic

to search for conserved protein complexes. As many pairwise network methods aim

to identify conserved protein complexes, for comparative proposes, we devise a greedy

heuristic for finding conserved protein complexes across species.

The heuristic aims to identify high-scoring non-redundant subgraphs from the

resultant APE graph. Specially, exhaustively starting from each APE, we iteratively

79

expand the subgraph by introducing a new APE that increases the alignment score

the most, until any of the following empirical stopping conditions occur: (1) the

number of proteins in either species exceeds an upper limit (we used 15); (2) the

score of the next expanding APE is smaller than a threshold (we used 10−2); (3)

the overall alignment score of the subgraph is smaller than a threshold (we used

10−3); (4) the graph distance of the next expanding APE exceeds an upper limit (we

used 4). At the end, small and redundant subgraphs are removed if the number of

proteins in a subgraph is less than four, or if there exists a higher-scored subgraph

overlapping more than 80% of proteins in either species.

4.3 Results

4.3.1 Experimental setup

We compare our method to two extant pairwise network alignment algorithms, Net-

workBLAST and MaWISh. We do not include NetworkBLAST-M and Graemlin in

our comparisons because they mainly focus on alignment of multiple networks, and

because Graemlin requires the unavailable in-house SRINI algorithm (Srinivasan

et al., 2006) to assign weights to PPIs. The ISOrank algorithm aims at resolving a

different problem of aligning networks globally, and NetworkBLAST-E performs sim-

ilarly to NetworkBLAST and is not available online. We thus exclude these methods

from the comparisons as well.

We apply DOMAIN on yeast-fly and yeast-worm PPI networks taken from DIP

(Database of Interacting Proteins, Oct 2008) (Xenarios et al., 2002), as they were

widely used in pairwise network alignment studies as benchmarks. The protein-to-

domain mappings are taken from Pfam (Pfam 23.0) (Finn et al., 2008), and we only

consider high-quality Pfam-A entries. Because not all proteins contain significant

Pfam domains, we generate a so-called “backbone” network, a subnetwork of DIP in

which all proteins contain at least one Pfam-A domain. As summarized in Table 4.1,

80

Table 4.1: Summary of backbone networks.

DIP Backbone DIPYeast Fly Worm Yeast Fly Worm

# PPIs 17,528 22,381 4,038 11,426 11,013 2,213# proteins 4,928 7,446 2,644 3,300 4,500 1,620

# GO annotated proteins ∗ 4,625 4,477 1,566 3,280 3,253 1,145# MIPS annotated proteins ∗∗ 1,100 — — 860 — —∗ With respect to biological process annotation of Gene Ontology.∗∗ Excluding MIPS category 550.

78.2% of MIPS annotated proteins and over 70% of GO annotated proteins are

contained in backbone networks. To simplify the setting of the four γ parameters,

we reduced the parameter space to one dimension by insisting that γa = k, γb = γc =

1/k, and γd = 1, for some value of k>1. We found that DOMAIN was not sensitive

to changes in k. In the results that follow, we used k=10.

4.3.2 DOMAIN outperforms previous methods in most performance metrics

We employ three measures to evaluate the biological significance of the alignments:

sensitivity/specificity, MIPS purity, and GO enrichment. These measures are also

suggested in several other network alignment studies (Hirsh and Sharan, 2007; Dutkowski

and Tiuryn, 2007; Kalaev et al., 2008).

The first two measures use the known yeast protein complexes cataloged in MIPS

(May 2006) (Mewes et al., 2002) as a gold standard. We exclude category 550

(obtained from high-throughput experiments) and only use complexes at level 3 or

lower. In consequence, there exist 122 MIPS complexes spanning 519 yeast proteins

in yeast backbone network, and 62 of them contain at least 3 proteins spanning 438

proteins. For each identified yeast alignment, we try to find a complex from MIPS

that maximizes the hypergeometric score and calculate an empirical enrichment p-

value. The significance level is obtained from sampling 10,000 random sets of proteins

of the same size, and the p-values are corrected for multiple testing using the false

81

Table 4.2: Performance comparisons of DOMAIN with NetworkBLAST and MaWIShon yeast-fly backbone networks.

method # of # proteins SPE SEN MIPS GO enrichmentcomplexes yeast fly (%) (%) (%) yeast(%) fly(%)

DOMAIN 100 338 313 34.0 9.0 66.7 89.0 78.0NetworkBLAST 82 299 213 31.7 7.4 40.6 87.8 79.3MaWISh 54 193 142 18.5 4.1 30.0 75.9 66.7

Table 4.3: Performance comparisons of DOMAIN with NetworkBLAST and MaWIShon yeast-worm backbone networks.

method # of # proteins SPE SEN MIPS GO enrichmentcomplexes yeast worm (%) (%) (%) yeast(%) worm(%)

DOMAIN 21 84 63 36.4 3.3 75.0 90.5 9.5NetworkBLAST 19 82 51 7.7 0.8 60.0 89.5 10.5MaWISh 11 42 32 11.1 1.6 42.8 63.6 9.1

discovery rate (FDR) (Benjamini and Hochberg, 1995). Then, the specificity is

defined as the percent of yeast alignments that have a significant match in MIPS (p-

value <0.05), and the sensitivity is defined as the percent of MIPS alignments that

have significant matches in the resulting alignments. Moreover, an alignment is called

a pure alignment if it satisfies two conditions: (1) it contains at least three MIPS

annotated proteins and (2) there exists a complex in MIPS that covers more than

75% of its MIPS annotated proteins. We report purity, calculated by the number

of pure alignments divided by the total number of alignments with at least three

MIPS annotated proteins, as an alternative measure of the sensitive identification of

specific complexes.

GO enrichment measures the functional coherence of the proteins in an identified

alignment with respect to the biological process annotation of GO, for each species

separately. We use the tool GO TermFinder (Boyle et al., 2004) to compute empirical

enrichment p-values, and correct for multiple testing using FDR. For each species, we

report the fraction of process-coherent alignments with p-value < 0.05 (considering

only the alignments with at least one GO annotated protein).

82

We chose to set the probability threshold of DDIs ε to the low but nonzero value

of 10−20 so as to take into account as much DDI information as possible. For yeast-

fly alignment, DOMAIN generated an APE graph consisting of 6,918 APEs with

47,964 alignment extension links, 24,549 node duplication links, 5,573 edge indel

links, and 1,149 edge jump links; for yeast-worm alignment, it returned a 1,410-

node APE graph with 4,230 alignment extension links, 4,087 node duplication links,

140 edge indel links, and 37 edge jump links. For accurate comparison, we applied

NetworkBLAST and MaWISh on backbone networks with their suggested parameter

settings (see Sharan et al., 2005a; Koyuturk et al., 2006 for details). As summarized

in Tables 4.2 and 4.3, DOMAIN identified more significant non-redundant alignments

than NetworkBLAST and MaWISh in both alignments—explaining the good scores

on the sensitivity metric—but also managed to outperform the other methods on the

specificity and purity metrics. Indeed, it achieved the highest performance on almost

every evaluation metric, and in the instances in which it was bested, the difference

is slight.

The running time of DOMAIN is comparable to NetworkBLAST and MaW-

ISh. DOMAIN is currently implemented in Perl, and its running time on yeast-

fly and yeast-worm backbone networks is less than one minute (Intel Core 2 CPU

[email protected], 2GB RAM). Because the running time is so small, we were able to

exhaustively expand from all APEs. If for some reason we needed to further reduce

computational complexity, we could instead consider an alternative expansion strat-

egy where we would expand only from “seed” APEs. The idea would be that if a

protein complex is conserved in many species, the PPIs in this complex are likely to

be conserved as well, and therefore the corresponding subgraph in the APE graph

should contain many alignment extension links. With this in mind, we could rank

the APEs by counting the number of their surrounding alignment extension links

and select, say, the top 25% as seeds for expansion. We tested this, and the results

83

were nearly identical to those listed in Tables 4.2 and 4.3, but the running time

for yeast-fly and yeast-worm alignments reduces to 30 and 15 seconds, respectively.

In our case, the running time was not a problem, but it is reassuring that a seed-

based expansion strategy seems to be effective at reducing the running time without

affecting the results.

4.3.3 DOMAIN is sensitive at detecting small alignments

DOMAIN is sensitive at detecting small network alignments that might be deemed by

other algorithms to be topologically insignificant. For example, DOMAIN reported

a network alignment between the yeast NEF1 complex and the fly proteins mei-9,

Ercc1, and Xpac with high confidence (Fig. 4.3A). The GO process coherence of

these three fly proteins is significant: nucleotide-excision repair (p-value ' 10−8),

DNA repair (p-value ' 10−6), cellular response to DNA damage stimulus (p-value

'10−6), etc. However, neither MaWISh nor NetworkBLAST reports any alignment

involving the yeast NEF1 complex. They are likely to miss such alignments because

1) the sequence similarity between RAD10 and Ercc1 is insignificant (BLAST E-value

' 10−8) and may be ignored if using a restrictive BLAST E-value threshold (e.g.,

10−10 suggested in Hirsh and Sharan 2007), and 2) this alignment consists of only

three matched proteins and two conserved interactions, so it may not be sufficiently

topologically significant for some aligners to detect. On the other hand, the DDIs

within this alignment are well-conserved across species (the DDI probabilities of

ERCC4-Rad10 are 1.00 in both species; the DDI probabilities of Rad10-XPA C are

1.00 and 0.54 in yeast and fly, respectively).

4.3.4 DOMAIN provides a comprehensive means of interpreting alignments

Another advantage of DOMAIN is that often it provides a more comprehensive means

of interpreting the identified network alignments, because protein domains are di-

84

Figure 4.3: Evaluation of alignment performance of DOMAIN. (A) DOMAINis sensitive to small alignments. DOMAIN reports a network alignment betweenthe yeast NEF1 complex (MIPS category 510.180.10.10) and the fly proteins mei-9,Ercc1, and Xpac. The object to the right of the double arrow depicts the corre-sponding subgraph of this alignment in the APE graph. (B) DOMAIN provides acomprehensive means to interpret network alignments. DOMAIN reports an align-ment between 10 yeast proteins and 3 worm proteins that significantly matches thepathway of SNARE interactions in vesicular transport in KEGG. (C) An example ofimproving network alignment by combining several cross-species pairwise alignments.(Green: yeast proteins; blue: fly proteins; orange: worm proteins.)

85

rectly relevant to function in many cases. For instance, RAD14 and Xpac may play

a similar role in the biological process of nucleotide-excision repair, as they share a

common XPA C domain. Furthermore, although the XPA N domain is not reported

as a significant domain for RAD14 in Pfam (E-value = 0.023), the alignment of

yeast RAD14 to fly Xpac suggests that XPA N is potentially an important functional

domain in RAD14.

Identifying conserved biological pathways across species is another important

application of network alignment. Fig. 4.3B demonstrates an example of alignment

reported by DOMAIN between 10 yeast proteins and 3 worm proteins, in which 9

yeast proteins (all except NYV1) and all 3 worm proteins are known to be involved

in the pathway of SNARE interactions in vesicular transport in KEGG (Kanehisa

and Goto, 2000).

4.3.5 Performance improves by combining cross-species pairwise alignments

Alignment performance may further be improved by combining several cross-species

pairwise network alignments. Fig. 4.3C shows an example of combining three align-

ments taken from yeast-fly, yeast-worm, and fly-worm network alignments, respec-

tively. By aligning yeast and fly networks, DOMAIN detects an alignment between

3 fly proteins (CG8142, RfC3, and RfC40) and 7 yeast proteins, and 4 of them

(RFC1-4) are involved in the replication factor C complex (MIPS: 410.40.30). As

the yeast replication factor C complex contains 5 proteins (RFC1-5), the F-score1 is

0.666. Further, we see that 2 worm proteins (F44B9.8 and rFc-2) are aligned to all

these 3 fly proteins in fly-worm alignment and 3 of these 7 yeast proteins (RFC2-4)

in yeast-worm alignment. This three-way alignment suggests that the alignment be-

tween fly proteins CG8142, RfC3, and RfC40 and yeast proteins RFC2-4 are of high

confidence, and the F-score is increased to 0.750.

1 F-score is defined as F = 2× (precision× recall)/(precision + recall)

86

Table 4.4: Cell-cycle-related functional modules conserved across budding yeast andfruit fly

clusters clusters enriched GO term alignment infoin yeast in fly in common

Spr28,Cdc11,Cdc12,Cdc10,Shs1,Cdc3

Sep4,Sep2,Sep1

cytokinesis, cell division conserved proteins that play keyroles in septin ring assembly

Htb1,Hta1,Hhf1,Hht1

Hs2A,His4,cenH3,His2B

chromatin assembly ordisassembly, chromatinorganization, chromo-some organization

conserved histone genes

Clb2,Rts1,Cdc28

CycD,dMST,PP2A

Inferred from the yeast cluster, flycluster seems to play a role in Mphase (e.g., chromosome segrega-tion, spindle assembly).

Clb2,Swe1,Cdc28

CycD,Cdk5,Cdk4

regulation of cell cycleprocess

both clusters play a rule in regula-tion of cell cycle (inferred: M-Cdk).

Clb5,Kin1,Cdc28

CycD,Cdk5,Cdk2

regulation of cell cycle,phosphorylation

both clusters play a rule in regula-tion of cell cycle (inferred: S-Cdk)

Ste20,Cdc28,Cln2

Cdk2,CycG,CG11533

regulation of cell cycle both clusters play a rule in regula-tion of cell cycle (inferred: G1/S-Cdk)

4.4 Detecting conserved cell-cycle-related functional modules

In the previous sections, we have introduced a network alignment method DOMAIN,

which employs a novel direct-edge-alignment paradigm to detected conserved func-

tional modules across pare-wise protein-protein interaction networks across species.

We demonstrated that DOMAIN is sensitive at detecting small conserved alignments

across species, and on the basis of protein functional domains, DOMAIN can also

provide us functional information about the resulting alignments. In this section, we

focus on the resultant alignments of DOMAIN between protein-protein interaction

networks of budding yeast and fruit fly, two largest and most studied networks. As

the major goal of this thesis is to study cell-cycle regulation, we ask the questions

how the cell-cycle-related functional modules are conserved across species.

87

The comparison of cell-cycle-related network alignments are listed in Table 4.4.

There exist 6 identified clusters across two species that seem to be related to cell cycle,

including yeast clusters of Cdc28-Clb2, Cdc28-Clb5, Cdc28-Cln2, three cyclin-CDK

complexes. The aligned clusters in fruit fly are all composed of cyclin-dependent

kinases (i.e., Cdc2, Cdc4, Cdc6), together with cyclins CycD and CycG, suggesting

that these fly clusters may also be cyclin-CDK complexes and play key roles dur-

ing the cell cycle. These results not only indicate that cyclin-CDK complexes are

highly conserved during evolution (Ubersax et al., 2003), but also demonstrate the

alignment performance of our method.

4.5 Discussions

In this chapter, we described DOMAIN, a domain-oriented pairwise network align-

ment framework. To our knowledge, DOMAIN is the first algorithm to introduce

protein domains into the network alignment problem. Also, DOMAIN uses a novel

direct-edge-alignment paradigm to directly detect equivalent PPI pairs across species

and suggests a new graph representation to merge these equivalent PPI pairs and

their network-evolutionary based relationships into one graph. We tested DOMAIN

to identify conserved protein complexes in the yeast-fly and yeast-worm protein in-

teraction networks, and the experimental results show that DOMAIN exhibits better

performance than two recent pairwise network alignment methods in most perfor-

mance metrics.

Although DOMAIN can be applied only to a subset of proteins with domain map-

pings, we notice that most functionally annotated proteins contain domain structures

and remain in this subset. To further overcome this restriction, we may employ a

larger domain database (e.g., CDD (Marchler-Bauer et al., 2007)), or combine DO-

MAIN with other network aligners. In addition, as the set of defined domains expands

and is refined over time, this will gradually become less of a restriction.

88

Further directions for research include extending this approach to multiple net-

work alignment and to network querying. Since multiple network alignment requires

more than two networks by definition, we would simply need to devise an appropriate

scoring scheme that can handle more than a pair of alignable PPIs at once, and then

extend the notion of the APE graph accordingly.

The goal of network querying is to identify subnetworks in a given network that

are similar to the query. Typically, the network query is a hypothetical or known

functional module. We may simply treat the query as a small input network and

apply our DOMAIN method directly on them. A more sophisticated approach would

be to devise a sequence-profile-like structure to describe the DDI contents of the

network query, as well as perhaps constructing such structures for the full network

as a one-time expense for many successive queries.

89

5

Conclusions

The imperfect synchrony of a synchronized population of cells prevents us from di-

rectly using populations to precisely observe the dynamics of processes that occur

in single cells. In this thesis, we mainly present a deconvolution algorithm that ef-

ficiently removes the effects of synchrony loss from population-level measurements.

When applied to recent replicate microarray data, it robustly recovers precise tran-

scription profiles with markedly increased dynamic range and temporal resolution.

Our algorithm is built upon the cloccs framework which models three distinct

asynchrony sources: imperfect synchronization in the initial cell populations, vari-

ance in progression rates of individual cells through the cell cycle, and asymmetric

cell division. It should be explicitly noted that our deconvolution method cannot

assess variability across single cells, which might be interesting, especially for molec-

ular species at very low concentrations where noise plays an important role. Rather,

our method provides a high-resolution view of the transcript levels of the average

single cell; or alternatively, it learns what would be observed if we were to measure

a population of cells that starts and remains in perfect synchrony throughout a time

course.

90

Our approach has several algorithmic advantages: (1) Our algorithm optimizes

a convex objective function, and thus has a unique global optimum. Mature convex

optimization techniques and implementations enable an optimal solution to be found

efficiently: in practice, we can deconvolve a transcription profile in a few seconds in

MATLAB running on standard hardware. (2) By design, deconvolution algorithms

enhance the features of blurred population-level measurements to sharpen underlying

signal. However, previous deconvolution methods often end up sharpening noise as

well. We avoid this problem by formulating an objective function that is Bayesian

l1-regularized using a wavelet basis. Such an approach has been used in the signal

and image processing communities, where it has been shown to effectively deblur

signals and images while smoothing away noise (Donoho et al., 1994); to our knowl-

edge, however, wavelet-basis regularization has never been applied in a branching

process context as we require here. The usefulness of this approach is evident, as

about one third of genes had a PTR that decreased after deconvolution, presumably

because the fluctuation in measured transcript levels was due to noise rather than

cell-cycle regulation. For example, after deconvolution, the constitutively expressed

actin gene ACT1 and almost all ribosomal protein genes are essentially flat over

the entire course of the cell cycle (Fig. 3.4). These observations indicate that our

deconvolution algorithm can correctly dampen noise even while sharpening signal.

(3) The extensible design of our convolution kernel approach allows us to learn a

single transcription profile from replicate time-series experiments, leading to more

accurate and robust estimates.

A further advantage of our deconvolution algorithm is that when applying it to

population-level measurements of transcript dynamics across the yeast cell cycle, it

can learn distinct cell-cycle transcription programs for mother and daughter cells,

because we explicitly model them as distinct within the branching process. Our

algorithm identifies 82 genes that appear to be transcribed specifically in daughter

91

cells, and we anticipate this finding will be useful for studying late mitotic and early

G1 cell-cycle events, as well as cell differentiation in yeast. Moreover, the ability to

distinguish programs for biologically relevant sub-populations is not limited simply to

mother and daughter cells in budding yeast; by modifying the underlying branching

process model, this feature of our deconvolution algorithm could be extended to other

systems, and thereby lead to the identification of transcription programs that occur

only in distinct sub-populations of cells.

Our deconvolved estimates show a significant increase in amplitude of cell-cycle

oscillation for most of the genes measured. Using our results, we established a larger

periodic gene set (nearly twice as large as that identified in Spellman et al. (1998))

that includes about 70% of the periodic genes identified in the previous studies.

Although we do not believe all these genes are exclusively cell-cycle-regulated—for

example, some genes with significant stress-response regulation are included (see

Supplementary Figure S2 for details)—the size of this set suggests that many genes

may exhibit previously unrecognized transcriptional regulation during the cell cycle.

On the other hand, we also noticed that some well-studied cell-cycle-involved genes

like MCM1 and CDT1 are not in our cell-cycle-regulated set (or any previously

established sets, for that matter). One explanation may be that their expression

does not vary during the cell cycle. Another explanation is that their expression is

variable but regulated post-transcriptionally (i.e., we might see fluctuating expression

if we monitored protein abundance, or in the case of kinase targets, abundance of

phosphorylated protein). A more remote third possibility is that these genes may

be transcriptionally regulated, but transcribed at multiple times during a single cell

cycle, possibly because they may play multiple roles; due to convolution effects, the

transcription profiles of such genes would be greatly muddled in a cell population,

and deconvolving them to achieve sufficiently large PTR scores may be difficult,

given the level of noise in microarray experiments.

92

Although we have demonstrated the usefulness of our algorithm by deconvolving

genome-wide transcription profiles, the algorithm is general and can be used to de-

convolve many other population-level data sources, such as nucleosome occupancy

measurements, protein expression profiles obtained by Western blots, or measure-

ments in organisms other than budding yeast. All the algorithm needs as input

are synchrony measurements from cloccs or some other distribution model (e.g.,

the cell-type distribution model used in Siegal-Gaskins et al. (2009)) and time-series

measurements to be deconvolved.

The further work of the cell-cycle deconvolution may lead to two directions. The

first one is to develop a user-friendly application. Because our deconvolution al-

gorithm is general and can be applied to many different types of cell-cycle data

sources in many organisms, and because it is upon the cell-cycle distribution model

cloccs, it would be beneficial if we develop an application, ideally a web applica-

tion, which integrates cloccs and the deconvolution framework. Users can upload

their time-series data as well as the corresponding cell-cycle profiles of biomarkers,

and our application may first estimate the cell-cycle parameters from cloccs, and

then using these parameters to deconvolve the given time-series data. For users, they

can simply review the resultant deconvolved cell-cycle profiles without knowing any

back-end algorithms. The second direction is to expand our deconvolution frame-

work with some other biological processes during the cell cycle. For example, the

transcriptional levels learned from the deconvolution is a result of gene production

and degradation. So if we know the mRNA decay rates at a single-cell level, we can,

at least in principle, accurately estimate the corresponding mRNA production rates,

which is conceptually regulated by the binding of upstream transcription factors.

93

Appendix A

Intrinsic disorder within and flanking theDNA-binding domains of human transcription

factors

In this appendix chapter, we introduce an associative study between intrinsically

disordered regions (IDRs) and transcriptional factors (TFs). By using different com-

putational disorder prediction methods, we investigate the prevalence of IDRs within

DNA-binding domains (DBDs) and in their flanking regions across human TFs. Our

results confirm the hypothesis that the most prevalent DBDs in human TFs exhibit

significant order, but the flanking regions of these DBDs generally exhibit significant

disorder. Most of the work present in this chapter appeared in Guo et al. (2012b).

A.1 Introduction to intrinsically disordered structures and transcrip-tion factors

The function of a protein is encoded in its amino acid sequence (i.e., primary struc-

ture). However, protein activity typically depends on the protein being folded prop-

erly into its component secondary structure elements (e.g., alpha helices, beta sheets)

and the overall, global conformation of the protein (i.e., tertiary structure). Protein

94

structure can be determined experimentally at high resolution either by X-ray crys-

tallography or by nuclear magnetic resonance (NMR). X-ray crystallography is often

used, but cannot provide information on the conformation of regions that are either

highly dynamic or unstructured in the crystal. NMR can provide information about

flexibility and dynamics in proteins, but this technique is limited to smaller proteins.

Through a combination of structural and biochemical studies, it has become

increasingly appreciated that a protein may not adopt a single, well-defined “struc-

ture”, a term connoting a measure of rigidity. Rather, a protein may sample an

ensemble of global conformations; parts of the protein may be largely constantly

structured across this ensemble, while other parts may be quite variable or flexible

across the ensemble. These latter regions are sometimes termed “intrinsically disor-

dered regions” (IDRs), though they may adopt a more structured conformation upon

interaction with another molecule, whether a protein, DNA, or other ligand (Eliezer,

2009).

Proteins are largely involved in processes related to molecular recognition (e.g.,

binding, signaling, complex formation, enzymatic catalysis), and IDRs may enable

these recognition events either directly (e.g., serving as the recognition domain of

a protein) or indirectly (e.g., serving as a hinge that allows two ordered regions of

a protein to come together to effect recognition). For this reason, IDRs have been

studied rather extensively over the past decade, and a large number of computational

methods have been developed for the prediction of IDRs on the basis of amino acid

sequence, though this remains an imperfect art (see He et al. (2009) for a review).

In this study, we were interested in exploring the role(s) that IDRs might play

in the recognition tasks of transcription factors (TFs) in particular. Computational

explorations have found that IDRs are generally more prevalent in TFs than would

be expected by chance, especially in eukaryotes (Minezaki et al., 2006; Liu et al.,

2006; Fuxreiter et al., 2011). As a specific example, careful molecular studies have

95

shown that a region of fifteen amino acids within the DNA-binding domain (DBD) of

the estrogen receptor is disordered in solution, and makes contacts with DNA (and

with another ER DBD monomer), as shown in a co-crystal structure of the ER DBD

bound to DNA (Schwabe et al., 1993). Moreover, IDRs outside the homeodomain

DBD have also been found to impact the DNA-binding affinity of the Drosophila TF

Ubx (Liu et al., 2008). In addition, the region N-terminal to the proximal accessory

region of the Saccharomyces cerevisiae C2H2 zinc finger TF Adr1 is disordered in

solution (even after binding DNA) and increases the affinity for non-specific DNA,

mainly by increasing the DNA association rate; increased affinity for non-specific

DNA might allow a protein to find its specific sites more quickly after translocation

from non-specific sites that are bound initially (Schaufler and Klevit, 2003). Finally,

DBDs often have N- or C-terminal extensions, referred to as ‘arms’ or ‘tails’, that

bind DNA but are disordered when free in solution (Crane-Robinson et al., 2006).

Intrigued by this ensemble of findings pointing to the importance of IDRs in TFs

and their interactions with DNA, we sought to explore the connection between IDRs

and TF function more precisely and systematically. We were particularly interested

in determining whether IDRs were more prevalent in the regions flanking the DBDs

that are responsible for the binding of sequence-specific TFs to DNA.

A.2 Materials and Methods

A.2.1 Constructing the TF dataset and the non-TF control dataset

We created two non-redundant datasets of human proteins: a TF set and a non-TF

set for use as a control. The procedure for constructing these sets and ensuring their

non-redundancy is described below and summarized in Figure A.1A.

We assembled the TF set from a published repertoire of human TFs (Vaquerizas

et al., 2009). In their study, Vaquerizas and colleagues manually curated and iden-

tified 1,987 TF-coding human genomic loci in the Ensembl database (Flicek et al.,

96

2011); the list includes 1,960 high-confidence entries and 27 entries curated as prob-

able. We cross-referenced these Ensembl loci against the RefSeq database (release

47) (Pruitt et al., 2009) to obtain 2,362 protein isoforms associated with 1,747 genes.

To reduce sequence redundancy and thus potential bias, if multiple isoforms were

associated with the same gene, we selected only the longest. This resulted in a final

total of 1,747 unique TF protein sequences, and in subsequent analysis, we call this

our TF set.

We assembled our non-TF control set by downloading all human proteins from

RefSeq, and excluding the 2,362 TF-associated isoforms from above, which yielded a

total of 32,567 non-TF proteins. To match the size and sequence length distribution

of our TF set, we randomly sampled 1,747 proteins from the 32,567 according to

the empirical sequence length distribution of the TF set; to ensure non-redundancy

during this process, at each iteration we required that the sampled protein come

from a locus not previously sampled. Therefore, the resulting control set contains

1,747 unique non-TF protein sequences.

A.2.2 Comparing the TF and non-TF sets of proteins

To ensure that the non-TF set represents a well-constructed control for the TF set,

we compared various properties of the two sets. First, we compared the sequence

length distributions of the TF set and the non-TF control set, in addition to the

set of all human TFs (i.e. with redundancy). As shown in Fig. A.1B, no apparent

differences exist between the sequence length distributions in the TF set, the non-TF

control set, and the set of all human TFs.

Next, we compared the amino acid compositions of the TF set, the non-TF

control set, and the set of all human TFs (Fig. A.1C). The amino acid composition

of sequences in IDRs have been shown to be significantly different from those in

ordered regions (Dunker et al., 2001), and IDRs have been shown to have high

97

0 500 1000 1500 2000

04

8

length (amino acids)

freq

uenc

y (%

)

all TFsnr TFsnon−TF ctrl

W F Y I M L V N C T A G R D H Q K S E P

freq

uenc

y (%

)

02

46

810 all TFs

nr TFsnon−TF ctrl

all human TF loci in Ensembl(1,987)

protein entries in RefSeq(2,362)

non-redundant (nr) TFs(1,747)

all human protein entriesin RefSeq (~35K)

non-TF protein entries(~32.6K)

sampled non-TF ctrl set(1,747)

cross-reference

remove redundancy:1 isoform per locus

remove TFs

A B

C

sample w.r.t.sequence length distribution

of nr TF set

Figure A.1: Generation of TF set and the non-TF control set. (A) A schematicof the pipeline for generating the TF set and the non-TF control set. (B) Sequencelength distributions of the TF set, the non-TF control set, and the set of all humanTFs (with redundancy). (C) The amino acid compositions of the TF set, the non-TFcontrol set, and the set of all human TFs (with redundancy). Amino acids are listedfrom most order-promoting to most disorder-promoting, according to (Campen et al.,2008). It is apparent from the histogram that compared to proteins in general, TFshave fewer order-promoting residues (e.g., W, F, Y, I, M, L, V) and more disorder-promoting residues (e.g., P, E, S, K, Q, H).

prevalence in TFs (Liu et al., 2006), so we might expect compositional differences

between the TF sets and the non-TF control set. Indeed, compared to the non-TF

control set, both TF sets are enriched in disorder-promoting amino acids (e.g., P,

E, S, K, Q, H), and depleted in order-promoting amino acids (e.g., W, F, Y, I, M,

L, V) (Dunker et al., 2001; Campen et al., 2008), as expected. However, the amino

acid compositions of our non-redundant TF set and the set of all human TFs are

nearly identical, suggesting that our procedure for removing redundancy introduces

no significant compositional bias.

A.2.3 Identifying DNA-binding domains (DBDs) and their locations within proteins

Our goal is to investigate the prevalence and locations of IDRs within human TFs,

and in particular, the spatial relationships between IDRs and DBDs in TFs. To iden-

tify all sequence-specific DBDs that occur within human TFs, we started with the

98

entire set of human proteins from RefSeq and identified every Pfam domain (Finn

et al., 2010) that was contained in a human protein with a p-value below 0.05. We

manually filtered for those domains whose text descriptions in the Pfam or Inter-

Pro (Hunter et al., 2009) databases indicated that the domain mediates sequence-

specific DNA binding, resulting in 76 domains which we henceforth call Pfam DBDs.

Using HMMER (Eddy, 2009) with default parameters, we searched for the loca-

tions of matches to Pfam DBDs within our TF set. We found 71 of the 76 Pfam DBDs

matched to proteins in our TF set, with 32 DBDs appearing more than five times.

Of the 1,747 proteins in our TF set, 669 contained only a single DBD, while another

642 contained multiple DBDs; proteins with multiple DBDs are typically those con-

taining multiple zinc fingers, which are annotated as separate domains even if they

occur in tandem within a protein. Indeed, the TF with the highest number of DBDs

is zinc finger protein 91 (RefSeq: NP 003421), which contains 31 zf-C2H2 (zinc fin-

ger, C2H2-type) domains. The zf-C2H2 domain is interesting in its own right as it

is by far the most prevalent domain in our TF set, appearing a total of 4,154 times,

almost 20 times as often as the next most prevalent domain.

A.2.4 Using multiple prediction methods to predict intrinsically disordered regions(IDRs) within proteins

To perform our analysis, we first needed to predict the ordered and disordered re-

gions within proteins using existing computational tools. Since this remains a bit of

an imperfect art, we took care to ensure that our conclusions would not be overly de-

pendent on the predictions of any single choice of method. Consequently, we chose to

use three distinct disorder prediction tools, each demonstrated to perform with high

accuracy (He et al., 2009): PONDR VSL2 (Peng et al., 2006), DISOPRED2 (Ward

et al., 2004), and PreDisorder 1.1 (Deng et al., 2009). PONDR VSL2 was evaluated

as the top-ranked disorder predictor in CASP7 in 2006 (Bordoli et al., 2007), and Pre-

99

Disorder was ranked among the top methods in disorder prediction during CASP8 in

2008 and CASP9 in 2010. These methods employ a variety of techniques to analyze

sequence and structural information for IDR prediction: PONDR VSL2 uses support

vector machines (SVMs) to separately address prediction problems in short versus

long sequence regions, and then merges the results using a logistic regression model;

DISOPRED2 is also based on SVMs, and compared to other prediction methods,

the main difference is that it is directly trained on the whole sequence using various

combinations of binary-encoded amino acid sequence, secondary structure predic-

tions, and sequence profiles; and PreDisorder 1.1 is based on an ab initio prediction

method along with a meta-prediction method.

A.2.5 Defining disorder features: spatial relationships of IDRs relative to DBDswithin TFs

Given the annotated DBDs and the predicted disorder regions in the TF set and the

non-TF control set, we sought to systematically analyze the association between TF

DBDs and predicted IDRs by testing for enrichment of IDRs at different locations

relative to DBDs. Specifically, we were interested in IDRs within the DBD itself,

as well as the regions flanking the DBD, and we developed five distinct ‘disorder

features’: we say that a DBD is disordered if at least a fraction f of its residues are

predicted to be disordered; we say that the N-terminal flank of a DBD is disordered

if at least a fraction f of the 30 residues flanking the DBD in the N-terminal direction

are predicted to be disordered; analogously, we say that the C-terminal flank of a

DBD is disordered if at least a fraction f of the 30 residues flanking the DBD in

the C-terminal direction are predicted to be disordered; we say that both flanks of

a DBD are disordered if both the N-terminal and C-terminal flanks are disordered;

and finally, we say that an entire TF is disordered if at least a fraction f of all of its

residues are disordered. We wanted to be fairly stringent in identifying these disorder

100

features, so that we could focus on those with the highest confidence; therefore, we

chose the value of 0.8 for f .

A.2.6 Calculating statistical significance of disorder features

To assess whether the prevalence of disorder features within and flanking DBDs was

unusually high or low, we needed to determine a suitable measure of significance.

Moreover, since different computational tools predict IDRs at different rates, our

significance measure needed to enable the comparison of results across methods,

and not be biased by methods that are systematically more or less likely to predict

disorder within proteins.

We thus developed two different null models to test for the significance of our

disorder features (e.g., disordered DBD, N-terminal flank, or C-terminal flank). The

first null model pretended that the location of a DBD occurred uniformly at random

within each sequence, and was based on the TF set. The second null model also

pretended that the location of a DBD occurred uniformly at random in each sequence,

but was based on the non-TF control set. In summary, these two null models—in

which the location of a DBD was chosen uniformly at random—were designed to

test whether the spatial relationships between IDRs and DBDs were statistically

significant or simply occurred by chance.

With each null model providing a baseline expectation for how often a disorder

feature might be found by chance, we could then compute a significance measure

based on the p-value from a hypergeometric distribution (i.e., Fisher’s exact test).

For each disorder feature we considered, we were able to compute two separate p-

values, one for each null model. Consistency of significance across the two different

null models thus gave us some confidence that our results were robust to the specific

choice of null model.

101

A.3 Results

A.3.1 Comparing the three methods to predict IDRs within proteins

We used three different disorder prediction tools to predict IDRs in both the TF set

and the non-TF control set. Though the purpose of this paper is to make use of

existing prediction methods and not to evaluate them (which has already been done

by others), it is important to at least have a summary sense of how each method

is performing on our various protein sets. A summary of the results of the three

methods is listed in Table A.1 and shown in Figure A.2. In Table A.1, we calculate

the total percentage of protein residues predicted as disordered by each method,

along with the average length of each predicted IDR. In Figure A.2, we compare the

fraction of each protein’s residues predicted as disordered by each method. The table

and figure reveal that all three methods consistently predict proteins in the TF set

to have more disordered residues, longer IDRs, and a greater fraction of disordered

residues than proteins in the non-TF control set, confirming earlier findings that

IDRs are enriched in TFs.

As an aside, it is apparent that PONDR VSL2 is far more likely than the other

two methods to call a residue as disordered, in both the TF set and the non-TF

control set, suggesting that the method is probably operating at a different point

on its receiver operating characteristic (ROC) curve, with high sensitivity but also

perhaps a relatively high false positive rate (Bordoli et al., 2007). In addition, the

average length of IDRs predicted by PONDR VSL2 is higher than the other two

methods, which may be related to the previous point, but may also be because the

method uses different SVMs to predict IDRs in short and long sequences separately.

102

Table A.1: Statistics summarizing disorder predictions on all the residues of all theproteins in both the TF set and the non-TF control set using three different disorderprediction tools.

TF set non-TF ctrl set

% of res. avg length % of res. avg lengthpredicted of IDRs predicted of IDRsin IDRs in IDRs

PONDR VSL2 83.2% 106 53.3% 39

DISOPRED2 47.4% 44 34.1% 36

PreDisorder 1.1 50.1% 19 38.3% 18

A.3.2 IDRs associated with TF DBDs or their flanking regions

To systematically study the associations between IDRs and DBDs, for each occur-

rence of a DBD class within a human TF, we calculated 30 different p-values: the

significance under two different null models (based on the TF set and the non-TF

control set) of five different kinds of disorder features (DBD, N-terminal flank, C-

terminal flank, both flanks, and entire TF) as computed by three different prediction

methods (PONDR VSL2, DISOPRED2, and PreDisorder 1.1). For each combination

of null model and feature, we say that the feature exhibits significant disorder under

that null model if at least two of the three prediction methods predict disorder at

p-value ≤ 0.005; on the other hand, we say that the feature exhibits significant order

under that null model if at least two of the three prediction methods predict disorder

at p-value ≥ 0.995. Note that it is certainly possible for a feature to be neither

significantly ordered nor significantly disordered under a particular null model.

Although we computed whether features exhibited significant order or disorder

across all Pfam DBDs occurring in our TF set, to avoid artifacts due to small sample

size, we restricted our subsequent analysis to the 32 DBD classes with at least five

occurrences in the TF set. Many of the most frequent DBD classes, including the

10 most prevalent ones, are structurally similar and can be roughly classified into

two groups: (1) those containing zinc fingers, and (2) those containing a basic helix-

103

turn-helix type of domain, domains in which helices are separated by loops (e.g.,

Homeobox, HLH, Fork head, Ets). The enrichment analysis results for these 32

DBD classes are listed in Table A.2; at the bottom of the table, we also included the

Pfam domains Basic, AT hook, and P53 (Basic and AT hook are included because

we mention them below in comparison to another study; P53 is a well-studied DBD

included for general interest).

The top 10 most frequently occurring DBD classes in human TFs all exhibit

significant order within the DBD itself, suggesting that structural flexibility within

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

2025

30

fraction of a protein's residuespredicted as disordered

freq

uenc

y (%

)

DISOPRED2PreDisorder1.1

PONDR VSL2A

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

freq

uenc

y (%

)

DISOPRED2PreDisorder1.1

PONDR VSL2

fraction of a protein's residuespredicted as disordered

B

Figure A.2: Distributions of the fraction of each protein’s residues predicted asdisordered by each method for the proteins in (A) the TF set and (B) the non-TFcontrol set.

104

these domains is rather limited. Strikingly, our results indicate that although the

DBDs themselves exhibit significant order, the regions flanking the DBDs are likely

to exhibit significant disorder. Only in the case of zf-C2H2 do the flanking regions ex-

hibit significant order (this will be discussed further in the next section). In contrast,

26 of the other 31 DBDs exhibit significant disorder in either the N-terminal flank,

the C-terminal flank, or both; and none of the other 31 DBDs exhibit significant

order in either flank under either null model. This is consistent with prior studies in

which it was found that DBDs are often separated by flexible linker regions, allowing

TFs to bind DNA with fine control over DNA binding affinity (Zhou, 2001; Fukuchi

et al., 2006).

A.3.3 Comparison of prediction methods in DBDs

To further investigate the detailed spatial relationships of the IDR predictions of the

three different methods to protein DBDs, we generated a meta-plot of the average

predicted order/disorder in the vicinity of each Pfam DBD according to each pre-

diction method. To do this, we first identified all occurrences of a Pfam DBD in

the TF set, and then across all those occurrences, calculated the average (mean) or-

der/disorder score predicted by each method at each residue within the DBD match

and both of its flanks (up to 30 amino acids). In cases where a TF contained only a

partial DBD match and not a full domain according to the HMMER alignment, we

considered only the aligned region in our calculations. We normalized the resulting

scores for the purpose of comparison across methods, and for uniformity in scale

across plots for different DBD classes (Fig. A.3).

Fig. A.3 displays meta-plots for five of the ten Pfam DBDs most prevalent in

human TFs. Results from DISOPRED2 and PreDisorder 1.1 are fairly consistent

across all five domain classes. Moreover, all three methods are in good agreement in

zf-C2HC and demonstrate similar prediction trends in zf-C4, Homeobox, and HLH.

105

Tab

leA

.2:

Enri

chm

ent

anal

ysi

sof

sign

ifica

ntl

yocc

urr

ing

order

edan

ddis

order

edre

gion

sw

ithin

and

flan

kin

ghum

anT

FD

BD

s.

DB

DN

-term

inal

C-t

erm

inal

both

flanks

whole

TF

flank

flank

sequence

No.

DB

DT

Ffa

mil

yaverage

DB

Dnum

ber

of

TF

set

non-T

FT

Fse

tnon-T

FT

Fse

tnon-T

FT

Fse

tnon-T

FT

Fse

tnon-T

F(P

fam

)le

ngth

(res.)

DB

Ds

inT

Fs

ctr

lse

tctr

lse

tctr

lse

tctr

lse

tctr

lse

t

1P

F00096

zf-

C2H

223.1

4154

OR

OR

OR

OR

OR

OR

OR

OR

OR

OR

2P

F00046

Hom

eob

ox

56.3

216

OR

OR

ID

ID

ID

ID

ID

ID

ID

ID

3P

F00010

HL

H53.3

100

OR

OR

ID

ID

ID

ID

ID

ID

–ID

4P

F00505

HM

Gb

ox

68.0

56

OR

OR

ID

ID

ID

ID

ID

ID

–ID

5P

F00250

Fork

head

98.2

47

OR

OR

ID

ID

ID

ID

ID

ID

ID

ID

6P

F00105

zf-

C4

70.2

45

OR

OR

ID

ID

ID

ID

ID

ID

OR

–

7P

F00249

Myb

DN

A-b

indin

g47.2

43

OR

OR

––

–ID

––

––

8P

F00170

bZ

IP1

64.2

34

OR

OR

ID

ID

––

ID

ID

ID

ID

9P

F00178

Ets

85.0

27

OR

OR

ID

ID

ID

ID

ID

ID

––

10

PF

00320

GA

TA

35.1

20

–O

RID

ID

ID

ID

ID

ID

–ID

11

PF

00907

T-b

ox

187.6

18

––

ID

ID

ID

ID

ID

ID

––

12

PF

01530

zf-

C2H

C31.0

14

ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

13

PF

02319

E2F

TD

P73.8

13

––

ID

ID

––

–ID

––

14

PF

00313

CSD

68.6

12

––

––

––

––

––

15

PF

05485

TH

AP

89.2

12

––

ID

ID

ID

ID

ID

ID

––

16

PF

01422

zf-

NF

-X1

21.5

11

––

––

––

––

––

17

PF

03165

MH

1109.9

11

––

ID

ID

ID

ID

ID

ID

––

18

PF

07716

bZ

IP2

54.0

10

––

ID

ID

––

ID

ID

––

19

PF

00292

PA

X125.6

9–

–ID

ID

ID

ID

ID

ID

––

20

PF

00098

zf-

CC

HC

17.9

8–

––

ID

–ID

–ID

––

21

PF

00808

CB

FD

NF

YB

HM

F63.1

8–

––

ID

––

––

––

22

PF

04218

CE

NP

-BN

52.5

8–

––

––

––

––

–

23

PF

00751

DM

47.0

7–

–ID

ID

ID

ID

ID

ID

–ID

24

PF

01342

SA

ND

79.0

7–

–ID

ID

ID

ID

ID

ID

––

25

PF

02257

RF

XD

NA

bin

din

g72.7

7–

–ID

ID

––

––

––

26

PF

02864

ST

AT

bin

d251.9

7–

––

––

––

––

–

27

PF

02892

zf-

BE

D50.1

7–

–ID

ID

ID

ID

ID

ID

––

28

PF

10401

IRF

-3174.0

7–

––

––

––

ID

––

29

PF

00447

HSF

DN

A-b

ind

104.2

6–

––

–ID

ID

ID

ID

––

30

PF

04516

CP

2227.2

6–

––

––

––

––

–

31

PF

03299

TF

AP

-2208.2

5–

––

ID

––

ID

ID

––

32

PF

05044

Pro

x1

224.0

5–

–ID

ID

––

ID

ID

––

PF

01586

Basi

c91.0

4ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

PF

00870

P53

196.3

3–

–ID

ID

ID

ID

ID

ID

––

PF

02178

AT

hook

109.0

1ID

ID

ID

ID

ID

ID

ID

ID

ID

ID

Notes:

The

DB

Ds

wit

hat

least

5occurr

ences

inth

eT

Fse

tare

list

ed

inth

eta

ble

,to

geth

er

wit

hB

asi

c,

P53,

and

AT

hook.

IDin

dic

ate

ssi

gnifi

cant

dis

ord

ere

dD

BD

s,D

BD

flanks,

or

TF

s(i

nat

least

two

of

thre

em

eth

ods,

p-v

alu

e≤

0.0

05).

OR

indic

ate

ssi

gnifi

cant

ord

ere

dD

BD

s,D

BD

flanks,

or

TF

s(i

nat

least

two

of

thre

em

eth

ods,

p-v

alu

e≤

0.0

05).

Adash

(–)

indic

ate

sentr

ies

that

are

neit

her

signifi

cantl

yord

ere

dnor

signifi

cantl

ydis

ord

ere

d.

The

DB

Ds

wit

hfe

wer

than

5occurr

ences

inth

eT

Fse

tin

clu

de:

Runt,

TE

A,

Basi

c,

HA

LZ

,z-a

lpha,

FY

RN

,F

YR

C,

P53,

AR

ID,

DM

A,

AK

AP

95,

GA

TA

-N,

P53

tetr

am

er,

Hom

ez,

XP

AN

,zf-

DH

HC

,G

CM

,C

G-1

,V

ert

HS

TF

,SIM

C,

Rad51,

HA

ND

,B

eta

-tre

foil,

LA

G1-D

NA

bin

d,

PW

I,zf-

MY

ND

,SA

P,

GC

R,

Oest

recep,

Pro

gre

cepto

r,zf-

TR

AF

,zf-

CH

Y,

Vert

IL3-r

eg

TF

,H

SA

,R

io2

N,

Brk

DB

D,

zf-

RA

G1,

AT

hook,

and

TM

FD

NA

bd.

106

Extended to all the DBDs listed in Table A.2, over 67.2% of the DBD classes that

are found to exhibit either significant disorder or significant order are identified as

such by all three methods.

Nevertheless, some discrepancies in the results from the different methods are

evident, such as zf-C2H2. The C2H2-type zinc finger domain is the most prevalent

DBD class found in metazoan TFs, including in human Tupler et al. (2001). It is

also one of the most highly ordered DBDs; however, the linker regions between these

C2H2 zinc finger domains are often disordered (Pabo et al., 2001). As shown in Fig-

ure A.3A, PONDR VSL2 reports that the C2H2 domain occurrences in human TFs

exhibit significant disorder in both the C2H2 domain itself and the adjacent N- and

C-terminal flanks; however, DISOPRED2 and PreDisorder both report the opposite,

namely that zf-C2H2 and its flanks exhibit significant order. Liu et al. (2006) care-

fully analyzed the difficulties of predicting intrinsic disorder in the zf-C2H2 domains

and their linker regions. They concluded that because many linker regions between

C2H2 zinc fingers are quite short, the windowing procedures employed by some IDR

prediction algorithms prevent them from being detected as disordered; the result is

an artifact in which linker regions between C2H2 zinc fingers are over-predicted as

being ordered.

A.3.4 Summary descriptions for some of the most prevalent DBD classes found inhuman TFs

Zinc fingers

Zinc fingers are small structural motifs whose folds are stabilized by coordination

of one or more zinc ions. Zinc fingers can be classified according to their zinc-

coordinating residues and folds. In Fig. A.3A-C, we show our IDR prediction results

for the three major zinc finger domain classes found in human TFs: zf-C2H2 (the

most prevalent DBD class in human TFs), zf-C4 (also referred to as nuclear recep-

107

diso

rder

edor

dere

d

DISOPRED2

PreDisorder1.1

zf-C2H2N-term flank C-term flank

(A)

N-term flank C-term flank

zf-C2HC

(B)

C-term flankN-term flank

zf-C4

(C)

diso

rder

edor

dere

d


HLH

(D) (E)

PONDR VSL2


Homeobox

Figure A.3: Shown are meta-plots for five prevalent DBDs in human TFs. (A) zinc-finger C2H2-type (length: ∼23 amino acids), (B) zinc-finger C2HC-type (length: ∼31amino acids), (C) zinc-finger C4-type (length: ∼70 amino acids), (D) homeodomainfold (length: ∼58 amino acids), and (E) helix-loop-helix (length: ∼53 amino acids).

tors), and zf-C2HC. Although all three classes contain zinc fingers, we find variability

in their regions of order and disorder. As discussed above, the C2H2 zinc finger do-

main is itself believed to be highly ordered, with individual ordered zinc fingers

separated by highly flexible linker regions (Pabo et al., 2001). We find that the C4

domain exhibits significant order within the DBD itself, but significant disorder in

flanking regions. In contrast, we find that the C2HC domain exhibits significant

disorder in both the DBD and flanking regions.

Homeobox

Homeobox (homeodomain fold) is the second-most abundant DBD class within hu-

man TFs. The homeodomain fold consists of an approximately 60 amino acid helix-

turn-helix structure in which three alpha helices are connected by short loop regions.

Our results (Fig. A.3D) extend the results of a prior study (Liu et al., 2008) that

108

found multiple intrinsically disordered sequences located outside the homeodomain

DBD of the Drosophila TF Ubx, that allow Hox family members (i.e., a subclass

of TFs with Homeobox DBDs) to bind DNA with high affinity but relatively low

specificity (Gehring et al., 1994; Hoey and Levine, 1988).

HLH

HLH (basic helix-loop-helix) is the third-most abundant DBD class within human

TFs, and is characterized by two α-helices connected by a loop. TFs that have this

domain typically bind DNA as either homo- or hetero-dimers, with each monomer

contacting DNA through a helix containing basic residues that facilitate DNA bind-

ing (Littlewood and Evan, 1995). As shown in Fig. A.3E, all three methods report

that HLH exhibits significant order within the domain itself, but significant disorder

in both the N- and C-terminal flanking regions. Our results also indicate that a short

but highly disordered region may frequently occur in the middle of the HLH domain,

consistent with prior observations that the linker regions and the loop region of HLH

proteins are of higher flexibility, allowing dimerization by folding and packing one

smaller helix against the other one (Littlewood and Evan, 1995).

A.4 Discussion

In this study, we used three different computational disorder prediction methods to

investigate the prevalence of IDRs within DBDs and in their flanking regions across

essentially the entire repertoire of human, sequence-specific TFs and their associated

Pfam DBDs. Our choice of multiple prediction methods was motivated by a desire

to be able to draw robust conclusions that were not dependent on any one particular

method.

Previously it was found that TFs are enriched for IDRs (Liu et al., 2006; Minezaki

et al., 2006). At the same time, DBDs responsible for TF binding did not seem

109

themselves to be particularly enriched for IDRs. For example, of the 25 DBDs studied

in (Liu et al., 2006), only the Basic and AT hook domains exhibited high amounts

of disorder; however, those domains are not particularly prevalent in human TFs,

occurring just four times and once in our TF set, respectively.1 We were intrigued

by the possibility that the enrichment of IDRs observed in TFs might be at least

partly due to disorder in the regions flanking DBDs; under such a hypothesis, DBDs

can be thought of as islands of order flanked by regions of disorder.

Our results support exactly such a hypothesis: the most prevalent DBDs in hu-

man TFs exhibit significant order, but the flanking regions of these DBDs generally

exhibit significant disorder. Similarly, among DBDs of intermediate prevalence (oc-

curring between 5 and 20 times in our TF set), although they do not appear often

enough to exhibit either significant order or disorder within the domains themselves,

most of them still exhibit significant disorder in one or both flanking regions.

The functional role played by the significant prevalence of disorder in the regions

flanking DBDs of human TFs is unclear. However, we can speculate that the in-

creased flexibility afforded by these flanking IDRs might contribute to the ability of

TFs to 1) recognize target sequences in the DNA appropriately, 2) bind to a wider

diversity of DNA target sequences, 3) be anchored with higher affinity to the DNA

after recognizing target sequences, 4) bind to other factors and complexes positioned

on the DNA or involved in transcriptional regulation, or 5) present activation do-

mains to downstream transcriptional regulatory machinery. It should be emphasized

that these possibilities are speculative; however, the results of this study suggest nu-

merous testable hypotheses regarding the roles of N- and C-terminal regions flanking

DBDs for many frequently occurring DBDs in hundreds of human TFs. For example,

the importance of the predicted disorder in these flanking regions in determining or

1 Though they do not occur often, where they do occur, they exhibit significant disorder in ourresults as well, corroborating the results in Liu et al. (2006); see Table A.2.

110

modulating the DNA binding affinity and/or specificity of the associated TFs could

be investigated with protein binding microarrays (PBMs) (Mukherjee et al., 2004;

Berger et al., 2008). PBMs could assay the affinity and/or specificity of proteins

representing the DBDs with their flanking regions, as compared to either the DBDs

alone or the DBDs with mutant flanking regions predicted not to be significantly dis-

ordered. If found to contribute to the DNA binding affinity and/or specificity of TFs,

IDRs that flank DBDs would broaden the scope of functional domains to be consid-

ered when evaluating the potential impact of mutations or natural polymorphisms

within exomes, such as in medical sequencing projects.

This study was focused on human TFs; however, since these DBD classes are the

predominant DBD classes not just in human TFs but throughout eukaryotes, the

results of this study may have important implications for studies of TFs across all

eukaryotes.

111

Bibliography

Aikawa, E., Nahrendorf, M., Sosnovik, D., Lok, V. M., Jaffer, F. A., Aikawa, M.,and Weissleder, R. (2007), “Multimodality molecular imaging identifies proteolyticand osteogenic activities in early aortic valve disease,” Circulation, 115, 377–386.

Alberts, B., Johnson, A., Lewis, J., Roberts, K., and Walter, P. (2007), MolecularBiology of the Cell in Cell, 5th Edition, Garland Science.

Amon, A. (2002), “Synchronization procedures,” Meth. Enzymol., 351, 457–467.

Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M.,Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P.,Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ring-wald, M., Rubin, G. M., and Sherlock, G. (2000), “Gene Ontology: Tool for theunification of biology. The Gene Ontology Consortium,” Nat. Genet., 25, 25–29.

Bakkenist, C. J. and Kastan, M. B. (2003), “DNA damage activates ATM throughintermolecular autophosphorylation and dimer dissociation,” Nature, 421, 499–506.

Bar-Joseph, Z., Farkash, S., Gifford, D. K., Simon, I., and Rosenfeld, R. (2004), “De-convolving cell cycle expression data with complementary information,” Bioinfor-matics, 20 Suppl 1, 23–30.

Bean, J. M., Siggia, E. D., and Cross, F. R. (2006), “Coherence and timing of cellcycle start examined at single-cell resolution,” Mol. Cell, 21, 3–14.

Bell, S. P. and Dutta, A. (2002), “DNA replication in eukaryotic cells,” Annu. Rev.Biochem., 71, 333–374.

Benjamini, Y. and Hochberg, Y. (1995), “Controlling the false discovery rate: a prac-tical and powerful approach to multiple testing,” Journal of the Royal StatisticalSociety. Series B (Methodological), pp. 289–300.

Berg, J., Lassig, M., and Wagner, A. (2004), “Structure and evolution of proteininteraction networks: a statistical model for link dynamics and gene duplications,”BMC Evol. Biol., 4, 51.

112

Berger, M. F., Badis, G., Gehrke, A. R., Talukder, S., Philippakis, A. A., Pena-Castillo, L., Alleyne, T. M., Mnaimneh, S., Botvinnik, O. B., Chan, E. T., Khalid,F., Zhang, W., Newburger, D., Jaeger, S. A., Morris, Q. D., Bulyk, M. L., andHughes, T. R. (2008), “Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences,” Cell, 133, 1266–1276.

Bernard, A., Vaughn, D., and Hartemink, A. (2007), “Reconstructing the topologyof protein complexes,” in Research in Computational Molecular Biology, pp. 32–46,Springer.

Bi, E., Maddox, P., Lew, D. J., Salmon, E. D., McMillan, J. N., Yeh, E., and Pringle,J. R. (1998), “Involvement of an actomyosin contractile ring in Saccharomycescerevisiae cytokinesis,” J. Cell Biol., 142, 1301–1312.

Bloom, J. and Cross, F. R. (2007), “Multiple levels of cyclin specificity in cell-cyclecontrol,” Nat. Rev. Mol. Cell Biol., 8, 149–160.

Bordoli, L., Kiefer, F., and Schwede, T. (2007), “Assessment of disorder predictionsin CASP7,” Proteins, 69 Suppl 8, 129–136.

Boyle, E. I., Weng, S., Gollub, J., Jin, H., Botstein, D., Cherry, J. M., and Sherlock,G. (2004), “GO::TermFinder–open source software for accessing Gene Ontologyinformation and finding significantly enriched Gene Ontology terms associatedwith a list of genes,” Bioinformatics, 20, 3710–3715.

Burrus, C., Gopinath, R., and Guo, H. (1998), Introduction to wavelets and wavelettransforms: a primer, Prentice Hall.

Bustin, M., Catez, F., and Lim, J. H. (2005), “The dynamics of histone H1 functionin chromatin,” Mol. Cell, 17, 617–620.

Campen, A., Williams, R. M., Brown, C. J., Meng, J., Uversky, V. N., and Dunker,A. K. (2008), “TOP-IDP-scale: a new amino acid scale measuring propensity forintrinsic disorder,” Protein Pept. Lett., 15, 956–963.

Chen, K. C., Calzone, L., Csikasz-Nagy, A., Cross, F. R., Novak, B., and Tyson,J. J. (2004), “Integrative analysis of cell cycle control in budding yeast,” Mol.Biol. Cell, 15, 3841–3862.

Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka,L., Wolfsberg, T. G., Gabrielian, A. E., Landsman, D., Lockhart, D. J., and Davis,R. W. (1998), “A genome-wide transcriptional analysis of the mitotic cell cycle,”Mol. Cell, 2, 65–73.

Colman-Lerner, A., Chin, T. E., and Brent, R. (2001), “Yeast Cbk1 and Mob2activate daughter-specific genetic programs to induce asymmetric cell fates,” Cell,107, 739–750.

113

Cosma, M. P. (2004), “Daughter-specific repression of Saccharomyces cerevisiae HO:Ash1 is the commander,” EMBO Rep., 5, 953–957.

Crane-Robinson, C., Dragan, A. I., and Privalov, P. L. (2006), “The extended armsof DNA-binding domains: a tale of tails,” Trends Biochem. Sci., 31, 547–552.

Cross, F. R. (2003), “Two redundant oscillatory mechanisms in the yeast cell cycle,”Dev. Cell, 4, 741–752.

Daubechies, I. (1992), Ten lectures on wavelets, vol. 61, Society for Industrial Math-ematics.

de Lichtenberg, U., Jensen, L. J., Fausboll, A., Jensen, T. S., Bork, P., and Brunak,S. (2005), “Comparison of computational methods for the identification of cellcycle-regulated genes,” Bioinformatics, 21, 1164–1171.

Deng, M., Mehta, S., Sun, F., and Chen, T. (2002), “Inferring domain-domain inter-actions from protein-protein interactions,” Genome Res., 12, 1540–1548.

Deng, X., Eickholt, J., and Cheng, J. (2009), “PreDisorder: ab initio sequence-basedprediction of protein disordered regions,” BMC Bioinformatics, 10, 436.

Di Talia, S., Skotheim, J. M., Bean, J. M., Siggia, E. D., and Cross, F. R. (2007),“The effects of molecular noise and size control on variability in the budding yeastcell cycle,” Nature, 448, 947–951.

Di Talia, S., Wang, H., Skotheim, J. M., Rosebrock, A. P., Futcher, B., and Cross,F. R. (2009), “Daughter-specific transcription factors regulate cell size control inbudding yeast,” PLoS Biol., 7, e1000221.

Dickinson, M. E. (2006), “Multimodal imaging of mouse development: tools for thepostgenomic era,” Dev. Dyn., 235, 2386–2400.

Donoho, D., Johnstone, I., and Johnstone, I. M. (1994), “Ideal spatial adaptation bywavelet shrinkage,” Biometrika, 81, 425–455.

Doolin, M. T., Johnson, A. L., Johnston, L. H., and Butler, G. (2001), “Overlappingand distinct roles of the duplicated yeast transcription factors Ace2p and Swi5p,”Mol. Microbiol., 40, 422–432.

Dunker, A. K., Lawson, J. D., Brown, C. J., Williams, R. M., Romero, P., Oh, J. S.,Oldfield, C. J., Campen, A. M., Ratliff, C. M., Hipps, K. W., Ausio, J., Nissen,M. S., Reeves, R., Kang, C., Kissinger, C. R., Bailey, R. W., Griswold, M. D., Chiu,W., Garner, E. C., and Obradovic, Z. (2001), “Intrinsically disordered protein,”J. Mol. Graph. Model., 19, 26–59.

114

Dutkowski, J. and Tiuryn, J. (2007), “Identification of functional modules fromconserved ancestral protein-protein interactions,” Bioinformatics, 23, i149–158.

Eddy, S. R. (2009), “A new generation of homology search tools based on probabilisticinference,” Genome Inform, 23, 205–211.

Eliezer, D. (2009), “Biophysical characterization of intrinsically disordered proteins,”Curr. Opin. Struct. Biol., 19, 23–30.

Finn, R. D., Tate, J., Mistry, J., Coggill, P. C., Sammut, S. J., Hotz, H. R., Ceric,G., Forslund, K., Eddy, S. R., Sonnhammer, E. L., and Bateman, A. (2008), “ThePfam protein families database,” Nucleic Acids Res., 36, D281–288.

Finn, R. D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J. E., Gavin,O. L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E. L.,Eddy, S. R., and Bateman, A. (2010), “The Pfam protein families database,”Nucleic Acids Res., 38, D211–222.

Flannick, J., Novak, A., Srinivasan, B. S., McAdams, H. H., and Batzoglou, S. (2006),“Graemlin: general and robust alignment of multiple large interaction networks,”Genome Res., 16, 1169–1181.

Flicek, P., Amode, M. R., Barrell, D., Beal, K., Brent, S., Chen, Y., Clapham,P., Coates, G., Fairley, S., Fitzgerald, S., Gordon, L., Hendrix, M., Hourlier, T.,Johnson, N., Kahari, A., Keefe, D., Keenan, S., Kinsella, R., Kokocinski, F.,Kulesha, E., Larsson, P., Longden, I., McLaren, W., Overduin, B., Pritchard, B.,Riat, H. S., Rios, D., Ritchie, G. R., Ruffier, M., Schuster, M., Sobral, D., Spudich,G., Tang, Y. A., Trevanion, S., Vandrovcova, J., Vilella, A. J., White, S., Wilder,S. P., Zadissa, A., Zamora, J., Aken, B. L., Birney, E., Cunningham, F., Dunham,I., Durbin, R., Fernandez-Suarez, X. M., Herrero, J., Hubbard, T. J., Parker, A.,Proctor, G., Vogel, J., and Searle, S. M. (2011), “Ensembl 2011,” Nucleic AcidsRes., 39, D800–806.

Forsburg, S. L. and Nurse, P. (1991), “Cell cycle regulation in the yeasts Saccha-romyces cerevisiae and Schizosaccharomyces pombe,” Annu. Rev. Cell Biol., 7,227–256.

Fukuchi, S., Homma, K., Minezaki, Y., and Nishikawa, K. (2006), “Intrinsicallydisordered loops inserted into the structural domains of human proteins,” J. Mol.Biol., 355, 845–857.

Futcher, B. (1999), “Cell cycle synchronization,” Methods Cell Sci, 21, 79–86.

Futcher, B. (2002), “Transcriptional regulatory networks and the yeast cell cycle,”Curr. Opin. Cell Biol., 14, 676–683.

115

Fuxreiter, M., Simon, I., and Bondos, S. (2011), “Dynamic protein-DNA recognition:beyond what can be seen,” Trends Biochem. Sci., 36, 415–423.

Gehring, W. J., Qian, Y. Q., Billeter, M., Furukubo-Tokunaga, K., Schier,A. F., Resendez-Perez, D., Affolter, M., Otting, G., and Wuthrich, K. (1994),“Homeodomain-DNA recognition,” Cell, 78, 211–223.

Goh, C. S., Bogan, A. A., Joachimiak, M., Walther, D., and Cohen, F. E. (2000),“Co-evolution of proteins with their interaction partners,” J. Mol. Biol., 299, 283–293.

Granovskaia, M. V., Jensen, L. J., Ritchie, M. E., Toedling, J., Ning, Y., Bork, P.,Huber, W., and Steinmetz, L. M. (2010), “High-resolution transcription atlas ofthe mitotic cell cycle in budding yeast,” Genome Biol., 11, R24.

Grant, M. and Boyd, S. (2008), Graph implementations for nonsmooth convex pro-grams, Lecture Notes in Control and Information Sciences, Springer-Verlag Lim-ited.

Grant, M. and Boyd, S. (2010), “CVX: Matlab Software for Disciplined ConvexProgramming, version 1.21,” http://cvxr.com/cvx.

Guo, X. and Hartemink, A. (2009), “Domain-oriented edge-based alignment of pro-tein interaction networks,” Bioinformatics, 25, i240–1246.

Guo, X., Bernard, A., Orlando, A. D., Haase, S. B., and Hartemink, A. (2012a),“Branching process deconvolution algorithm reveals a detailed cell-cycle transcrip-tional program,” submitted.

Guo, X., Bulky, M. L., and Hartemink, A. J. (2012b), “Intrinsic disorder within andflanking the DNA-binding domains of human transcription factors,” in PacificSymposium on Biocomputing., p. 104.

Haar, A. (1910), “Zur theorie der orthogonalen funktionensysteme,” MathematischeAnnalen, 69, 331–371.

Haase, S. B. and Reed, S. I. (1999), “Evidence that a free-running oscillator drivesG1 events in the budding yeast cell cycle,” Nature, 401, 394–397.

Haase, S. B. and Reed, S. I. (2002), “Improved flow cytometric analysis of the bud-ding yeast cell cycle,” Cell Cycle, 1, 132–136.

Hanlon, S. E., Rizzo, J. M., Tatomer, D. C., Lieb, J. D., and Buck, M. J. (2011),“The Stress Response Factors Yap6, Cin5, Phd1, and Skn7 Direct Targeting of theConserved Co-Repressor Tup1-Ssn6 in S. cerevisiae,” PLoS ONE, 6, e19060.

116

http://cvxr.com/cvx

Hansen, P. (1992), “Analysis of discrete ill-posed problems by means of the L-curve,”SIAM Review, 34, 561–580.

Harder, N., Mora-Bermudez, F., Godinez, W. J., Ellenberg, J., Eils, R., and Rohr,K. (2006), “Automated analysis of the mitotic phases of human cells in 3D fluo-rescence microscopy image sequences,” Med Image Comput Comput Assist Interv,9, 840–848.

Hartwell, L. H. and Unger, M. W. (1977), “Unequal division in Saccharomyces cere-visiae and its implications for the control of cell division,” J. Cell Biol., 75, 422–435.

He, B., Wang, K., Liu, Y., Xue, B., Uversky, V. N., and Dunker, A. K. (2009),“Predicting intrinsic disorder in proteins: an overview,” Cell Res., 19, 929–949.

Hereford, L. M., Osley, M. A., Ludwig, T. R., and McLaughlin, C. S. (1981), “Cell-cycle regulation of yeast histone mRNA,” Cell, 24, 367–375.

Hirsh, E. and Sharan, R. (2007), “Identification of conserved protein complexes basedon a model of protein network evolution,” Bioinformatics, 23, e170–176.

Hoey, T. and Levine, M. (1988), “Divergent homeo box proteins recognize similarDNA sequences in Drosophila,” Nature, 332, 858–861.

Hunter, S., Apweiler, R., Attwood, T. K., Bairoch, A., Bateman, A., Binns, D.,Bork, P., Das, U., Daugherty, L., Duquenne, L., Finn, R. D., Gough, J., Haft,D., Hulo, N., Kahn, D., Kelly, E., Laugraud, A., Letunic, I., Lonsdale, D., Lopez,R., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A.,Mulder, N., Natale, D., Orengo, C., Quinn, A. F., Selengut, J. D., Sigrist, C. J.,Thimma, M., Thomas, P. D., Valentin, F., Wilson, D., Wu, C. H., and Yeats, C.(2009), “InterPro: the integrative protein signature database,” Nucleic Acids Res.,37, D211–215.

Itzhaki, Z., Akiva, E., Altuvia, Y., and Margalit, H. (2006), “Evolutionary conser-vation of domain-domain interactions,” Genome Biol., 7, R125.

Jansen, M. (2001), Noise reduction by wavelet thresholding, Lecture Notes in Statis-tics, Springer-Verlag.

Jorgensen, P. and Tyers, M. (2004), “How cells coordinate growth and division,”Curr. Biol., 14, R1014–1027.

Jothi, R., Cherukuri, P. F., Tasneem, A., and Przytycka, T. M. (2006), “Co-evolutionary analysis of domains in interacting proteins reveals insights intodomain-domain interactions mediating protein-protein interactions,” J. Mol. Biol.,362, 861–875.

117

Kalaev, M., Bafna, V., and Sharan, R. (2008), “Fast and accurate alignment ofmultiple protein networks,” in Research in Computational Molecular Biology, pp.246–256, Springer.

Kamakaka, R. T. and Biggins, S. (2005), “Histone variants: Deviants?” Genes Dev.,19, 295–310.

Kanehisa, M. and Goto, S. (2000), “KEGG: kyoto encyclopedia of genes andgenomes,” Nucleic Acids Res., 28, 27–30.

Kelley, B. P., Sharan, R., Karp, R. M., Sittler, T., Root, D. E., Stockwell, B. R.,and Ideker, T. (2003), “Conserved pathways within bacteria and yeast as revealedby global protein network alignment,” Proc. Natl. Acad. Sci. U.S.A., 100, 11394–11399.

Koyuturk, M., Kim, Y., Topkara, U., Subramaniam, S., Szpankowski, W., andGrama, A. (2006), “Pairwise alignment of protein interaction networks,” J. Com-put. Biol., 13, 182–199.

Kuranda, M. J. and Robbins, P. W. (1991), “Chitinase is required for cell separationduring growth of Saccharomyces cerevisiae,” J. Biol. Chem., 266, 19758–19767.

Lee, M. G. and Nurse, P. (1987), “Complementation used to clone a human homo-logue of the fission yeast cell cycle control gene cdc2,” Nature, 327, 31–35.

Liskay, R. M. (1977), “Absence of a measurable G2 phase in two Chinese hamstercell lines,” Proc. Natl. Acad. Sci. U.S.A., 74, 1622–1625.

Littlewood, T. D. and Evan, G. I. (1995), “Transcription factors 2: helix-loop-helix,”Protein Profile, 2, 621–702.

Liu, J., Perumal, N. B., Oldfield, C. J., Su, E. W., Uversky, V. N., and Dunker, A. K.(2006), “Intrinsic disorder in transcription factors,” Biochemistry, 45, 6873–6888.

Liu, Y., Matthews, K. S., and Bondos, S. E. (2008), “Multiple intrinsically disorderedsequences alter DNA binding by the homeodomain of the Drosophila hox proteinultrabithorax,” J. Biol. Chem., 283, 20874–20887.

Lord, P. G. and Wheals, A. E. (1980), “Asymmetrical division of Saccharomycescerevisiae,” J. Bacteriol., 142, 808–818.

Lord, P. G. and Wheals, A. E. (1981), “Variability in individual cell cycles of Sac-charomyces cerevisiae,” J. Cell. Sci., 50, 361–376.

Lu, P., Nakorchevskiy, A., and Marcotte, E. M. (2003), “Expression deconvolution:A reinterpretation of DNA microarray data reveals dynamic changes in cell popu-lations,” Proc. Natl. Acad. Sci. U.S.A., 100, 10370–10375.

118

Mallat, S. (1989), “A theory for multiresolution signal decomposition: The waveletrepresentation,” Pattern Analysis and Machine Intelligence, IEEE Transactionson, 11, 674–693.

Mallat, S. (1999), A wavelet tour of signal processing, Academic Pr.

Mallat, S. G. (2008), A wavelet tour of signal processing, Academic Press.

Marchler-Bauer, A., Anderson, J. B., Derbyshire, M. K., DeWeese-Scott, C., Gon-zales, N. R., Gwadz, M., Hao, L., He, S., Hurwitz, D. I., Jackson, J. D., Ke, Z.,Krylov, D., Lanczycki, C. J., Liebert, C. A., Liu, C., Lu, F., Lu, S., Marchler,G. H., Mullokandov, M., Song, J. S., Thanki, N., Yamashita, R. A., Yin, J. J.,Zhang, D., and Bryant, S. H. (2007), “CDD: a conserved domain database forinteractive domain family analysis,” Nucleic Acids Res., 35, D237–240.

Mayhew, M. B., Robinson, J. W., Jung, B., Haase, S. B., and Hartemink, A. J.(2011), “A generalized model for multi-marker analysis of cell cycle progression insynchrony experiments,” Bioinformatics, 27, i295–i303.

Mayhew, M. B., Guo, X., Haase, S. B., and Hartemink, A. J. (2012), “Close encoun-ters of the collaborative kind,” Computer, 45, 24–30.

Mewes, H. W., Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K., Mokrejs,M., Morgenstern, B., Munsterkotter, M., Rudd, S., and Weil, B. (2002), “MIPS:a database for genomes and protein sequences,” Nucleic Acids Res., 30, 31–34.

Miller, C., Schwalb, B., Maier, K., Schulz, D., Dumcke, S., Zacher, B., Mayer, A.,Sydow, J., Marcinowski, L., Dolken, L., Martin, D. E., Tresch, A., and Cramer, P.(2011), “Dynamic transcriptome analysis measures rates of mRNA synthesis anddecay in yeast,” Mol. Syst. Biol., 7, 458.

Minezaki, Y., Homma, K., Kinjo, A. R., and Nishikawa, K. (2006), “Human tran-scription factors contain a high fraction of intrinsically disordered regions essentialfor transcriptional regulation,” J. Mol. Biol., 359, 1137–1149.

Mintseris, J. and Weng, Z. (2005), “Structure, function, and evolution of transientand obligate protein-protein interactions,” Proc. Natl. Acad. Sci. U.S.A., 102,10930–10935.

Morgan, D. (2007), The Cell Cycle: Principles of Control, London, New SciencePress.

Morgan, D. O. (1997), “Cyclin-dependent kinases: engines, clocks, and microproces-sors,” Annu. Rev. Cell Dev. Biol., 13, 261–291.

119

Mukherjee, S., Berger, M. F., Jona, G., Wang, X. S., Muzzey, D., Snyder, M., Young,R. A., and Bulyk, M. L. (2004), “Rapid analysis of the DNA-binding specificitiesof transcription factors with DNA microarrays,” Nat. Genet., 36, 1331–1339.

Murray, A. and Hunt, T. (1993), The Cell Cycle. An introduction, New York, W. H.Freeman & Co.

Murray, A. W. (2004), “Recycling the cell cycle: cyclins revisited,” Cell, 116, 221–234.

Orlando, D. A. (2009), “Regulation of Global Transcription Dynamics During CellDivision and Root Development,” PhD dissertation, Duke University.

Orlando, D. A., Lin, C. Y., Bernard, A., Iversen, E. S., Hartemink, A. J., andHaase, S. B. (2007), “A probabilistic model for cell cycle distributions in synchronyexperiments,” Cell Cycle, 6, 478–488.

Orlando, D. A., Lin, C. Y., Bernard, A., Wang, J. Y., Socolar, J. E., Iversen, E. S.,Hartemink, A. J., and Haase, S. B. (2008), “Global control of cell-cycle transcrip-tion by coupled CDK and network oscillators,” Nature, 453, 944–947.

Orlando, D. A., Iversen, E. S., Hartemink, A. J., and Haase, S. B. (2009), “Abranching process model for flow cytometry and budding index measurementsin cell synchrony experiments,” Annals of Applied Statistics, 3, 1521–1541.

Osley, M. A. (1991), “The regulation of histone synthesis in the cell cycle,” Annu.Rev. Biochem., 60, 827–861.

Pabo, C. O., Peisach, E., and Grant, R. A. (2001), “Design and selection of novelCys2His2 zinc finger proteins,” Annu. Rev. Biochem., 70, 313–340.

Pazos, F., Helmer-Citterich, M., Ausiello, G., and Valencia, A. (1997), “Correlatedmutations contain information about protein-protein interaction,” J. Mol. Biol.,271, 511–523.

Peng, K., Radivojac, P., Vucetic, S., Dunker, A. K., and Obradovic, Z. (2006),“Length-dependent prediction of protein intrinsic disorder,” BMC Bioinformatics,7, 208.

Pierrez, J. and Ronot, X. (1992), “Flow cytometric analysis of the cell cycle: math-ematical modeling and biological interpretation,” Acta Biotheor., 40, 131–137.

Pramila, T., Wu, W., Miles, S., Noble, W. S., and Breeden, L. L. (2006), “TheForkhead transcription factor Hcm1 regulates chromosome segregation genes andfills the S-phase gap in the transcriptional circuitry of the cell cycle,” Genes Dev.,20, 2266–2278.

120

Pruitt, K. D., Tatusova, T., Klimke, W., and Maglott, D. R. (2009), “NCBI ReferenceSequences: current status, policy and new initiatives,” Nucleic Acids Res., 37,D32–36.

Qiu, P., Wang, Z. J., and Liu, K. J. (2006), “Polynomial model approach for resyn-chronization analysis of cell-cycle gene expression data,” Bioinformatics, 22, 959–966.

Raser, J. M. and O’Shea, E. K. (2005), “Noise in gene expression: Origins, conse-quences, and control,” Science, 309, 2010–2013.

Riley, R., Lee, C., Sabatti, C., and Eisenberg, D. (2005), “Inferring protein domaininteractions from databases of interacting proteins,” Genome Biol., 6, R89.

Rowicka, M., Kudlicki, A., Tu, B. P., and Otwinowski, Z. (2007), “High-resolutiontiming of cell cycle-regulated gene expression,” Proc. Natl. Acad. Sci. U.S.A., 104,16892–16897.

Schaufler, L. E. and Klevit, R. E. (2003), “Mechanism of DNA binding by the ADR1zinc finger transcription factor as determined by SPR,” J. Mol. Biol., 329, 931–939.

Schuster-Bockler, B. and Bateman, A. (2007), “Reuse of structural domain-domaininteractions in protein networks,” BMC Bioinformatics, 8, 259.

Schwabe, J. W., Chapman, L., Finch, J. T., Rhodes, D., and Neuhaus, D. (1993),“DNA recognition by the oestrogen receptor: from solution to the crystal,” Struc-ture, 1, 187–204.

Schwacha, A. and Bell, S. P. (2001), “Interactions between two catalytically distinctMCM subgroups are essential for coordinated ATP hydrolysis and DNA replica-tion,” Mol. Cell, 8, 1093–1104.

Sharan, R. and Ideker, T. (2006), “Modeling cellular machinery through biologicalnetwork comparison,” Nat. Biotechnol., 24, 427–433.

Sharan, R., Suthram, S., Kelley, R. M., Kuhn, T., McCuine, S., Uetz, P., Sittler, T.,Karp, R. M., and Ideker, T. (2005a), “Conserved patterns of protein interactionin multiple species,” Proc. Natl. Acad. Sci. U.S.A., 102, 1974–1979.

Sharan, R., Ideker, T., Kelley, B., Shamir, R., and Karp, R. M. (2005b), “Identifica-tion of protein complexes by comparative analysis of yeast and bacterial proteininteraction data,” J. Comput. Biol., 12, 835–846.

Siegal-Gaskins, D., Ash, J. N., and Crosson, S. (2009), “Model-based deconvolutionof cell cycle time-series data reveals gene expression details at high resolution,”PLoS Comput. Biol., 5, e1000460.

121

Sil, A. and Herskowitz, I. (1996), “Identification of asymmetrically localized deter-minant, Ash1p, required for lineage-specific transcription of the yeast HO gene,”Cell, 84, 711–722.

Simchen, G. (1978), “Cell cycle mutants,” Annual review of genetics, 12, 161–191.

Simmons Kovacs, L. A., Nelson, C. L., and Haase, S. B. (2008), “Intrinsic and cyclin-dependent kinase-dependent control of spindle pole body duplication in buddingyeast,” Mol. Biol. Cell, 19, 3243–3253.

Singh, R., Xu, J., and Berger, B. (2008), “Global alignment of multiple proteininteraction networks with application to functional orthology detection,” Proc.Natl. Acad. Sci. U.S.A., 105, 12763–12768.

Slater, M. L., Sharrow, S. O., and Gart, J. J. (1977), “Cell cycle of Saccha-romycescerevisiae in populations growing at different rates,” Proc. Natl. Acad.Sci. U.S.A., 74, 3850–3854.

Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B.,Brown, P. O., Botstein, D., and Futcher, B. (1998), “Comprehensive identificationof cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarrayhybridization,” Mol. Biol. Cell, 9, 3273–3297.

Srinivasan, B., Novak, A., Flannick, J., Batzoglou, S., and McAdams, H. (2006),“Integrated protein interaction networks for 11 microbes,” in Research in Compu-tational Molecular Biology, pp. 1–14, Springer.

Srinivasan, B. S., Shah, N. H., Flannick, J. A., Abeliuk, E., Novak, A. F., andBatzoglou, S. (2007), “Current progress in network research: toward referencenetworks for key model organisms,” Brief. Bioinformatics, 8, 318–332.

Stacey, D. W. and Hitomi, M. (2008), “Cell cycle studies based upon quantitativeimage analysis,” Cytometry A, 73, 270–278.

Teixeira, M. C., Monteiro, P., Jain, P., Tenreiro, S., Fernandes, A. R., Mira, N. P.,Alenquer, M., Freitas, A. T., Oliveira, A. L., and Sa-Correia, I. (2006), “TheYEASTRACT database: A tool for the analysis of transcription regulatory asso-ciations in Saccharomyces cerevisiae,” Nucleic Acids Res., 34, D446–451.

Tobey, R. A. and Crissman, H. A. (1975), “Unique techniques for cell analysis uti-lizing mithramycin and flow microfluorometry,” Exp. Cell Res., 93, 235–239.

Toyn, J. H., Johnson, A. L., Donovan, J. D., Toone, W. M., and Johnston, L. H.(1997), “The Swi5 transcription factor of Saccharomyces cerevisiae has a role inexit from mitosis through induction of the CDK-inhibitor Sic1 in telophase,” Ge-netics, 145, 85–96.

122

Tupler, R., Perini, G., and Green, M. R. (2001), “Expressing the human genome,”Nature, 409, 832–833.

Ubersax, J. A., Woodbury, E. L., Quang, P. N., Paraz, M., Blethrow, J. D., Shah,K., Shokat, K. M., and Morgan, D. O. (2003), “Targets of the cyclin-dependentkinase Cdk1,” Nature, 425, 859–864.

Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A., and Luscombe, N. M.(2009), “A census of human transcription factors: function, expression and evolu-tion,” Nat. Rev. Genet., 10, 252–263.

Wang, Y., Shirogane, T., Liu, D., Harper, J. W., and Elledge, S. J. (2003), “Exitfrom exit: Resetting the cell cycle through Amn1 inhibition of G protein signaling,”Cell, 112, 697–709.

Ward, J. J., Sodhi, J. S., McGuffin, L. J., Buxton, B. F., and Jones, D. T. (2004),“Prediction and functional analysis of native disorder in proteins from the threekingdoms of life,” J. Mol. Biol., 337, 635–645.

Woldringh, C. L., Huls, P. G., and Vischer, N. O. (1993), “Volume growth of daugh-ter and parent cells during the cell cycle of Saccharomyces cerevisiae a/alpha asdetermined by image cytometry,” J. Bacteriol., 175, 3174–3181.

Xenarios, I., Salwinski, L., Duan, X. J., Higney, P., Kim, S. M., and Eisenberg, D.(2002), “DIP, the Database of Interacting Proteins: a research tool for studyingcellular networks of protein interactions,” Nucleic Acids Res., 30, 303–305.

Zhenping, L., Zhang, S., Wang, Y., Zhang, X. S., and Chen, L. (2007), “Alignmentof molecular networks by integer quadratic programming,” Bioinformatics, 23,1631–1639.

Zhou, H. X. (2001), “The affinity-enhancing roles of flexible linkers in two-domainDNA-binding proteins,” Biochemistry, 40, 15069–15073.

123

Biography

Xin Guo was born on June 13, 1979 in Harbin, China. He earned a B.E degree

from Chiba Institute of Technology, Japan in April 2002, and earned two master

degrees from Tokyo Institute of Technology, Japan and Saarland University, Ger-

many, respectively. In 2006, he joined the Ph.D. program in Computer Science at

Duke University. Upon completion of his degree, he will join Gilead Sciences, a

biotechnology company headquartered in Foster City, CA, as a research scientist.

Publications:

1. Guo, X., Bernard, A., Orlando, O. A., Haase, S. B., Hartemink. A. J. (2012)

“Branching process deconvolution algorithm reveals a detailed cell-cycle tran-

scriptional program,” (submitted).

2. Mayhew, M. B., Guo, X., Haase, S. B., Hartemink, A. J. (2012) “Close en-

counters of the collaborative kind”, IEEE Computer. 45: pp. 24–30.

3. Guo, X., Bulyk, L. M., Hartemink, A. J. (2012) “Intrinsic disorder within and

flanking the DNA-binding domains of human transcription factors”, Pacific

Symposium on Biocomputing (PSB2012), 17:104–115, January 2012.

4. Guo, X., Hartemink, A. J. (2009) “Domain-oriented edge-based alignment of

protein interaction networks”, Intelligent Systems in Molecular Biology 2009

(ISMB09). Bioinformatics, 25:i240–246, July 2009.

124

from population to single cells: deconvolution of cell

Documents