gene network inference via sequence alignment and ... · figure page 4. stable steady state...

Gene Network Inference via

Sequence Alignment and Rectification

by

Philippe Christophe Faucon

A Dissertation Presented in Partial Fulfillmentof the Requirements for the Degree

Doctor of Philosophy

Approved August 2017 by theGraduate Supervisory Committee:

Huan Liu, ChairXiao Wang

Sharon CrookYalin Wang

Hessam Sarjoughian

ARIZONA STATE UNIVERSITY

December 2017

ABSTRACT

While techniques for reading DNA in some capacity has been possible for decades,

the ability to accurately edit genomes at scale has remained elusive. Novel techniques

have been introduced recently to aid in the writing of DNA sequences. While writing

DNA is more accessible, it still remains expensive, justifying the increased interest in

in silico predictions of cell behavior. In order to accurately predict the behavior of

cells it is necessary to extensively model the cell environment, including gene-to-gene

interactions as completely as possible.

Significant algorithmic advances have been made for identifying these interactions,

but despite these improvements current techniques fail to infer some edges, and

fail to capture some complexities in the network. Much of this limitation is due to

heavily underdetermined problems, whereby tens of thousands of variables are to be

inferred using datasets with the power to resolve only a small fraction of the variables.

Additionally, failure to correctly resolve gene isoforms using short reads contributes

significantly to noise in gene quantification measures.

This dissertation introduces novel mathematical models, machine learning tech-

niques, and biological techniques to solve the problems described above. Mathematical

models are proposed for simulation of gene network motifs, and raw read simula-

tion. Machine learning techniques are shown for DNA sequence matching, and DNA

sequence correction.

Results provide novel insights into the low level functionality of gene networks.

Also shown is the ability to use normalization techniques to aggregate data for gene

network inference leading to larger data sets while minimizing increases in inter-

experimental noise. Results also demonstrate that high error rates experienced by

third generation sequencing are significantly different than previous error profiles, and

i

that these errors can be modeled, simulated, and rectified. Finally, techniques are

provided for amending this DNA error that preserve the benefits of third generation

sequencing.

ii

I dedicate this work to my parents Arlene and Philippe, who inspire me to be the

best version of myself; to my sister Anni and my brother Pierre, who inspire me to

pursue what I love; and to my many excellent teachers and mentors throughout the

years, who inspire me to learn and to teach.

iii

ACKNOWLEDGMENTS

I would like to thank my advisors, Huan Liu and Xiao Wang for their support

and guidance throughout my PhD. Without their mentoring, and at times corralling,

I would almost certainly still be working on a million small projects without any

significant progress. I especially appreciate their willingness to let me fully explore my

research interests, and their forthrightness when I set my mind to a task that they

believed would not be fruitful. I would also like to thank the rest of my committee;

Sharon Crook who introduced me to bioinformatics during my undergraduate years,

and continued to guide me during my graduate research, along with Yalin Wang and

Hessam Sarjoughian, who provided valuable suggestions on my research directions.

I am remarkably fortunate to have worked with so many amazing colleagues in

both the Xiao Lab and Data Mining and Machine Learning(DMML) lab at ASU.

Over my graduate studies I’ve had the pleasure of working with many exceptional

researchers: Parithi Balachandran, Ghazaleh Beigi Xingwen Chen, Ryan Dougherty,

Hao Hu, Isaac Jones, Jundong Li, David Menn, Fred Morstatter, Tahora Nazer, Suhas

Ranganath, Justin Sampson, Kai Shu, Kylie Standage-Beier, Ri-Qi Su, Robert Trevino,

Lezhi Wang, Suhang Wang, Fuqing Wu, Liang Wu, Tsung-Yen (John) Yu, and Qi

Zhang. Above all I would like to thank my sidekick Parithi Balachandran, an excellent

research colleague and friend, and without whom many research projects would not

have been possible. I am also incredibly grateful to Robert Trevino for his assistance

with for his research insights, and for introducing me to conference paper writing.

Having a life outside of research, and the support of friends is critical to enjoying

and succeeding in grad school. On that note I would especially like to thank Parithi

Balachandran, Michael Berg, Mario Giacomazzo, Hao Hu, David Menn, Fred Morstat-

ter, James Palazzolo, Jose Perez, Kim Phan, Brianna Rudow, Jeff Semmens, and

iv

Robert Trevino and the ASU fencing team. Without their support and friendship I

would likely have left the university long before completing my degree.

I would also like to acknowledge my family, who supported me through the most

difficult parts of my studies and provided a constant stream of encouragement. My

grandparents Berit and Anker, my parents Arlene and Philippe, and my siblings Anni

and Pierre, have all played a monumental role in shaping me into the person I am

today. I could not be more grateful for the constant support they have given me, and

I hope that I have, do, and will continue to make them proud.

Learning to learn is one of the most difficult tasks that we as people, and especially

as researchers have to undertake. We have to find techniques for locating relevant

and trustworthy sources, sorting through their available knowledge, and distilling it

down to what we can remember and make use of. Through my studies I was very

fortunate to meet mentors that inspired me to learn, and showed me the joy in learning.

Likewise while teaching I had outstanding students that showed me joy in teaching,

and inspiring others to learn and achieve great things. In specific I would like to thank

Karen Glazier, the first (and only) teacher to fail me, for reminding me that regardless

of how much you think you know you can still fail if you don’t apply yourself.

v

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

CHAPTER

1 GENE NETWORK MOTIFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 Network Enumeration and Parameter Scanning for Multi-

stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.2 Complete Auto-Activation Contributes to Multistability . . . . 7

1.1.3 Sensitivity and bifurcation of multistability . . . . . . . . . . . . . . . . 8

1.1.4 Stochastic State Transitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.1.5 SSS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.1.6 A Regulatory Network for Pluripotency . . . . . . . . . . . . . . . . . . . 17

1.1.7 In Silico Induced Pluripotency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.3 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.3.1 Enumerating and Eliminating Redundant Networks . . . . . . . . 31

1.3.2 ODE Modeling and High-Throughput Screening . . . . . . . . . . . 31

1.3.3 Analysis of the Parameter Space . . . . . . . . . . . . . . . . . . . . . . . . . . 33

1.3.4 Bifurcation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.3.5 Stochastic Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.3.6 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2 GENE NETWORK MODELING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.1 Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

vi

CHAPTER Page

2.1.1 Boolean Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.1.2 Bayesian and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.1.3 Differential Equation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.2 RNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.3 Data Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.3.1 Acquiring the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.3.2 Noise Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.3.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3 NANOPORE READ SIMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3.1 Error Source Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.3.2 Error Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.3.3 Read Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4.1 Modeling K-mer Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4.2 Identifying Bias Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4.4 Simulator Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4 READ CORRECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

vii

CHAPTER Page

4.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.4.1 High-Accuracy K-mers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.4.2 Utilization of K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.4.3 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.5.1 Clustering on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.1 Methodological Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

NOTES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

viii

LIST OF TABLES

Table Page

1. Overview of RNA-Seq Experiments Analyzed with a Listing of Tool and

Asset Versions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2. Identified Error Sources and Their Contribution to K-Mer Overall Accuracy

with Respect to the Bias Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3. Sum of Squares Error (SSE) between K-Mer Simulators and Their Respective

Training Sets, and between the Fragmented R9 Data Set and the Complete

R9 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4. memory and Wall-Clock Time Consumption of SNaReSim under Various

Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5. Comparison of MHAP and K-Means Clusters over the Engineered Feature

Space, the Number of K-Means Clusters Was Set to the Number of Compo-

nents Identified by MHAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

ix

LIST OF FIGURES

Figure Page

1. Multistability Arises from Small Gene Networks and Underlies Cell Differ-

entiation. Fully-Connected Triads (FCTs) Are Important, Recurring Tran-

scriptional Networks in the Development and Maintenance of Cellular States.

Notably, the Oct4-Sox2-Nanog Triad Has Been Implicated in Inducing and

Maintaining Stem Cell Pluripotency. In the Waddington Model for Cell

Differentiation, the Cell’s Underlying Developmental Landscape Is Governed

by the Dynamical Potential of These Small Gene Networks to Realize Multi-

stability. Transition from One State to Another Can Be Guided via Altered

Topology or Strength of Wiring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2. High-Throughput Computational Screening for Multistable Networks. (A)

FCTs Were Enumerated and Modeled by ODEs (Parameters Denoted as P).

(B) The Multistability Probability of 104 FCTs Generated by the Computa-

tional Screen Is Plotted, with a High Multistability Probability FCTs Marked

with an X with the Color Representing the Multistability Group; the Two

Highest Probability Groups Are Labeled by Their Indices. (C) The Fifteen

FCTs with the Highest Multistable Potentials Are Represented Visually. . . . . 6

x

Figure Page

3. Bifurcation and Stochastic Analysis of Network 84. (A) The Topology of

Network 84 Was Predicted to Have the Highest Probability for Multistability.

(B) The Bifurcation Diagram of Network 84 Plots Transcription Factor

Concentrations of X and Y at Each SSS against α, the Self-Activation

Strength. The Mutual Inhibition Parameters Are Also Set Equal and Referred

to as β, Here Set to 0.1 and α Is Used as the Bifurcation Parameter Ranging

between .01 and 100. SSS Are Color Coded and Listed in the Legend to

Distinguish Different SSS, Where ‘+’ Indicates the Gene Is ‘ON’ and ‘-

’ Means ‘OFF’. (C) Simulations of Noise-Induced State Transitioning in

Network 84 under Different Levels of Noise. Simulations Were Performed

with Auto-Regulation Equal to 1 and Mutual Inhibition Equal to 0.1, with

Noise Levels of 0.5, 0.85, and 1 from left to right. The Locations of the

Deterministically Calculated States Are Indicated with Red Spheres, with

Their Size Correlated to Their Spectral Radius. The Blue Ribbons Indicate

the Temporal Trajectories of the Simulations. The Black Arrows Indicate

Initial Direction of State Transitions. (D) The Number of States Traveled

in the Stochastic Simulations Plotted versus Noise Strength. The Red Line

Represents Simulations that Were Initiated from the All-ON State, and the

Blue Line Represents Simulations that Were Initiated from the Origin. . . . . . . 9

xi

Figure Page

4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting

the Number of Stable Steady States Generated by Combinations of Differ-

ent Auto-Regulatory Strengths (Rows) and Mutual-Regulatory Strengths

(Columns). The Green and Orange Regions Are Visualized as FCT Diagrams

Independently in B. (B) The Green-Boxed and Orange-Boxed Diagrams

Corresponds to the Green-Boxed and Orange-Boxed Parameter Combina-

tion Regime from (A). (C) Many of the Parameter Combinations Yielding

Multi-Stable Systems Are Not Represented by the Matrix, and Are instead

Abstracted Here. As an Example, Here We Present a Parameter Combination

Regime that Can Support 6 SSS and Auto-Activation Strengths Are Similar

to Those in the Orange Box (Thick Red Arrows) but Have One Unrestrained

Pair of Mutual Inhibition that Can Exist at a Very High Strength (Blue

T-Head Repression Lines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

xii

Figure Page

5. Bifurcation and Stochastic Analysis of Network 1. (A) Network 1 Shares the

Topological Architecture of the Oct4-Sox2-Nanog Triad, and Is Believed to

Underlie Stem Cell Differentiation and Reprogramming. (B) The Bifurcation

Diagram of Network 84 Plots Transcription Factor Concentrations of X and Y

at Each SSS against α, the Self-Activation Strength. The Mutual Inhibition

Parameters Are Also Set Equal and Referred to as β, Here Set to 0.1 and α

Is Used as the Bifurcation Parameter Ranging between .01 and 100. SSS Are

Color Coded and Listed in the Legend to Distinguish Different SSS, Where

‘+’ Indicates the Gene Is ‘ON’ and ‘-’ Means ‘OFF’. (C) Simulations of Noise-

Induced State Transitioning in Network 1 under Different Levels of Noise.

Simulations Were Performed with Auto-Regulation Equal to 1 and Mutual

Inhibition Equal to 0.1, with Noise Levels of 0.5, 0.85, and 1 from left to right.

The Locations of the Deterministically Calculated States Are Indicated with

Red Spheres, with Their Size Correlated to Their Spectral Radius. The Blue

Ribbons Indicate the Temporal Trajectories of the Simulations. The Black

Arrows Indicate Initial Direction of State Transitions. (D) The Number of

States Traveled in the Stochastic Simulations Plotted versus Noise Strength.

The Red Line Represents Simulations that Were Initiated from the All-ON

State, and the Blue Line Represents Simulations that Were Initiated from

the Origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

xiii

Figure Page

6. Stable Steady State Analysis of Network 1. (A) A Matrix Presenting the Num-

ber of Stable Steady States Generated by Combinations of Different Auto-

Regulatory Strengths (Rows) and Mutual-Regulatory Strengths (Columns).

The Green and Orange Regions Are Visualized as FCT Diagrams Indepen-

dently in B. (B) The Green-Boxed and Orange-Boxed Diagrams Corresponds

to the Green-Boxed and Orange-Boxed Parameter Combination Regime from

(A). (C) Many of the Parameter Combinations Yielding Multi-Stable Systems

Are Not Represented by the Matrix, and Are instead Abstracted Here. As

an Example, Here We Present a Parameter Combination Regime that Can

Support 6 SSS and Auto-Activation Strengths Are Similar to Those in the

Orange Box (Thick Red Arrows) but Have One Unrestrained Pair of Mutual

Inhibition that Can Exist at a Very High Strength (Blue T-Head Repression

Lines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

xiv

Figure Page

7. Bifurcation and Stochastic Analysis of Network 1 with Constitutive Basal

Expression. (A) The Topology of Network 1 Is Used to Mimic IPSC Exper-

iments with Added Constitutive Expression of All Three Genes. (B) The

X and Y Protein Concentrations of Stable and Unstable Steady States Are

Plotted against Self-Activation Strengths. With the Addition of the Consti-

tutive Expression We Can See a Further Decrease in Stabilities, Specifically

Two-ON States Are Never Stable. Additionally One-ON States Lose Stability

Very Rapidly, Approximately at α=1, and the All-OFF State Destabilizes

after α=9 Leaving Only the All-ON State. This Can Also Be Seen in the

Sizes of the Spectral Radii, Deterministically These Steady States Exist but

in Practice They Have Very Little Influence. (C) Simulations of Network 1

Modeled in the Presence of an Additional Constitutive Overexpression Term

Consistently Show a Transition from an All-OFF State to an All-ON State

after a Period of Latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

8. Stochastic Simulations Capture Latency Distributions in Induced Pluripo-

tency. (A) Distribution of Simulated Latency for the Modified Network 1

Model Illustrates a Skewed Bell Shape. The Histogram Was Generated from

2000 Simulations. (B) Similar Results Generated from 2000 Simulations with

Network 84. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

xv

Figure Page

9. Two Fully Connected(A/B), and Two Partially Connected(C/D) Triads Are

Depicted. (A/C) Illustrate Only Positive Network Regulation (Activation),

Indicated by Red Arrows, I.e. Increasing Quantities of the Transcription

Factor at the Tail of the Arrow Also Increases Quantities of the Transcription

Factor at the Head. (B/D) Utilize Both Positive and Negative Network

Regulation (Repression), Indicated by the Blue Bars, I.e. Increasing the

Quantities of the Transcription Factor at the Tail of the Error Decreases

Quantities of the Transcription Factor at the Head. . . . . . . . . . . . . . . . . . . . . . . . . 38

10.In This Figure We Present the Data Processing Pipeline from 5 Other ScRNA-

Seq Experiments Available on NCBI GEO. GSE59127 and GSE59129 (left,

Blue), GSE 52583 (left Center, Red), GSE21180(Center, Olive), GSE45284

(right Center, Orange) and Our Pipeline (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

11.Scatterplots Showing Ratios of Coefficient of Variation(CV), Where Points

Represent CV of a Single Gene within the Original Author’s Pipeline, and

Ours. The X=y Line Drawn, Representing the Point Where CV Is Equal

between Both Pipelines. Points on the X>y Side Are Drawn in Red, and

Point on the Y>x Side Are Drawn in Black. For the Experiments in (A-

C), Raw ScRNA-Seq Reads Are Provided by the NCBI GEO Service and

Analyzed Using Our Customized Pipeline. In Addition to Raw Reads Each

Author Provides Estimates of Gene Activity Levels through FPKM or TPM.

In (A-C) We Plot the Coefficient of Variation (CV) of Our Pipeline Results

against Those of the Original Authors. In (D) We Group the Experiments

from (A-C) to Visualize Inter-Experimental CV. . . . . . . . . . . . . . . . . . . . . . . . . . . 50

xvi

Figure Page

12.The Accuracies of All 6-Mers with Accuracy on the Vertical Axis, and a Kmer

Index on the Horizontal Axis. (A) All Experiments Sorted with Respect to

Their Accuracy. (B) Experiments Sorted with Respect to the Escherichia

Coli R9 2D Experiment. (C) The R9 E. Coli Template Experiment Divided

Randomly into 5 Groups, Sorted with Respect to the First Group. (D) Only

the R9 E. Coli Template Experiment, Aligned with Both Graphmap and

BWA-MEM, Sorted with Respect to BWA-MEM.. . . . . . . . . . . . . . . . . . . . . . . . . . 58

13.A Plot of the Non-Negative Least Squares (Nnls) Fit of the Predicted Features

Compared to the Actual Distribution from the R9 E. Coli 2D Experiment,

Sorted by the True Experiment. (A) All Model Feautures and the True

Accuracy of Neighboring K-Mers as a Feature. (B) Only the Features that

Are Fully Predictable from the Model, as Described in Bias Souces Table. . . . 64

14.Comparison of 3 Previously Published Sequencing Simulators with the New

Model We Proposed. Fit of R9 Pore Model K-Mer Accuracies Sorted Inter-

nally (A), and by 2D E. Coli, the Training Model (B). Fit of R7 Pore Model

K-Mer Accuracies Sorted Internally (C), and by 2D E. Coli, the Training

Model (D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

xvii

Figure Page

15.(A) Strands of DNA Bases Are Added to the Nanopore Sequencer as an

Analog Data Input (B) As Strands Pass through the Nanopores Electrical

Current Readings Are Taken. These Readings Are Consequently Grouped into

‘‘Events”, with Each Event Approximately Representing a K-Mer (Sequence

of Bases). These Events May Be Improperly Split, Events May Have a Mean

Value Higher or Lower than Expected for a DNA K-Mer, and Event Lengths

May Be Different. (C) A Base Caller Uses the Sequence of Events to Predict

the Original Sequence of Bases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

16.(A) DNA Sequences Can Be Interpreted as a Sequence of K-Mers, Each K-

Mer Can Be Replaced by Its Expected Mean Current to Provide a Mapping

from DNA Sequences into a High-Dimensional Current Space. (B) Clusters

of 2-Dimensional Points (Sequence Length 7), Clusters Are Found Using

DBSCAN with Neighborhood Size 1 and Epsilon 2 . . . . . . . . . . . . . . . . . . . . . . . . . 77

xviii

Chapter 1

GENE NETWORK MOTIFS

Figure 1. Multistability arises from small gene networks and underlies celldifferentiation. Fully-connected triads (FCTs) are important, recurringtranscriptional networks in the development and maintenance of cellular states.Notably, the Oct4-Sox2-Nanog triad has been implicated in inducing and maintainingstem cell pluripotency. In the Waddington model for cell differentiation, the cell’sunderlying developmental landscape is governed by the dynamical potential of thesesmall gene networks to realize multistability. Transition from one state to anothercan be guided via altered topology or strength of wiring.

Embryonic stem cells have the capacity to differentiate into more than 200 isogenic

progenitor and terminal cell types [1], each of which represents a stable cellular state.

The potential for a system to realize multiple states is termed multistability, which

has been proposed as the mechanism underlying cell differentiation for over half a

century, with stable states represented as either valleys in the developmental landscape

1

[2], [3] or as dynamic attractors in high-dimensional gene expression space [4]–[6].

Bistability is the simplest form of multistability, which has been studied extensively

both in natural contexts [7] and in synthetic systems [8]–[13]. It has been shown to

be responsible for both normal cell fate determination in Xenopus oocyte maturation

[14] and oncogenesis [15]. Additionally, recent work in hematopoietic stem cells (HSC)

implies that more complicated cases of multistability help in canalizing HSC lineage

determination [16]. The discovery of induced pluripotent stem cells (iPSC) [17]–[20]

further demonstrates that cellular state transitioning within a multistable system is a

reversible process that can potentially be engineered for therapeutic purposes.

Despite evidence for the existence of multiple substates of pluripotent and

hematopoietic stem cells [21]–[25], the molecular mechanisms responsible for gen-

erating complex multistability are poorly understood. Transitions between stable

states may underlie widely observed stem cell population heterogeneity [23], [25]–[34] ,

random latency in induced pluripotency [34], [35], and even the progression of diseases

such as cancer [36], [37]. According to this view, regulatory networks capable of

adopting multiple stable states may enable cells to explore adjacent fate possibilities

through state transitions, thereby enabling differentiation in the case of stem cells, or

generating phenotypic diversity for selection to act upon in the case of tumor evolu-

tion. Recent studies have implicated the fully-connected triad (FCT) [38]–[40] as a

recurring network motif among the transcriptional regulatory circuits that control the

development and maintenance of cellular states. Most notably, the Oct4-Sox2-Nanog

triad is believed to be the key functional module in the maintenance of stem cell

pluripotency [41]–[43]. The success of induced pluripotency through the overexpression

of transcription factors alone further indicates the dominant role of transcription factor

networks in maintaining pluripotency and directing cellular reprogramming. All this

2

suggests a strong relationship between biological multistability and the conserved FCT

network of transcription factors(Figure 1), and motivates a need for understanding

the relationship between FCT regulatory relationships and multistability.

Here we developed a high-throughput computational approach to search the

network dynamics of all fully-connected transcription triads for their potential to

realize multistability. Our computational searches identified complete auto-activation,

(i.e.: all three nodes activate themselves), as a topological feature of triads capable

of multiple stable steady states (SSS). Detailed analyses were performed on network

topologies selected for a high likelihood of multistability. These analyses show that the

relative stability of SSS can be tuned by adjusting the network regulatory strengths

between nodes. After incorporation of gene expression noise in stochastic simulations,

the system was able to spontaneously transit from metastable (SSS with relatively

lower stabilities) to differentiated states (SSS with at least one gene OFF and high

stabilities), behaving like a three-dimensional Markovian jump system with bias.

This is also consistent with experimentally demonstrated stochastic cancer cell state

transitions [44], which result in a phenotypic equilibrium in populations of cancer

cells.

Different levels of noise, designed to mimic degrees of “noisy” transcriptional activity

in cellular systems, were found to either promote or disrupt state transitions, with some

transitions requiring layovers at one or more intermediate states. Such connections

between gene expression noise and cell state transitions have also been demonstrated

in many biological processes [45]. Finally, in our simplified theoretical simulation of

reprogramming to generate induced pluripotent stem cells, the distribution of random

temporal latency could be described by an Inverse Gaussian distribution, a finding

consistent with experimental observations [34]. Taken together, our work illustrates

3

that computational screening, analysis, and simulation of network dynamics can serve

as valuable tools for connecting conserved regulatory modules with complex biological

processes, such as development, disease and stem cell reprogramming.

1.1 Results

1.1.1 Network Enumeration and Parameter Scanning for Multistability

To define the set of FCT networks with a high likelihood for multistability, we

first enumerated all possible fully-connected three-node topologies (2A). Here, nodes

represent transcription factors and edges represent possible transcriptional regulations.

After taking into account topological equivalencies, we were left with 104 unique FCTs

as candidate topologies, which fully represent all 29=512 possible FCT topologies.

Each FCT candidate was generally modeled by a set of three-variable ordinary

differential equations (ODEs), with each variable representing the abundance of a

single transcription factor and each of the nine tunable parameters representing a

single regulation strength (Figure 2A). The activation and repression of each gene by

other factors are modeled additively to account for reported mechanistic independence

of transcription regulations [16], [40], [46] (MATERIALS AND METHODS). This

simplification can be expanded to include more mechanistic details of specific networks

without altering the general conclusion of this screening. A state of the network is

then defined as an equilibrium or SSS if all transcription factor abundances do not

change over time. An algorithm was developed to automate the search process to

identify all possible states.

Specifically, to measure the probability of multistability, we scanned N = 421,875

parameter sets for each of the 104 networks to identify all possible states. The values

4

of these parameters are generic but cover a wide range to serve as a proof of principle

for diverse biological scenarios. The probability of multistability was then defined as

the number of parameter sets giving rise to at least 5 stable steady states divided

by N. This probability reflects the robustness of certain FCT’s multistability, and

hence quantifies its possible natural presence frequency. The entire FCT multistability

screen is an ensemble of many independent computations and, as a result, was run

in parallel and performed in high-throughput computer clusters (see MATERIALS

AND METHODS). Such large-scale screening is computationally intensive and is only

possible with simplified model equations. Therefore the chosen forms of our ODE

models omit many biochemical details such as epigenetics and transcriptional factor

co-binding. The models serve as a general abstraction and system specific details

could be incorporated later in light of specific experimental evidence.

5

Figure 2. High-throughput computational screening for multistable networks. (A)FCTs were enumerated and modeled by ODEs (parameters denoted as P). (B) Themultistability probability of 104 FCTs generated by the computational screen isplotted, with a high multistability probability FCTs marked with an X with the colorrepresenting the multistability group; the two highest probability groups are labeledby their indices. (C) The Fifteen FCTs with the highest multistable potentials arerepresented visually.

6

1.1.2 Complete Auto-Activation Contributes to Multistability

Using our network screening approach, we quantified and ranked the potential for

multistability of all 104 possible FCTs (Figure 2A). Of the 104, 15 showed significant

probability for multistability (marked in Figure 2B and topologies drawn in Figure

2C).

The multistable FCTs were empirically categorized into three groups according to

their probability of multistability (color shaded in Figure 2B,C). The first group, which

had the highest probability of realizing multistability and with the potential for eight

states, consisted of a single network (Network 84) with all positive auto-regulation

and all mutual inhibition. The second group of multistable FCTs, consisting of

Networks 8, 38, 60, and 62, was also exclusively positive auto-regulating but displayed

a mixture of positive and negative mutual regulation. Specifically, these networks were

either enriched for mutual inhibition (Networks 60 and 62) or displayed symmetrical

topologies (Network 8 and 38). The third group, which was comprised of 10 FCTs that

exhibited lower but still significant potential for multistability, once again featured

similar positive auto-regulation as a common attribute. The networks of the third

group had either asymmetric topologies (Networks 3, 10, 12, 27, 36 and 56) or

competing mutual regulation of target genes (Network 5, 25, and 30), except for

Network 1, which was entirely void of inhibitory regulation. These results suggest that

an appropriate combination of mutual inhibitory and activating regulation helps to

increase the potential for a network to realize multistability, a finding that is consistent

with previous theoretical studies [47].

From our screen, we were able to identify a set of important, minimum topological

features capable of multistability. In particular, of the possible 16 FCTs with com-

plete positive auto-regulation, our screen selected 15 as having a high likelihood for

7

multistability. Network 54 was the only FCT displaying complete auto-activation with

insignificant potential for multistability, likely because of competing mutual activation

and repression of all the nodes (bottom of Figure 2C).

Indeed, positive auto-regulation has been shown to be an important feature of

bistable systems [7], [9], [10], [14], [48], [49], and recent experimental work also

suggests it could be necessary for maintaining multipotency in stem cells [50]. The

role of positive feedback in multistability has been well studied [9], [14], [51], and the

interconnected autoregulatory loop formed by Oct4, Sox2, and Nanog is thought to

underlie the bistable switch between ESC self-renewal and differentiation, and the

ability of core ESC regulators to ‘boot up’ the pluripotency transcriptional program

by forced expression during reprogramming [52]. Our computational results similarly

suggest that, within the limits of the study, complete positive auto-regulation is an

important contributor to the generation of high-dimensional multistability across

diverse FCT topologies. It is striking to note that one or two instances of auto-

activation, no matter how strong, are not sufficient to generate multistability in our

simulations. Therefore, it is possible that complete auto-activation of FCTs could be

used as an important hallmark in the identification of new multistable gene networks.

1.1.3 Sensitivity and bifurcation of multistability

8

Figure 3. Bifurcation and Stochastic Analysis of Network 84. (A) The topology ofNetwork 84 was predicted to have the highest probability for multistability. (B) Thebifurcation diagram of Network 84 plots transcription factor concentrations of X andY at each SSS against α, the self-activation strength. The mutual inhibitionparameters are also set equal and referred to as β, here set to 0.1 and α is used asthe bifurcation parameter ranging between .01 and 100. SSS are color coded andlisted in the legend to distinguish different SSS, where ‘+’ indicates the gene is ‘ON’and ‘-’ means ‘OFF’. (C) Simulations of noise-induced state transitioning in Network84 under different levels of noise. Simulations were performed with auto-regulationequal to 1 and mutual inhibition equal to 0.1, with noise levels of 0.5, 0.85, and 1from left to right. The locations of the deterministically calculated states areindicated with red spheres, with their size correlated to their spectral radius. Theblue ribbons indicate the temporal trajectories of the simulations. The black arrowsindicate initial direction of state transitions. (D) The number of states traveled in thestochastic simulations plotted versus noise strength. The red line representssimulations that were initiated from the all-ON state, and the blue line representssimulations that were initiated from the origin.

9

Network 84 (Figure 3A, all positive auto-regulation and all mutual inhibition)

yielded the highest potential for multistability in our screen (Figure 2C). To gain a

deeper understanding of this FCT, we performed a bifurcation analysis [53], [54] to

investigate how the change of one regulation strength affects the number, stability,

and location of all states, while other parameters are kept constant (Figure 3B).

Bifurcation analysis is a useful analytical tool that can reveal when and where one

state can bifurcate into two or more states in response to a changing parameter.

This is analogous to a progenitor cell transitioning from a self-renewing phase to a

multi-lineage differentiating phase in response to a stimulus [16].

In the bifurcation diagram of Network 84 (Figure 3B), locations of both SSS (thick

colored lines) and unstable steady states (thin grayed lines) are plotted. X and Y

indicate the abundance of gene product of X and Y. The axis of α represents the

strength of all self-activation, while other parameters are kept constant. Gene product

Z is omitted because only three quantities can be visualized in the plot. Nevertheless,

its abundance can be inferred from that of X and Y. Thus the figure provides the

position of SSS as the concentration of the three transcription factors, and their

self-activation strength, varies.

As can be seen in Figure 3B, Network 84 has complex dynamics. When auto-

activation is very weak (α less than 1), the system only has one SSS (the black line),

which corresponds to the state in which the abundances of all three gene products are

very low and could be roughly categorized as all genes OFF. This finding intuitively

makes sense considering the mutual repression of Network 84’s topology, which under

weak auto-activation does not support individual genes overcoming degradation to

support the ON state. However, as the strengths of auto-activation increase beyond

a threshold, the system suddenly demonstrates many more SSS (colored solid lines).

10

All of these SSS branch out from the all-OFF state, which is the typical nature

of bifurcation. When α = 1, the system can assume its maximum of eight states

(illustrated in Figure 3B as red circles). These states include the following possible

binary combinations: all three genes ON (orange); all three genes OFF (black); two

genes ON and one gene OFF (blue, cyan, and purple); and, one gene ON and two

genes OFF (green, brown, and yellow). The locations of these eight SSS are also

plotted in Figure 3C in XYZ coordinate. The sizes of the circles correspond to each

state’s stability (details below). Network 84 is able to maintain multiple SSS for

a wide parameter range of auto-activation. The all-OFF state (Figure 3B black

line) disappears when auto-activation strengths become greater than 1.05 (Figure 3B

gray circles illustrate locations of SSS when α =5). As the auto-activations become

stronger (e.g., α greater than 7), all three one-gene-ON (green, brown, and yellow)

states disappeared, leaving only two-gene-ON and all-ON states to be SSS (blue, cyan,

purple, and orange lines in Figure 3B).

The potential for this three-node gene network to yield as many as eight stable

states underscores the idea that many cell types (e.g., over 200) could be generated

from networks comprised of a small number of nodes. Taken further, a fully-connected

network with only eight nodes could produce 256 states [47]. While only theoretical,

it is an example of the potential for such combinatorial complexity and provides

a plausible mechanism for cell lineage determination, where different gene product

abundances and ON-OFF combinations could drive the system towards a spectrum

of cell fates. Coupled with our bifurcation analysis, where we modeled the abrupt

appearance or collapse of SSS in this topology, FCTs could also lead to discrete jumps

between states in response to a changing cellular environment or signaling. The

phenomenon of hysteresis, when such jumps cannot be reversed by returning to the

11

previous environmental conditions, has been established as the mechanism for many

irreversible cell fate determinations [55].

1.1.4 Stochastic State Transitioning

Next, we investigated the effects of gene expression fluctuations on the multistability

potential of Network 84. Fluctuations in gene expression under constant environmental

conditions may arise from inherent stochastic fluctuations in gene expression, or

from regulatory architecture, chromatin structure, or transcriptional bursting. Here,

we use the broad term ‘gene expression noise’ to encompass all these possibilities.

Gene expression noise has been shown to affect cellular physiology in single-cell

organisms [56]–[62], and has also been linked to random developmental patterns and

cell fate decisions in multicellular organisms [63]–[67]. The mechanisms underlying

the utilization and regulation of gene expression noise in higher organisms, however,

are not well understood. We introduced such noise into our model by modifying the

algorithm from Adalsteinsson et al. [68] to help understand how inherent stochasticity

affects cell differentiation (see MATERIALS AND METHODS).

Using our modified algorithm, we simulated the temporal evolution of all three

gene product abundances under the influence of different levels of gene expression

noise. The system was initialized at the all-ON state to mimic the promiscuous

gene expression state that is often observed when various progenitor cells start to

differentiate [69]–[73]. The locations of eight SSS when α =1 are indicated by the red

spheres in Figure 3C, where the size of the spheres represents the stability of each

state as measured by its spectral radius [74], [75]. As shown in Figure 3C, at small

noise levels (Figure 3C, left panel), expression of all three genes displayed fluctuations,

yet the trajectory remained focused around the initial all-ON state, indicating that

12

small levels of gene expression noise were not sufficient to induce spontaneous state

transitioning in Network 84. This also indicates the stability of the network under

well-buffered conditions. As the level of noise was increased to an intermediate level

(Figure 3C, middle panel), the system began to transition from the initial all-ON state

to a two-ON-one-OFF state and later to a two-OFF-one-ON state. Within the time

frame simulated, the system transitioned back and forth between these two latter

states and spent most of the time in the vicinity of these two states. When the noise

level was increased further, the system was able to visit all six states with at least one

gene ON and one gene OFF (Figure 3C, right panel).

The results of the stochastic state-transition simulations have several interesting

biological implications. First, the system transitioned exclusively between states

with at least one gene ON and one gene OFF. After jumping out of the initial state,

it never returned to the all-ON state within the simulation time frame; moreover,

it never reached the all-OFF state. For this topology, the all-ON and all-OFF

states are significantly less stable than the other six states, and thus serve as weaker

attractors. Specifically, it is the fine balance in the abundance of all gene products

that determines the stability of the all-ON and all-OFF states, and this balance

is easily disrupted by random perturbations such as gene expression noise. These

results provide a simplified and yet concrete demonstration of the recently proposed

concept of “dynamic equilibrium” [76], in which individual cells can jump between

states randomly, while the proportion of cells in a population at one specific state

remains constant. Here we can see the proportion (probability) of cells in one state

is determined by the stability of each state, which in turn is determined by network

topology and system parameters. Therefore when one single fluorescence reporter

tagged to one gene is experimentally used to track single cell dynamics, the cell

13

population will exhibit a broad distribution because this multi-dimensional system is

projected onto a single measured dimension.

Second, state transitions were only observed between two adjacent states and

not between non-neighboring states. In other words, the transition from one state

to another non-neighboring state required layovers at one or more intermediate

states. This phenomenon resembles the experimental observation that progenitor cells

pass through multiple intermediate states as they differentiate [5]. Moreover, these

simulations also highlight that the transition of a triad from one state to another can

take multiple possible routes. Similar results have been observed at the cellular level,

where recent progress in de-differentiation [17], [19], [20] and trans-differentiation [77]

has shown that natural developmental cellular state transitions may be only one of

many possible routes between cell types [78].

Third, trajectories of the stochastic simulation showed a directional bias in gene

expression state space, presumably due to mutual repression. For instance, when the

system was in the X-ON/Y-ON/Z-OFF state, its trajectory was virtually limited to

the X-Y plane, with little fluctuation in the Z direction (Figure 3C, middle and right

panels). The result is that each state in these simulations was functionally “flat” and

comprised of the expression from only two genes. Similarly, the one-ON states had a

polar or single dimensional expression distribution.

Next, we subjected the system to a range of noise levels to gain a deeper under-

standing of the system’s response and pattern of state transitioning. Noise strengths

ranging from 0.4 to 1.2 were selected for analysis. Plotted in Figure 3D (red line) is the

number of states traveled by the system, from the all ON state, within the simulated

time versus the noise strength. The result shows a step function-like increase in the

number of states with increasing noise strength. A similar pattern was observed when

14

the simulation was initiated at the origin (blue line) rather than the all-ON state.

One exception is that from this alternative starting point the system could not travel

to the all-ON state and therefore had only six accessible states.

In both simulations, the sharp increase in the number of states traveled occurred

between noise strengths of 0.8 and 1 (Figure 3D). This appears to be the critical

threshold for overcoming the stability barrier between states and for complete noise-

induced state transitioning. The existence of such a critical noise threshold suggests

that natural systems may have the ability to operate in different regimes that either

utilize or repress gene expression stochasticity to achieve distinct physiological goals.

1.1.5 SSS Analysis

To further explore Network 84’s response to two or more simultaneously changing

parameters, we conducted more thorough parameter scanning (see MATERIALS AND

METHODS for details). First, two parameter combination regimes shown in Figure

4A and B suggest that, with strong auto-activation, Network 84 is ensured to have at

least four states (bottom two rows). Under these conditions, weak mutual inhibition

led to seven or eight states (orange and green boxes) while strong mutual inhibition led

to only four: all two-ON-one-OFF combinations and all-ON (purple box). However,

when both mutual inhibition and auto-activation are weak, Network 84 assumes only

one state, that of all genes OFF (Figure 4A, blue box). When auto-activation is weak

and mutual inhibition is strong the system is capable of the three two-ON-one-OFF

states (Figure 4A, red box, locations of SSS not shown). Taken together, this again

reinforces the importance of strong auto-activation in generating multistability and

hence the potential to canalize cell differentiation. As a third parameter combination

regime, we tested the effect of asymmetrical regulation on multistability (Figure 4C, see

15

Figure 4. Stable Steady State Analysis of Network 84. (A) A matrix presenting thenumber of stable steady states generated by combinations of different auto-regulatorystrengths (rows) and mutual-regulatory strengths (columns). The green and orangeregions are visualized as FCT diagrams independently in B. (B) The green-boxed andorange-boxed diagrams corresponds to the green-boxed and orange-boxed parametercombination regime from (A). (C) Many of the parameter combinations yieldingmulti-stable systems are not represented by the matrix, and are instead abstractedhere. As an example, here we present a parameter combination regime that cansupport 6 SSS and auto-activation strengths are similar to those in the orange box(thick red arrows) but have one unrestrained pair of mutual inhibition that can existat a very high strength (blue T-head repression lines).

MATERIALS AND METHODS for details). Here we increased the mutual repression

of a single pair of genes for the conditions represented by the green and orange boxes

(Figure 4A, auto-regulatory strength between 1- 5) and saw the number of SSS fall

to six. These findings provide possible insight into why certain network topologies

have been adopted by evolution for particular regulatory tasks. For example, Network

84 demonstrates a broad, multistable parameter space that can be tuned, through

regulatory strengths, for the step-wise control of the maximal number of states. Such

16

a mechanism could be envisioned to transit cellular identity from a pluripotent state

to multi- or uni-potent state during differentiation.

1.1.6 A Regulatory Network for Pluripotency

Experiments probing the structure and function of the regulatory network governing

pluripotency suggest that the core pluripotency regulators Oct4, Sox2, and Nanog

are organized into a regulatory topology of mutual activation that is shared with

Network 1 selected by our screen, [21], [79]. This fully interconnected autoregulatory

loop is believed to stabilize self-renewal, govern the bistable switch between ES cell

self-renewal and differentiation, and underlie the ability of selected pluripotency

regulators to ‘boot up’ the pluripotency transcriptional programming upon forced

expression in somatic cells, thereby reprogramming them to a stem cell-like state

[52]. Additional layers of regulatory complexity have been described in this network,

including autorepression by Nanog [80], [81] and differential cell phenotypes in response

to varying levels of these factors [82]–[84]. Moreover, Nanog acts as a homodimer [85],

while Sox2 and Oct4 heterodimerize to regulate transcription [86], and attempts have

been made to model the regulatory dynamics of these transcriptional complexes and

their interaction with signaling factors [87]. Further adding to the complexity, Oct4

may also partner with other Sox factors to form alternative regulatory complexes that

bind a distinct set of enhancers [88].

Notably, protein levels of the core pluripotency regulator Nanog have been observed

to fluctuate in cultured pluripotent stem cell populations [23], [24], [89]–[92], suggesting

that this regulatory network is capable of adopting multiple states. To further explore

possible roles of multistability in stem cell transcriptional regulation, we carried out

17

Figure 5. Bifurcation and Stochastic Analysis of Network 1. (A) Network 1 sharesthe topological architecture of the Oct4-Sox2-Nanog triad, and is believed to underliestem cell differentiation and reprogramming. (B) The bifurcation diagram of Network84 plots transcription factor concentrations of X and Y at each SSS against α, theself-activation strength. The mutual inhibition parameters are also set equal andreferred to as β, here set to 0.1 and α is used as the bifurcation parameter rangingbetween .01 and 100. SSS are color coded and listed in the legend to distinguishdifferent SSS, where ‘+’ indicates the gene is ‘ON’ and ‘-’ means ‘OFF’. (C)Simulations of noise-induced state transitioning in Network 1 under different levels ofnoise. Simulations were performed with auto-regulation equal to 1 and mutualinhibition equal to 0.1, with noise levels of 0.5, 0.85, and 1 from left to right. Thelocations of the deterministically calculated states are indicated with red spheres,with their size correlated to their spectral radius. The blue ribbons indicate thetemporal trajectories of the simulations. The black arrows indicate initial direction ofstate transitions. (D) The number of states traveled in the stochastic simulationsplotted versus noise strength. The red line represents simulations that were initiatedfrom the all-ON state, and the blue line represents simulations that were initiatedfrom the origin.

18

Figure 6. Stable Steady State Analysis of Network 1. (A) A matrix presenting thenumber of stable steady states generated by combinations of different auto-regulatorystrengths (rows) and mutual-regulatory strengths (columns). The green and orangeregions are visualized as FCT diagrams independently in B. (B) The green-boxed andorange-boxed diagrams corresponds to the green-boxed and orange-boxed parametercombination regime from (A). (C) Many of the parameter combinations yieldingmulti-stable systems are not represented by the matrix, and are instead abstractedhere. As an example, here we present a parameter combination regime that cansupport 6 SSS and auto-activation strengths are similar to those in the orange box(thick red arrows) but have one unrestrained pair of mutual inhibition that can existat a very high strength (blue T-head repression lines).

detailed analysis on this network topology (Figure 5A). Our goal in doing so is not to

capture every regulatory complexity present in the endogenous biological network, but

rather to provide a theoretical framework for how this simplified network topology and

fluctuations in network components might affect multistability. As with Network 84,

we first carried out the bifurcation analysis for Network 1 (Figure 5B). Considering

the topology of Network 1 consists solely of positive regulation, one might expect that

the network would primarily assume the all OFF “resting” or all ON “activated” states.

This intuition is largely correct but not complete. While network 1 does tend toward

19

the all-OFF state it is also capable of generating eight states when the strengths of

auto-regulation and mutual regulation are 1 and 0.1, respectively (Figure 5B and C;

Figure 6A). This diagram (Figure 5B) shares some similarities with that of Network

84 (Figure 3B). For instance, both networks exhibit eight SSS when auto-activation

strength is equal to 1, and fewer as auto-activation strength increases. Likewise, when

auto-activation is reduced to be smaller than 1, both networks collapse to only support

one SSS (the black line). Thus Network 1 also displays discrete emergence and collapse,

or hysteresis, of SSS, allowing it to potentially serve as a foundation for irreversible

cell fate decision-making. It is interesting that while the two-ON states (blue, cyan,

and purple lines) collapse readily as α increases, the one-ON states (green, brown,

and yellow lines) and the all-ON and all-OFF states (orange, black), are stable over a

wide range of parameter values. It is also important to note that despite similarities

in the state space locations of each of the eight SSS in Networks 1 and 84, there are

considerable differences in their relative stabilities.

As with Network 84, we next incorporated different strengths of gene expression

stochasticity, or noise, to simulate state transitioning (Figure 5C). The simulations

were started from the all-ON state to mimic the all-ON pluripotent state of the

Oct4-Sox2-Nanog triad. In the presence of small levels of noise, the system remained

in the initial state throughout the duration of the simulation (Figure 5C, left panel).

When the noise level was increased, the system was able to spontaneously transition

to other states before settling in the all-OFF state, which represents the all-OFF

terminal, or differentiated, state (Figure 5C, middle panel). When the noise level was

increased further, the system transitioned out of the initial state much earlier toward

the all-OFF state via an intermediate state (Figure 5C, right panel), confirming that

the all-OFF state is indeed the terminal state and robust to different levels of noise.

20

It should be noted that simulation times were the same for each panel (Figure

5C). In the high noise level simulation (Figure 5C, right panel), the system reverted

to the all-OFF state relatively quickly and exhibited small stochastic deviations

that rarely exceeded the boundary of the sphere representing the state. This could

perhaps be likened to the terminal state of the Oct4-Sox2-Nanog triad after cell

differentiation, where expression of these genes is markedly depressed in gene expression

profiles. Furthermore, for the modeling of cell fate decisions during development, the

binary nature of these simulations appears to mimic the observed stable states of the

Oct4-Sox2-Nanog triad, typically in either the all-ON or all-OFF state, with partial

activation visible transiently during iPSC induction [17], [18], [93].

Much like the trajectories observed for Network 84, those for Network 1 exhibited

the ability to transition between many states stochastically, with the potential of

accessing five of the eight possible states. Although, as depicted, the residence time

of states for Network 1 was largely restricted to either all-ON or all-OFF (Figures

3C, 5C). Network 1 simulations could not access the three two-ON-one-OFF states

unless initiated there, presumably because of the states’ low stabilities. Moreover,

unlike Network 84, all the simulation trajectories for Network 1 ultimately reached

and remained at the all-OFF state, a result of Network 1’s topology and the superior

stability of the all-OFF state. This also led to a lower average number of states

traveled by Network 1 at higher noise levels (Figure 5D). A higher noise threshold for

state switching of this sort could serve as added protection against aberrant regulation

of pluripotency. From a modeling perspective, these results offer a useful means of

rapidly predicting the dynamical stochastic trajectory of a network based on state

stabilities, without having to run entire stochastic simulations.

As with Network 84, we tested the sensitivity of Network 1 to different combinations

21

of regulation strengths (Figure 6). Again, weak mutual activation and strong auto-

activation gave rise to the maximal number of states (Figure 6A, green and orange

boxes). Interestingly, however, the remainder of the regulation parameter space

was much simpler than that of Network 84, yielding only one or two states (Figure

5A, blue and red boxes). Again, independent manipulation of mutual regulatory

interactions can also lead to parameter combinations that enable multistability. A

thorough parameter scan for such multistability found a third interesting configuration,

one where a gene (Z) propagates signal to downstream genes (X and Y), which also

have strong auto-activation (Figure 6C). This pattern has one defining characteristic

in that it maintains multi-stability as long as there is only one master while the

promoted genes have weak interactions with each other. Such multi-stability is lost

when the system adopts a two-driver gene topology. Therefore Network 1 may be

considered a largely bimodal topology that retains the potential for greater state

diversity, but only within a narrow parameter window (Figure 5B and Figure 6A).

This characterization of Network 1 topology is supported by other studies. Specifically,

while claims of monoallelic expression of Nanog [89] have been contested [24], [29],

a common finding across the literature is heterogenous expression of endogenous

Nanog protein in cultured ES cells by immunostaining [23], [24], [75], [76], [89], [90].

These findings suggest bistable expression of Nanog, and support the idea that the

Oct4-Sox2-Nanog network may exist in at least four stable states (all high; Oct4 and

Sox2 high but Nanog low, as occurs in low Nanog ESCs; high Sox2 but low Nanog

and Oct4 as occurs in neural progenitor cells [94]; or all low/off in the differentiated

state).

Analysis of the regulatory parameter space for networks 1 and 84 raises an important

consideration that may be broadly applicable. Could progressive lineage determination,

22

or restriction of pluripotency, be correlated with the strength of mutual regulation?

For both of the networks analyzed, the highest potential for multistability (eight states)

was found in a parameter space with weak mutual regulation (Figure 4A and Figure

6A). As the strength of mutual regulation increases from this point, more restricted

potential fates predominate, much as would be expected as cell lines differentiate. The

influence of greater mutual regulation on the potential for multistability would be

interesting to explore, and perhaps is well-suited to modeling in the hematopoietic

system where hematopoietic stem cells (HSCs) or multi-potent progenitors could be

compared with bi-potent progenitors.

1.1.7 In Silico Induced Pluripotency

Reprogramming mature cell types into iPSCs is achieved by the overexpression of a

number of defined transcription factors (typically including Oct4, Sox2, c-Myc, and

Klf-4) that regulate the necessary pluripotency phenotypes as well as each other’s

expression [18]–[20]. Recently, it has been shown that overexpression of only Oct4,

Sox2, and Nanog results in similar rates of iPSC generation from human amnion cells

[95], and, importantly, the Oct4-Sox2-Nanog triad is highly expressed and appears

to be the core regulatory circuit for pluripotency in stem cells [96]. To construct a

simplified in silico version of the iPSC experiment involving the Oct4-Sox2-Nanog

triad, we added a constitutive expression term to each gene of Network 1 (Figure 7A,

see MATERIALS AND METHODS).

To begin, we performed bifurcation analysis on the network 1 topology with an

added term for constitutive expression (Figure 7B). The addition of a basal level of

expression to Network 1 resulted in at most five possible states, two of which (the

orange all-ON and black all-OFF lines) possessed considerably higher stabilities than

23

Figure 7. Bifurcation and Stochastic Analysis of Network 1 with constitutive basalexpression. (A) The topology of Network 1 is used to mimic iPSC experiments withadded constitutive expression of all three genes. (B) The X and Y proteinconcentrations of stable and unstable steady states are plotted against self-activationstrengths. With the addition of the constitutive expression we can see a furtherdecrease in stabilities, specifically two-ON states are never stable. Additionallyone-ON states lose stability very rapidly, approximately at α=1, and the all-OFFstate destabilizes after α=9 leaving only the all-ON state. This can also be seen inthe sizes of the spectral radii, deterministically these steady states exist but inpractice they have very little influence. (C) Simulations of Network 1 modeled in thepresence of an additional constitutive overexpression term consistently show atransition from an all-OFF state to an all-ON state after a period of latency.

the others (Figure 7B). All one-ON-two-OFF SSS (green, brown, and yellow lines)

are only stable for a narrow range of α. Even when they are stable, their stabilities

are insignificant. Looking closely we can see that between the red, green, and yellow

lines there are 3 unstable paths without a stable component. These lines would be the

locations of two-ON-one-OFF states if they achieved stability. More importantly, as

auto-activation strengthens, even the all-OFF state’s stability becomes insignificant

when compared with the all-ON state (Figure 7B gray circles).

A comparison of bifurcation diagrams of Network 1 with and without the constitu-

tive expression term (Figure 5B, Figure 7B) highlights some striking and important

24

differences. Firstly, as observed experimentally, by adding the constitutive basal

expression term to all three genes, there is increase in the stability of the all-ON state.

Moreover, our simulation predicts a reduction in the number of possible SSS from

eight to five, as well as a significant reduction in the stability of the non all-ON or

all-OFF states. While these findings must be considered within the limits of our study,

they suggest constitutive expression could help to canalize the transition from all-OFF

to all-ON by simplifying the triad’s regulatory space, and thus provide a foundation

for induced pluripotency.

To further examine modified Network 1, the effects of gene expression noise were

investigated through stochastic simulation from the all-OFF state with a moderate

level of noise (see MATERIALS AND METHODS). Consistent with experimental

evidence [34], at the start of the simulation, the system hovered near the origin for a

period of time and then began to transition toward the all-ON state. Once again, by

adding a basal overexpression term to the genes, we found that the stability of the

all-ON state was increased to be greater than that of the all-OFF state (Figure 7C).

As a result, the system was attracted to the all-ON “pluripotent” state in a way that

was not observed with the unmodified Network 1 (Figure 5C).

These findings reflect observations from the experimental practice of induced

pluripotency in which the reprogramming factors are strongly overexpressed, and may

shed light on the regulatory mechanisms behind induced pluripotency. For instance,

the overexpression of defined transcription factors in differentiated cell types may alter

the regulatory parameters of the FCT into ones that prefer the all-ON “pluripotent”

state. Affected cells would then gradually transition from their original state to the

preferred all-ON state. Further, as is often done experimentally, we removed the

constitutive expression term from our simulation and found the system to return

25

to an unmodified Network 1 behavior. Much like iPSCs, the network was able to

re-differentiate to the all-OFF state or to remain in the all-ON state, under conditions

of low noise or high auto-activation respectively.

Figure 8. Stochastic simulations capture latency distributions in inducedpluripotency. (A) Distribution of simulated latency for the modified Network 1 modelillustrates a skewed bell shape. The histogram was generated from 2000 simulations.(B) Similar results generated from 2000 simulations with Network 84.

To further test the utility of our model as an abstraction for the transcriptional

regulation of stem cells and stem cell reprogramming, we next used our model to

predict the temporal dynamics of stochastic reprogramming. We first investigated

the time required for the system to transition from the all-OFF to the all-ON state,

which is termed the first passage time (FPT) [97]. An FPT distribution of induced

pluripotency was generated by running 2000 simulations with our modified Network 1,

resulting in a characteristic asymmetrical bell shape distribution with a long tail and a

peak at approximately 50 normalized time units (Figure 8A). To ensure the generality

of this observation, an FPT distribution for the transition from one state to another

26

in Network 84 was similarly generated and found to yield the same pattern (Figure

8B). Such a skewed bell shape distribution is consistent with published experimental

data [34].

1.2 Discussion

Over fifty years have passed since the connection between development and multista-

bility was first proposed [3]. However, despite efforts in the last decade to reverse

engineer transcription regulatory networks, the molecular basis for multistability is still

not well understood. Likewise, multistable systems with more than two states have

not yet been constructed de novo. We address this challenge with a high-throughput,

dynamical systems search algorithm, powered by parallel computing, that identifies

gene network topologies enriched in their capacity to realize multistability.

Focusing our screen on FCTs, our algorithm identified complete auto-activation

as an important topological feature for high-dimensional multistability, along with a

variety of combinations of mutual regulation. Our analyses shed light on existing ideas

and generate new testable hypotheses for some of the regulatory mechanisms underlying

cell fate determination. Many of the seemingly disparate experimental observations

including random reprogramming latency, the existence of intermediate states during

differentiation and reprogramming, population heterogeneity, and alternate routes of

cellular state transitioning [5], [27], [34], [96], [98], can potentially be connected and

addressed by our dynamical analyses of FCTs.

The central importance of complete auto-activation in multistable triads suggests

that nodes for such regulatory networks could be identified in exogenous overexpression

screens for known key regulators. In such a screen, the components of multistable

FCTs would self-identify, by maintaining their own expression after the withdrawal

27

of candidate transcription factor overexpression. Transcription factors identified this

way could represent “pioneers”, those capable of activating regulatory partners or

overcoming transcriptional inaccessibility, i.e., silenced loci, such as Oct4.

The identification of regulatory partners (mutual regulation) could also be char-

acterized by the analysis of promoter/enhancer regions upstream of identified auto-

activation transcription factors. Such a “guilt by association” method may also help

to infer the function of the triad as a whole, or lesser-known regulatory partners.

Thus, using established developmental transcription factors as seeds, the Waddington

landscapes of whole cell lineages could perhaps be mapped by using auto-activation

as a cipher.

The notion that specific topologies may serve particular regulatory roles better

than others is supported by the identification of an additional Network 1 regulatory

topology in a stem cell population. The transcription factors Gata2, Fli1 and Scl/Tal1

form a regulatory triad that operates through mutual and auto-activation to promote

hematopoiesis in the mouse embryo [99]. As discussed in the context of the Oct4-

Sox2-Nanog triad, the Network 1 topology has a tendency to operate in a bimodal

and terminal manner, meaning that once switched to the all-ON state the triad has

a tendency to be locked into that state. Such a regulatory system would impart

a sort of cell “memory”, perhaps allowing newly formed HSCs to travel from the

aorta-gonad-mesonephros (AGM) or primordial niche to the fetal liver and bone

marrow niche without losing their multipotent state [99], [100].

Importantly, regulatory triads do not exist in isolation and are not static entities.

Rather, inputs from the environment and other signaling pathways can influence the

state or composition of endogenous triads through the perturbation of regulatory

relationships. The consequence of such influence was evident in our bifurcation

28

analysis of Networks 84, 1, and modified 1, where small changes to the strengths of the

network regulation created or eliminated SSS. Thus, under the influence of external

signaling (e.g., post-translational modification), subcellular localization of co-factors

or regulatory ligands [101]–[103], the same network topology may alternate between

different regulatory state configurations. Such state switching could be an effective

regulatory strategy for responding to changing environmental and developmental cues.

While gene expression stochasticity is an accepted factor of cell fate determination

[61], its role in complex processes such as differentiation remains elusive. We subjected

our chosen multistable networks (e.g., Network 84) to stochastic simulations and found

that moderate levels of gene expression noise were sufficient to disrupt the balance of

protein abundances required for maintaining all-ON and all-OFF states, even with

high protein levels. In other words, our simulations suggest that noise is critical to a

biological system’s ability to transition from meta-stable states to differentiated states.

Moreover, noise-induced transitioning showed multiple possible routes from one state

to another, with the trajectories traversing step-wise through adjacent states (Figures

3C, 5C). This is possibly the in silico manifestation of experimental observations

showing progenitor cells passing through alternate intermediate states on the way to

a differentiated state [5]. Additionally, by testing a range of noise levels, we found

that the number of states traveled by these multistable systems increased sharply

at specific noise levels (Figures 3C, 5C). Perhaps cellular systems straddle such a

threshold, whereby gene expression noise can either be suppressed or amplified to

realize very different physiological outcomes. The advent of single-cell gene expression

profiling methods [104] affords new windows on cellular heterogeneity, and theoretical

frameworks such as we present here may provide valuable insights when informed by

the experimental data these technologies will provide.

29

Like signaling cascades in their simplest form, the regulatory relationships between

the genes of FCTs define their meta-properties. In a manner reminiscent of the low-

pass filtering capacity of cascades, our stochastic simulations detected network-specific

variation in tolerance for noise. Specifically, in the analyses of Networks 1 and 84,

significantly different noise thresholds for multistability were measured—0.8 and 1.4,

respectively (Figures 3 and 5). This observation suggests that, along with other

measures, networks may be tuned for unique noise-switching thresholds. For example,

the Gata2-Fli1-Sci/Tal1 triad (Network 1) would be predicted to remain in the all-ON

state as long as noise levels are maintained low, but to transition quickly to an all-OFF

state under high noise. This could be tested experimentally using tunable synthetic

noise controllers [94] or generators [105] upstream of these HSC regulators to apply

large fluctuations to the system.

As such, differentiation or pluripotency may provide readily accessible experimental

models. One possible method to induce noise could be the expression of splice isoforms

[106], [107]. Focusing on the endogenous triads discussed, state change dynamics

could be evaluated under the perturbations of either single or multiple splice isoforms.

Thus, while controlling for total protein expression within a particular triad, one

could evaluate whether “noise”, derived from the concurrent expression of multiple

isoforms, influences the product or rate of the induced state transition. If so, an

interesting outcome may be that individual differentiation pathways could be primed

for state change by expressing tailored suites of splice-isoforms or in the case of iPSC

reprogramming, perhaps accelerate the rate of state switching.

Finally, our work offers new directions for synthetic biology [11], [12], [108]–[112].

As a tool, the application of synthetic biology to disease-related or developmental

triads may allow for true reverse engineering of in vivo processes. For example,

30

the regulatory dynamics of Network 1 (Gata2-Fli1-Scl/Tal1 and Oct4-Sox2-Nanog)

or Network 84 (RUNX2, SOX9 and PPAR-γ) topologies could be systematically

examined for response to perturbation or noise by constructing such FCTs in an

orthogonal environment (e.g., bacteria). Similarly, the effects of stochasticity on the

regulation of pluripotency could be studied directly in mammalian systems using

tunable synthetic noise generators. Such a system could vary the gene expression of

triad components, while keeping their mean expression level constant, to evaluate

the influence of noise on network state change [57], [113]. Our work here provides a

starting point for the design of next-generation synthetic circuits that could exploit

multistability and its physiological consequences.

1.3 Materials and methods

1.3.1 Enumerating and Eliminating Redundant Networks

Three-node networks can have 39=19683 configurations, each with nine edges that

consist of positive, negative or null regulations between the nodes. Our focus on

FCTs resulted in a total of 29=512 possible configurations because each edge can

only represent positive or negative regulation. Through permutation analysis many of

these 512 networks were found to be identical to each other. Identical networks were

eliminated, leaving us with 104 unique networks (see SI).

1.3.2 ODE Modeling and High-Throughput Screening

A three-node gene regulatory network can be described by a three-dimensional ODE:

dx

dt= a1f1 (x) + b1g1 (y) + c1h1 (z)− δx (1.1)

31

dy

dt= a2f2 (x) + b2g2 (y) + c2h2 (z)− δy (1.2)

dz

dt= a3f3 (x) + b3g3 (y) + c3h3 (z)− δz (1.3)

In this formulation each variable (x,y,z ) represents the protein abundance of one

gene product (Equations (1.1)–(1.3)). Here only one generic equation is used to

describe the gene expression regulation including transcription, translation, and other

post-transcriptional and post-translational modifications. Each equation captures

the core nonlinear dynamics of gene expression regulation and omits details of the

fine-tuning of it. The parameters (a1−3,b1−3,c1−3) represent the strength of mutual or

auto-regulation, be it activation or inhibition. Each function (f, g or h) has the form Fn

/ (kn + Fn) when representing activation, and the form kn/(kn+Fn) when representing

inhibition, where F can be either x, y, or z. The activation and repression of each

gene by other factors are modeled additively to account for reported mechanistic

independence of transcription regulations [16], [40], [46]. For example, Network 84

shown in Figure 3A can be described by the ODE:

dx

dt= a1

xn

kn + xn+ b1

kn

kn + yn+ c1

kn

kn + zn− δx (1.4)

dy

dt= a2

kn

kn + xn+ b2

yn

kn + yn+ c2

kn

kn + zn− δy (1.5)

dz

dt= a3

kn

kn + xn+ b3

kn

kn + yn+ c3

zn

kn + zn− δz (1.6)

In the models values of n=4 and k=0.5 were assumed for all genes (Equations (1.4)–

(1.6)). While it is known that transcription factors critical in stem cell programming

often form homodimers [85], [114] or heterodimers [115], n is set to be equal 4 because

the nonlinearity quantified by n is often affected by many factors in addition to

protein multimerization [10]. A hill coefficient of 4 has been used previously to

demonstrate generic behavior of stem cell differentiation [16]. Recently, hill coefficients

32

up to 9 have also been used [40] to construct models with experimentally verified

predictions. It is also found in our study that hill coefficient values affect the region

of multistability but not general conclusions. While it did not affect the results of

steady-state calculations, δ was set to be 1; this assumption can be adjusted in light of

future available experimental results. For mutual regulation strength, the parameters

(b1, c1, a2, c2, a3, b3) were set to [0.1, 0.2, 0.5, 1 or 5]; for auto-regulation strength,

the parameters (a1, b2, c3) were set to 0.1, 0.5 or 1. These parameter possibilities

lead to a total of 56*33=421,875 parameter sets to test.

We numerically solved for the root of the right-hand side of each ODE set with

over 1000 different initial guesses uniformly distributed over the entire state space.

Measuring the probability of multistability for each FCT was accomplished by analyz-

ing all 421,875 parameter sets. A parallel algorithm was developed to distribute this

task as an ensemble of independent computations (See the SI).

1.3.3 Analysis of the Parameter Space

In addition to the tabular data (Figure 4A and 4A) that is useful for quick visualizations

and analysis of stabilities, it is desirable to recognize key patterns of parameter

combinations (PCs) yielding multistability. Therefore networks chosen (Network 84

and 1) for in-depth analysis had their results recomputed with a wider sample of

parameter spaces. Specifically the PCs were expanded to five levels for each edge, [0.1

0.5 1 2 6], yielding 5^9 = 1,953,125 possible PCs. The found clusters can then be

generalized to create the FCT diagrams in Figures 4C and 4C.

33

1.3.4 Bifurcation Analysis

Bifurcation diagrams were generated using MatCont [116], a continuation toolbox

for MATLAB. Bifurcation analysis was performed on the auto-activation strength;

mutual-regulation strengths were set at .1. Spectral radii were gathered at specific

intervals to compare the stability of the stable steady states. Stable and unstable

steady state information was graphed to generate Figures 3B, 5B, and 7B.

1.3.5 Stochastic Simulations

The ODE model can be converted to chemical Langevin equations to introduce

stochasticity while approximating the Gillespie algorithm [68], [117]. All parameters

were scaled up so that copy numbers of gene products are in biologically reasonable

quantities and can reach over one hundred. This was done by multiplying the

coefficients in a, b, c and k by 100 with other parameters unchanged. This linear

shift does not alter the bifurcation dynamics or the relative locations of SSS. At each

iteration, all chemical species are updated using the equation:

N (t+ ∆t) = N (t) + ∆tA (N (t)) + S√∆t (F (N (t)) +B (N (t)))z (t) (1.7)

In this equation, N (t) represents the abundance of chemical species, A(t) is the

right-hand side the ODE, and F and B are the forward and backward reaction terms

in each equation, respectively, which were extracted from corresponding ODEs. z (t)

are the standard Normal variables, S is the stoichiometric matrix of the biochemical

reactions (refer to [68] for additional details). In our modified algorithm, we added

one parameter, α, to Equation 1.7 to represent the tunable noise strength:

34

N (t+ ∆t) = N (t) + ∆tA (N (t)) + α ∗ S√∆t (F (N (t)) +B (N (t)))z (t) (1.8)

Here α was the noise strength used in Figures 3C and 5C. Therefore, the results

were comparable to pure discrete simulations when α equals 1. Modulation of the

noise strength(α) was used to facilitate investigations of the system’s dynamics in

response to different noise levels. The multiplicative noise term is chosen for simplicity

without loss of generality.

1.3.6 Statistical Tests

All statistical tests were performed using Matlab. The parameters for each distribution

were estimated using the Matlab command mle. Chi-square (X 2) statistics for each

distribution were computed with the chi2gof command.

35

Chapter 2

GENE NETWORK MODELING

Embryonic stem cells have the capacity to differentiate into more than 200 isogenic

progenitor and terminal cell types [1], each of which represents a stable cellular state.

The potential for a system to realize multiple states is termed multistability, which has

been proposed as the mechanism underlying cell differentiation for over half a century,

with stable states represented as either valleys in the developmental landscape[2], [3]

or as dynamic attractors in high-dimensional gene expression space [4]–[6]. These

states are realized through a combination of inputs resulting from both the cytoplasm,

and the host of the cell. The interplay of genes with these inputs within the cell drives

cells towards large responses such as embryonic or carcinogenic differentiation, but also

smaller responses such as metabolic activities [118], stress responses . Understanding

these interactions is critical to the construction of novel gene circuits with a behavior

that is predictable in silico, and consistent in vivo [119].

The ability to infer gene networks has made tremendous strides over the past

decades thanks to advances along 3 main fronts. Decreased sequencing costs have lead

to an explosion of available sequencing data, much of which is available to researchers

from systems like the NCBI GEO DataSets [120]. Due to the nature of the problem,

whereby tens of thousands of genes are predicted using hundreds of individual sample

points, the problem is massively underdetermined. This ability to access and pool

these samples puts better constraints on the system as a whole, and leads to much

stronger conclusions during network inference. Second, significant advances have been

made in the algorithms used for gene network inference. These advances have been

36

made through both algorithmic advancements, and novel approaches made possible

through generally increasing computational power. Third, the quality of sequencing

results has improved significantly. In specific the length of sequence that can be read

successfully has improved, and these longer sequences can more easily be resolved

to unique positions, increasing the quality of gene quantifications, and consequently

network inference.

Here we discuss network inference methodologies capable of inferring these network

interactions from sequencing data and some of their limitations. We discuss limitations

of tools downstream from the sequencers, and propose a solution to integrating data

from multiple sources. We propose a technique for automatically collecting and

pre-processing a large corpus of sequencing runs from public sources, allowing for

increased power during the network inference phase. We demonstrate the viability of

our pipeline as a means of both increasing input data, and decreasing gene expression

noise, a significant problem in network inference [104], [121]–[123].

2.1 Network Models

Gene regulatory network (GRN) modeling is a technique dedicated making functional

understanding of gene regulation at a network level, a crucial step in the understanding

of cell fate determination. Specifically it seeks to understand the processes behind

GRNs and employ them to engineer novel biological devices, metabolic processes,

and therapeutics. Since the development of using Boolean networks [4] to model

cell fate determination, there have been massive advances in model accuracy. These

advances have sourced both from growth in computational abilities, but also in

greater understanding of cells and gene networks. As computational power has grown

[124], the desire for more powerful and predictive models has evolved the modeling

37

landscape from simple and lightweight into more complex and flexible models. Initially

starting with Boolean networks [4], [125] and moving through deterministic and linear

models [126], [127], Bayesian network modeling [128]–[130], and more recently to

differential equation based models. Differential equations have the added benefit of

being extensible to incorporate stochastic simulations in a more natural way than

previous models.

Figure 9. Two fully connected(A/B), and two partially connected(C/D) triads aredepicted. (A/C) illustrate only positive network regulation (activation), indicated byred arrows, i.e. increasing quantities of the transcription factor at the tail of thearrow also increases quantities of the transcription factor at the head. (B/D) utilizeboth positive and negative network regulation (repression), indicated by the bluebars, i.e. increasing the quantities of the transcription factor at the tail of the errordecreases quantities of the transcription factor at the head

2.1.1 Boolean Networks

Since their introduction, Boolean networks have remained a first tool choice for many

gene network modelers, even as more powerful tools become available [125], [131],

[132]. The primary motivations for this are their low cost of simulation, and a strong

theoretical background from computer science theory. Additionally, because the model

38

is very simple Boolean networks tend to be easy to fit with regard to experimental

data. As a technique for network approximation this remains a go-to tool, with more

recent and advanced Boolean network models [132], [133] capable of modeling cell state

dynamics, and even hysteresis. While these models are efficient for an overarching

view, they lose traction as one attempts to model cell state with more fidelity, both

due to their discretization of the input domain, and particularly because of difficulties

in modeling the influence of noise.

Boolean networks are modeled as a set of boolean expressions defining transitions

between state n and state n+1. As an example Equations (Equations (2.1)–(2.4))

show one set of boolean rules for the networks from Figure 9. It is relatively clear that

these rules are overly restrictive by the network’s inability to model oscillations, some

of the steady states, or many of the empirical behaviors observed by [119]. Larger

networks, however, are still capable of modeling some stochastic behaviors, including

oscillations and dynamic attractors [4]. These models have performed remarkably

well [122] despite their simplicity, likely due to the significantly reduced number of

features required to properly fit the model. Boolean network models have fairly limited

usage in simulation [133], but these are still quite successful at inferring rules for the

connectivity of genes, suggesting positive and negative regulations [121], [134], [135].

Boolean Equation for Figure 9A:

fx(x, y, z) = x ∧ y ∧ z

fy(x, y, z) = x ∧ y ∧ z

fz(x, y, z) = x ∧ y ∧ z

(2.1)

39

Boolean Equation for Figure 9B:

fx(x, y, z) = x ∧ ¬y ∧ ¬z

fy(x, y, z) = ¬x ∧ y ∧ ¬z

fz(x, y, z) = ¬x ∧ ¬y ∧ z

(2.2)

Boolean Equation for Figure 9C:

fx(x, y, z) = y ∧ z

fy(x, y, z) = y ∧ z

fz(x, y, z) = y ∧ z

(2.3)

Boolean Equation for Figure 9D:

fx(x, y, z) = ¬y ∧ ¬z

fy(x, y, z) = y ∧ ¬z

fz(x, y, z) = ¬y ∧ z

(2.4)


gene product. Much of the success of Boolean networks can be attributed to the

relatively small number of features required to fit the boolean networks, and the

simplicity of the resultant models. In the examples above the model only needs to

determine whether gene X is activated, repressed, or indifferent to the behavior of

other nodes. Boolean models can be extended to allow for some increased flexibility,

for example instead of requiring all conditions to match only a subset are required

to match, but as the number of subsets increases the difficulty in fitting the model

increases as well.

2.1.2 Bayesian and Neural Networks

Bayesian and Neural networks offer a more powerful alternative to the less compli-

cated Boolean networks. Bayesian networks allow a generated model and parameter

40

combination to be inferred almost in their entirety from the input data [129], removing

some of the need for domain knowledge. Neural networks in particular have shown

great success in fields such as gene expression profiling [136] and credit card fraud

detection [137], [138]. Much of this success comes at two costs: first, the input data

set must be quite large relative to the dimensionality of the problem [139], and second,

much of the power behind these tools must be sacrificed in order to maintain human

interpretability of the resulting model.

Neural networks are typically highly connected [140], meaning that the parameters

required are often on the order of O(N^2). Additionally, where most other networks

have only an input or current step, and an output, or next step, neural networks often

contain “hidden nodes” nested between the input and output space. These nodes have

an effect on the result and are necessary to model high dimensional nonlinearity [134],

but they dramatically increase the number of features rendering network inference

problems exceedingly underdetermined. Additionally, the meaning of these hidden

nodes is often difficult, if possible, to interpret.

Bayesian Equation for Figure 9A:

Px(x, y, z) = a1P (x)b1P (y)c1P (z)

Py(x, y, z) = a2P (x)b2P (y)c2P (z)

Pz(x, y, z) = a3P (x)b3P (y)c3P (z)

(2.5)

Bayesian Equation for Figure 9B:

Px(x, y, z) = a1P (x)b1(1− P (y))c1(1− P (z))

Py(x, y, z) = a2(1− P (x))b2P (y)c2(1− P (z))

Pz(x, y, z) = a3(1− P (x))b3(1− P (y))c3P (z)

(2.6)

41

Bayesian Equation for Figure 9C:

Px(x, y, z) = b1P (y)c1P (z)

Py(x, y, z) = b2P (y)c2P (z)

Pz(x, y, z) = b3P (y)c3P (z)

(2.7)

Bayesian Equation for Figure 9D:

Px(x, y, z) = b1(1− P (y))c1(1− P (z))

Py(x, y, z) = b2P (y)c2(1− P (z))

Pz(x, y, z) = b3(1− P (y))c3P (z)

(2.8)

In Equations (Equations (2.5)–(2.8)) each variable (x,y,z ) represents the protein

abundance of one gene product. Bayesian networks are not significantly more difficult

to learn than boolean networks, adding only a small number of new parameters. Thanks

to this simplicity Bayesian networks have been used to replace Boolean networks when

slightly more predictive power is needed, or more data is available [141], [142]. In the

examples above the model needs to determine whether x is activated, repressed, or

indifferent to the behavior of other nodes, and also determine the strength of each

link probabalistically. Boolean models can be extended to allow for some increased

flexibility, for example instead of requiring all conditions to match only a subset are

required to match, but as the number of subsets increases the difficulty in fitting the

model increases as well.

2.1.3 Differential Equation Networks

Linear and nonlinear differential equation network models provide an important

advance in the understanding and modeling of GRNs [119]. They present fewer

variables than Bayesian networks while allowing a potentially better behavioral fit.

42

Additionally they utilize a continuous input and output domain while maintaining

the interpretability and some of the simplicity of Boolean networks. While linear

and nonlinear differential equation models are superficially similar, recent research

has shown that the effect of transcription regulations on genes is nonlinear in nature

[12]. The nonlinearity of transcription regulator binding has been demonstrated and

previous models have been capable of approximating nonlinearity through additional

parameters, but we recently showed that the effects are nonlinear even with a single

gene activator or repressor [12]. This nonlinearity explains the existence of multiple

stable steady states and encourages yet more accurate modeling of cell state, as a

small change in initial conditions in an undifferentiated cell may lead to a major

change between differentiation pathways.

Differential Equation for Figure 9A

dx

dt= a1

xn

kn + xn+ b1

yn

kn + yn+ c1

zn

kn + zn− δx

dy

dt= a2

xn

kn + xn+ b2

yn

kn + yn+ c2

zn

kn + zn− δy

dz

dt= a3

xn

kn + xn+ b3

yn

kn + yn+ c3

zn

kn + zn− δz

(2.9)

Differential Equation for Figure 9B

dx

dt= a1

xn

kn + xn+ b1

kn

kn + yn+ c1

kn

kn + zn− δx

dy

dt= a2

kn

kn + xn+ b2

yn

kn + yn+ c2

kn

kn + zn− δy

dz

dt= a3

kn

kn + xn+ b3

kn

kn + yn+ c3

zn

kn + zn− δz

(2.10)

Differential Equation for Figure 9C

dx

dt= b1

yn

kn + yn+ c1

zn

kn + zn− δx

dy

dt= b2

yn

kn + yn+ c2

zn

kn + zn− δy

dz

dt= b3

yn

kn + yn+ c3

zn

kn + zn− δz

(2.11)

43

Differential Equation for Figure 9D

dx

dt= b1

kn

kn + yn+ c1

kn

kn + zn− δx

dy

dt= b2

yn

kn + yn+ c2

kn

kn + zn− δy

dz

dt= b3

kn

kn + yn+ c3

zn

kn + zn− δz

(2.12)


gene product (Equations (2.9)–(2.12)). Here only one generic equation is used to

describe the gene expression regulation including transcription, translation, and other

post-transcriptional and post-translational modifications. Each equation captures

the core nonlinear dynamics of gene expression regulation and omits details of the

fine-tuning of it. The parameters (a1−3,b1−3,c1−3) represent the strength of mutual or

auto-regulation, be it activation or inhibition. Each function (f, g or h) has the form Fn

/ (kn + Fn) when representing activation, and the form kn/(kn+Fn) when representing

inhibition, where F can be either x, y, or z. The activation and repression of each

gene by other factors are modeled additively to account for reported mechanistic

independence of transcription regulations [16], [40], [46].

2.2 RNA Sequencing

2.3 Data Aggregation

The past decade has seen numerous revolutions in “next-gen” sequencing, and with

them new tools to aide in their use. This has come largely as an increase in sequencing

accuracy, increased sequence length, and the introduction of single-cell sequencing.

Analysis (Table 1) shows that these breakneck advances in sequencing technology have

resulted in a fractured landscape of techniques that are difficult to rectify, and that

44

ExperimentID

Cell Types ReferenceGenome

Library Prep Aligner andQuantifier

200021180 ES, Neuro2A,MEF

NCBI 37.1 custom Bowtie v0.9.6

200029087 ES, MEF mm9 custom Bowtie v0.9.6200045284 NPC, MEF mm9 Illumina

TruSeq RNASample PrepKit v2

TopHat v2.03and Bowtie2.0.0.6 cuffdiff2.02

200045719 Liver andDevelopmentcells

mm9 SMARTer Ul-tra Low InputRNA for Illu-mina Sequenc-ing

bowtie 0.12.7Rpkmforgenes

200052583 Lung mm10 Nextera bowtie2-2.1.0tophat-2.0.8cufflinks-2.0.2

200059127 Kidney mm9 Nextera TopHat v1.4.1GeneSpring12.6.1-GX-NGS-RV

200059129 Kidney mm9 Nextera TopHat v1.4.1GeneSpring12.6.1-GX-NGS-RV

Table 1. Overview of RNA-seq experiments analyzed with a listing of tool and assetversions.

result in different outputs. In the interest of meta-genomics, authors should provide a

set of pre-processed data, which could be easily utilized by researchers for downstream

applications. In fact many grants require authors to submit their data to the NCBI

GEO repository so that replication of results and downstream analysis is more readily

available. Unfortunately this data is not always usable for researchers as the tools

used to quantify read counts introduce a significant bias into the reads Figure 11,

45

when this bias is varied between experiments it introduces another source of technical

noise. Additionally, while NCBI’s GEO requires authors to submit processed data

there is little restriction on how the data is formatted and which calculations are used,

with some authors providing transcripts per million (TPM) [143], others fragments per

kilobase per million mapped reads (FPKM) [144], others provide a direct read count

[145], and some provide alignments for the individual reads without quantification

[146].

The next possibility is that researchers could have an easy way to take arbitrary

unprocessed reads and align them on their own. To this end there has been a great

deal of effort; from standardization of reference genomes, to online resources such as

Galaxy [147], [148], even including tools such as mygene.info [149] capable of converting

between different gene nomenclatures. Despite this progress many holes remain in

the analysis pipelines; quantified gene readings are not sufficiently available (or are

inconsistent), and quantifying genes is too complicated, requiring significant computing

power and in-depth knowledge of both the sequence library preparation technique, and

of how to perform alignment and quantification in general. Unfortunately this kind of

broad knowledge is not common, and while critical to the alignment phases, having

this kind of information that is not of interest to most downstream computational

studies. To this end we created a tool for automatically downloading experimental data,

recognizing experimental parameters, and performing alignment and quantification

steps such that output data has a minimal bias, and that the bias provided is uniform

among experiments. We further demonstrate that this data contains a smaller amount

of noise relative to the disparate processing pipelines provided by the original authors.

46

2.3.1 Acquiring the Data

Through our analysis of datasets publicly available in NCBI’s GEO repository we

found a few key problems:

1. Varied adapter sequences

2. Multiple barcode handling techniques

3. Differing sequence lengths

Depending on the sequencing chemistry used there are different adapters present

to attach the recerse-transcribed mRNA transcript to the sequencer. In theory these

sequences should not be included in the reads provided, and as a result should have

no influence on downstream processing. In our results we find that they frequently

remain within the sequenced records, likely as a result of multiple adapter attachment.

While analyzing one data set we found over 30% of reads containing a copy of the

adapter sequence (NCBI GEO GSE52583) [143]. Unfortunately GEO does not provide

a standardized field for storing the adapter sequences that were used, consequently one

must analyze the sequenced data and search for overrepresented sequences occurring

on the forward or reverse strand near either end of the read. The results of a failure

to remove the adapter sequence are ambiguous and further depend on the alignment

pipeline used. Read aligners will have varying degrees of success in aligning the reads

to the genome, with some spurious alignments of the sequence against somewhat

random locations in the genome that happen to match the adapter sequence. In order

to remedy this our pipeline performs an automated search for likely adapter sequences

in the sequenced data, and trims them to reduce this noise. In the case that the reads

are too short after the adapter has been trimmed, the reads are discarded.

47

One major problem in sequencing is the random dropout of individual molecules

during sequencing. To overcome this issue most experiments overcome this issue

by using a PCR amplification step, resulting in multiple copies of the same read,

slightly biased by the PCR procedure [150]. Brunskill et al [144] demonstrate that

this PCR bias can be overcome by attaching a barcode to each unique molecule. This

provides a significant reduction in noise, with a minimal loss of sequencing length.

This too provides a challenge to individual researchers who may not be aware of the

presence of a barcode, or the constant flanking sequences. Failure to account for the

barcodes can reduce the accuracy in the same way that an ignored adapter can, while

simultaneously throwing away valuable information. To account for barcodes our

pipeline has a special search within the adapter finder to look for varying bases, which

are automatically recognized as barcodes and utilized to remove duplicated reads.

Read lengths within second generation sequencing continue to grow, starting

with Illumina’s 25 base pair single-ended read, through their current 300 base pair

double-ended read. This increase in read length is critical to the proper resolution of

RNA isoforms, and unique mapping of identified reads. Nonetheless, it too presents

challenges. Increased read lengths drastically increase the probability of reading across

an RNA splice junction meaning that the genome annotation used should contain

the most extensive and accurate list of splice junctions for your organism of choice.

Most researchers use the University of California Santa Clara (UCSC) annotations,

but even among those annotations there are multiple versions in use. This results in

yet another source of noise as different annotation versions will refer to identical genes

with different names that an also be ambiguous, mapping to more than one gene in

annotations from a different group. Even differing versions of annotations from a

48

single group may contain different genes, or may rename genes that were previously

suspected and were later confirmed.

2.3.2 Noise Minimization

Figure 10. In this figure we present the data processing pipeline from 5 otherScRNA-seq experiments available on NCBI GEO. GSE59127 and GSE59129 (left,blue), GSE 52583 (left center, red), GSE21180(center, olive), GSE45284 (right center,orange) and our pipeline (right).

As shown in Figure 11A-C the coefficient of variation(CV) of our pipeline is

approximately in line with those provided by the providers of the respective data sets.

In Figure 11D we demonstrate that despite having similar intra-experimental CV’s

our pipeline produces a reduced inter-experimental CV compared with the mixed

49

Figure 11. Scatterplots showing ratios of Coefficient of Variation(CV), where pointsrepresent CV of a single gene within the original author’s pipeline, and ours. Thex=y line drawn, representing the point where CV is equal between both pipelines.Points on the x>y side are drawn in red, and point on the y>x side are drawn inblack. For the experiments in (A-C), raw scRNA-seq reads are provided by the NCBIGEO service and analyzed using our customized pipeline. In addition to raw readseach author provides estimates of gene activity levels through FPKM or TPM. In(A-C) we plot the coefficient of variation (CV) of our pipeline results against those ofthe original authors. In (D) we group the experiments from (A-C) to visualizeinter-experimental CV.

pipelines of the original authors. Furthermore we expect that the effect of this CV

reduction will continue to become more pronounced as the number of aggregated

works, and distinct sequencing pipelines, increases.

50

2.3.3 Future Work

We propose to add two components to the pipeline: automated parameter optimization

and post-processing of the data. Due to differences in the data it is currently important

to determine key features such as maximum and minimum fragment lengths, adapter

and barcode presence, and in some cases even the format of the read type(color space or

base space) and quality score. This is currently done manually, with some parameters

determined through examination of the sequencing procedures and sequencing files

with other parameters collected by aligning a subset of the data and then resequencing

the remaining data. Automation of these tasks can be performed well through a

rule-based system utilizing lists of sequencers, and a distribution-based analysis of read

lengths. In addition, we propose to post-process data to remove previously included

contextual noise, such as cell-cycle noise. Not all cells undergo cell cycle, but much of

the available scRNA-seq data is from differentiating cells. This differentiation noise can

obfuscate subpopulations of cells, which may have expression levels that are masked

by the boom and bust of genes required for cell division [151]. In differentiating T cells

upwards of 50% of non-technical noise can be accounted for by cell cycle variations

[151]. The techniques to remove this noise can also be expanded to account for

other sources of noise, allowing us to extract and classify noise sources, validate them

biologically, then remove them from the data source. A large pool of homogenized

data will also allow us to replicate algorithmic experiments from other groups to

examine whether they function generally, or only in special cases, and similarly to

develop algorithms that generalize well across data sets from different labs.

51

Chapter 3

NANOPORE READ SIMULATION

Nanopores represent the first commercial technology in decades to present a

significantly different technique for DNA sequencing, and one of the first technologies

to propose direct RNA sequencing. Despite significant differences with previous

sequencing technologies, read simulators to date make similar assumptions with

respect to error profiles and their analysis, resulting in incorrect characterization

of nanopore error. This is a great disservice to both nanopore sequencing and to

computer scientists who seek to optimize their tools for the platform. Previous works

have discussed the occurrence of some bias in the identifiability of certain k-mers, but

this discussion has been focused on homopolymers, leaving unanswered the question

of whether k-mer bias exists over all k-mers, the strength of the bias, how it occurs,

and what can be done to reduce the effects. In this work, we demonstrate that current

read simulators fail to accurately represent k-mer error distributions. We explore the

sources of k-mer bias in nanopore basecalls, and we present a model for predicting

k-mers that are difficult to identify. We also propose SNaReSim, a new state-of-the-art

simulator, and demonstrate that it provides higher accuracy with respect to 6-mer

accuracy biases.

3.1 Introduction

DNA sequencing has become an integral component of biological research, with ap-

plications ranging from gene network or organism identification through biological

engineering. This is accomplished by reading analog nucleotides and representing

52

them as digital sequences of bases that can be further analyzed downstream. Histor-

ically, sequencing has been dominated by synthesis-based approaches involving the

replication of an existing DNA or cDNA molecule, with the fluorescent nucleotides,

providing a stepwise 4-dimensional color space readout of the DNA sequence. This

approach provides high accuracy through amplification of the input sequence, but

these amplification steps obfuscate modifications on the DNA, and lead to uneven

representation of input sequences, largely as a function of GC content [152]. A novel

approach known as nanopore sequencing allows for the identification of an input

sequence by measuring electrical current signals as the DNA strand passes through a

nanopore. This eschews the pre-amplification step, reducing sequence-selection bias,

and allowing for the detection of modifications on the raw DNA strands.

Nanopore data introduces novel challenges when compared with previous genera-

tions of sequencers. First, nanopore reads provide significantly larger mean lengths

and higher error rates than synthesis-based approaches (10,000 bases at 80% accuracy,

compared with 500 bases at 99.9% accuracy) [153]. Additionally, synthesis-based

approaches tend to have an error profile that is not heavily biased by the underlying

DNA sequence, while nanopore errors are strongly connected to the DNA sequence be-

ing read [152], [154]. This bias is unaccounted for in most simulators and downstream

analysis tools, and lacks a published model, but it has significant implications. For

most downstream analysis, sequences that are read in must be aligned to a reference

genome, or to another sequence. Because sequence alignment algorithms are O(N2),

and most genomes are immense, almost all current generation sequence aligners rely

on an indexed set of perfect or nearly perfect “seed“ sequences to propose candidates

for an alignment, where a seed is 6-30 nucleotides that must match on each sequence.

Because of observations on the DNA sequencing data when these aligners were writ-

53

ten, seeds are typically treated as either completely equal, or as a function of “seed”

uniqueness within a genome [155]–[158]. Similarly, after matching a “seed“ between

two sequences the alignment is extended utilizing a cost function for matches and

mismatches, but these cost functions also do not incorporate the varied probability of

error within across a DNA sequence. Furthermore, the ability of the community to

address these issues has be hampered by a lack of tools capable of simulating these

results, and a model to explain possible sources of this bias.

Biological data is often difficult to obtain, requiring significant time and reagents

to culture, prepare, and process experimental samples. Because of this it is beneficial

to run in silico trials of their experiments before attempting a “real” run. Because

these in silico results are being used as a proxy for real data it is similarly desireable

to minimize the differences between in silico results and experimental results. To

this end simulators have been made for most sequencing technologies, incorporating

various facets and behaviors intrinsic to technology. Because nanopore sequencers are

novel in their approach to reading DNA there many of the assumptions inherited from

previous simulators are no longer valid, and while some invalid assumptions have been

identified there are more that remain [159]–[161].

We demonstrate that the specific sequence under investigation strongly contributes

to the local probability of observing an error. While previous studies have observed

that homopolymers are not identified correctly we extend this to show that it is

more extensive than the previously observed homopolymer bias, and that current

generation simulators are unable to properly model this variation in error probability.

We propose a read simulator based on a Markov chain, with observations modified

by the underlying seuqence and demonstrate that it provides a higher fidelity to the

error distribution, which fit from real data. To accomplish this task, we identify the

54

accuracy of individual k-mers and their behavior in different contexts. We also predict

key features of the error bias, and calculate their individual contributions to the final

error. Our simulator is fully automated, allowing for both in silico amplification of

read data, or simulation of data on completely novel genomes. Our contributions are

summarized below:

• We demonstrate that while some base calling error is random there is a significant

component that is systematic and unaccounted for (section 3.4.1).

• We develop a set of features for predicting k-mer accuracy and show that our

model generally correlates well with data found empirically (section 3.4.2).

• We propose an algorithm for simulating reads, a variant of the Hidden Markov

Model (HMM) employed by Nanosim [159], with modifications to apply observed

k-mer bias. We demonstrate that it models the k-mer error distribution better

than other popular read simulators [159]–[161] (section 3.4.3).

3.2 Related Work

Cost, throughput, and accuracy have been major hindrances in DNA sequencing.

With the development of NGS technologies cost has reduced while throughput and

accuracy continue to climb. Still, simulators offer a significant benefit due to their

low cost and exceptionally high throughput, allowing testing while developing new

algorithms. Simulators aim to produce sequences with the most fidelity possible for

their given platform. As such, simulated reads should account for biological and

technical bias [162].

Simulators typically generate synthetic reads by extracting a sequence from a

reference genome and then introducing errors into that sequence. Parameters required

by simulators to introduce these features into the sequence are either provided at run

55

time or are stored inside a metadata called a model or error profile. By analyzing the

alignment of empirical data to a reference genome error profiles are created. Errors

have been generated by first predicting a quality score [163], by base position within

a simulated read [164], or by predicting an error sequence and then applying it to a

read [159], [160], [165].

Modeling of third generation single-molecule reads has many advantages when

compared with second generation sequencers. Polymerase chain reaction (PCR),

required by second generation sequencers during the pre-amplification step, introduces

significant bias, but is not required for third generation sequencers [166]. GC bias, a

secondary effect of the PCR amplification step, resulting in low base accuracy and

high coverage variability, is also removed [167]. Third generation sequencing on the

other hand has new challenges that must be modeled. Simulators for third generation

sequencers must deal with longer reads containing significantly larger stretches of

errors, and a biased error that varies with nearby nucleotides. There are two main

platforms for third generation sequencing, PacBio’s SMRT sequencing, and Oxford

Nanopore’s nanopore sequencing. Each platform has their own read simulators, but

none are unable to properly model the k-mer bias of nanopore sequencing data.

NanoSim [159] is a read simulator for ONT data, modeling reads as the result

of a HMM. The model is fit by aligning empirical reads to a reference genome, then

collecting a list of error subtypes, lengths, and transition probabilities. Reads can

be simulated by sampling a length, then generating a sequence of errors. One major

limitation of this approach is that it is unable to properly model k-mer bias; this

is somewhat overcome by a post-processing step where all homopolymers of length

greater than 5 (e.g. “AAAAAAA“) are compressed to a length of 5.

LongISLND [161] is a read simulator developed for PacBio data. It models k-mer

56

bias explicitly and directly by storing observed mutations for each k-mer it sees,

stored as a key-value pair. It also uses an Extended K-mer (EKmer) model to aid

in homopolymer error generation. An EKmer is a regular k-mer followed by an

integer representing a length of homopolymer covering the middle term in the k-mer

facilitating the simulation of arbitrary stretches of homopolymer without explicitly

observing them. One major limitation of LongISLND is that while is is able to

approximate the k-mer bias it lacks the accuracy level observed in true ONT data.

This is likely due to ONT data having grouped errors (i.e. the probability of error

given an error exists nearby is higher than normal) that are not properly modeled

using their simulator.

3.3 Problem Description

DNA has a label space of Σ ∈ {A,C, T,G} representing the different nucleotide bases.

Let s ∈ Σn be a DNA strand of length n. Using Nanopore technology, a series of

discrete measurements δ is generated, representing the current that passes through

the nanopore at each time step from time t0 to tT , where tT reflects the total time

required for s to completely transit the Nanopore. This creates a corresponding vector

∆ ∈ Rd, where d >> n is the number of discrete measurements obtained from time t0

to tT .

The vector ∆ is subsequently binned into q different bins, B = b1, b2...bq, using

different time intervals that capture ≈ dnmeasurements per bin. The mean µi and

standard deviation σi are subsequently derived for each bin bi. An approximation

strand, s, of the original strand s is reconstructed or “base called” using the sequence

of (µi, sigmai) pairs using either a Recurrent Neural Network [168], or a HMM [153],

57

(A) (B)

(C) (D)

Figure 12. The accuracies of all 6-mers with accuracy on the vertical axis, and akmer index on the horizontal axis. (A) All experiments sorted with respect to theiraccuracy. (B) Experiments sorted with respect to the Escherichia coli R9 2Dexperiment. (C) The R9 E. coli template experiment divided randomly into 5 groups,sorted with respect to the first group. (D) Only the R9 E. coli template experiment,aligned with both Graphmap and BWA-MEM, sorted with respect to BWA-MEM.

[169] in conjunction with a lookup table from the Nanopore manufacturer that specifies

the most probable k-mer base pair for the given µi value. Using the R9 pore from

Oxford Nanopore µi is best explained by a sequence of 6 DNA bases, thus the k-mer

length k = 6 is used in later analysis.

58

3.3.1 Error Source Identification

Naively, one may assume that error is distributed evenly over s, however it is trivial to

see that it is not the case. The expected mean current µ exists on a linear range, thus

k-mers near the minimum current are not as influenced by δ readings lower than their

expected mean, with the opposite true for readings near the top of the range. Second,

if the density range of the k-mer currents is uneven, more error is to be expected in

high density regions [154]. Third, some sequences of k-mers are easily identifiable

due to large changes in current, while others are difficult to identify due to small

changes [154].

To elucidate the drivers of k-mer bias we first calculate a list of k-mer accuracies K,

and propose a set of features F , where each feature fi confers a nonnegative accuracy

to each k-mer. Error cannot be negative, thus we can approximate the influence of

each feature by solving Equation 3.1.

arg minx

‖Fx−K‖2

subject to x ≥ 0

(3.1)

Where x is a vector indicating the contribution of each error feature F. Fx is then

the best approximation of the original k-mer bias with respect to euclidean distance.

3.3.2 Error Features

Error during sequencing can occur in multiple locations, and from multiple sources.

Some of these sources of error are recoverable because of constraints enforced by the

sequential relationship of the bins B, i.e. that each bin predicts a k-mer for itself,

but also implies limitations on adjacent k-mers. Other error sources are significantly

59

more difficult to recover from, such is the case. To this end we generated features to

capture specific subtypes of errors.

Sequence Identifiability refers to the ability to easily differentiate one sequence of

k-mers from another given a sequence of bins. These kinds of errors can arise if there

is a shift in the current that persists for multiple bins, or a drift between recalibration

phases. This can be approximated by calculating the number of sequences of length L

with a manhattan distance lower than a provided threshold [154]. We also analyze

the local density of the pore model, as highly dense regions reduce the probability

that a bin’s mean value implicates a specific k-mer.

Transition Identifiability Some transitions between k-mers result in an expected

current change smaller than the standard deviation of either nucleotide. As a result,

some of these changes are not captured successfully by the event detection algorithm,

or in an attempt to correct this behavior there may be multiple bins that should only

be a single bin. In nanopore base callers these gaps or duplications are known as

“skips“ or “stays”. To predict this, for each k-mer, we calculate the difference in mean

current value between the k-mer under inspection and all immediately previous and

subsequent k-mers. We also calculate the number of steps required to return to the

same k-mer, and the difference in current that should have been observed across those

steps.

3.3.3 Read Simulation

Read simulation occurs in 2 phases: model fitting, then simulating reads from an

input genome. Fitting the model requires existing reads to be aligned to a reference

genome, alignment is performed using BWA-MEM [156] with standard options. From

each read, the top alignment is selected, and unaligned reads are discarded. From the

60

remaining reads, parameters are extracted with respect to average error length for the

error types {Insertion, Deletion, Mismatch}, transition probabilities, and the accuracy

of all k-mers of length 6. The inverse of the k-mer accuracy is also used as a “cost”

of correctly predicting a k-mer. These parameters are then stored for downstream

simulation.

Reads are simulated according to a HMM which generates transitions between

correct and erroneous stretches, and the lengths of those stretches. This approach has

the benefit of easy explainability, but has limitations in that it is unable to correctly

model k-mer accuracy distribution. Moreover, this shortfall is difficult to overcome, as

modeling k-mer accuracies as states would require an intractable number of parameters.

To remedy this situation we allow errors and error lengths to be generated according

to the HMM, but we adjust the lengths by using a cost function described above in a

process detailed in Algorithm 1.

As a second approach we introduce read simulation as an optimization problem

where we attempt to minimize the differences between X ∈ Rn approximated by

X ∈ Rn, where each x is a parameter, including accuracies of each of the 6-mers,

transition probabilities between each error subtype, and average correct read length.

This is formally specified in Equation (3.2).

minimize{‖X− X‖2} (3.2)

Minimization is achieved by monte carlo simulation, iteratively proposing errors to

the simulated sequence until the error difference reaches a threshold, or an empirically

determined number of mutations has been rejected. The overall algorithm is detailed

in Algorithm 2.

61

Data: read extracted from sample genome

Result: mutated read

while readPosition < readlength do

scaleFactor = mean k-mer cost around readPosition;

isError = random ∗ scaleFactor < .5;

if isError then

errorType = transition probability from HMM;

end

budget = length from model based on errorType;

while cost < budget do

cost += cost of k-mer at readPosition + i;

i += 1;

end

if isError thensave the mutation to a mutation list

end

readPosition += i;

endAlgorithm 1: Generation of mutation list for sampled read using the k-mer biased

HMM model.

62

Data: read extracted from sample genome

Result: mutated read

while ‖X − X‖2 > minDifference AND numReject < maxNumReject do

errorPosition = uniformRandom;

errorType = random based on parameters;

errorLength = length from model based on errorType;

tempX = X with error applied;

if ‖X − tempX‖2 < ‖X − X‖2 then

X = tempX;

else

numReject += 1;

end

endAlgorithm 2: Random introduction of errors to achieve minimum difference.

3.4 Results

3.4.1 Modeling K-mer Bias

Before attempting to model k-mer bias in real data is it important to understand the

granularity at which it occurs, and the consistency. To this end we examined 2 public

datasets 1 2 to determine whether bias was present, and to what extent. We show

that while there is a consistent k-mer bias within experiments, there are significant

differences between the R7 and R9 pores in Figure 12. We attribute the majority

1http://lab.loman.net/2016/07/30/nanopore-r9-data-release/

2http://lab.loman.net/2014/10/01/where-can-i-get-oxford-nanopore-miniontm-data-from/

63

http://lab.loman.net/2016/07/30/nanopore-r9-data-release/

http://lab.loman.net/2014/10/01/where-can-i-get-oxford-nanopore-miniontm-data-from/

(A) (B)

Figure 13. A plot of the non-negative least squares (nnls) fit of the predictedfeatures compared to the actual distribution from the R9 E. coli 2D experiment,sorted by the true experiment. (A) All model feautures and the true accuracy ofneighboring k-mers as a feature. (B) Only the features that are fully predictable fromthe model, as described in bias souces table.

Bias Source Example Low Example High ContributionTransition Identifiability AAAAAA ATATAT 64.6%Sequence Identifiability CTGTCA TATATT 3.32%Standard Deviation CTAGAG TTGAAA 1.79%

Fraction A CGTCGT AAAAAA 9.73%Fraction T CGACGA TTTTTT 2.90%Fraction C AGTAGT CCCCCC 9.33%Fraction G CATCAT GGGGGG 7.03%

Table 2. Identified error sources and their contribution to k-mer overall accuracy withrespect to the bias source

of the differences to the pores themselves, but some influence also likely comes from

different versions of ONT basecaller used on each dataset.

64

3.4.2 Identifying Bias Sources

After identifying the presence of k-mer bias we attempted to elucidate the source of

the bias. To this end we generated a set of features with each providing a score for

each k-mer, and we attempted to find a linear combination of each feature capable of

explaining the bias, and providing a fractional contribution of each step. As shown

in Table 2 multiple error sources influence the accuracy fraction of each k-mer. We

find that the strongest feature for predicting the accuracy is the median accuracy of

neighbors at one step away. While this is not directly helpful, it does suggest that

direct modeling of error sources must incorporate neighbor accuracy aspects. Overall

we find that our predictive model provides a good first attempt, providing some insight,

but leaving much of the error signal uncaptured, as illustrated in Figure 13. We expect

that much of the remaining error signal is a result of missing features that may be

identified through a more thorough analysis. We also expect that some of the error

signal cannot be captured through linear combinations. For example, the original

basecallers were incapable of capturing homopolymers with a length greater than 5,

using linear combinations the k-mer “AAAAAA” would need to have 0 accuracy across

all features, an unrealistic expectation.

3.4.3 Simulation Results

To validate our simulation results we generated read profiles for Nanosim [159],

PBSim [160], LONGIslnd [161], and the two simulators we have proposed. Nanopore

sequencing experiments typically yield between 10,000 and 20,000 reads [159], with

measured statistics being approximately identical at even 20% of this size as shown in

Figure 12 (C). To measure this effect in simulations, we generated 2 data sets with

65

(A) (B)

(C) (D)

Figure 14. Comparison of 3 previously published sequencing simulators with the newmodel we proposed. Fit of R9 pore model k-mer accuracies sorted internally (A), andby 2D E. coli, the training model (B). Fit of R7 pore model k-mer accuracies sortedinternally (C), and by 2D E. coli, the training model (D).

each simulator: one with 4,000 reads, and one with 20,000 reads. These reads were

then aligned using BWA-MEM [156] with standard parameters. The sum of squares

error is the sum of the difference between the model 6-mer accuracy and the simulated

k-mer accuracy, for all 6-mers. Results of both large and small simulations are shown

in Table 3, while the 20,000 read simulations are shown in Figure 14. Ultimately we

66

Source Sample Size SSE

Split R9 6,481 0.121

LongISLND R9 20,000 167.04

LongISLND R9 4,000 167.80

Nanosim R9 20,000 44.26

Nanosim R9 4,000 46.76

SNRG R9 20,000 9.52

SNRG R9 4,000 9.66



Nanosim R7 20,000 30.76

Nanosim R7 4,000 36.16

SNRG R7 20,000 5.24

SNRG R7 4,000 5.48

Table 3. Sum of squares error (SSE) between k-mer simulators and their respectivetraining sets, and between the fragmented R9 data set and the complete R9 data set

find that sample size has very little influence on existing simulators, as the k-mer

bias is either completely un-modeled, or has systematic difficulties capturing the

k-mer error rate (LongISLND). In our simulators we find some improvement through

increased sample size, though most of the remaining difference does appear to be from

systematic errors.

3.4.4 Simulator Complexity

In addition to validity, the performance of a simulator is also of great importance. If

reads can be simulated properly but with an excessive requirement on the processing

power or memory the simulation is of little value. To this end we benchmarked

67

Genome (MB) Sample Size Cores max RSS (MB) Runtime (h:m:s)

E. Coli (4.64) 2000 4 659.51 0:48:20

E. Coli (4.64) 2000 1 532.87 2:18:41

E. Coli (4.64) 10000 4 1,105.60 3:33:01

E. Coli (4.64) 10000 1 636.41 12:36:30

E. Coli (4.64) 20000 4 1,666.17 6:58:18

E. Coli (4.64) 20000 1 810.46 22:53:12

Saccharomyces (12.1) 2000 4 670.12 0:51:37

Saccharomyces (12.1) 2000 1 558.66 2:44:32

Saccharomyces (12.1) 10000 4 2,040.84 3:57:08

Saccharomyces (12.1) 10000 1 560.92 17:11:49

Saccharomyces (12.1) 20000 4 1023.23 12:29:48

Saccharomyces (12.1) 20000 1 556.47 31:38:11

Table 4. memory and wall-clock time consumption of SNaReSim under variousconditions

the simulator on an 8-core Ubuntu system with 32GB of RAM under 3 different

genome sizes and experiment sizes. The 2 genomes are chosen to be E. Coli, and

Saccharomyces Cerevisae, as each are model organisms, with at an order of magnitude

of size difference one magnitude of difference in size. The sizes chosen are 2,000, 10,000,

and 20,000, to represent simulation of small, medium, and very large experimental

runs. data was collected by running the simulator through ’time -v’, where each result

was tracked for speed and memory consumption. In order to improve results, each

test was conducted 5 times with only the median reported. The results are shown in

table 4.

The results illustrate a few strengths, and a few weaknesses, with many weaknesses

arriving from inefficient Python optimizations. Standard Python implementations for

68

example (CPython) contain a Global Interpreter Lock (GIL), meaning that threading

where global variables are shared between threads is not capable of utilizing more than

one CPU core completely. To this end many projects are implemented using processes,

where each global variable is replicated and communication must occur using pipes,

or a block of data that will be returned by the process. SNaReSim implements data

passing in the second way, largely because of implementation simplicities, but this

results in a significant memory waste observed as the difference between the 4-core

and 1-core versions. When running in a single-thread mode simulated reads are

immediately dumped into an open file descriptor, meaning that the resident set size is

able to be quite small. The Multi-core implementation on the other hand must pass

the data blocks back to the head process before writing to the file. This difference in

memory consumption could be largely resolved by having child processes write to a

queue, and having the parent process constantly write the queue contents into a file.

3.5 Conclusion

We have demonstrated the presence of biased k-mer accuracy within Oxford Nanopore

Technologies sequencing platform. We show that this bias is consistent within ex-

periments, and between sequence aligners, but varies between pore models. This

information is of significant value to sequence aligners, which typically use in index of

seed sequences to begin alignment, in order to reduce the search space and increase

speed. Due to the bias in k-mer accuracy we show that some seed sequences are

virtually impossible to correctly identify in sequenced samples, while other seeds have

an accuracy significantly greater than the mean.

We demonstrate that this k-mer bias has some observable and predictable founda-

tion in the mean current readings of each k-mer(i.e. the manufacturers pore model),

69

although significant progress remains to be made. This provides a way to estimate the

read accuracy for a provided pore model without performing large sequencing runs; it

also suggests that a model-guided approach for nanopore design could improve overall

accuracy. Alternatively it could allow for the development of specific pores, where

accurate discrimination of some k-mer subtypes is more important than others.

Finally we propose a novel nanopore read simulator capable of modeling the k-mer

accuracy bias observed in this experiment. We demonstrate that our simulator has

a significantly reduced sum of squares error with respect to 6-mer accuracy when

compared with other third generation sequencing simulators. This provides a more

realistic benchmark for sequence aligners and genome assemblers to compare against.

The availability of high accuracy reads allows for the exploration of new applications,

including; sequencing of larger organisms, organism disambiguation when sequencing

a population, exact sequence detection in diploid and polyploid organisms, and the

ability to scaffold genomes across exceptionally long repeat regions. Providing an

understanding of accuracy in nanopore design, and development of tools to aid the

alignment of the produced reads is thus critical to continued progress.

70

Chapter 4

READ CORRECTION

Nanopore sequencing has introduced the ability to sequence long stretches of

DNA, enabling the resolution of repeating segments, or paired SNPs across long

stretches of DNA. Unfortunately significant error rates >15%, introduced through

systematic and random noise inhibit downstream analysis. We propose a novel method,

using unsupervised learning, to correct biologically amplified reads before downstream

analysis proceeds. We also demonstrate that our method has performance comparable

to existing techniques without limiting the detection of repeats, or the length of the

input sequence.

4.1 Introduction

DNA sequencing has become a critical part of most biological research, in tasks ranging

from gene network identification, to biological engineering, to organism identification

and the generation of phylogenies. Most of these applications have 2 primary foci:

the length of DNA sequence reads, and the per-base error in those reads. Third

generation sequencing technologies presented by Oxford Nanopore Technologies(ONT)

and PacBio offer read lengths 10-100x what was possible with previous sequencing

technologies, but at a per-base read accuracy near 85%, down from 99.99% using

second generation technologies. While the increased read length enables many new

biological applications [170], [171] the low read accuracy hinders others. Extensive

research has gone into developing techniques that can robustly improve Nanopore

read accuracy and analyzing the trade-offs [153], [172], [173].

71

We propose an approach to correcting errors that does not make assumptions

about the presence of a reference genome and is robust to mixed biological populations

where reads with significant overlap must be identified as different. To accomplish this

task multiple copies of the same original DNA sequence must be read, resulting in a

large number reads, of which a small number are high-error copies of one another. We

then use the K-means algorithm to cluster reads based on an engineered feature space.

This technique does not require replicated reads to be biologically attached [174],

meaning that the individual sequence lengths are not constrained. Additionally our

approach groups reads with their own replicates, removing the need for a reference

genome. Our contributions are summarized below:

• We demonstrate that while some base calling error is random there is a significant

component that is systematic and can be modeled. in section 4.4.1 we present a

simple model for predicting easily identifiable k-mers and show that our model

correlates well with data found empirically.

• We propose a feature space based on the easily identifiable k-mers that accu-

rately groups copies of unique DNA strands. In section 4.5.1, we demonstrate

comparable clustering performance to MinHash [175] with a number of simulated

reads similar to biological experiments.

• We demonstrate that a simple K-means approach can replace a more complex

biological process to group unique DNA strands together.

• The proposed method allows for the analysis of larger DNA strands compared

to INC-Seq [174].

72

GGCACGTTAC

GGCACGTTAC

GGCACGTTAC

(b) nanopore sequencing(a) replicated reads (c) base call

GGTCAGTTAC

GATCAGTTAC

GGGTCACGTTAC

. . .

. . .

. . .

Figure 15. (A) Strands of DNA bases are added to the nanopore sequencer as an

analog data input (B) As strands pass through the nanopores electrical current

readings are taken. These readings are consequently grouped into “events”, with each

event approximately representing a k-mer (sequence of bases). These events may be

improperly split, events may have a mean value higher or lower than expected for a

DNA k-mer, and event lengths may be different. (C) A base caller uses the sequence

of events to predict the original sequence of bases.

4.2 Related Works

Given the importance of sequence read accuracy, a variety of methods have been

proposed for increasing read accuracy. These approaches centered around either

detecting overlapping regions in genomic alignments to provide read corrections [153],

[157], [176], or correcting reads before alignment [172], [174], [177], [178].

Read overlapping [157], and correction from genomic alignments [153], [176] has

shown significant potential for read correction. These techniques have shown the

ability to increase sequence accuracy into the high 90%’s, with relatively minimal

requirements [153]. One major shortcoming is that they are limited by error rate

and genomic similarity, with large genomes or high error rates resulting in excessive

correction times or erroneous results [176]. Some of these shortcomings can be softened

73

by the presence of a reasonable reference genome, as the error rate for read-genome

alignments is approximately half of that seen in read-read alignments.

Read pre-correction has gained significant steam recently, and particular advantages

have been seen with small genomes which contain more than 10x even within a single

sequencing run. Early attempts at pre-correction focused on aligning high accuracy

second generation reads to long nanopore reads [172], [178]. This provides the possibil-

ity to resolve many features but introduces systematic error within repeating regions

of the reads. More recent attempts have focused on nanopore reads exclusively [174],

trading a significantly reduced read length for increased read accuracy. To a large

extent both of these correction efforts are unacceptable. Introducing a systematic bias

into sequencing results prevents the detection of mutations in sequencing experiments,

or leads to incorrect references during de novo genome constructions [172], [177].

Decreasing the relative read length is similarly detrimental in that much of the benefit

of 3rd generation sequencing is consequently lost.

After correction reads typically must be mapped to a reference to be used. To aid

in this mapping high-information sequences have been explored. Although because

previous technologies yielded a very high level of accuracy ’high-information’ is taken to

mean rare k-mers, ones that appear less frequently than random chance would suggest.

Focusing on these highly informative subsequences has allowed for significantly faster

sequence matching [175], sequence pseudoalignment [179], or full alignment [158],

[180]. Unlike second generation sequencing, nanopore has significantly higher error,

and it is introduced as a biased error in readings. Because of this, the definition of

high-information subsequences can be stretched to incorporate the probability of being

able to correctly identify a k-mer.

74

4.3 Problem Description

DNA has a label space of Σ ∈ {A,C, T,G} representing the different nucleotide bases.

Let s ∈ Σn be a DNA strand of length n. Using Nanopore technology, a series of

discrete measurements δ is generated that represents the change in electrical current as

each base passes through the Nanopore from time t0 to tT , where tT reflects the total

time required for s to completely transit the Nanopore. This creates a corresponding

vector ∆ ∈ Rd, where d >> n (d ≈ n ∗ 12 ) is the number of discrete measurements

obtained from time t0 to tT .

The vector ∆ is subsequently binned into q(≈ n) different bins, B = b1, b2...bq,

using different time intervals that capture ≈ dqmeasurements per bin. The mean µi,

and standard deviation σi are subsequently derived for each bin bi. An approximation

strand, s, of the original strand s is approximated from or “base called“ using µi in

conjunction with a lookup table from the Nanopore manufacturer that specifies the

most probable k-mer base pair for the given µi value. Using the R9 pore from Oxford

Nanopore µi is best explained by a sequence of 6 DNA bases, thus the k-mer length

k = 6 is used. Thus, a function that uses manufacturer provided information will

yield a 6-mer f(µi, σi) = [A|T |C|G]6. Unfortunately, the approximation s obtained

from this function has > 15% [174] median error rate.

Let each approximated DNA strand be treated as a single data point {s1, s2, ...sv},

where v is the total number of strands being analyzed. Let there also be GGG =

G1, G2, ...Gp different groups associated with p unique strands. If copies are made of

each strand such that v >> p, we would like to group each strand with its respective

copies. The problem is formally defined as:

arg minGGG

p∑k=1

∑φ(s)∈Gk

||φ(s)− ck||2, (4.1)

75

where φ(·) represents the transformation of a DNA strand to a descriptive feature

space. ck is the centroid of group Gk defined as

ck =1

|Gk|∑

φ(s)∈Gk

φ(s), (4.2)

which is the mean of group k in the transformed feature space. Therefore, selecting an

appropriate feature space transformation is vital to clustering the copies together. The

objective becomes finding a feature space transformation that minimizes the distance

of the centroid and its members, such that the centroid is accurately representative of

each unique strand.

4.4 Proposed Method

4.4.1 High-Accuracy K-mers

One key step in clustering DNA sequences is identifying reliable features to generate a

“thumb print” for each DNA strand. This thumb print allows copies of each respective

DNA strand to be grouped together with high accuracy so that subsequent sequencing

corrections can be performed. In order to find the most reliable thumb print, we

analyze the transition behavior of the DNA bases passing through a Nanopore. Each

6-mer DNA sequence will have six different transitions reflecting the transition of

the sequence, one base at a time, through the pore. These transitions are reflected

by the change in the previously defined µ from time ti to ti+1. Some sequences and

transitions have been previously identified as being difficult to base call correctly [181].

The reference measurements provided by the nanopore manufacturer provide the mean

and standard deviation associated with what one should expect to observe given a

particular 6-mer DNA sequence. A state transition table is subsequently derived for

76

(76.6, 109.1,92.4)

(79.5, 97.1)

(83.5, 83.5)

(83.5, 81.1)

(80.7, 98.9)

AAAAAAA

AAAAAAG

TTGTATG

AAACCTT

k-mers meancurrents

visualizedcurrents

TTGTCACG

(A) (B)

Figure 16. (A) DNA sequences can be interpreted as a sequence of k-mers, each

k-mer can be replaced by its expected mean current to provide a mapping from DNA

sequences into a high-dimensional current space. (B) Clusters of 2-dimensional points

(sequence length 7), clusters are found using DBSCAN with neighborhood size 1 and

epsilon 2

each K-mer DNA base. Each transition has an associated transition value τ that

reflects the change in value from one K-mer to the next.

We used DBSCAN [182], a density-based clustering approach, to identify the most

unique τ patterns. The DBSCAN algorithm clusters these transitions based on density

and their distance to one another. Clusters that are generated can be interpreted as

sequences of k-mers which are easy to confuse with other members of the cluster, but

easy to differentiate from other sequences. As a result, those clusters that tend to

be smaller and more isolated reflect transition sequences that are more unique and

less likely to be confused with other sequences; the more unique a sequence is, the

more likely it is to be base called correctly. Because of these requirements we set the

DBSCAN required density to 1 (i.e. a K-mer should be treated as a cluster even if

77

it has no similar k-mers), and we varied the DBSCAN epsilon(neighbor euclidean

distance requirement) between .5 and 5; empirically we found that an epsilon of 2

(approximately the median standard deviations provided by the nanopore the model)

with k-mers of length 10 (4 transitions) gave us a sufficient number of high-quality

clusters. To determine the features used we selected the 500 smallest clusters, and

selected the longest common subsequence within the cluster. These k-mers were

then treated as high-accuracy k-mers. A virtual DNA thumb print that is capable of

clustering each unique strand with its respective copies was then generated using the

high-accuracy k-mers.

The results-guided search allowed us to verify and refine our model predictions

through the analysis of real-world data. To this end we gathered E. Coli K12 data

previously made public by Nick Loman’s lab {http://lab.loman.net/2016/07/30/

nanopore-r9-data-release/}. These reads were then aligned to the genome using

Graphmap [155], while discarding unmapped reads. From the resulting alignment

files the segments of exactly matching sequence were extracted, and all k-mers (for

k=3-10) within them were extracted and counted. These counts were then divided

by the true counts, defined by the regions on the reference genome matched by the

individual reads. This ratio provides recall for all k-mers, precision is calculated by

selecting all k-mers in the same range present in the reads that were not eliminated.

We find a high concordance in the k-mers identified by the model-guided and

results-guided approaches, with a significant portion of the disagreement arising from

subsequences with low change between dimensions as demonstrated in the first example

of 16(A), where low current change between k-mers results in a higher probability of

event caller error that is not accounted for in the model.

78



4.4.2 Utilization of K-means

After DNA sequences have been reduced to their thumb print, the K-means algorithm

was employed over the feature space to cluster DNA copies. K-means is a robust

clustering algorithm that minimizes the objective function previously defined in

Equation 4.1 while allowing us to define the number of clusters. The number of

unique strands define how many clusters the K-means algorithm should find. With

simulated results the exact number of unique strands is known, and future biological

experiments will provide the same information (through a colony count). We find

though that we are able to recover the error-corrected original sequences even in cases

where the number of clusters is modestly overpredicted or underpredicted, which is of

great value when the counts may not be exact due to possible sample contamination.

K-means was employed in our system using the L2 norm within the reduced

subspace created by our DNA thumb print. One limitation of K-means is that it is

only guaranteed to find a local minimum; to remedy this we perform 10 runs and

select the run with the lowest sum of squared error (SSE). Aside from being a generic

choice in K-means clustering results we find that SSE has a strong negative correlation

with cluster purity on our simulated data. Finally, our implementation was employed

using individual points as initial clusters; We find that this significantly removes the

probability that a cluster will be empty. Additionally any cluster with less than 5

reads is treated as empty and ignored, as there is no chance of reads within being

corrected to an acceptable level.

4.4.3 Synthetic Data Generation

Current read simulators [159], [160] are incapable of properly modeling k-mer accuracy

distributions identified here. nonetheless due to the low cost, high availability, and

79

Organism Simulator number of MHAP K-Means

reads purity purity

E. Coli NanoSim 2,000 98.2% 96.6%

E. Coli NanoSim 20,000 77.2% 75.9%

Table 5. Comparison of MHAP and K-means clusters over the engineered featurespace, the number of K-means clusters was set to the number of componentsidentified by MHAP

relatively high fidelity it is still valuable to validate computational techniques on

simulated data. To this end we generated synthetic reads using Nanosim [159]. Each

unique read was replicated according to a Poisson random value with an expected

value of 50, this was empirically chosen to ensure (>99%) that at least 30 replicates

were generated as we determined was required for read correction. Each replicate

originates with the same sequence, but errors are calculated and applied independently.

In total we generated two data sets; one with 2,000 (40 unique) synthetic reads 20,000

synthetic reads (414 unique) and performed downstream experiments on this set.

4.5 Results

4.5.1 Clustering on Synthetic Data

There are not currently any standard tools for identifying replicated reads without a

reference genome. Most existing tools were generated for 2nd generation sequencing

technologies where reads are already high accuracy, and duplicated reads provide

no additional information, and as such they are removed after alignment [183]. In

this regard the closest tool we could find is the popular read overlapper MinHash

Alignment Process (MHAP) [175] used for de novo genome construction. As a metric

of comparison we use cluster purity as shown in figure 5. To generate the following

80

data we ran MHAP using default parameters, from the output we removed any links

with less than 70% overlap, and counted connected components as clusters. In small

data sets our performance is good, but less than MHAP due to MHAP performing an

alignment post-processing step; in this way the full reads can be used allowing for the

identification of many false positives. In larger data sets there are larger numbers of

more significant overlaps that come from distinct reads.

Despite the utilization of cluster purity as a metric, the ultimate goal is to

generate accurate reads to be used for downstream processing. To this end we utilized

PBDAGCON [177] to create consensus sequences from aligning reads within each

cluster. We find that in the smaller E. Coli data set we are able to accurately recover

each of the 40 unique reads with a mean read accuracy of 97%, up from an individual

mean read accuracy of 81%. In the larger data set we are able to recover 408 of 414

individual sequences, again with a mean accuracy of 97%. On inspection most of

the failing reads are caused by a division of read copies between multiple clusters.

Interestingly, while some sequences can be base called with only 15 copies, depending

on the error distribution the consensus caller can fail with as many as 30 copies,

indicating that biological sequencing experiments should aim to have more than 30

copies for read correction.

4.6 Conclusion

We have proposed a novel unsupervised learning technique capable of significantly

increasing accuracy of nanopore base calls. We demonstrate that our technique

provides significant advantages over current techniques by providing similar levels of

accuracy while removing read length limitations imposed by previous techniques. We

81

tested our method on data simulated from a state of the art simulator, and hypothesize

that on actual results the performance should be even better.

The availability of high accuracy reads allows for the exploration of new applications,

including; sequencing of larger organisms, organism disambiguation when sequencing

a population, exact sequence detection in diploid and polyploid organisms, and the

ability to scaffold genomes across exceptionally long repeat regions. Providing a path

to these high accuracy-reads without compromising read length is therefore crucial to

continued progress.

82

Chapter 5

CONCLUSIONS

This chapter concludes the dissertation by summarizing the contributions of the

work and highlighting future directions.

5.1 Methodological Contributions

1. Modeling Gene Network Motifs In Chapter 1 this thesis presents a thorough

analysis of a differential equation-based fully-connected triadic network construct.

This construct is analyzed in both stable and dynamic environments to identify

stable steady states, and to validate their differential stability, and transition

dynamics. Implementing these discoveries requires a way to map real genes

within a network into this triadic construct though, so this thesis delves deeper

into network inference techniques.

2. Techniques for Data Aggregation In Chapter 2 this thesis reviews tech-

niques for network inference, and provides techniques of data aggregation and

normalization to improve inference quality. This is performed by collecting

publicly available data and normalizing the process for interpreting raw reads.

We find that while this approach shows some reduction to the noise incurred by

combining data sets there is still asignificant amount of noise remaining, and for

significant progress to be made high quality data is required.

3. Modeling and Simulation of Nanopore Sequencing Data in Chapter

4 this thesis reviews error patters in third generation nanopore sequencing.

We demonstrate that existing read simulators fail to adequately represent the

83

accuracy bias that is present in nanopore data. To explain the sources of error a

set of features are generated from the model and fit to the error profile, providing

an explanation for the sources of error. Finally an HMM-based simulator capable

of generating the accuracy-biased nanopore reads is demonstrated and validated

against current read simulators.

4. Error Correction in Nanopore Sequencing In the final chapters this thesis

introduces computational and biological techniques for improving the quality of

nanopore reads using basecalled reads, using a directed acyclic graph, and by

correcting the raw signal provided by reads, which can later be basecalled with

higher fidelity.

5.2 Future Work

1. Further Exploration of Nanopore Error Sources. while this thesis

presents some analysis on sources of errors in nanopore sequencing there is

a significant fraction of error that remains unexplained. A more thorough

understanding of this error, and the ability to predict k-mer accuracy errors

completely in silico would be of significant value in the future of pore design.

2. Integration of Error Bias Findings Into Existing Software Tools. Our

findings indicate that many existing software tools that assume parity among all

k-mers are incorrect in their assumptions. Collaborations with other researchers

to integrate our findings both into sequence aligners and basecallers would be of

significant value to the scientific community as a whole.

84

NOTES

85

REFERENCES

[1] LC Junqueira, J Carneiro, and RO Kelley. “Basic Histology. Appleton andLange”. In: Stamford, Connecticut. USA pp (1995), pp. 283–300.

[2] B. D. Macarthur, A. Ma’ayan, and I. R. Lemischka. “Systems biology of stemcell fate and cellular reprogramming”. In: Nat Rev Mol Cell Biol 10 (Oct. 2009).10, pp. 672–81. doi: 10.1038/nrm2766.

[3] C. H. Waddington. The strategy of the genes; a discussion of some aspects oftheoretical biology. London, Allen & Unwin, 1957. ix, 262 p.

[4] S.A. Kauffman. “Metabolic stability and epigenesis in randomly constructedgenetic nets”. In: Journal of Theoretical Biology 22.3 (Mar. 1969), pp. 437–467.doi: 10.1016/0022-5193(69)90015-0. (Visited on 09/26/2013).

[5] S. Huang, G. Eichler, Y. Bar-Yam, et al. “Cell fates as high-dimensionalattractor states of a complex gene regulatory network”. In: Phys Rev Lett 94(Apr. 1, 2005). 12, p. 128701.

[6] Manu, S. Surkova, A. V. Spirov, et al. “Canalization of gene expression anddomain shifts in the Drosophila blastoderm by dynamical attractors”. In: PLoSComput Biol 5 (Mar. 2009). 3, e1000303. doi: 10.1371/journal.pcbi.1000303.

[7] M. Acar, A. Becskei, and A. van Oudenaarden. “Enhancement of cellularmemory by reducing stochastic transitions”. In: Nature 435 (May 12, 2005).7039, pp. 228–32.

[8] T. S. Gardner, C. R. Cantor, and J. J. Collins. “Construction of a genetic toggleswitch in Escherichia coli”. In: Nature 403 (Jan. 20, 2000). 6767, pp. 339–42.

[9] Attila Becskei, Bertrand Séraphin, and Luis Serrano. “Positive feedback ineukaryotic gene networks: cell differentiation by graded to binary responseconversion”. In: The EMBO Journal 20.10 (May 15, 2001), pp. 2528–2535. doi:10.1093/emboj/20.10.2528. (Visited on 09/25/2013).

[10] F. J. Isaacs, J. Hasty, C. R. Cantor, et al. “Prediction and measurement of anautoregulatory genetic module”. In: Proc Natl Acad Sci U S A 100 (June 24,2003). 13, pp. 7714–9. doi: 10.1073/pnas.1332628100.

[11] Tom Ellis, Xiao Wang, and James J. Collins. “Diversity-based, model-guidedconstruction of synthetic gene networks with predicted functions”. In: Nature

86

http://dx.doi.org/10.1038/nrm2766

http://dx.doi.org/10.1016/0022-5193(69)90015-0

http://dx.doi.org/10.1371/journal.pcbi.1000303

http://dx.doi.org/10.1093/emboj/20.10.2528

http://dx.doi.org/10.1073/pnas.1332628100

Biotechnology 27.5 (May 2009), pp. 465–471. doi: 10.1038/nbt.1536. (Visitedon 07/03/2013).

[12] Min Wu, Ri-Qi Su, Xiaohui Li, et al. “Engineering of regulated stochastic cellfate determination”. In: Proceedings of the National Academy of Sciences 110.26(June 25, 2013), pp. 10610–10615. doi: 10.1073/pnas.1305423110. (Visited on07/03/2013).

[13] Daniel Huang, William J. Holtz, and Michel M. Maharbiz. “A genetic bistableswitch utilizing nonlinear protein degradation”. In: Journal of Biological En-gineering 6.1 (July 9, 2012), p. 9. doi: 10.1186/1754-1611-6-9. (Visited on07/27/2013).

[14] W. Xiong and J. E. Ferrell. “A positive-feedback-based bistable ’memorymodule’ that governs a cell fate decision”. In: Nature 426 (Nov. 27, 2003). 6965,pp. 460–5.

[15] Tetsuya Shiraishi, Shinako Matsuyama, and Hiroaki Kitano. “Large-ScaleAnalysis of Network Bistability for Human Cancers”. In: PLoS Comput Biol6.7 (July 8, 2010), e1000851. doi: 10.1371/journal.pcbi.1000851. (Visited on07/27/2013).

[16] S. Huang, Y. P. Guo, G. May, et al. “Bifurcation dynamics in lineage-commitment in bipotent progenitor cells”. In: Dev Biol 305 (May 15, 2007). 2,pp. 695–713.

[17] I. H. Park, R. Zhao, J. A. West, et al. “Reprogramming of human somatic cellsto pluripotency with defined factors”. In: Nature 451 (Jan. 10, 2008). 7175,pp. 141–6. doi: 10.1038/nature06534.

[18] Kazutoshi Takahashi and Shinya Yamanaka. “Induction of Pluripotent StemCells from Mouse Embryonic and Adult Fibroblast Cultures by Defined Factors”.In: Cell 126.4 (Aug. 25, 2006), pp. 663–676. doi: 10.1016/j.cell.2006.07.024.(Visited on 10/08/2013).

[19] K. Takahashi, K. Tanabe, M. Ohnuki, et al. “Induction of pluripotent stemcells from adult human fibroblasts by defined factors”. In: Cell 131 (Nov. 30,2007). 5, pp. 861–72. doi: 10.1016/j.cell.2007.11.019.

[20] J. Yu, M. A. Vodyanik, K. Smuga-Otto, et al. “Induced pluripotent stem celllines derived from human somatic cells”. In: Science 318 (Dec. 21, 2007). 5858,pp. 1917–20. doi: 10.1126/science.1151526.

87

http://dx.doi.org/10.1038/nbt.1536


http://dx.doi.org/10.1186/1754-1611-6-9


http://dx.doi.org/10.1038/nature06534

http://dx.doi.org/10.1016/j.cell.2006.07.024


http://dx.doi.org/10.1126/science.1151526

[21] L. A. Boyer, T. I. Lee, M. F. Cole, et al. “Core transcriptional regulatorycircuitry in human embryonic stem cells”. In: Cell 122 (Sept. 23, 2005). 6,pp. 947–56. doi: 10.1016/j.cell.2005.08.020.

[22] Jonghwan Kim, Jianlin Chu, Xiaohua Shen, et al. “An Extended TranscriptionalNetwork for Pluripotency of Embryonic Stem Cells”. In: Cell 132.6 (Mar. 21,2008), pp. 1049–1061. doi: 10.1016/j.cell.2008.02.039. (Visited on 07/03/2013).

[23] T. Kalmar, C. Lim, P. Hayward, et al. “Regulated fluctuations in nanogexpression mediate cell fate decisions in embryonic stem cells”. In: PLoS Biol 7(July 2009). 7, e1000149.

[24] Ian Chambers, Jose Silva, Douglas Colby, et al. “Nanog safeguards pluripotencyand mediates germline development”. In: Nature 450.7173 (Dec. 20, 2007),pp. 1230–1234. doi: 10.1038/nature06403.

[25] T. Graf and M. Stadtfeld. “Heterogeneity of embryonic and adult stem cells”. In:Cell Stem Cell 3 (Nov. 6, 2008). 5, pp. 480–3. doi: 10.1016/j.stem.2008.10.007.

[26] Maurice A Canham, Alexei A Sharov, Minoru S H Ko, et al. “Functional het-erogeneity of embryonic stem cells revealed through translational amplificationof an early endodermal transcript”. In: PLoS biology 8.5 (May 2010), e1000379.doi: 10.1371/journal.pbio.1000379.

[27] E. M. Chan, S. Ratanasirintrawoot, I. H. Park, et al. “Live cell imagingdistinguishes bona fide human iPS cells from partially reprogrammed cells”. In:Nat Biotechnol 27 (Nov. 2009). 11, pp. 1033–7. doi: 10.1038/nbt.1580.

[28] Katsuhiko Hayashi, Susana M Chuva de Sousa Lopes, Fuchou Tang, et al.“Dynamic equilibrium and heterogeneity of mouse pluripotent stem cells withdistinct functional and epigenetic states”. In: Cell stem cell 3.4 (Oct. 9, 2008),pp. 391–401. doi: 10.1016/j.stem.2008.07.027.

[29] Ben D MacArthur, Ana Sevilla, Michel Lenz, et al. “Nanog-dependent feedbackloops regulate murine embryonic stem cell heterogeneity”. In: Nature cell biology14.11 (Nov. 2012), pp. 1139–1147. doi: 10.1038/ncb2603.

[30] Todd S Macfarlan, Wesley D Gifford, Shawn Driscoll, et al. “Embryonic stemcell potency fluctuates with endogenous retrovirus activity”. In: Nature 487.7405(July 5, 2012), pp. 57–63. doi: 10.1038/nature11244.

[31] J. Silva and A. Smith. “Capturing pluripotency”. In: Cell 132 (Feb. 22, 2008).4, pp. 532–6. doi: 10.1016/j.cell.2008.02.006.

88




http://dx.doi.org/10.1016/j.stem.2008.10.007

http://dx.doi.org/10.1371/journal.pbio.1000379



http://dx.doi.org/10.1038/ncb2603



[32] Yayoi Toyooka, Daisuke Shimosato, Kazuhiro Murakami, et al. “Identificationand characterization of subpopulations in undifferentiated ES cell culture”.In: Development (Cambridge, England) 135.5 (Mar. 2008), pp. 909–918. doi:10.1242/dev.017400.

[33] Jamie Trott, Katsuhiko Hayashi, Azim Surani, et al. “Dissecting ensemblenetworks in ES cell populations reveals micro-heterogeneity underlying pluripo-tency”. In: Molecular bioSystems 8.3 (Mar. 2012), pp. 744–752. doi: 10.1039/c1mb05398a.

[34] J. Hanna, K. Saha, B. Pando, et al. “Direct cell reprogramming is a stochasticprocess amenable to acceleration”. In: Nature 462 (Dec. 3, 2009). 7273, pp. 595–601. doi: 10.1038/nature08592.

[35] S. Yamanaka. “Elite and stochastic models for induced pluripotent stem cellgeneration”. In: Nature 460 (July 2, 2009). 7251, pp. 49–52. doi: 10.1038/nature08180.

[36] A. Brock, H. Chang, and S. Huang. “Non-genetic heterogeneity–a mutation-independent driving force for the somatic evolution of tumours”. In: Nat RevGenet 10 (May 2009). 5, pp. 336–42. doi: 10.1038/nrg2556.

[37] Sui Huang. “Tumor progression: chance and necessity in Darwinian and Lamar-ckian somatic (mutationless) evolution”. In: Progress in biophysics and molecularbiology 110.1 (Sept. 2012), pp. 69–86. doi: 10.1016/j.pbiomolbio.2012.05.001.

[38] Wenzhe Ma, Ala Trusina, Hana El-Samad, et al. “Defining Network Topologiesthat Can Achieve Biochemical Adaptation”. In: Cell 138.4 (Aug. 21, 2009),pp. 760–773. doi: 10.1016/j.cell.2009.06.013. (Visited on 07/03/2013).

[39] R. Milo, S. Shen-Orr, S. Itzkovitz, et al. “Network motifs: simple buildingblocks of complex networks”. In: Science 298 (Oct. 25, 2002). 5594, pp. 824–7.doi: 10.1126/science.298.5594.824.

[40] Jian Shu, Chen Wu, Yetao Wu, et al. “Induction of pluripotency in mousesomatic cells with lineage specifiers”. In: Cell 153.5 (May 23, 2013), pp. 963–975.doi: 10.1016/j.cell.2013.05.001.

[41] I. Chambers and A. Smith. “Self-renewal of teratocarcinoma and embryonicstem cells”. In: Oncogene 23 (Sept. 20, 2004). 43, pp. 7150–60. doi: 10.1038/sj.onc.1207930.

89

http://dx.doi.org/10.1242/dev.017400

http://dx.doi.org/10.1039/c1mb05398a

http://dx.doi.org/10.1039/c1mb05398a




http://dx.doi.org/10.1038/nrg2556

http://dx.doi.org/10.1016/j.pbiomolbio.2012.05.001


http://dx.doi.org/10.1126/science.298.5594.824


http://dx.doi.org/10.1038/sj.onc.1207930

http://dx.doi.org/10.1038/sj.onc.1207930

[42] T. Graf and T. Enver. “Forcing cells to change lineages”. In: Nature 462 (Dec. 3,2009). 7273, pp. 587–94. doi: 10.1038/nature08533.

[43] H. Niwa. “How is pluripotency determined and maintained?” In: Development134 (Feb. 2007). 4, pp. 635–46. doi: 10.1242/dev.02787.

[44] Piyush B. Gupta, Christine M. Fillmore, Guozhi Jiang, et al. “Stochastic StateTransitions Give Rise to Phenotypic Equilibrium in Populations of Cancer Cells”.In: Cell 146.4 (Aug. 19, 2011), pp. 633–644. doi: 10.1016/j.cell.2011.07.026.(Visited on 07/27/2013).

[45] Avigdor Eldar and Michael B. Elowitz. “Functional roles for noise in geneticcircuits”. In: Nature 467.7312 (Sept. 9, 2010), pp. 167–173. doi: 10.1038/nature09326. (Visited on 07/27/2013).

[46] Peter Laslo, Chauncey J. Spooner, Aryeh Warmflash, et al. “MultilineageTranscriptional Priming and Determination of Alternate Hematopoietic CellFates”. In: Cell 126.4 (Aug. 25, 2006), pp. 755–766. doi: 10.1016/j.cell.2006.06.052. (Visited on 09/04/2013).

[47] M. Kaufman, C. Soule, and R. Thomas. “A new necessary condition on inter-action graphs for multistationarity”. In: J Theor Biol 248 (Oct. 21, 2007). 4,pp. 675–85. doi: 10.1016/j.jtbi.2007.06.016.

[48] J. Hasty, J. Pradines, M. Dolnik, et al. “Noise-based switches and amplifiersfor gene expression”. In: Proc Natl Acad Sci U S A 97 (Feb. 29, 2000). 5,pp. 2075–80.

[49] Santhosh Palani and Casim A Sarkar. “Transient noise amplification and geneexpression synchronization in a bistable mammalian cell-fate switch”. In: Cellreports 1.3 (Mar. 29, 2012), pp. 215–224. doi: 10.1016/j.celrep.2012.01.007.

[50] Harukazu Suzuki, Alistair R. R. Forrest, Erik van Nimwegen, et al. “Thetranscriptional network that controls growth arrest and differentiation in ahuman myeloid leukemia cell line”. In: Nature Genetics 41.5 (May 2009),pp. 553–562. doi: 10.1038/ng.375. (Visited on 08/08/2013).

[51] Hao Yuan Kueh, Ameya Champhekhar, Stephen L. Nutt, et al. “PositiveFeedback Between PU.1 and the Cell Cycle Controls Myeloid Differentiation”.In: Science 341.6146 (Aug. 9, 2013), pp. 670–673. doi: 10.1126/science.1240831.(Visited on 08/08/2013).

90








http://dx.doi.org/10.1016/j.jtbi.2007.06.016

http://dx.doi.org/10.1016/j.celrep.2012.01.007

http://dx.doi.org/10.1038/ng.375


[52] Richard A. Young. “Control of the Embryonic Stem Cell State”. In: Cell 144.6(Mar. 18, 2011), pp. 940–954. doi: 10.1016/j.cell.2011.01.032. (Visited on06/04/2014).

[53] Bard Ermentrout. Simulating, analyzing, and animating dynamical systems : aguide to XPPAUT for researchers and students. Software, environments, tools.Philadelphia: Society for Industrial and Applied Mathematics, 2002. xiv, 290 p.

[54] Steven H. Strogatz. Nonlinear Dynamics and Chaos: With Applications toPhysics, Biology, Chemistry, and Engineering. Reading, Mass.: Addison-WesleyPub., 1994. xi, 498.

[55] K. C. Chen, L. Calzone, A. Csikasz-Nagy, et al. “Integrative analysis of cellcycle control in budding yeast”. In: Mol Biol Cell 15 (Aug. 2004). 8, pp. 3841–62.

[56] W. J. Blake, K. AErn M, C. R. Cantor, et al. “Noise in eukaryotic geneexpression”. In: Nature 422 (Apr. 10, 2003). 6932, pp. 633–7.

[57] W. J. Blake, G. Balazsi, M. A. Kohanski, et al. “Phenotypic consequences ofpromoter-mediated transcriptional noise”. In: Mol Cell 24 (Dec. 28, 2006). 6,pp. 853–65.

[58] M. B. Elowitz, A. J. Levine, E. D. Siggia, et al. “Stochastic gene expression ina single cell”. In: Science 297 (Aug. 16, 2002). 5584, pp. 1183–6.

[59] Nicholas J Guido, Xiao Wang, David Adalsteinsson, et al. “A bottom-upapproach to gene regulation”. In: Nature 439.7078 (Feb. 16, 2006), pp. 856–860.doi: 10.1038/nature04473.

[60] J. M. Raser and E. K. O’Shea. “Control of stochasticity in eukaryotic geneexpression”. In: Science 304 (June 18, 2004). 5678, pp. 1811–4.

[61] G. M. Suel, J. Garcia-Ojalvo, L. M. Liberman, et al. “An excitable generegulatory circuit induces transient cellular differentiation”. In: Nature 440(Mar. 23, 2006). 7083, pp. 545–50.

[62] M. Thattai and A. van Oudenaarden. “Intrinsic noise in gene regulatory net-works”. In: Proc Natl Acad Sci U S A 98 (July 17, 2001). 15, pp. 8614–9.

[63] M. L. Bell, J. B. Earl, and S. G. Britt. “Two types of Drosophila R7 photorecep-tor cells are arranged randomly: a model for stochastic cell-fate determination”.In: J Comp Neurol 502 (May 1, 2007). 1, pp. 75–85.

91



[64] V. Chickarmane, T. Enver, and C. Peterson. “Computational modeling ofthe hematopoietic erythroid-myeloid switch reveals insights into cooperativity,priming, and irreversibility”. In: PLoS Comput Biol 5 (Jan. 2009). 1, e1000268.

[65] C. F. Chuang, M. K. Vanhoven, R. D. Fetter, et al. “An innexin-dependentcell network establishes left-right neuronal asymmetry in C. elegans”. In: Cell129 (May 18, 2007). 4, pp. 787–99. doi: 10.1016/j.cell.2007.02.052.

[66] A. Raj, S. A. Rifkin, E. Andersen, et al. “Variability in gene expression underliesincomplete penetrance”. In: Nature 463 (Feb. 18, 2010). 7283, pp. 913–8. doi:10.1038/nature08781.

[67] G. Yao, T. J. Lee, S. Mori, et al. “A bistable Rb-E2F switch underlies therestriction point”. In: Nat Cell Biol 10 (Apr. 2008). 4, pp. 476–82.

[68] D. Adalsteinsson, D. McMillen, and T. C. Elston. “Biochemical NetworkStochastic Simulator (BioNetS): software for stochastic modeling of biochemicalnetworks”. In: BMC Bioinformatics 5 (Mar. 8, 2004), p. 24.

[69] Koichi Akashi, Xi He, Jie Chen, et al. “Transcriptional accessibility for genes ofmultiple tissues and hematopoietic lineages is hierarchically controlled duringearly hematopoiesis”. In: Blood 101.2 (Jan. 15, 2003), pp. 383–389. doi: 10.1182/blood-2002-06-1780.

[70] M. A. Cross and T. Enver. “The lineage commitment of haemopoietic progenitorcells”. In: Curr Opin Genet Dev 7 (Oct. 1997). 5, pp. 609–13.

[71] M. Hu, D. Krause, M. Greaves, et al. “Multilineage gene expression precedescommitment in the hemopoietic system”. In: Genes Dev 11 (Mar. 15, 1997). 6,pp. 774–85.

[72] Carla F Bender Kim, Erica L Jackson, Amber E Woolfenden, et al. “Identifica-tion of bronchioalveolar stem cells in normal lung and lung cancer”. In: Cell121.6 (June 17, 2005), pp. 823–835. doi: 10.1016/j.cell.2005.03.032.

[73] T. Miyamoto, H. Iwasaki, B. Reizis, et al. “Myeloid or lymphoid promiscuityas a critical step in hematopoietic lineage commitment”. In: Dev Cell 3 (July2002). 1, pp. 137–47.

[74] I. S. Gradshteyn, I. M. Ryzhik, and Alan Jeffrey. Table of integrals, series,and products. 6th. San Diego: Academic Press, 2000. xlvii, 1163 p. url: http://www.loc.gov/catdir/description/els031/00104373.html%20http://www.loc.gov/catdir/toc/els031/00104373.html.

92



http://dx.doi.org/10.1182/blood-2002-06-1780

http://dx.doi.org/10.1182/blood-2002-06-1780


http://www.loc.gov/catdir/description/els031/00104373.html%20http://www.loc.gov/catdir/toc/els031/00104373.html



[75] Joseph Xu Zhou, M. D. S. Aliyu, Erik Aurell, et al. “Quasi-potential landscapein complex multi-stable systems”. In: Journal of The Royal Society Interface(Aug. 29, 2012), rsif20120434. doi: 10 . 1098 / rsif . 2012 . 0434. (Visited on03/19/2014).

[76] Ben D. MacArthur and Ihor R. Lemischka. “Statistical Mechanics of Pluripo-tency”. In: Cell 154.3 (Aug. 1, 2013), pp. 484–489. doi: 10.1016/j.cell.2013.07.024. (Visited on 08/02/2013).

[77] T. Vierbuchen, A. Ostermeier, Z. P. Pang, et al. “Direct conversion of fibroblaststo functional neurons by defined factors”. In: Nature 463 (Feb. 25, 2010). 7284,pp. 1035–41. doi: 10.1038/nature08797.

[78] T. Nakagawa, M. Sharma, Y. Nabeshima, et al. “Functional hierarchy andreversibility within the murine spermatogenic stem cell compartment”. In:Science 328 (Apr. 2, 2010). 5974, pp. 62–7. doi: 10.1126/science.1182868.

[79] Y. H. Loh, Q. Wu, J. L. Chew, et al. “The Oct4 and Nanog transcriptionnetwork regulates pluripotency in mouse embryonic stem cells”. In: Nat Genet38 (Apr. 2006). 4, pp. 431–40. doi: 10.1038/ng1760.

[80] Miguel Fidalgo, Francesco Faiola, Carlos-Filipe Pereira, et al. “Zfp281 mediatesNanog autorepression through recruitment of the NuRD complex and inhibitssomatic cell reprogramming”. In: Proceedings of the National Academy ofSciences of the United States of America 109.40 (Oct. 2, 2012), pp. 16202–16207. doi: 10.1073/pnas.1208533109.

[81] Pablo Navarro, Nicola Festuccia, Douglas Colby, et al. “OCT4/SOX2-independent Nanog autorepression modulates heterogeneous Nanog geneexpression in mouse ES cells”. In: The EMBO journal 31.24 (Dec. 12, 2012),pp. 4547–4562. doi: 10.1038/emboj.2012.321.

[82] H Niwa, J Miyazaki, and A G Smith. “Quantitative expression of Oct-3/4defines differentiation, dedifferentiation or self-renewal of ES cells”. In: Naturegenetics 24.4 (Apr. 2000), pp. 372–376. doi: 10.1038/74199.

[83] Aliaksandra Radzisheuskaya, Gloryn Le Bin Chia, Rodrigo L dos Santos, et al.“A defined Oct4 level governs cell state transitions of pluripotency entry anddifferentiation into all embryonic lineages”. In: Nature cell biology 15.6 (June2013), pp. 579–590. doi: 10.1038/ncb2742.

[84] Violetta Karwacki-Neisius, Jonathan Göke, Rodrigo Osorno, et al. “ReducedOct4 Expression Directs a Robust Pluripotent State with Distinct Signaling

93

http://dx.doi.org/10.1098/rsif.2012.0434





http://dx.doi.org/10.1038/ng1760


http://dx.doi.org/10.1038/emboj.2012.321

http://dx.doi.org/10.1038/74199

http://dx.doi.org/10.1038/ncb2742

Activity and Increased Enhancer Occupancy by Oct4 and Nanog”. In: CellStem Cell 12.5 (Feb. 5, 2013), pp. 531–545. doi: 10.1016/j.stem.2013.04.023.(Visited on 06/04/2014).

[85] J. Wang, D. N. Levasseur, and S. H. Orkin. “Requirement of Nanog dimerizationfor stem cell self-renewal and pluripotency”. In: Proc Natl Acad Sci U S A 105(Apr. 29, 2008). 17, pp. 6326–31. doi: 10.1073/pnas.0802288105.

[86] Joon-Lin Chew, Yuin-Han Loh, Wensheng Zhang, et al. “Reciprocal transcrip-tional regulation of Pou5f1 and Sox2 via the Oct4/Sox2 complex in embryonicstem cells”. In: Molecular and cellular biology 25.14 (July 2005), pp. 6031–6046.doi: 10.1128/MCB.25.14.6031-6046.2005.

[87] Vijay Chickarmane, Carl Troein, Ulrike A Nuber, et al. “Transcriptional Dynam-ics of the Embryonic Stem Cell Switch”. In: PLoS Comput Biol 2.9 (Sept. 15,2006), e123. doi: 10.1371/journal.pcbi.0020123. (Visited on 06/04/2014).

[88] Irene Aksoy, Ralf Jauch, Jiaxuan Chen, et al. “Oct4 switches partnering fromSox2 to Sox17 to reinterpret the enhancer code and specify endoderm”. In: TheEMBO journal 32.7 (Apr. 3, 2013), pp. 938–953. doi: 10.1038/emboj.2013.31.

[89] Dina A. Faddah, Haoyi Wang, Albert Wu Cheng, et al. “Single-Cell AnalysisReveals that Expression of Nanog Is Biallelic and Equally Variable as that ofOther Pluripotency Factors in Mouse ESCs”. In: Cell Stem Cell 13.1 (July 3,2013), pp. 23–29. doi: 10.1016/j.stem.2013.04.019. (Visited on 08/08/2013).

[90] Adam Filipczyk, Konstantinos Gkatzis, Jun Fu, et al. “Biallelic Expression ofNanog Protein in Mouse Embryonic Stem Cells”. In: Cell Stem Cell 13.1 (July 3,2013), pp. 12–13. doi: 10.1016/j.stem.2013.04.025. (Visited on 08/08/2013).

[91] Yusuke Miyanari and Maria-Elena Torres-Padilla. “Control of ground-statepluripotency by allelic regulation of Nanog”. In: Nature 483.7390 (Feb. 12,2012), pp. 470–473. doi: 10.1038/nature10807. (Visited on 03/22/2012).

[92] Austin Smith. “Nanog Heterogeneity: Tilting at Windmills?” In: Cell StemCell 13.1 (July 3, 2013), pp. 6–7. doi: 10.1016/j.stem.2013.06.016. (Visited on08/08/2013).

[93] O. Adewumi, B. Aflatoonian, L. Ahrlund-Richter, et al. “Characterization ofhuman embryonic stem cell lines by the International Stem Cell Initiative”. In:Nat Biotechnol 25 (July 2007). 7, pp. 803–16. doi: 10.1038/nbt1318.

94



http://dx.doi.org/10.1128/MCB.25.14.6031-6046.2005


http://dx.doi.org/10.1038/emboj.2013.31





http://dx.doi.org/10.1038/nbt1318

[94] Kevin F Murphy, Rhys M Adams, Xiao Wang, et al. “Tuning and controllinggene expression noise in synthetic gene networks”. In: Nucleic acids research38.8 (May 2010), pp. 2712–2726. doi: 10.1093/nar/gkq091.

[95] Hong-xi Zhao, Yang Li, Hai-feng Jin, et al. “Rapid and efficient repro-gramming of human amnion-derived cells into pluripotency by three factorsOCT4/SOX2/NANOG”. In: Differentiation; research in biological diversity 80.2(Oct. 2010), pp. 123–129. doi: 10.1016/j.diff.2010.03.002.

[96] R. Jaenisch and R. Young. “Stem cells, the molecular circuitry of pluripotencyand nuclear reprogramming”. In: Cell 132 (Feb. 22, 2008). 4, pp. 567–82. doi:10.1016/j.cell.2008.01.015.

[97] Vidyadhar G. Kulkarni. Modeling and analysis of stochastic systems. New York:Chapman & Hall, 1995. url: http://www.loc.gov/catdir/enhancements/fy0744/95015182-d.html.

[98] Filipe Grácio, Joaquim Cabral, and Bruce Tidor. “Modeling Stem Cell InductionProcesses”. In: PLoS ONE 8.5 (May 8, 2013), e60240. doi: 10.1371/journal.pone.0060240. (Visited on 08/08/2013).

[99] John E Pimanda, Katrin Ottersbach, Kathy Knezevic, et al. “Gata2, Fli1, andScl form a recursively wired gene-regulatory circuit during early hematopoieticdevelopment”. In: Proceedings of the National Academy of Sciences of theUnited States of America 104.45 (Nov. 6, 2007), pp. 17692–17697. doi: 10.1073/pnas.0707045104.

[100] Michael Kyba and George Q Daley. “Hematopoiesis from embryonic stem cells:lessons from and for ontogeny”. In: Experimental hematology 31.11 (Nov. 2003),pp. 994–1006.

[101] Sunhong Kim, Yongsung Kim, Jiwoon Lee, et al. “Regulation of FOXO1 byTAK1-Nemo-like kinase pathway”. In: The Journal of biological chemistry285.11 (Mar. 12, 2010), pp. 8122–8129. doi: 10.1074/jbc.M110.101824.

[102] B. Pourcet, I. Pineda-Torra, B. Derudas, et al. “SUMOylation of humanperoxisome proliferator-activated receptor alpha inhibits its trans-activitythrough the recruitment of the nuclear corepressor NCoR”. In: J Biol Chem285 (Feb. 26, 2010). 9, pp. 5983–92. doi: 10.1074/jbc.M109.078311.

[103] F. Wei, H. R. Scholer, and M. L. Atchison. “Sumoylation of Oct4 enhances itsstability, DNA binding, and transactivation”. In: J Biol Chem 282 (July 20,2007). 29, pp. 21551–60. doi: 10.1074/jbc.M611041200.

95

http://dx.doi.org/10.1093/nar/gkq091

http://dx.doi.org/10.1016/j.diff.2010.03.002


http://www.loc.gov/catdir/enhancements/fy0744/95015182-d.html

http://www.loc.gov/catdir/enhancements/fy0744/95015182-d.html

http://dx.doi.org/10.1371/journal.pone.0060240




http://dx.doi.org/10.1074/jbc.M110.101824

http://dx.doi.org/10.1074/jbc.M109.078311

http://dx.doi.org/10.1074/jbc.M611041200

[104] Alex K. Shalek, Rahul Satija, Xian Adiconis, et al. “Single-cell transcriptomicsreveals bimodality in expression and splicing in immune cells”. In: Nature498.7453 (June 13, 2013), pp. 236–240. doi: 10.1038/nature12172. url: http://www.nature.com/nature/journal/v498/n7453/full/nature12172.html (visitedon 08/28/2014).

[105] Ting Lu, Michael Ferry, Ron Weiss, et al. “A molecular noise generator”. In:Physical biology 5.3 (2008), p. 036006. doi: 10.1088/1478-3975/5/3/036006.

[106] Mathieu Gabut, Payman Samavarchi-Tehrani, Xinchen Wang, et al. “An Al-ternative Splicing Switch Regulates Embryonic Stem Cell Pluripotency andReprogramming”. In: Cell 147 (Sept. 2011), pp. 132–146. doi: 10.1016/j.cell.2011.08.023. (Visited on 10/18/2011).

[107] Yu Lu, Yuin-Han Loh, Hu Li, et al. “Alternative Splicing of MBD2 SupportsSelf-Renewal in Human Pluripotent Stem Cells”. In: Cell stem cell (May 7,2014). doi: 10.1016/j.stem.2014.04.002.

[108] T. L. Deans, C. R. Cantor, and J. J. Collins. “A tunable genetic switch basedon RNAi and repressor proteins for regulating gene expression in mammaliancells”. In: Cell 130 (July 27, 2007). 2, pp. 363–72.

[109] A. S. Khalil and J. J. Collins. “Synthetic biology: applications come of age”. In:Nat Rev Genet 11 (May 2010). 5, pp. 367–79. doi: 10.1038/nrg2775.

[110] T. K. Lu, A. S. Khalil, and J. J. Collins. “Next-generation synthetic genenetworks”. In: Nat Biotechnol 27 (Dec. 2009). 12, pp. 1139–50. doi: 10.1038/nbt.1591.

[111] J. Stricker, S. Cookson, M. R. Bennett, et al. “A fast, robust and tunablesynthetic gene oscillator”. In: Nature 456 (Nov. 27, 2008). 7221, pp. 516–9. doi:10.1038/nature07389.

[112] M. Tigges, T. T. Marquez-Lago, J. Stelling, et al. “A tunable synthetic mam-malian oscillator”. In: Nature 457 (Jan. 15, 2009). 7227, pp. 309–12.

[113] R. Lu, F. Markowetz, R. D. Unwin, et al. “Systems-level dynamic analyses offate change in murine embryonic stem cells”. In: Nature 462 (Nov. 19, 2009).7271, pp. 358–62. doi: 10.1038/nature08575.

[114] Ingmar Glauche, Maria Herberg, and Ingo Roeder. “Nanog Variability andPluripotency Regulation of Embryonic Stem Cells - Insights from a Mathe-

96


http://www.nature.com/nature/journal/v498/n7453/full/nature12172.html

http://www.nature.com/nature/journal/v498/n7453/full/nature12172.html

http://dx.doi.org/10.1088/1478-3975/5/3/036006




http://dx.doi.org/10.1038/nrg2775





matical Model Analysis”. In: PLoS ONE 5.6 (June 21, 2010), e11238. doi:10.1371/journal.pone.0011238. (Visited on 07/03/2013).

[115] A. Remenyi, K. Lins, L. J. Nissen, et al. “Crystal structure of aPOU/HMG/DNA ternary complex suggests differential assembly ofOct4 and Sox2 on two enhancers”. In: Genes Dev 17 (Aug. 15, 2003). 16,pp. 2048–59. doi: 10.1101/gad.269303.

[116] A. Dhooge, W. Govaerts, and Yu. A. Kuznetsov. “MATCONT: A MATLABPackage for Numerical Bifurcation Analysis of ODEs”. In: ACM Trans. Math.Softw. 29.2 (June 2003), pp. 141–164. doi: 10.1145/779359.779362. (Visited on11/15/2013).

[117] Daniel Gillespie. “Exact stochastic simulation of coupled chemical reactions”.In: J. Phys. Chem. 81 (1977), pp. 2340–2361.

[118] Tong Ihn Lee, Nicola J. Rinaldi, Francois Robert, et al. “Transcriptional Regu-latory Networks in Saccharomyces cerevisiae”. In: Science 298.5594 (Oct. 25,2002), pp. 799–804. doi: 10 . 1126/ science . 1075090. url: http : // science .sciencemag .org .ezproxy1. lib .asu.edu/content/298/5594/799 (visited on07/13/2017).

[119] Philippe C. Faucon, Keith Pardee, Roshan M. Kumar, et al. “Gene Networks ofFully Connected Triads with Complete Auto-Activation Enable Multistabilityand Stepwise Stochastic Transitions”. In: PLOS ONE 9.7 (July 24, 2014),e102873. doi: 10.1371/journal.pone.0102873. url: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0102873 (visited on 09/14/2016).

[120] Tanya Barrett, Stephen E. Wilhite, Pierre Ledoux, et al. “NCBI GEO: archivefor functional genomics data sets”. In: Nucleic Acids Research 41 (D1 Jan. 1,2013), pp. D991–D995. doi: 10.1093/nar/gks1193. url: https://academic.oup.com/nar/article/41/D1/D991/1067995/NCBI-GEO-archive-for-functional-genomics-data-sets (visited on 07/13/2017).

[121] Markus Maucher, Barbara Kracher, Michael Kühl, et al. “Inferring Booleannetwork structure via correlation”. In: Bioinformatics 27.11 (June 1, 2011),pp. 1529–1536. doi: 10.1093/bioinformatics/btr166. url: http://bioinformatics.oxfordjournals.org/content/27/11/1529 (visited on 03/21/2016).

[122] Daniel Marbach, James C. Costello, Robert Küffner, et al. “Wisdom of crowdsfor robust gene network inference”. In: Nature Methods 9.8 (Aug. 2012), pp. 796–804. doi: 10.1038/nmeth.2016. url: http://www.nature.com.ezproxy1.lib.asu.edu/nmeth/journal/v9/n8/full/nmeth.2016.html (visited on 07/13/2017).

97


http://dx.doi.org/10.1101/gad.269303

http://dx.doi.org/10.1145/779359.779362


http://science.sciencemag.org.ezproxy1.lib.asu.edu/content/298/5594/799

http://science.sciencemag.org.ezproxy1.lib.asu.edu/content/298/5594/799


http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0102873


http://dx.doi.org/10.1093/nar/gks1193

https://academic.oup.com/nar/article/41/D1/D991/1067995/NCBI-GEO-archive-for-functional-genomics-data-sets



http://dx.doi.org/10.1093/bioinformatics/btr166

http://bioinformatics.oxfordjournals.org/content/27/11/1529


http://dx.doi.org/10.1038/nmeth.2016

http://www.nature.com.ezproxy1.lib.asu.edu/nmeth/journal/v9/n8/full/nmeth.2016.html


[123] Jing Zhao, Lili Miao, Jian Yang, et al. “Prediction of Links and Weights inNetworks by Reliable Routes”. In: Scientific Reports 5 (July 22, 2015), p. 12261.doi: 10.1038/srep12261. url: http://www.nature.com/articles/srep12261(visited on 04/04/2016).

[124] Gordon E Moore et al. Cramming more components onto integrated circuits.McGraw-Hill, 1965.

[125] Ilya Shmulevich, Edward R. Dougherty, Seungchan Kim, et al. “ProbabilisticBoolean networks: a rule-based uncertainty model for gene regulatory networks”.In: Bioinformatics 18.2 (Feb. 1, 2002), pp. 261–274. doi: 10.1093/bioinformatics/18.2.261. (Visited on 09/24/2013).

[126] E P van Someren, L F Wessels, and M J Reinders. “Linear modeling of geneticnetworks from experimental data”. In: Proceedings / ... International Conferenceon Intelligent Systems for Molecular Biology ; ISMB. International Conferenceon Intelligent Systems for Molecular Biology 8 (2000), pp. 355–366.

[127] P D’haeseleer, X Wen, S Fuhrman, et al. “Linear modeling of mRNA expres-sion levels during CNS development and injury”. In: Pacific Symposium onBiocomputing. Pacific Symposium on Biocomputing (1999), pp. 41–52.

[128] Min Zou and Suzanne D. Conzen. “A new dynamic Bayesian network (DBN)approach for identifying gene regulatory networks from time course microarraydata”. In: Bioinformatics 21.1 (Jan. 1, 2005), pp. 71–79. doi: 10.1093/bioinformatics/bth463. (Visited on 10/10/2013).

[129] Nir Friedman, Michal Linial, Iftach Nachman, et al. “Using Bayesian Net-works to Analyze Expression Data”. In: Journal of Computational Biology7.3 (Aug. 2000), pp. 601–620. doi: 10.1089/106652700750050961. (Visited on09/25/2013).

[130] E. J. Moler, M. L. Chow, and I. S. Mian. “Analysis of molecular profile datausing generative and discriminative methods”. In: Physiological Genomics 4.2(Jan. 1, 2000), pp. 109–126. (Visited on 09/25/2013).

[131] Sui Huang, Ingemar Ernberg, and Stuart Kauffman. “Cancer attractors: Asystems view of tumors from a gene network dynamics and developmentalperspective”. In: Seminars in Cell & Developmental Biology 20.7 (Sept. 2009),pp. 869–876. doi: 10.1016/j.semcdb.2009.07.003. (Visited on 09/26/2013).

98

http://dx.doi.org/10.1038/srep12261

http://www.nature.com/articles/srep12261

http://dx.doi.org/10.1093/bioinformatics/18.2.261

http://dx.doi.org/10.1093/bioinformatics/18.2.261

http://dx.doi.org/10.1093/bioinformatics/bth463


http://dx.doi.org/10.1089/106652700750050961

http://dx.doi.org/10.1016/j.semcdb.2009.07.003

[132] Sui Huang and Donald E. Ingber. “The structural and mechanical complexityof cell-growth control”. In: Nature Cell Biology 1.5 (Sept. 1999), E131–E138.doi: 10.1038/13043. (Visited on 10/08/2013).

[133] Sui Huang. “Gene expression profiling, genetic networks, and cellular states:an integrating concept for tumorigenesis and drug discovery”. In: Journal ofMolecular Medicine 77.6 (June 1, 1999), pp. 469–480. doi: 10.1007/s001099900023. (Visited on 10/08/2013).

[134] Kevin Murphy and Saira Mian. “Modelling Gene Expression Data using Dy-namic Bayesian Networks”. In: (May 1999). url: http://www-devel.cs.ubc.ca/~murphyk/Papers/ismb99.pdf.

[135] Huilei Xu, Yen-Sin Ang, Ana Sevilla, et al. “Construction and validation ofa regulatory network for pluripotency and self-renewal of mouse embryonicstem cells”. In: PLoS computational biology 10.8 (Aug. 2014), e1003777. doi:10.1371/journal.pcbi.1003777.

[136] Javed Khan, Jun S. Wei, Markus Ringner, et al. “Classification and diagnosticprediction of cancers using gene expression profiling and artificial neural net-works”. In: Nature Medicine 7.6 (June 2001), pp. 673–679. doi: 10.1038/89044.(Visited on 09/27/2013).

[137] R. Brause, T. Langsdorf, and M. Hepp. “Neural data mining for credit card frauddetection”. In: 11th IEEE International Conference on Tools with ArtificialIntelligence, 1999. Proceedings. 11th IEEE International Conference on Toolswith Artificial Intelligence, 1999. Proceedings. 1999, pp. 103–106. doi: 10.1109/TAI.1999.809773.

[138] M. Syeda, Yan-Qing Zhang, and Y. Pan. “Parallel granular neural networks forfast credit card fraud detection”. In: Proceedings of the 2002 IEEE InternationalConference on Fuzzy Systems, 2002. FUZZ-IEEE’02. Proceedings of the 2002IEEE International Conference on Fuzzy Systems, 2002. FUZZ-IEEE’02. Vol. 1.2002, pp. 572–577. doi: 10.1109/FUZZ.2002.1005055.

[139] Richard Bellman and Robert E Kalaba. Dynamic programming and moderncontrol theory. Academic Press New York, 1965.

[140] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, et al. “Empirical Eval-uation of Gated Recurrent Neural Networks on Sequence Modeling”. In:arXiv:1412.3555 [cs] (Dec. 11, 2014). arXiv: 1412.3555. url: http://arxiv.org/abs/1412.3555 (visited on 10/11/2016).

99

http://dx.doi.org/10.1038/13043

http://dx.doi.org/10.1007/s001099900023

http://dx.doi.org/10.1007/s001099900023

http://www-devel.cs.ubc.ca/~murphyk/Papers/ismb99.pdf

http://www-devel.cs.ubc.ca/~murphyk/Papers/ismb99.pdf


http://dx.doi.org/10.1038/89044

http://dx.doi.org/10.1109/TAI.1999.809773

http://dx.doi.org/10.1109/TAI.1999.809773

http://dx.doi.org/10.1109/FUZZ.2002.1005055

http://arxiv.org/abs/1412.3555



[141] Mukesh Bansal, Vincenzo Belcastro, Alberto Ambesi-Impiombato, et al. “Howto infer gene networks from expression profiles”. In: Molecular Systems Biology3.1 (Jan. 1, 2007), p. 78. doi: 10.1038/msb4100120. url: http://msb.embopress.org/content/3/1/78 (visited on 07/13/2017).

[142] Jing Yu, V. Anne Smith, Paul P. Wang, et al. “Advances to Bayesian networkinference for generating causal networks from observational biological data”.In: Bioinformatics 20.18 (Dec. 12, 2004), pp. 3594–3603. doi: 10.1093/bioinformatics/bth448. url: https://academic.oup.com/bioinformatics/article/20/18/3594/202541/Advances-to-Bayesian-network-inference-for (visited on07/13/2017).

[143] Barbara Treutlein, Doug G. Brownfield, Angela R. Wu, et al. “Reconstructinglineage hierarchies of the distal lung epithelium using single-cell RNA-seq”. In:Nature 509.7500 (May 15, 2014), pp. 371–375. doi: 10.1038/nature13173.

[144] Eric W. Brunskill, Joo-Seop Park, Eunah Chung, et al. “Single cell dissection ofearly kidney development: multilineage priming”. In: Development (Cambridge,England) 141.15 (Aug. 2014), pp. 3093–3101. doi: 10.1242/dev.110601.

[145] Saiful Islam, Una Kjallquist, Annalena Moliner, et al. “Characterization of thesingle-cell transcriptional landscape by highly multiplex RNA-seq”. In: GenomeResearch 21.7 (July 2011), pp. 1160–1167. doi: 10.1101/gr.110882.110.

[146] Robert N. Plasschaert, Sebastien Vigneau, Italo Tempera, et al. “CTCF bindingsite sequence differences are associated with unique regulatory and functionaltrends during embryonic stem cell differentiation”. In: Nucleic Acids Research42.2 (Jan. 2014), pp. 774–789. doi: 10.1093/nar/gkt910.

[147] Jeremy Goecks, Anton Nekrutenko, James Taylor, et al. “Galaxy: a com-prehensive approach for supporting accessible, reproducible, and transparentcomputational research in the life sciences”. In: Genome Biology 11.8 (2010),R86. doi: 10.1186/gb-2010-11-8-r86.

[148] Daniel Blankenberg, Gregory Von Kuster, Nathaniel Coraor, et al. “Galaxy: aweb-based genome analysis tool for experimentalists”. In: Current Protocols inMolecular Biology Chapter 19 (Jan. 2010), pages. doi: 10.1002/0471142727.mb1910s89.

[149] Chunlei Wu, Ian Macleod, and Andrew I. Su. “BioGPS and MyGene.info:organizing online, gene-centric information”. In: Nucleic Acids Research 41(Database issue Jan. 2013), pp. D561–565. doi: 10.1093/nar/gks1114.

100

http://dx.doi.org/10.1038/msb4100120

http://msb.embopress.org/content/3/1/78

http://msb.embopress.org/content/3/1/78



https://academic.oup.com/bioinformatics/article/20/18/3594/202541/Advances-to-Bayesian-network-inference-for

https://academic.oup.com/bioinformatics/article/20/18/3594/202541/Advances-to-Bayesian-network-inference-for



http://dx.doi.org/10.1101/gr.110882.110

http://dx.doi.org/10.1093/nar/gkt910

http://dx.doi.org/10.1186/gb-2010-11-8-r86

http://dx.doi.org/10.1002/0471142727.mb1910s89

http://dx.doi.org/10.1002/0471142727.mb1910s89

http://dx.doi.org/10.1093/nar/gks1114

[150] Zhide Fang and Xiangqin Cui. “Design and validation issues in RNA-seqexperiments”. In: Briefings in Bioinformatics 12.3 (May 1, 2011), pp. 280–287.doi: 10.1093/bib/bbr004. url: http://bib.oxfordjournals.org/content/12/3/280 (visited on 10/09/2014).

[151] Florian Buettner, Kedar N. Natarajan, F. Paolo Casale, et al. “Computationalanalysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data revealshidden subpopulations of cells”. In: Nature Biotechnology 33.2 (Feb. 2015),pp. 155–160. doi: 10.1038/nbt.3102. url: http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v33/n2/full/nbt.3102.html (visited on 07/18/2017).

[152] Michael G. Ross, Carsten Russ, Maura Costello, et al. “Characterizing andmeasuring bias in sequence data”. In: Genome Biology 14.5 (May 29, 2013), R51.doi: 10.1186/gb-2013-14-5-r51. url: http://genomebiology.biomedcentral.com.ezproxy1.lib.asu.edu/articles/10.1186/gb-2013-14-5-r51 (visited on04/11/2017).

[153] Nicholas J. Loman, Joshua Quick, and Jared T. Simpson. “A complete bacterialgenome assembled de novo using only nanopore sequencing data”. In: NatureMethods 12.8 (Aug. 2015), pp. 733–735. doi: 10 . 1038/nmeth . 3444. url:http://www.nature.com.ezproxy1.lib.asu.edu/nmeth/journal/v12/n8/full/nmeth.3444.html (visited on 09/12/2016).

[154] Philippe Christophe Faucon, Robert Trevino, Parithi Balachandran, et al. “HighAccuracy Base Calls in Nanopore Sequencing”. In: bioRxiv (Apr. 11, 2017),p. 126680. doi: 10.1101/126680. url: http://biorxiv.org/content/early/2017/04/11/126680 (visited on 04/11/2017).

[155] Ivan Sović, Mile Šikić, Andreas Wilm, et al. “Fast and sensitive mapping ofnanopore sequencing reads with GraphMap”. In: Nature Communications 7(Apr. 15, 2016), p. 11307. doi: 10.1038/ncomms11307. url: http://www.nature.com.ezproxy1.lib.asu.edu/ncomms/2016/160415/ncomms11307/full/ncomms11307.html (visited on 01/09/2017).

[156] Heng Li. “Toward better understanding of artifacts in variant calling fromhigh-coverage samples”. In: Bioinformatics 30.20 (Oct. 15, 2014), pp. 2843–2851. doi: 10.1093/bioinformatics/btu356. url: https://academic.oup.com/bioinformatics/article/30/20/2843/2422145/Toward-better-understanding-of-artifacts-in (visited on 04/11/2017).

[157] Konstantin Berlin, Sergey Koren, Chen-Shan Chin, et al. “Assembling largegenomes with single-molecule sequencing and locality-sensitive hashing”. In:Nature Biotechnology 33.6 (June 2015), pp. 623–630. doi: 10.1038/nbt.3238.

101

http://dx.doi.org/10.1093/bib/bbr004

http://bib.oxfordjournals.org/content/12/3/280

http://bib.oxfordjournals.org/content/12/3/280


http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v33/n2/full/nbt.3102.html


http://dx.doi.org/10.1186/gb-2013-14-5-r51

http://genomebiology.biomedcentral.com.ezproxy1.lib.asu.edu/articles/10.1186/gb-2013-14-5-r51

http://genomebiology.biomedcentral.com.ezproxy1.lib.asu.edu/articles/10.1186/gb-2013-14-5-r51




http://dx.doi.org/10.1101/126680

http://biorxiv.org/content/early/2017/04/11/126680


http://dx.doi.org/10.1038/ncomms11307

http://www.nature.com.ezproxy1.lib.asu.edu/ncomms/2016/160415/ncomms11307/full/ncomms11307.html



http://dx.doi.org/10.1093/bioinformatics/btu356

https://academic.oup.com/bioinformatics/article/30/20/2843/2422145/Toward-better-understanding-of-artifacts-in




url: http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v33/n6/abs/nbt.3238.html (visited on 01/24/2017).

[158] Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, et al. “GappedBLAST and PSI-BLAST: a new generation of protein database search pro-grams”. In: Nucleic Acids Research 25.17 (Sept. 1, 1997), pp. 3389–3402. doi:10.1093/nar/25.17.3389. url: http://nar.oxfordjournals.org/content/25/17/3389 (visited on 12/15/2016).

[159] Chen Yang, Justin Chu, Ren, et al. “NanoSim: nanopore sequence read simulatorbased on statistical characterization”. In: bioRxiv (Mar. 18, 2016), p. 044545.doi: 10.1101/044545. url: http://biorxiv.org/content/early/2016/03/18/044545.1 (visited on 12/16/2016).

[160] Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. “PBSIM: PacBio readssimulator—toward accurate genome assembly”. In: Bioinformatics 29.1 (Jan. 1,2013), pp. 119–121. doi: 10.1093/bioinformatics/bts649. url: http://bioinformatics.oxfordjournals.org/content/29/1/119 (visited on 12/16/2016).

[161] Bayo Lau, Marghoob Mohiyuddin, John C. Mu, et al. “LongISLND: in silicosequencing of lengthy and noisy datatypes”. In: Bioinformatics 32.24 (Dec. 15,2016), pp. 3829–3832. doi: 10.1093/bioinformatics/btw602. url: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5167071/ (visited on 02/15/2017).

[162] Merly Escalona, Sara Rocha, and David Posada. “A comparison of tools for thesimulation of genomic next-generation sequencing data”. In: Nature ReviewsGenetics 17.8 (Aug. 2016), pp. 459–469. doi: 10 .1038/nrg .2016.57. url:http://www.nature.com.ezproxy1.lib.asu.edu/nrg/journal/v17/n8/abs/nrg.2016.57.html (visited on 04/18/2017).

[163] Matthew Frampton and Richard Houlston. “Generation of Artificial FASTQFiles to Evaluate the Performance of Next-Generation Sequencing Pipelines”. In:PLOS ONE 7.11 (Nov. 12, 2012), e49110. doi: 10.1371/journal.pone.0049110.url: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0049110 (visited on 04/18/2017).

[164] Dent A. Earl, Keith Bradnam, John St John, et al. “Assemblathon 1: Acompetitive assessment of de novo short read assembly methods”. In: GenomeResearch (Sept. 16, 2011), gr.126599.111. doi: 10.1101/gr.126599.111. url:http://genome.cshlp.org/content/early/2011/09/16/gr.126599.111 (visited on04/18/2017).

102

http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v33/n6/abs/nbt.3238.html

http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v33/n6/abs/nbt.3238.html

http://dx.doi.org/10.1093/nar/25.17.3389

http://nar.oxfordjournals.org/content/25/17/3389

http://nar.oxfordjournals.org/content/25/17/3389

http://dx.doi.org/10.1101/044545

http://biorxiv.org/content/early/2016/03/18/044545.1

http://biorxiv.org/content/early/2016/03/18/044545.1

http://dx.doi.org/10.1093/bioinformatics/bts649



http://dx.doi.org/10.1093/bioinformatics/btw602

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5167071/


http://dx.doi.org/10.1038/nrg.2016.57

http://www.nature.com.ezproxy1.lib.asu.edu/nrg/journal/v17/n8/abs/nrg.2016.57.html

http://www.nature.com.ezproxy1.lib.asu.edu/nrg/journal/v17/n8/abs/nrg.2016.57.html




http://dx.doi.org/10.1101/gr.126599.111

http://genome.cshlp.org/content/early/2011/09/16/gr.126599.111

[165] Weichun Huang, Leping Li, Jason R. Myers, et al. “ART: a next-generationsequencing read simulator”. In: Bioinformatics 28.4 (Feb. 15, 2012), pp. 593–594. doi: 10.1093/bioinformatics/btr708. url: https://academic.oup.com/bioinformatics/article/28/4/593/213322/ART-a-next-generation-sequencing-read-simulator (visited on 04/18/2017).

[166] Juliane C. Dohm, Claudio Lottaz, Tatiana Borodina, et al. “Substantial biasesin ultra-short read data sets from high-throughput DNA sequencing”. In:Nucleic Acids Research 36.16 (Sept. 2008), e105. doi: 10.1093/nar/gkn425.url: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2532726/ (visited on04/18/2017).

[167] Yen-Chun Chen, Tsunglin Liu, Chun-Hui Yu, et al. “Effects of GC Bias inNext-Generation-Sequencing Data on De Novo Genome Assembly”. In: PLOSONE 8.4 (Apr. 29, 2013), e62856. doi: 10.1371/journal.pone.0062856. url:http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0062856(visited on 04/18/2017).

[168] Vladimír Boža, Broňa Brejová, and Tomáš Vinař. “DeepNano: Deep Recur-rent Neural Networks for Base Calling in MinION Nanopore Reads”. In:arXiv:1603.09195 [q-bio] (Mar. 30, 2016). arXiv: 1603 . 09195. url: http ://arxiv.org/abs/1603.09195 (visited on 10/11/2016).

[169] Matei David, Lewis Jonathan Dursi, Delia Yao, et al. “Nanocall: An OpenSource Basecaller for Oxford Nanopore Sequencing Data”. In: bioRxiv (Mar. 28,2016), p. 046086. doi: 10.1101/046086. url: http://biorxiv.org/content/early/2016/03/28/046086 (visited on 10/11/2016).

[170] Joshua Quick, Nicholas J. Loman, Sophie Duraffour, et al. “Real-time, portablegenome sequencing for Ebola surveillance”. In: Nature 530.7589 (Feb. 11, 2016),pp. 228–232. doi: 10 . 1038/nature16996. url: http : //www .nature . com.ezproxy1.lib.asu.edu/nature/journal/v530/n7589/full/nature16996.html(visited on 09/12/2016).

[171] Miten Jain, Ian T. Fiddes, Karen H. Miga, et al. “Improved data analysisfor the MinION nanopore sequencer”. In: Nature Methods 12.4 (Apr. 2015),pp. 351–356. doi: 10.1038/nmeth.3290. url: http://www.nature.com/nmeth/journal/v12/n4/full/nmeth.3290.html (visited on 09/13/2016).

[172] Sara Goodwin, James Gurtowski, Scott Ethe-Sayers, et al. “Oxford Nanoporesequencing, hybrid error correction, and de novo assembly of a eukaryoticgenome”. In: Genome Research 25.11 (Nov. 1, 2015), pp. 1750–1756. doi:

103

http://dx.doi.org/10.1093/bioinformatics/btr708

https://academic.oup.com/bioinformatics/article/28/4/593/213322/ART-a-next-generation-sequencing-read-simulator



http://dx.doi.org/10.1093/nar/gkn425







http://dx.doi.org/10.1101/046086




http://www.nature.com.ezproxy1.lib.asu.edu/nature/journal/v530/n7589/full/nature16996.html

http://www.nature.com.ezproxy1.lib.asu.edu/nature/journal/v530/n7589/full/nature16996.html


http://www.nature.com/nmeth/journal/v12/n4/full/nmeth.3290.html


10.1101/gr.191395.115. url: http://genome.cshlp.org/content/25/11/1750(visited on 09/12/2016).

[173] Tamas Szalay and Jene A. Golovchenko. “De novo sequencing and variantcalling with nanopores using PoreSeq”. In: Nature Biotechnology 33.10 (Oct.2015), pp. 1087–1091. doi: 10.1038/nbt.3360. url: http://www.nature.com/nbt/journal/v33/n10/full/nbt.3360.html (visited on 09/13/2016).

[174] Chenhao Li, Kern Rei Chng, Esther Jia Hui Boey, et al. “INC-Seq: accuratesingle molecule reads using nanopore sequencing”. In: GigaScience 5 (Aug. 2,2016), p. 34. doi: 10.1186/s13742-016-0140-7. url: http://dx.doi.org/10.1186/s13742-016-0140-7 (visited on 12/15/2016).

[175] Brian D. Ondov, Todd J. Treangen, Adam B. Mallonee, et al. “Fast genomeand metagenome distance estimation using MinHash”. In: bioRxiv (Oct. 26,2015), p. 029827. doi: 10.1101/029827. url: http://biorxiv.org/content/early/2015/10/26/029827 (visited on 12/15/2016).

[176] Sergey Koren, Brian P. Walenz, Konstantin Berlin, et al. Canu: scalable andaccurate long-read assembly via adaptive k-mer weighting and repeat separation| bioRxiv. Aug. 24, 2016. url: http://biorxiv.org/content/early/2016/08/24/071282 (visited on 12/15/2016).

[177] Chen-Shan Chin, David H. Alexander, Patrick Marks, et al. “Nonhybrid,finished microbial genome assemblies from long-read SMRT sequencing data”.In: Nature Methods 10.6 (June 2013), pp. 563–569. doi: 10.1038/nmeth.2474.url: http://www.nature.com.ezproxy1.lib.asu.edu/nmeth/journal/v10/n6/abs/nmeth.2474.html (visited on 12/15/2016).

[178] Sergey Koren, Michael C. Schatz, Brian P. Walenz, et al. “Hybrid error cor-rection and de novo assembly of single-molecule sequencing reads”. In: NatureBiotechnology 30.7 (July 2012), pp. 693–700. doi: 10.1038/nbt.2280. url:http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v30/n7/full/nbt.2280.html (visited on 12/15/2016).

[179] Nicolas L. Bray, Harold Pimentel, Páll Melsted, et al. “Near-optimal proba-bilistic RNA-seq quantification”. In: Nature Biotechnology 34.5 (May 2016),pp. 525–527. doi: 10.1038/nbt.3519. url: http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v34/n5/full/nbt.3519.html (visited on 12/15/2016).

[180] Stephen F. Altschul, Warren Gish, Webb Miller, et al. “Basic local alignmentsearch tool”. In: Journal of Molecular Biology 215.3 (Oct. 5, 1990), pp. 403–410.

104

http://dx.doi.org/10.1101/gr.191395.115

http://genome.cshlp.org/content/25/11/1750


http://www.nature.com/nbt/journal/v33/n10/full/nbt.3360.html

http://www.nature.com/nbt/journal/v33/n10/full/nbt.3360.html

http://dx.doi.org/10.1186/s13742-016-0140-7

http://dx.doi.org/10.1186/s13742-016-0140-7

http://dx.doi.org/10.1186/s13742-016-0140-7

http://dx.doi.org/10.1101/029827






http://www.nature.com.ezproxy1.lib.asu.edu/nmeth/journal/v10/n6/abs/nmeth.2474.html

http://www.nature.com.ezproxy1.lib.asu.edu/nmeth/journal/v10/n6/abs/nmeth.2474.html







doi: 10.1016/S0022-2836(05)80360-2. url: http://www.sciencedirect.com/science/article/pii/S0022283605803602 (visited on 12/15/2016).

[181] T. Laver, J. Harrison, P. A. O’Neill, et al. “Assessing the performance ofthe Oxford Nanopore Technologies MinION”. In: Biomolecular Detection andQuantification 3 (Mar. 2015), pp. 1–8. doi: 10.1016/j.bdq.2015.02.001. url:http://www.sciencedirect.com/science/article/pii/S2214753515000224 (visitedon 03/15/2016).

[182] Martin Ester, Hans-Peter Kriegel, Jörg Sander, et al. “A density-based algorithmfor discovering clusters in large spatial databases with noise.” In: Kdd. Vol. 96.1996, pp. 226–231.

[183] Philip Brennecke, Simon Anders, Jong Kyoung Kim, et al. “Accounting fortechnical noise in single-cell RNA-seq experiments”. In: Nature Methods 10.11(Nov. 2013), pp. 1093–1095. doi: 10.1038/nmeth.2645. url: http://www.nature . com/nmeth/ journal/v10/n11/ full/nmeth .2645 .html (visited on03/27/2015).

105

http://dx.doi.org/10.1016/S0022-2836(05)80360-2

http://www.sciencedirect.com/science/article/pii/S0022283605803602


http://dx.doi.org/10.1016/j.bdq.2015.02.001





gene network inference via sequence alignment and ... · figure page 4. stable steady state...

Documents