gene network inference via sequence alignment and ... · figure page 4. stable steady state...
TRANSCRIPT
![Page 1: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/1.jpg)
Gene Network Inference via
Sequence Alignment and Rectification
by
Philippe Christophe Faucon
A Dissertation Presented in Partial Fulfillmentof the Requirements for the Degree
Doctor of Philosophy
Approved August 2017 by theGraduate Supervisory Committee:
Huan Liu, ChairXiao Wang
Sharon CrookYalin Wang
Hessam Sarjoughian
ARIZONA STATE UNIVERSITY
December 2017
![Page 2: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/2.jpg)
©2017 Philippe Christophe Faucon
All Rights Reserved
![Page 3: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/3.jpg)
ABSTRACT
While techniques for reading DNA in some capacity has been possible for decades,
the ability to accurately edit genomes at scale has remained elusive. Novel techniques
have been introduced recently to aid in the writing of DNA sequences. While writing
DNA is more accessible, it still remains expensive, justifying the increased interest in
in silico predictions of cell behavior. In order to accurately predict the behavior of
cells it is necessary to extensively model the cell environment, including gene-to-gene
interactions as completely as possible.
Significant algorithmic advances have been made for identifying these interactions,
but despite these improvements current techniques fail to infer some edges, and
fail to capture some complexities in the network. Much of this limitation is due to
heavily underdetermined problems, whereby tens of thousands of variables are to be
inferred using datasets with the power to resolve only a small fraction of the variables.
Additionally, failure to correctly resolve gene isoforms using short reads contributes
significantly to noise in gene quantification measures.
This dissertation introduces novel mathematical models, machine learning tech-
niques, and biological techniques to solve the problems described above. Mathematical
models are proposed for simulation of gene network motifs, and raw read simula-
tion. Machine learning techniques are shown for DNA sequence matching, and DNA
sequence correction.
Results provide novel insights into the low level functionality of gene networks.
Also shown is the ability to use normalization techniques to aggregate data for gene
network inference leading to larger data sets while minimizing increases in inter-
experimental noise. Results also demonstrate that high error rates experienced by
third generation sequencing are significantly different than previous error profiles, and
i
![Page 4: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/4.jpg)
that these errors can be modeled, simulated, and rectified. Finally, techniques are
provided for amending this DNA error that preserve the benefits of third generation
sequencing.
ii
![Page 5: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/5.jpg)
I dedicate this work to my parents Arlene and Philippe, who inspire me to be the
best version of myself; to my sister Anni and my brother Pierre, who inspire me to
pursue what I love; and to my many excellent teachers and mentors throughout the
years, who inspire me to learn and to teach.
iii
![Page 6: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/6.jpg)
ACKNOWLEDGMENTS
I would like to thank my advisors, Huan Liu and Xiao Wang for their support
and guidance throughout my PhD. Without their mentoring, and at times corralling,
I would almost certainly still be working on a million small projects without any
significant progress. I especially appreciate their willingness to let me fully explore my
research interests, and their forthrightness when I set my mind to a task that they
believed would not be fruitful. I would also like to thank the rest of my committee;
Sharon Crook who introduced me to bioinformatics during my undergraduate years,
and continued to guide me during my graduate research, along with Yalin Wang and
Hessam Sarjoughian, who provided valuable suggestions on my research directions.
I am remarkably fortunate to have worked with so many amazing colleagues in
both the Xiao Lab and Data Mining and Machine Learning(DMML) lab at ASU.
Over my graduate studies I’ve had the pleasure of working with many exceptional
researchers: Parithi Balachandran, Ghazaleh Beigi Xingwen Chen, Ryan Dougherty,
Hao Hu, Isaac Jones, Jundong Li, David Menn, Fred Morstatter, Tahora Nazer, Suhas
Ranganath, Justin Sampson, Kai Shu, Kylie Standage-Beier, Ri-Qi Su, Robert Trevino,
Lezhi Wang, Suhang Wang, Fuqing Wu, Liang Wu, Tsung-Yen (John) Yu, and Qi
Zhang. Above all I would like to thank my sidekick Parithi Balachandran, an excellent
research colleague and friend, and without whom many research projects would not
have been possible. I am also incredibly grateful to Robert Trevino for his assistance
with for his research insights, and for introducing me to conference paper writing.
Having a life outside of research, and the support of friends is critical to enjoying
and succeeding in grad school. On that note I would especially like to thank Parithi
Balachandran, Michael Berg, Mario Giacomazzo, Hao Hu, David Menn, Fred Morstat-
ter, James Palazzolo, Jose Perez, Kim Phan, Brianna Rudow, Jeff Semmens, and
iv
![Page 7: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/7.jpg)
Robert Trevino and the ASU fencing team. Without their support and friendship I
would likely have left the university long before completing my degree.
I would also like to acknowledge my family, who supported me through the most
difficult parts of my studies and provided a constant stream of encouragement. My
grandparents Berit and Anker, my parents Arlene and Philippe, and my siblings Anni
and Pierre, have all played a monumental role in shaping me into the person I am
today. I could not be more grateful for the constant support they have given me, and
I hope that I have, do, and will continue to make them proud.
Learning to learn is one of the most difficult tasks that we as people, and especially
as researchers have to undertake. We have to find techniques for locating relevant
and trustworthy sources, sorting through their available knowledge, and distilling it
down to what we can remember and make use of. Through my studies I was very
fortunate to meet mentors that inspired me to learn, and showed me the joy in learning.
Likewise while teaching I had outstanding students that showed me joy in teaching,
and inspiring others to learn and achieve great things. In specific I would like to thank
Karen Glazier, the first (and only) teacher to fail me, for reminding me that regardless
of how much you think you know you can still fail if you don’t apply yourself.
v
![Page 8: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/8.jpg)
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
CHAPTER
1 GENE NETWORK MOTIFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Network Enumeration and Parameter Scanning for Multi-
stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.2 Complete Auto-Activation Contributes to Multistability . . . . 7
1.1.3 Sensitivity and bifurcation of multistability . . . . . . . . . . . . . . . . 8
1.1.4 Stochastic State Transitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.5 SSS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.1.6 A Regulatory Network for Pluripotency . . . . . . . . . . . . . . . . . . . 17
1.1.7 In Silico Induced Pluripotency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.3 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.3.1 Enumerating and Eliminating Redundant Networks . . . . . . . . 31
1.3.2 ODE Modeling and High-Throughput Screening . . . . . . . . . . . 31
1.3.3 Analysis of the Parameter Space . . . . . . . . . . . . . . . . . . . . . . . . . . 33
1.3.4 Bifurcation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.3.5 Stochastic Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.3.6 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2 GENE NETWORK MODELING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1 Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
vi
![Page 9: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/9.jpg)
CHAPTER Page
2.1.1 Boolean Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.1.2 Bayesian and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.1.3 Differential Equation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.2 RNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3 Data Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.1 Acquiring the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.3.2 Noise Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3 NANOPORE READ SIMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.1 Error Source Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.2 Error Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.3 Read Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.1 Modeling K-mer Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.2 Identifying Bias Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.4 Simulator Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4 READ CORRECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
vii
![Page 10: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/10.jpg)
CHAPTER Page
4.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.1 High-Accuracy K-mers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4.2 Utilization of K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.3 Synthetic Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5.1 Clustering on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.1 Methodological Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
NOTES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
viii
![Page 11: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/11.jpg)
LIST OF TABLES
Table Page
1. Overview of RNA-Seq Experiments Analyzed with a Listing of Tool and
Asset Versions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2. Identified Error Sources and Their Contribution to K-Mer Overall Accuracy
with Respect to the Bias Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3. Sum of Squares Error (SSE) between K-Mer Simulators and Their Respective
Training Sets, and between the Fragmented R9 Data Set and the Complete
R9 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4. memory and Wall-Clock Time Consumption of SNaReSim under Various
Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5. Comparison of MHAP and K-Means Clusters over the Engineered Feature
Space, the Number of K-Means Clusters Was Set to the Number of Compo-
nents Identified by MHAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
ix
![Page 12: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/12.jpg)
LIST OF FIGURES
Figure Page
1. Multistability Arises from Small Gene Networks and Underlies Cell Differ-
entiation. Fully-Connected Triads (FCTs) Are Important, Recurring Tran-
scriptional Networks in the Development and Maintenance of Cellular States.
Notably, the Oct4-Sox2-Nanog Triad Has Been Implicated in Inducing and
Maintaining Stem Cell Pluripotency. In the Waddington Model for Cell
Differentiation, the Cell’s Underlying Developmental Landscape Is Governed
by the Dynamical Potential of These Small Gene Networks to Realize Multi-
stability. Transition from One State to Another Can Be Guided via Altered
Topology or Strength of Wiring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. High-Throughput Computational Screening for Multistable Networks. (A)
FCTs Were Enumerated and Modeled by ODEs (Parameters Denoted as P).
(B) The Multistability Probability of 104 FCTs Generated by the Computa-
tional Screen Is Plotted, with a High Multistability Probability FCTs Marked
with an X with the Color Representing the Multistability Group; the Two
Highest Probability Groups Are Labeled by Their Indices. (C) The Fifteen
FCTs with the Highest Multistable Potentials Are Represented Visually. . . . . 6
x
![Page 13: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/13.jpg)
Figure Page
3. Bifurcation and Stochastic Analysis of Network 84. (A) The Topology of
Network 84 Was Predicted to Have the Highest Probability for Multistability.
(B) The Bifurcation Diagram of Network 84 Plots Transcription Factor
Concentrations of X and Y at Each SSS against α, the Self-Activation
Strength. The Mutual Inhibition Parameters Are Also Set Equal and Referred
to as β, Here Set to 0.1 and α Is Used as the Bifurcation Parameter Ranging
between .01 and 100. SSS Are Color Coded and Listed in the Legend to
Distinguish Different SSS, Where ‘+’ Indicates the Gene Is ‘ON’ and ‘-
’ Means ‘OFF’. (C) Simulations of Noise-Induced State Transitioning in
Network 84 under Different Levels of Noise. Simulations Were Performed
with Auto-Regulation Equal to 1 and Mutual Inhibition Equal to 0.1, with
Noise Levels of 0.5, 0.85, and 1 from left to right. The Locations of the
Deterministically Calculated States Are Indicated with Red Spheres, with
Their Size Correlated to Their Spectral Radius. The Blue Ribbons Indicate
the Temporal Trajectories of the Simulations. The Black Arrows Indicate
Initial Direction of State Transitions. (D) The Number of States Traveled
in the Stochastic Simulations Plotted versus Noise Strength. The Red Line
Represents Simulations that Were Initiated from the All-ON State, and the
Blue Line Represents Simulations that Were Initiated from the Origin. . . . . . . 9
xi
![Page 14: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/14.jpg)
Figure Page
4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting
the Number of Stable Steady States Generated by Combinations of Differ-
ent Auto-Regulatory Strengths (Rows) and Mutual-Regulatory Strengths
(Columns). The Green and Orange Regions Are Visualized as FCT Diagrams
Independently in B. (B) The Green-Boxed and Orange-Boxed Diagrams
Corresponds to the Green-Boxed and Orange-Boxed Parameter Combina-
tion Regime from (A). (C) Many of the Parameter Combinations Yielding
Multi-Stable Systems Are Not Represented by the Matrix, and Are instead
Abstracted Here. As an Example, Here We Present a Parameter Combination
Regime that Can Support 6 SSS and Auto-Activation Strengths Are Similar
to Those in the Orange Box (Thick Red Arrows) but Have One Unrestrained
Pair of Mutual Inhibition that Can Exist at a Very High Strength (Blue
T-Head Repression Lines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
xii
![Page 15: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/15.jpg)
Figure Page
5. Bifurcation and Stochastic Analysis of Network 1. (A) Network 1 Shares the
Topological Architecture of the Oct4-Sox2-Nanog Triad, and Is Believed to
Underlie Stem Cell Differentiation and Reprogramming. (B) The Bifurcation
Diagram of Network 84 Plots Transcription Factor Concentrations of X and Y
at Each SSS against α, the Self-Activation Strength. The Mutual Inhibition
Parameters Are Also Set Equal and Referred to as β, Here Set to 0.1 and α
Is Used as the Bifurcation Parameter Ranging between .01 and 100. SSS Are
Color Coded and Listed in the Legend to Distinguish Different SSS, Where
‘+’ Indicates the Gene Is ‘ON’ and ‘-’ Means ‘OFF’. (C) Simulations of Noise-
Induced State Transitioning in Network 1 under Different Levels of Noise.
Simulations Were Performed with Auto-Regulation Equal to 1 and Mutual
Inhibition Equal to 0.1, with Noise Levels of 0.5, 0.85, and 1 from left to right.
The Locations of the Deterministically Calculated States Are Indicated with
Red Spheres, with Their Size Correlated to Their Spectral Radius. The Blue
Ribbons Indicate the Temporal Trajectories of the Simulations. The Black
Arrows Indicate Initial Direction of State Transitions. (D) The Number of
States Traveled in the Stochastic Simulations Plotted versus Noise Strength.
The Red Line Represents Simulations that Were Initiated from the All-ON
State, and the Blue Line Represents Simulations that Were Initiated from
the Origin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
xiii
![Page 16: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/16.jpg)
Figure Page
6. Stable Steady State Analysis of Network 1. (A) A Matrix Presenting the Num-
ber of Stable Steady States Generated by Combinations of Different Auto-
Regulatory Strengths (Rows) and Mutual-Regulatory Strengths (Columns).
The Green and Orange Regions Are Visualized as FCT Diagrams Indepen-
dently in B. (B) The Green-Boxed and Orange-Boxed Diagrams Corresponds
to the Green-Boxed and Orange-Boxed Parameter Combination Regime from
(A). (C) Many of the Parameter Combinations Yielding Multi-Stable Systems
Are Not Represented by the Matrix, and Are instead Abstracted Here. As
an Example, Here We Present a Parameter Combination Regime that Can
Support 6 SSS and Auto-Activation Strengths Are Similar to Those in the
Orange Box (Thick Red Arrows) but Have One Unrestrained Pair of Mutual
Inhibition that Can Exist at a Very High Strength (Blue T-Head Repression
Lines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
xiv
![Page 17: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/17.jpg)
Figure Page
7. Bifurcation and Stochastic Analysis of Network 1 with Constitutive Basal
Expression. (A) The Topology of Network 1 Is Used to Mimic IPSC Exper-
iments with Added Constitutive Expression of All Three Genes. (B) The
X and Y Protein Concentrations of Stable and Unstable Steady States Are
Plotted against Self-Activation Strengths. With the Addition of the Consti-
tutive Expression We Can See a Further Decrease in Stabilities, Specifically
Two-ON States Are Never Stable. Additionally One-ON States Lose Stability
Very Rapidly, Approximately at α=1, and the All-OFF State Destabilizes
after α=9 Leaving Only the All-ON State. This Can Also Be Seen in the
Sizes of the Spectral Radii, Deterministically These Steady States Exist but
in Practice They Have Very Little Influence. (C) Simulations of Network 1
Modeled in the Presence of an Additional Constitutive Overexpression Term
Consistently Show a Transition from an All-OFF State to an All-ON State
after a Period of Latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8. Stochastic Simulations Capture Latency Distributions in Induced Pluripo-
tency. (A) Distribution of Simulated Latency for the Modified Network 1
Model Illustrates a Skewed Bell Shape. The Histogram Was Generated from
2000 Simulations. (B) Similar Results Generated from 2000 Simulations with
Network 84. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
xv
![Page 18: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/18.jpg)
Figure Page
9. Two Fully Connected(A/B), and Two Partially Connected(C/D) Triads Are
Depicted. (A/C) Illustrate Only Positive Network Regulation (Activation),
Indicated by Red Arrows, I.e. Increasing Quantities of the Transcription
Factor at the Tail of the Arrow Also Increases Quantities of the Transcription
Factor at the Head. (B/D) Utilize Both Positive and Negative Network
Regulation (Repression), Indicated by the Blue Bars, I.e. Increasing the
Quantities of the Transcription Factor at the Tail of the Error Decreases
Quantities of the Transcription Factor at the Head. . . . . . . . . . . . . . . . . . . . . . . . . 38
10.In This Figure We Present the Data Processing Pipeline from 5 Other ScRNA-
Seq Experiments Available on NCBI GEO. GSE59127 and GSE59129 (left,
Blue), GSE 52583 (left Center, Red), GSE21180(Center, Olive), GSE45284
(right Center, Orange) and Our Pipeline (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
11.Scatterplots Showing Ratios of Coefficient of Variation(CV), Where Points
Represent CV of a Single Gene within the Original Author’s Pipeline, and
Ours. The X=y Line Drawn, Representing the Point Where CV Is Equal
between Both Pipelines. Points on the X>y Side Are Drawn in Red, and
Point on the Y>x Side Are Drawn in Black. For the Experiments in (A-
C), Raw ScRNA-Seq Reads Are Provided by the NCBI GEO Service and
Analyzed Using Our Customized Pipeline. In Addition to Raw Reads Each
Author Provides Estimates of Gene Activity Levels through FPKM or TPM.
In (A-C) We Plot the Coefficient of Variation (CV) of Our Pipeline Results
against Those of the Original Authors. In (D) We Group the Experiments
from (A-C) to Visualize Inter-Experimental CV. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
xvi
![Page 19: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/19.jpg)
Figure Page
12.The Accuracies of All 6-Mers with Accuracy on the Vertical Axis, and a Kmer
Index on the Horizontal Axis. (A) All Experiments Sorted with Respect to
Their Accuracy. (B) Experiments Sorted with Respect to the Escherichia
Coli R9 2D Experiment. (C) The R9 E. Coli Template Experiment Divided
Randomly into 5 Groups, Sorted with Respect to the First Group. (D) Only
the R9 E. Coli Template Experiment, Aligned with Both Graphmap and
BWA-MEM, Sorted with Respect to BWA-MEM.. . . . . . . . . . . . . . . . . . . . . . . . . . 58
13.A Plot of the Non-Negative Least Squares (Nnls) Fit of the Predicted Features
Compared to the Actual Distribution from the R9 E. Coli 2D Experiment,
Sorted by the True Experiment. (A) All Model Feautures and the True
Accuracy of Neighboring K-Mers as a Feature. (B) Only the Features that
Are Fully Predictable from the Model, as Described in Bias Souces Table. . . . 64
14.Comparison of 3 Previously Published Sequencing Simulators with the New
Model We Proposed. Fit of R9 Pore Model K-Mer Accuracies Sorted Inter-
nally (A), and by 2D E. Coli, the Training Model (B). Fit of R7 Pore Model
K-Mer Accuracies Sorted Internally (C), and by 2D E. Coli, the Training
Model (D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
xvii
![Page 20: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/20.jpg)
Figure Page
15.(A) Strands of DNA Bases Are Added to the Nanopore Sequencer as an
Analog Data Input (B) As Strands Pass through the Nanopores Electrical
Current Readings Are Taken. These Readings Are Consequently Grouped into
‘‘Events”, with Each Event Approximately Representing a K-Mer (Sequence
of Bases). These Events May Be Improperly Split, Events May Have a Mean
Value Higher or Lower than Expected for a DNA K-Mer, and Event Lengths
May Be Different. (C) A Base Caller Uses the Sequence of Events to Predict
the Original Sequence of Bases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
16.(A) DNA Sequences Can Be Interpreted as a Sequence of K-Mers, Each K-
Mer Can Be Replaced by Its Expected Mean Current to Provide a Mapping
from DNA Sequences into a High-Dimensional Current Space. (B) Clusters
of 2-Dimensional Points (Sequence Length 7), Clusters Are Found Using
DBSCAN with Neighborhood Size 1 and Epsilon 2 . . . . . . . . . . . . . . . . . . . . . . . . . 77
xviii
![Page 21: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/21.jpg)
Chapter 1
GENE NETWORK MOTIFS
Figure 1. Multistability arises from small gene networks and underlies celldifferentiation. Fully-connected triads (FCTs) are important, recurringtranscriptional networks in the development and maintenance of cellular states.Notably, the Oct4-Sox2-Nanog triad has been implicated in inducing and maintainingstem cell pluripotency. In the Waddington model for cell differentiation, the cell’sunderlying developmental landscape is governed by the dynamical potential of thesesmall gene networks to realize multistability. Transition from one state to anothercan be guided via altered topology or strength of wiring.
Embryonic stem cells have the capacity to differentiate into more than 200 isogenic
progenitor and terminal cell types [1], each of which represents a stable cellular state.
The potential for a system to realize multiple states is termed multistability, which
has been proposed as the mechanism underlying cell differentiation for over half a
century, with stable states represented as either valleys in the developmental landscape
1
![Page 22: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/22.jpg)
[2], [3] or as dynamic attractors in high-dimensional gene expression space [4]–[6].
Bistability is the simplest form of multistability, which has been studied extensively
both in natural contexts [7] and in synthetic systems [8]–[13]. It has been shown to
be responsible for both normal cell fate determination in Xenopus oocyte maturation
[14] and oncogenesis [15]. Additionally, recent work in hematopoietic stem cells (HSC)
implies that more complicated cases of multistability help in canalizing HSC lineage
determination [16]. The discovery of induced pluripotent stem cells (iPSC) [17]–[20]
further demonstrates that cellular state transitioning within a multistable system is a
reversible process that can potentially be engineered for therapeutic purposes.
Despite evidence for the existence of multiple substates of pluripotent and
hematopoietic stem cells [21]–[25], the molecular mechanisms responsible for gen-
erating complex multistability are poorly understood. Transitions between stable
states may underlie widely observed stem cell population heterogeneity [23], [25]–[34] ,
random latency in induced pluripotency [34], [35], and even the progression of diseases
such as cancer [36], [37]. According to this view, regulatory networks capable of
adopting multiple stable states may enable cells to explore adjacent fate possibilities
through state transitions, thereby enabling differentiation in the case of stem cells, or
generating phenotypic diversity for selection to act upon in the case of tumor evolu-
tion. Recent studies have implicated the fully-connected triad (FCT) [38]–[40] as a
recurring network motif among the transcriptional regulatory circuits that control the
development and maintenance of cellular states. Most notably, the Oct4-Sox2-Nanog
triad is believed to be the key functional module in the maintenance of stem cell
pluripotency [41]–[43]. The success of induced pluripotency through the overexpression
of transcription factors alone further indicates the dominant role of transcription factor
networks in maintaining pluripotency and directing cellular reprogramming. All this
2
![Page 23: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/23.jpg)
suggests a strong relationship between biological multistability and the conserved FCT
network of transcription factors(Figure 1), and motivates a need for understanding
the relationship between FCT regulatory relationships and multistability.
Here we developed a high-throughput computational approach to search the
network dynamics of all fully-connected transcription triads for their potential to
realize multistability. Our computational searches identified complete auto-activation,
(i.e.: all three nodes activate themselves), as a topological feature of triads capable
of multiple stable steady states (SSS). Detailed analyses were performed on network
topologies selected for a high likelihood of multistability. These analyses show that the
relative stability of SSS can be tuned by adjusting the network regulatory strengths
between nodes. After incorporation of gene expression noise in stochastic simulations,
the system was able to spontaneously transit from metastable (SSS with relatively
lower stabilities) to differentiated states (SSS with at least one gene OFF and high
stabilities), behaving like a three-dimensional Markovian jump system with bias.
This is also consistent with experimentally demonstrated stochastic cancer cell state
transitions [44], which result in a phenotypic equilibrium in populations of cancer
cells.
Different levels of noise, designed to mimic degrees of “noisy” transcriptional activity
in cellular systems, were found to either promote or disrupt state transitions, with some
transitions requiring layovers at one or more intermediate states. Such connections
between gene expression noise and cell state transitions have also been demonstrated
in many biological processes [45]. Finally, in our simplified theoretical simulation of
reprogramming to generate induced pluripotent stem cells, the distribution of random
temporal latency could be described by an Inverse Gaussian distribution, a finding
consistent with experimental observations [34]. Taken together, our work illustrates
3
![Page 24: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/24.jpg)
that computational screening, analysis, and simulation of network dynamics can serve
as valuable tools for connecting conserved regulatory modules with complex biological
processes, such as development, disease and stem cell reprogramming.
1.1 Results
1.1.1 Network Enumeration and Parameter Scanning for Multistability
To define the set of FCT networks with a high likelihood for multistability, we
first enumerated all possible fully-connected three-node topologies (2A). Here, nodes
represent transcription factors and edges represent possible transcriptional regulations.
After taking into account topological equivalencies, we were left with 104 unique FCTs
as candidate topologies, which fully represent all 29=512 possible FCT topologies.
Each FCT candidate was generally modeled by a set of three-variable ordinary
differential equations (ODEs), with each variable representing the abundance of a
single transcription factor and each of the nine tunable parameters representing a
single regulation strength (Figure 2A). The activation and repression of each gene by
other factors are modeled additively to account for reported mechanistic independence
of transcription regulations [16], [40], [46] (MATERIALS AND METHODS). This
simplification can be expanded to include more mechanistic details of specific networks
without altering the general conclusion of this screening. A state of the network is
then defined as an equilibrium or SSS if all transcription factor abundances do not
change over time. An algorithm was developed to automate the search process to
identify all possible states.
Specifically, to measure the probability of multistability, we scanned N = 421,875
parameter sets for each of the 104 networks to identify all possible states. The values
4
![Page 25: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/25.jpg)
of these parameters are generic but cover a wide range to serve as a proof of principle
for diverse biological scenarios. The probability of multistability was then defined as
the number of parameter sets giving rise to at least 5 stable steady states divided
by N. This probability reflects the robustness of certain FCT’s multistability, and
hence quantifies its possible natural presence frequency. The entire FCT multistability
screen is an ensemble of many independent computations and, as a result, was run
in parallel and performed in high-throughput computer clusters (see MATERIALS
AND METHODS). Such large-scale screening is computationally intensive and is only
possible with simplified model equations. Therefore the chosen forms of our ODE
models omit many biochemical details such as epigenetics and transcriptional factor
co-binding. The models serve as a general abstraction and system specific details
could be incorporated later in light of specific experimental evidence.
5
![Page 26: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/26.jpg)
Figure 2. High-throughput computational screening for multistable networks. (A)FCTs were enumerated and modeled by ODEs (parameters denoted as P). (B) Themultistability probability of 104 FCTs generated by the computational screen isplotted, with a high multistability probability FCTs marked with an X with the colorrepresenting the multistability group; the two highest probability groups are labeledby their indices. (C) The Fifteen FCTs with the highest multistable potentials arerepresented visually.
6
![Page 27: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/27.jpg)
1.1.2 Complete Auto-Activation Contributes to Multistability
Using our network screening approach, we quantified and ranked the potential for
multistability of all 104 possible FCTs (Figure 2A). Of the 104, 15 showed significant
probability for multistability (marked in Figure 2B and topologies drawn in Figure
2C).
The multistable FCTs were empirically categorized into three groups according to
their probability of multistability (color shaded in Figure 2B,C). The first group, which
had the highest probability of realizing multistability and with the potential for eight
states, consisted of a single network (Network 84) with all positive auto-regulation
and all mutual inhibition. The second group of multistable FCTs, consisting of
Networks 8, 38, 60, and 62, was also exclusively positive auto-regulating but displayed
a mixture of positive and negative mutual regulation. Specifically, these networks were
either enriched for mutual inhibition (Networks 60 and 62) or displayed symmetrical
topologies (Network 8 and 38). The third group, which was comprised of 10 FCTs that
exhibited lower but still significant potential for multistability, once again featured
similar positive auto-regulation as a common attribute. The networks of the third
group had either asymmetric topologies (Networks 3, 10, 12, 27, 36 and 56) or
competing mutual regulation of target genes (Network 5, 25, and 30), except for
Network 1, which was entirely void of inhibitory regulation. These results suggest that
an appropriate combination of mutual inhibitory and activating regulation helps to
increase the potential for a network to realize multistability, a finding that is consistent
with previous theoretical studies [47].
From our screen, we were able to identify a set of important, minimum topological
features capable of multistability. In particular, of the possible 16 FCTs with com-
plete positive auto-regulation, our screen selected 15 as having a high likelihood for
7
![Page 28: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/28.jpg)
multistability. Network 54 was the only FCT displaying complete auto-activation with
insignificant potential for multistability, likely because of competing mutual activation
and repression of all the nodes (bottom of Figure 2C).
Indeed, positive auto-regulation has been shown to be an important feature of
bistable systems [7], [9], [10], [14], [48], [49], and recent experimental work also
suggests it could be necessary for maintaining multipotency in stem cells [50]. The
role of positive feedback in multistability has been well studied [9], [14], [51], and the
interconnected autoregulatory loop formed by Oct4, Sox2, and Nanog is thought to
underlie the bistable switch between ESC self-renewal and differentiation, and the
ability of core ESC regulators to ‘boot up’ the pluripotency transcriptional program
by forced expression during reprogramming [52]. Our computational results similarly
suggest that, within the limits of the study, complete positive auto-regulation is an
important contributor to the generation of high-dimensional multistability across
diverse FCT topologies. It is striking to note that one or two instances of auto-
activation, no matter how strong, are not sufficient to generate multistability in our
simulations. Therefore, it is possible that complete auto-activation of FCTs could be
used as an important hallmark in the identification of new multistable gene networks.
1.1.3 Sensitivity and bifurcation of multistability
8
![Page 29: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/29.jpg)
Figure 3. Bifurcation and Stochastic Analysis of Network 84. (A) The topology ofNetwork 84 was predicted to have the highest probability for multistability. (B) Thebifurcation diagram of Network 84 plots transcription factor concentrations of X andY at each SSS against α, the self-activation strength. The mutual inhibitionparameters are also set equal and referred to as β, here set to 0.1 and α is used asthe bifurcation parameter ranging between .01 and 100. SSS are color coded andlisted in the legend to distinguish different SSS, where ‘+’ indicates the gene is ‘ON’and ‘-’ means ‘OFF’. (C) Simulations of noise-induced state transitioning in Network84 under different levels of noise. Simulations were performed with auto-regulationequal to 1 and mutual inhibition equal to 0.1, with noise levels of 0.5, 0.85, and 1from left to right. The locations of the deterministically calculated states areindicated with red spheres, with their size correlated to their spectral radius. Theblue ribbons indicate the temporal trajectories of the simulations. The black arrowsindicate initial direction of state transitions. (D) The number of states traveled in thestochastic simulations plotted versus noise strength. The red line representssimulations that were initiated from the all-ON state, and the blue line representssimulations that were initiated from the origin.
9
![Page 30: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/30.jpg)
Network 84 (Figure 3A, all positive auto-regulation and all mutual inhibition)
yielded the highest potential for multistability in our screen (Figure 2C). To gain a
deeper understanding of this FCT, we performed a bifurcation analysis [53], [54] to
investigate how the change of one regulation strength affects the number, stability,
and location of all states, while other parameters are kept constant (Figure 3B).
Bifurcation analysis is a useful analytical tool that can reveal when and where one
state can bifurcate into two or more states in response to a changing parameter.
This is analogous to a progenitor cell transitioning from a self-renewing phase to a
multi-lineage differentiating phase in response to a stimulus [16].
In the bifurcation diagram of Network 84 (Figure 3B), locations of both SSS (thick
colored lines) and unstable steady states (thin grayed lines) are plotted. X and Y
indicate the abundance of gene product of X and Y. The axis of α represents the
strength of all self-activation, while other parameters are kept constant. Gene product
Z is omitted because only three quantities can be visualized in the plot. Nevertheless,
its abundance can be inferred from that of X and Y. Thus the figure provides the
position of SSS as the concentration of the three transcription factors, and their
self-activation strength, varies.
As can be seen in Figure 3B, Network 84 has complex dynamics. When auto-
activation is very weak (α less than 1), the system only has one SSS (the black line),
which corresponds to the state in which the abundances of all three gene products are
very low and could be roughly categorized as all genes OFF. This finding intuitively
makes sense considering the mutual repression of Network 84’s topology, which under
weak auto-activation does not support individual genes overcoming degradation to
support the ON state. However, as the strengths of auto-activation increase beyond
a threshold, the system suddenly demonstrates many more SSS (colored solid lines).
10
![Page 31: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/31.jpg)
All of these SSS branch out from the all-OFF state, which is the typical nature
of bifurcation. When α = 1, the system can assume its maximum of eight states
(illustrated in Figure 3B as red circles). These states include the following possible
binary combinations: all three genes ON (orange); all three genes OFF (black); two
genes ON and one gene OFF (blue, cyan, and purple); and, one gene ON and two
genes OFF (green, brown, and yellow). The locations of these eight SSS are also
plotted in Figure 3C in XYZ coordinate. The sizes of the circles correspond to each
state’s stability (details below). Network 84 is able to maintain multiple SSS for
a wide parameter range of auto-activation. The all-OFF state (Figure 3B black
line) disappears when auto-activation strengths become greater than 1.05 (Figure 3B
gray circles illustrate locations of SSS when α =5). As the auto-activations become
stronger (e.g., α greater than 7), all three one-gene-ON (green, brown, and yellow)
states disappeared, leaving only two-gene-ON and all-ON states to be SSS (blue, cyan,
purple, and orange lines in Figure 3B).
The potential for this three-node gene network to yield as many as eight stable
states underscores the idea that many cell types (e.g., over 200) could be generated
from networks comprised of a small number of nodes. Taken further, a fully-connected
network with only eight nodes could produce 256 states [47]. While only theoretical,
it is an example of the potential for such combinatorial complexity and provides
a plausible mechanism for cell lineage determination, where different gene product
abundances and ON-OFF combinations could drive the system towards a spectrum
of cell fates. Coupled with our bifurcation analysis, where we modeled the abrupt
appearance or collapse of SSS in this topology, FCTs could also lead to discrete jumps
between states in response to a changing cellular environment or signaling. The
phenomenon of hysteresis, when such jumps cannot be reversed by returning to the
11
![Page 32: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/32.jpg)
previous environmental conditions, has been established as the mechanism for many
irreversible cell fate determinations [55].
1.1.4 Stochastic State Transitioning
Next, we investigated the effects of gene expression fluctuations on the multistability
potential of Network 84. Fluctuations in gene expression under constant environmental
conditions may arise from inherent stochastic fluctuations in gene expression, or
from regulatory architecture, chromatin structure, or transcriptional bursting. Here,
we use the broad term ‘gene expression noise’ to encompass all these possibilities.
Gene expression noise has been shown to affect cellular physiology in single-cell
organisms [56]–[62], and has also been linked to random developmental patterns and
cell fate decisions in multicellular organisms [63]–[67]. The mechanisms underlying
the utilization and regulation of gene expression noise in higher organisms, however,
are not well understood. We introduced such noise into our model by modifying the
algorithm from Adalsteinsson et al. [68] to help understand how inherent stochasticity
affects cell differentiation (see MATERIALS AND METHODS).
Using our modified algorithm, we simulated the temporal evolution of all three
gene product abundances under the influence of different levels of gene expression
noise. The system was initialized at the all-ON state to mimic the promiscuous
gene expression state that is often observed when various progenitor cells start to
differentiate [69]–[73]. The locations of eight SSS when α =1 are indicated by the red
spheres in Figure 3C, where the size of the spheres represents the stability of each
state as measured by its spectral radius [74], [75]. As shown in Figure 3C, at small
noise levels (Figure 3C, left panel), expression of all three genes displayed fluctuations,
yet the trajectory remained focused around the initial all-ON state, indicating that
12
![Page 33: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/33.jpg)
small levels of gene expression noise were not sufficient to induce spontaneous state
transitioning in Network 84. This also indicates the stability of the network under
well-buffered conditions. As the level of noise was increased to an intermediate level
(Figure 3C, middle panel), the system began to transition from the initial all-ON state
to a two-ON-one-OFF state and later to a two-OFF-one-ON state. Within the time
frame simulated, the system transitioned back and forth between these two latter
states and spent most of the time in the vicinity of these two states. When the noise
level was increased further, the system was able to visit all six states with at least one
gene ON and one gene OFF (Figure 3C, right panel).
The results of the stochastic state-transition simulations have several interesting
biological implications. First, the system transitioned exclusively between states
with at least one gene ON and one gene OFF. After jumping out of the initial state,
it never returned to the all-ON state within the simulation time frame; moreover,
it never reached the all-OFF state. For this topology, the all-ON and all-OFF
states are significantly less stable than the other six states, and thus serve as weaker
attractors. Specifically, it is the fine balance in the abundance of all gene products
that determines the stability of the all-ON and all-OFF states, and this balance
is easily disrupted by random perturbations such as gene expression noise. These
results provide a simplified and yet concrete demonstration of the recently proposed
concept of “dynamic equilibrium” [76], in which individual cells can jump between
states randomly, while the proportion of cells in a population at one specific state
remains constant. Here we can see the proportion (probability) of cells in one state
is determined by the stability of each state, which in turn is determined by network
topology and system parameters. Therefore when one single fluorescence reporter
tagged to one gene is experimentally used to track single cell dynamics, the cell
13
![Page 34: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/34.jpg)
population will exhibit a broad distribution because this multi-dimensional system is
projected onto a single measured dimension.
Second, state transitions were only observed between two adjacent states and
not between non-neighboring states. In other words, the transition from one state
to another non-neighboring state required layovers at one or more intermediate
states. This phenomenon resembles the experimental observation that progenitor cells
pass through multiple intermediate states as they differentiate [5]. Moreover, these
simulations also highlight that the transition of a triad from one state to another can
take multiple possible routes. Similar results have been observed at the cellular level,
where recent progress in de-differentiation [17], [19], [20] and trans-differentiation [77]
has shown that natural developmental cellular state transitions may be only one of
many possible routes between cell types [78].
Third, trajectories of the stochastic simulation showed a directional bias in gene
expression state space, presumably due to mutual repression. For instance, when the
system was in the X-ON/Y-ON/Z-OFF state, its trajectory was virtually limited to
the X-Y plane, with little fluctuation in the Z direction (Figure 3C, middle and right
panels). The result is that each state in these simulations was functionally “flat” and
comprised of the expression from only two genes. Similarly, the one-ON states had a
polar or single dimensional expression distribution.
Next, we subjected the system to a range of noise levels to gain a deeper under-
standing of the system’s response and pattern of state transitioning. Noise strengths
ranging from 0.4 to 1.2 were selected for analysis. Plotted in Figure 3D (red line) is the
number of states traveled by the system, from the all ON state, within the simulated
time versus the noise strength. The result shows a step function-like increase in the
number of states with increasing noise strength. A similar pattern was observed when
14
![Page 35: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/35.jpg)
the simulation was initiated at the origin (blue line) rather than the all-ON state.
One exception is that from this alternative starting point the system could not travel
to the all-ON state and therefore had only six accessible states.
In both simulations, the sharp increase in the number of states traveled occurred
between noise strengths of 0.8 and 1 (Figure 3D). This appears to be the critical
threshold for overcoming the stability barrier between states and for complete noise-
induced state transitioning. The existence of such a critical noise threshold suggests
that natural systems may have the ability to operate in different regimes that either
utilize or repress gene expression stochasticity to achieve distinct physiological goals.
1.1.5 SSS Analysis
To further explore Network 84’s response to two or more simultaneously changing
parameters, we conducted more thorough parameter scanning (see MATERIALS AND
METHODS for details). First, two parameter combination regimes shown in Figure
4A and B suggest that, with strong auto-activation, Network 84 is ensured to have at
least four states (bottom two rows). Under these conditions, weak mutual inhibition
led to seven or eight states (orange and green boxes) while strong mutual inhibition led
to only four: all two-ON-one-OFF combinations and all-ON (purple box). However,
when both mutual inhibition and auto-activation are weak, Network 84 assumes only
one state, that of all genes OFF (Figure 4A, blue box). When auto-activation is weak
and mutual inhibition is strong the system is capable of the three two-ON-one-OFF
states (Figure 4A, red box, locations of SSS not shown). Taken together, this again
reinforces the importance of strong auto-activation in generating multistability and
hence the potential to canalize cell differentiation. As a third parameter combination
regime, we tested the effect of asymmetrical regulation on multistability (Figure 4C, see
15
![Page 36: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/36.jpg)
Figure 4. Stable Steady State Analysis of Network 84. (A) A matrix presenting thenumber of stable steady states generated by combinations of different auto-regulatorystrengths (rows) and mutual-regulatory strengths (columns). The green and orangeregions are visualized as FCT diagrams independently in B. (B) The green-boxed andorange-boxed diagrams corresponds to the green-boxed and orange-boxed parametercombination regime from (A). (C) Many of the parameter combinations yieldingmulti-stable systems are not represented by the matrix, and are instead abstractedhere. As an example, here we present a parameter combination regime that cansupport 6 SSS and auto-activation strengths are similar to those in the orange box(thick red arrows) but have one unrestrained pair of mutual inhibition that can existat a very high strength (blue T-head repression lines).
MATERIALS AND METHODS for details). Here we increased the mutual repression
of a single pair of genes for the conditions represented by the green and orange boxes
(Figure 4A, auto-regulatory strength between 1- 5) and saw the number of SSS fall
to six. These findings provide possible insight into why certain network topologies
have been adopted by evolution for particular regulatory tasks. For example, Network
84 demonstrates a broad, multistable parameter space that can be tuned, through
regulatory strengths, for the step-wise control of the maximal number of states. Such
16
![Page 37: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/37.jpg)
a mechanism could be envisioned to transit cellular identity from a pluripotent state
to multi- or uni-potent state during differentiation.
1.1.6 A Regulatory Network for Pluripotency
Experiments probing the structure and function of the regulatory network governing
pluripotency suggest that the core pluripotency regulators Oct4, Sox2, and Nanog
are organized into a regulatory topology of mutual activation that is shared with
Network 1 selected by our screen, [21], [79]. This fully interconnected autoregulatory
loop is believed to stabilize self-renewal, govern the bistable switch between ES cell
self-renewal and differentiation, and underlie the ability of selected pluripotency
regulators to ‘boot up’ the pluripotency transcriptional programming upon forced
expression in somatic cells, thereby reprogramming them to a stem cell-like state
[52]. Additional layers of regulatory complexity have been described in this network,
including autorepression by Nanog [80], [81] and differential cell phenotypes in response
to varying levels of these factors [82]–[84]. Moreover, Nanog acts as a homodimer [85],
while Sox2 and Oct4 heterodimerize to regulate transcription [86], and attempts have
been made to model the regulatory dynamics of these transcriptional complexes and
their interaction with signaling factors [87]. Further adding to the complexity, Oct4
may also partner with other Sox factors to form alternative regulatory complexes that
bind a distinct set of enhancers [88].
Notably, protein levels of the core pluripotency regulator Nanog have been observed
to fluctuate in cultured pluripotent stem cell populations [23], [24], [89]–[92], suggesting
that this regulatory network is capable of adopting multiple states. To further explore
possible roles of multistability in stem cell transcriptional regulation, we carried out
17
![Page 38: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/38.jpg)
Figure 5. Bifurcation and Stochastic Analysis of Network 1. (A) Network 1 sharesthe topological architecture of the Oct4-Sox2-Nanog triad, and is believed to underliestem cell differentiation and reprogramming. (B) The bifurcation diagram of Network84 plots transcription factor concentrations of X and Y at each SSS against α, theself-activation strength. The mutual inhibition parameters are also set equal andreferred to as β, here set to 0.1 and α is used as the bifurcation parameter rangingbetween .01 and 100. SSS are color coded and listed in the legend to distinguishdifferent SSS, where ‘+’ indicates the gene is ‘ON’ and ‘-’ means ‘OFF’. (C)Simulations of noise-induced state transitioning in Network 1 under different levels ofnoise. Simulations were performed with auto-regulation equal to 1 and mutualinhibition equal to 0.1, with noise levels of 0.5, 0.85, and 1 from left to right. Thelocations of the deterministically calculated states are indicated with red spheres,with their size correlated to their spectral radius. The blue ribbons indicate thetemporal trajectories of the simulations. The black arrows indicate initial direction ofstate transitions. (D) The number of states traveled in the stochastic simulationsplotted versus noise strength. The red line represents simulations that were initiatedfrom the all-ON state, and the blue line represents simulations that were initiatedfrom the origin.
18
![Page 39: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/39.jpg)
Figure 6. Stable Steady State Analysis of Network 1. (A) A matrix presenting thenumber of stable steady states generated by combinations of different auto-regulatorystrengths (rows) and mutual-regulatory strengths (columns). The green and orangeregions are visualized as FCT diagrams independently in B. (B) The green-boxed andorange-boxed diagrams corresponds to the green-boxed and orange-boxed parametercombination regime from (A). (C) Many of the parameter combinations yieldingmulti-stable systems are not represented by the matrix, and are instead abstractedhere. As an example, here we present a parameter combination regime that cansupport 6 SSS and auto-activation strengths are similar to those in the orange box(thick red arrows) but have one unrestrained pair of mutual inhibition that can existat a very high strength (blue T-head repression lines).
detailed analysis on this network topology (Figure 5A). Our goal in doing so is not to
capture every regulatory complexity present in the endogenous biological network, but
rather to provide a theoretical framework for how this simplified network topology and
fluctuations in network components might affect multistability. As with Network 84,
we first carried out the bifurcation analysis for Network 1 (Figure 5B). Considering
the topology of Network 1 consists solely of positive regulation, one might expect that
the network would primarily assume the all OFF “resting” or all ON “activated” states.
This intuition is largely correct but not complete. While network 1 does tend toward
19
![Page 40: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/40.jpg)
the all-OFF state it is also capable of generating eight states when the strengths of
auto-regulation and mutual regulation are 1 and 0.1, respectively (Figure 5B and C;
Figure 6A). This diagram (Figure 5B) shares some similarities with that of Network
84 (Figure 3B). For instance, both networks exhibit eight SSS when auto-activation
strength is equal to 1, and fewer as auto-activation strength increases. Likewise, when
auto-activation is reduced to be smaller than 1, both networks collapse to only support
one SSS (the black line). Thus Network 1 also displays discrete emergence and collapse,
or hysteresis, of SSS, allowing it to potentially serve as a foundation for irreversible
cell fate decision-making. It is interesting that while the two-ON states (blue, cyan,
and purple lines) collapse readily as α increases, the one-ON states (green, brown,
and yellow lines) and the all-ON and all-OFF states (orange, black), are stable over a
wide range of parameter values. It is also important to note that despite similarities
in the state space locations of each of the eight SSS in Networks 1 and 84, there are
considerable differences in their relative stabilities.
As with Network 84, we next incorporated different strengths of gene expression
stochasticity, or noise, to simulate state transitioning (Figure 5C). The simulations
were started from the all-ON state to mimic the all-ON pluripotent state of the
Oct4-Sox2-Nanog triad. In the presence of small levels of noise, the system remained
in the initial state throughout the duration of the simulation (Figure 5C, left panel).
When the noise level was increased, the system was able to spontaneously transition
to other states before settling in the all-OFF state, which represents the all-OFF
terminal, or differentiated, state (Figure 5C, middle panel). When the noise level was
increased further, the system transitioned out of the initial state much earlier toward
the all-OFF state via an intermediate state (Figure 5C, right panel), confirming that
the all-OFF state is indeed the terminal state and robust to different levels of noise.
20
![Page 41: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/41.jpg)
It should be noted that simulation times were the same for each panel (Figure
5C). In the high noise level simulation (Figure 5C, right panel), the system reverted
to the all-OFF state relatively quickly and exhibited small stochastic deviations
that rarely exceeded the boundary of the sphere representing the state. This could
perhaps be likened to the terminal state of the Oct4-Sox2-Nanog triad after cell
differentiation, where expression of these genes is markedly depressed in gene expression
profiles. Furthermore, for the modeling of cell fate decisions during development, the
binary nature of these simulations appears to mimic the observed stable states of the
Oct4-Sox2-Nanog triad, typically in either the all-ON or all-OFF state, with partial
activation visible transiently during iPSC induction [17], [18], [93].
Much like the trajectories observed for Network 84, those for Network 1 exhibited
the ability to transition between many states stochastically, with the potential of
accessing five of the eight possible states. Although, as depicted, the residence time
of states for Network 1 was largely restricted to either all-ON or all-OFF (Figures
3C, 5C). Network 1 simulations could not access the three two-ON-one-OFF states
unless initiated there, presumably because of the states’ low stabilities. Moreover,
unlike Network 84, all the simulation trajectories for Network 1 ultimately reached
and remained at the all-OFF state, a result of Network 1’s topology and the superior
stability of the all-OFF state. This also led to a lower average number of states
traveled by Network 1 at higher noise levels (Figure 5D). A higher noise threshold for
state switching of this sort could serve as added protection against aberrant regulation
of pluripotency. From a modeling perspective, these results offer a useful means of
rapidly predicting the dynamical stochastic trajectory of a network based on state
stabilities, without having to run entire stochastic simulations.
As with Network 84, we tested the sensitivity of Network 1 to different combinations
21
![Page 42: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/42.jpg)
of regulation strengths (Figure 6). Again, weak mutual activation and strong auto-
activation gave rise to the maximal number of states (Figure 6A, green and orange
boxes). Interestingly, however, the remainder of the regulation parameter space
was much simpler than that of Network 84, yielding only one or two states (Figure
5A, blue and red boxes). Again, independent manipulation of mutual regulatory
interactions can also lead to parameter combinations that enable multistability. A
thorough parameter scan for such multistability found a third interesting configuration,
one where a gene (Z) propagates signal to downstream genes (X and Y), which also
have strong auto-activation (Figure 6C). This pattern has one defining characteristic
in that it maintains multi-stability as long as there is only one master while the
promoted genes have weak interactions with each other. Such multi-stability is lost
when the system adopts a two-driver gene topology. Therefore Network 1 may be
considered a largely bimodal topology that retains the potential for greater state
diversity, but only within a narrow parameter window (Figure 5B and Figure 6A).
This characterization of Network 1 topology is supported by other studies. Specifically,
while claims of monoallelic expression of Nanog [89] have been contested [24], [29],
a common finding across the literature is heterogenous expression of endogenous
Nanog protein in cultured ES cells by immunostaining [23], [24], [75], [76], [89], [90].
These findings suggest bistable expression of Nanog, and support the idea that the
Oct4-Sox2-Nanog network may exist in at least four stable states (all high; Oct4 and
Sox2 high but Nanog low, as occurs in low Nanog ESCs; high Sox2 but low Nanog
and Oct4 as occurs in neural progenitor cells [94]; or all low/off in the differentiated
state).
Analysis of the regulatory parameter space for networks 1 and 84 raises an important
consideration that may be broadly applicable. Could progressive lineage determination,
22
![Page 43: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/43.jpg)
or restriction of pluripotency, be correlated with the strength of mutual regulation?
For both of the networks analyzed, the highest potential for multistability (eight states)
was found in a parameter space with weak mutual regulation (Figure 4A and Figure
6A). As the strength of mutual regulation increases from this point, more restricted
potential fates predominate, much as would be expected as cell lines differentiate. The
influence of greater mutual regulation on the potential for multistability would be
interesting to explore, and perhaps is well-suited to modeling in the hematopoietic
system where hematopoietic stem cells (HSCs) or multi-potent progenitors could be
compared with bi-potent progenitors.
1.1.7 In Silico Induced Pluripotency
Reprogramming mature cell types into iPSCs is achieved by the overexpression of a
number of defined transcription factors (typically including Oct4, Sox2, c-Myc, and
Klf-4) that regulate the necessary pluripotency phenotypes as well as each other’s
expression [18]–[20]. Recently, it has been shown that overexpression of only Oct4,
Sox2, and Nanog results in similar rates of iPSC generation from human amnion cells
[95], and, importantly, the Oct4-Sox2-Nanog triad is highly expressed and appears
to be the core regulatory circuit for pluripotency in stem cells [96]. To construct a
simplified in silico version of the iPSC experiment involving the Oct4-Sox2-Nanog
triad, we added a constitutive expression term to each gene of Network 1 (Figure 7A,
see MATERIALS AND METHODS).
To begin, we performed bifurcation analysis on the network 1 topology with an
added term for constitutive expression (Figure 7B). The addition of a basal level of
expression to Network 1 resulted in at most five possible states, two of which (the
orange all-ON and black all-OFF lines) possessed considerably higher stabilities than
23
![Page 44: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/44.jpg)
Figure 7. Bifurcation and Stochastic Analysis of Network 1 with constitutive basalexpression. (A) The topology of Network 1 is used to mimic iPSC experiments withadded constitutive expression of all three genes. (B) The X and Y proteinconcentrations of stable and unstable steady states are plotted against self-activationstrengths. With the addition of the constitutive expression we can see a furtherdecrease in stabilities, specifically two-ON states are never stable. Additionallyone-ON states lose stability very rapidly, approximately at α=1, and the all-OFFstate destabilizes after α=9 leaving only the all-ON state. This can also be seen inthe sizes of the spectral radii, deterministically these steady states exist but inpractice they have very little influence. (C) Simulations of Network 1 modeled in thepresence of an additional constitutive overexpression term consistently show atransition from an all-OFF state to an all-ON state after a period of latency.
the others (Figure 7B). All one-ON-two-OFF SSS (green, brown, and yellow lines)
are only stable for a narrow range of α. Even when they are stable, their stabilities
are insignificant. Looking closely we can see that between the red, green, and yellow
lines there are 3 unstable paths without a stable component. These lines would be the
locations of two-ON-one-OFF states if they achieved stability. More importantly, as
auto-activation strengthens, even the all-OFF state’s stability becomes insignificant
when compared with the all-ON state (Figure 7B gray circles).
A comparison of bifurcation diagrams of Network 1 with and without the constitu-
tive expression term (Figure 5B, Figure 7B) highlights some striking and important
24
![Page 45: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/45.jpg)
differences. Firstly, as observed experimentally, by adding the constitutive basal
expression term to all three genes, there is increase in the stability of the all-ON state.
Moreover, our simulation predicts a reduction in the number of possible SSS from
eight to five, as well as a significant reduction in the stability of the non all-ON or
all-OFF states. While these findings must be considered within the limits of our study,
they suggest constitutive expression could help to canalize the transition from all-OFF
to all-ON by simplifying the triad’s regulatory space, and thus provide a foundation
for induced pluripotency.
To further examine modified Network 1, the effects of gene expression noise were
investigated through stochastic simulation from the all-OFF state with a moderate
level of noise (see MATERIALS AND METHODS). Consistent with experimental
evidence [34], at the start of the simulation, the system hovered near the origin for a
period of time and then began to transition toward the all-ON state. Once again, by
adding a basal overexpression term to the genes, we found that the stability of the
all-ON state was increased to be greater than that of the all-OFF state (Figure 7C).
As a result, the system was attracted to the all-ON “pluripotent” state in a way that
was not observed with the unmodified Network 1 (Figure 5C).
These findings reflect observations from the experimental practice of induced
pluripotency in which the reprogramming factors are strongly overexpressed, and may
shed light on the regulatory mechanisms behind induced pluripotency. For instance,
the overexpression of defined transcription factors in differentiated cell types may alter
the regulatory parameters of the FCT into ones that prefer the all-ON “pluripotent”
state. Affected cells would then gradually transition from their original state to the
preferred all-ON state. Further, as is often done experimentally, we removed the
constitutive expression term from our simulation and found the system to return
25
![Page 46: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/46.jpg)
to an unmodified Network 1 behavior. Much like iPSCs, the network was able to
re-differentiate to the all-OFF state or to remain in the all-ON state, under conditions
of low noise or high auto-activation respectively.
Figure 8. Stochastic simulations capture latency distributions in inducedpluripotency. (A) Distribution of simulated latency for the modified Network 1 modelillustrates a skewed bell shape. The histogram was generated from 2000 simulations.(B) Similar results generated from 2000 simulations with Network 84.
To further test the utility of our model as an abstraction for the transcriptional
regulation of stem cells and stem cell reprogramming, we next used our model to
predict the temporal dynamics of stochastic reprogramming. We first investigated
the time required for the system to transition from the all-OFF to the all-ON state,
which is termed the first passage time (FPT) [97]. An FPT distribution of induced
pluripotency was generated by running 2000 simulations with our modified Network 1,
resulting in a characteristic asymmetrical bell shape distribution with a long tail and a
peak at approximately 50 normalized time units (Figure 8A). To ensure the generality
of this observation, an FPT distribution for the transition from one state to another
26
![Page 47: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/47.jpg)
in Network 84 was similarly generated and found to yield the same pattern (Figure
8B). Such a skewed bell shape distribution is consistent with published experimental
data [34].
1.2 Discussion
Over fifty years have passed since the connection between development and multista-
bility was first proposed [3]. However, despite efforts in the last decade to reverse
engineer transcription regulatory networks, the molecular basis for multistability is still
not well understood. Likewise, multistable systems with more than two states have
not yet been constructed de novo. We address this challenge with a high-throughput,
dynamical systems search algorithm, powered by parallel computing, that identifies
gene network topologies enriched in their capacity to realize multistability.
Focusing our screen on FCTs, our algorithm identified complete auto-activation
as an important topological feature for high-dimensional multistability, along with a
variety of combinations of mutual regulation. Our analyses shed light on existing ideas
and generate new testable hypotheses for some of the regulatory mechanisms underlying
cell fate determination. Many of the seemingly disparate experimental observations
including random reprogramming latency, the existence of intermediate states during
differentiation and reprogramming, population heterogeneity, and alternate routes of
cellular state transitioning [5], [27], [34], [96], [98], can potentially be connected and
addressed by our dynamical analyses of FCTs.
The central importance of complete auto-activation in multistable triads suggests
that nodes for such regulatory networks could be identified in exogenous overexpression
screens for known key regulators. In such a screen, the components of multistable
FCTs would self-identify, by maintaining their own expression after the withdrawal
27
![Page 48: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/48.jpg)
of candidate transcription factor overexpression. Transcription factors identified this
way could represent “pioneers”, those capable of activating regulatory partners or
overcoming transcriptional inaccessibility, i.e., silenced loci, such as Oct4.
The identification of regulatory partners (mutual regulation) could also be char-
acterized by the analysis of promoter/enhancer regions upstream of identified auto-
activation transcription factors. Such a “guilt by association” method may also help
to infer the function of the triad as a whole, or lesser-known regulatory partners.
Thus, using established developmental transcription factors as seeds, the Waddington
landscapes of whole cell lineages could perhaps be mapped by using auto-activation
as a cipher.
The notion that specific topologies may serve particular regulatory roles better
than others is supported by the identification of an additional Network 1 regulatory
topology in a stem cell population. The transcription factors Gata2, Fli1 and Scl/Tal1
form a regulatory triad that operates through mutual and auto-activation to promote
hematopoiesis in the mouse embryo [99]. As discussed in the context of the Oct4-
Sox2-Nanog triad, the Network 1 topology has a tendency to operate in a bimodal
and terminal manner, meaning that once switched to the all-ON state the triad has
a tendency to be locked into that state. Such a regulatory system would impart
a sort of cell “memory”, perhaps allowing newly formed HSCs to travel from the
aorta-gonad-mesonephros (AGM) or primordial niche to the fetal liver and bone
marrow niche without losing their multipotent state [99], [100].
Importantly, regulatory triads do not exist in isolation and are not static entities.
Rather, inputs from the environment and other signaling pathways can influence the
state or composition of endogenous triads through the perturbation of regulatory
relationships. The consequence of such influence was evident in our bifurcation
28
![Page 49: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/49.jpg)
analysis of Networks 84, 1, and modified 1, where small changes to the strengths of the
network regulation created or eliminated SSS. Thus, under the influence of external
signaling (e.g., post-translational modification), subcellular localization of co-factors
or regulatory ligands [101]–[103], the same network topology may alternate between
different regulatory state configurations. Such state switching could be an effective
regulatory strategy for responding to changing environmental and developmental cues.
While gene expression stochasticity is an accepted factor of cell fate determination
[61], its role in complex processes such as differentiation remains elusive. We subjected
our chosen multistable networks (e.g., Network 84) to stochastic simulations and found
that moderate levels of gene expression noise were sufficient to disrupt the balance of
protein abundances required for maintaining all-ON and all-OFF states, even with
high protein levels. In other words, our simulations suggest that noise is critical to a
biological system’s ability to transition from meta-stable states to differentiated states.
Moreover, noise-induced transitioning showed multiple possible routes from one state
to another, with the trajectories traversing step-wise through adjacent states (Figures
3C, 5C). This is possibly the in silico manifestation of experimental observations
showing progenitor cells passing through alternate intermediate states on the way to
a differentiated state [5]. Additionally, by testing a range of noise levels, we found
that the number of states traveled by these multistable systems increased sharply
at specific noise levels (Figures 3C, 5C). Perhaps cellular systems straddle such a
threshold, whereby gene expression noise can either be suppressed or amplified to
realize very different physiological outcomes. The advent of single-cell gene expression
profiling methods [104] affords new windows on cellular heterogeneity, and theoretical
frameworks such as we present here may provide valuable insights when informed by
the experimental data these technologies will provide.
29
![Page 50: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/50.jpg)
Like signaling cascades in their simplest form, the regulatory relationships between
the genes of FCTs define their meta-properties. In a manner reminiscent of the low-
pass filtering capacity of cascades, our stochastic simulations detected network-specific
variation in tolerance for noise. Specifically, in the analyses of Networks 1 and 84,
significantly different noise thresholds for multistability were measured—0.8 and 1.4,
respectively (Figures 3 and 5). This observation suggests that, along with other
measures, networks may be tuned for unique noise-switching thresholds. For example,
the Gata2-Fli1-Sci/Tal1 triad (Network 1) would be predicted to remain in the all-ON
state as long as noise levels are maintained low, but to transition quickly to an all-OFF
state under high noise. This could be tested experimentally using tunable synthetic
noise controllers [94] or generators [105] upstream of these HSC regulators to apply
large fluctuations to the system.
As such, differentiation or pluripotency may provide readily accessible experimental
models. One possible method to induce noise could be the expression of splice isoforms
[106], [107]. Focusing on the endogenous triads discussed, state change dynamics
could be evaluated under the perturbations of either single or multiple splice isoforms.
Thus, while controlling for total protein expression within a particular triad, one
could evaluate whether “noise”, derived from the concurrent expression of multiple
isoforms, influences the product or rate of the induced state transition. If so, an
interesting outcome may be that individual differentiation pathways could be primed
for state change by expressing tailored suites of splice-isoforms or in the case of iPSC
reprogramming, perhaps accelerate the rate of state switching.
Finally, our work offers new directions for synthetic biology [11], [12], [108]–[112].
As a tool, the application of synthetic biology to disease-related or developmental
triads may allow for true reverse engineering of in vivo processes. For example,
30
![Page 51: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/51.jpg)
the regulatory dynamics of Network 1 (Gata2-Fli1-Scl/Tal1 and Oct4-Sox2-Nanog)
or Network 84 (RUNX2, SOX9 and PPAR-γ) topologies could be systematically
examined for response to perturbation or noise by constructing such FCTs in an
orthogonal environment (e.g., bacteria). Similarly, the effects of stochasticity on the
regulation of pluripotency could be studied directly in mammalian systems using
tunable synthetic noise generators. Such a system could vary the gene expression of
triad components, while keeping their mean expression level constant, to evaluate
the influence of noise on network state change [57], [113]. Our work here provides a
starting point for the design of next-generation synthetic circuits that could exploit
multistability and its physiological consequences.
1.3 Materials and methods
1.3.1 Enumerating and Eliminating Redundant Networks
Three-node networks can have 39=19683 configurations, each with nine edges that
consist of positive, negative or null regulations between the nodes. Our focus on
FCTs resulted in a total of 29=512 possible configurations because each edge can
only represent positive or negative regulation. Through permutation analysis many of
these 512 networks were found to be identical to each other. Identical networks were
eliminated, leaving us with 104 unique networks (see SI).
1.3.2 ODE Modeling and High-Throughput Screening
A three-node gene regulatory network can be described by a three-dimensional ODE:
dx
dt= a1f1 (x) + b1g1 (y) + c1h1 (z)− δx (1.1)
31
![Page 52: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/52.jpg)
dy
dt= a2f2 (x) + b2g2 (y) + c2h2 (z)− δy (1.2)
dz
dt= a3f3 (x) + b3g3 (y) + c3h3 (z)− δz (1.3)
In this formulation each variable (x,y,z ) represents the protein abundance of one
gene product (Equations (1.1)–(1.3)). Here only one generic equation is used to
describe the gene expression regulation including transcription, translation, and other
post-transcriptional and post-translational modifications. Each equation captures
the core nonlinear dynamics of gene expression regulation and omits details of the
fine-tuning of it. The parameters (a1−3,b1−3,c1−3) represent the strength of mutual or
auto-regulation, be it activation or inhibition. Each function (f, g or h) has the form Fn
/ (kn + Fn) when representing activation, and the form kn/(kn+Fn) when representing
inhibition, where F can be either x, y, or z. The activation and repression of each
gene by other factors are modeled additively to account for reported mechanistic
independence of transcription regulations [16], [40], [46]. For example, Network 84
shown in Figure 3A can be described by the ODE:
dx
dt= a1
xn
kn + xn+ b1
kn
kn + yn+ c1
kn
kn + zn− δx (1.4)
dy
dt= a2
kn
kn + xn+ b2
yn
kn + yn+ c2
kn
kn + zn− δy (1.5)
dz
dt= a3
kn
kn + xn+ b3
kn
kn + yn+ c3
zn
kn + zn− δz (1.6)
In the models values of n=4 and k=0.5 were assumed for all genes (Equations (1.4)–
(1.6)). While it is known that transcription factors critical in stem cell programming
often form homodimers [85], [114] or heterodimers [115], n is set to be equal 4 because
the nonlinearity quantified by n is often affected by many factors in addition to
protein multimerization [10]. A hill coefficient of 4 has been used previously to
demonstrate generic behavior of stem cell differentiation [16]. Recently, hill coefficients
32
![Page 53: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/53.jpg)
up to 9 have also been used [40] to construct models with experimentally verified
predictions. It is also found in our study that hill coefficient values affect the region
of multistability but not general conclusions. While it did not affect the results of
steady-state calculations, δ was set to be 1; this assumption can be adjusted in light of
future available experimental results. For mutual regulation strength, the parameters
(b1, c1, a2, c2, a3, b3) were set to [0.1, 0.2, 0.5, 1 or 5]; for auto-regulation strength,
the parameters (a1, b2, c3) were set to 0.1, 0.5 or 1. These parameter possibilities
lead to a total of 56*33=421,875 parameter sets to test.
We numerically solved for the root of the right-hand side of each ODE set with
over 1000 different initial guesses uniformly distributed over the entire state space.
Measuring the probability of multistability for each FCT was accomplished by analyz-
ing all 421,875 parameter sets. A parallel algorithm was developed to distribute this
task as an ensemble of independent computations (See the SI).
1.3.3 Analysis of the Parameter Space
In addition to the tabular data (Figure 4A and 4A) that is useful for quick visualizations
and analysis of stabilities, it is desirable to recognize key patterns of parameter
combinations (PCs) yielding multistability. Therefore networks chosen (Network 84
and 1) for in-depth analysis had their results recomputed with a wider sample of
parameter spaces. Specifically the PCs were expanded to five levels for each edge, [0.1
0.5 1 2 6], yielding 5^9 = 1,953,125 possible PCs. The found clusters can then be
generalized to create the FCT diagrams in Figures 4C and 4C.
33
![Page 54: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/54.jpg)
1.3.4 Bifurcation Analysis
Bifurcation diagrams were generated using MatCont [116], a continuation toolbox
for MATLAB. Bifurcation analysis was performed on the auto-activation strength;
mutual-regulation strengths were set at .1. Spectral radii were gathered at specific
intervals to compare the stability of the stable steady states. Stable and unstable
steady state information was graphed to generate Figures 3B, 5B, and 7B.
1.3.5 Stochastic Simulations
The ODE model can be converted to chemical Langevin equations to introduce
stochasticity while approximating the Gillespie algorithm [68], [117]. All parameters
were scaled up so that copy numbers of gene products are in biologically reasonable
quantities and can reach over one hundred. This was done by multiplying the
coefficients in a, b, c and k by 100 with other parameters unchanged. This linear
shift does not alter the bifurcation dynamics or the relative locations of SSS. At each
iteration, all chemical species are updated using the equation:
N (t+ ∆t) = N (t) + ∆tA (N (t)) + S√∆t (F (N (t)) +B (N (t)))z (t) (1.7)
In this equation, N (t) represents the abundance of chemical species, A(t) is the
right-hand side the ODE, and F and B are the forward and backward reaction terms
in each equation, respectively, which were extracted from corresponding ODEs. z (t)
are the standard Normal variables, S is the stoichiometric matrix of the biochemical
reactions (refer to [68] for additional details). In our modified algorithm, we added
one parameter, α, to Equation 1.7 to represent the tunable noise strength:
34
![Page 55: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/55.jpg)
N (t+ ∆t) = N (t) + ∆tA (N (t)) + α ∗ S√∆t (F (N (t)) +B (N (t)))z (t) (1.8)
Here α was the noise strength used in Figures 3C and 5C. Therefore, the results
were comparable to pure discrete simulations when α equals 1. Modulation of the
noise strength(α) was used to facilitate investigations of the system’s dynamics in
response to different noise levels. The multiplicative noise term is chosen for simplicity
without loss of generality.
1.3.6 Statistical Tests
All statistical tests were performed using Matlab. The parameters for each distribution
were estimated using the Matlab command mle. Chi-square (X 2) statistics for each
distribution were computed with the chi2gof command.
35
![Page 56: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/56.jpg)
Chapter 2
GENE NETWORK MODELING
Embryonic stem cells have the capacity to differentiate into more than 200 isogenic
progenitor and terminal cell types [1], each of which represents a stable cellular state.
The potential for a system to realize multiple states is termed multistability, which has
been proposed as the mechanism underlying cell differentiation for over half a century,
with stable states represented as either valleys in the developmental landscape[2], [3]
or as dynamic attractors in high-dimensional gene expression space [4]–[6]. These
states are realized through a combination of inputs resulting from both the cytoplasm,
and the host of the cell. The interplay of genes with these inputs within the cell drives
cells towards large responses such as embryonic or carcinogenic differentiation, but also
smaller responses such as metabolic activities [118], stress responses . Understanding
these interactions is critical to the construction of novel gene circuits with a behavior
that is predictable in silico, and consistent in vivo [119].
The ability to infer gene networks has made tremendous strides over the past
decades thanks to advances along 3 main fronts. Decreased sequencing costs have lead
to an explosion of available sequencing data, much of which is available to researchers
from systems like the NCBI GEO DataSets [120]. Due to the nature of the problem,
whereby tens of thousands of genes are predicted using hundreds of individual sample
points, the problem is massively underdetermined. This ability to access and pool
these samples puts better constraints on the system as a whole, and leads to much
stronger conclusions during network inference. Second, significant advances have been
made in the algorithms used for gene network inference. These advances have been
36
![Page 57: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/57.jpg)
made through both algorithmic advancements, and novel approaches made possible
through generally increasing computational power. Third, the quality of sequencing
results has improved significantly. In specific the length of sequence that can be read
successfully has improved, and these longer sequences can more easily be resolved
to unique positions, increasing the quality of gene quantifications, and consequently
network inference.
Here we discuss network inference methodologies capable of inferring these network
interactions from sequencing data and some of their limitations. We discuss limitations
of tools downstream from the sequencers, and propose a solution to integrating data
from multiple sources. We propose a technique for automatically collecting and
pre-processing a large corpus of sequencing runs from public sources, allowing for
increased power during the network inference phase. We demonstrate the viability of
our pipeline as a means of both increasing input data, and decreasing gene expression
noise, a significant problem in network inference [104], [121]–[123].
2.1 Network Models
Gene regulatory network (GRN) modeling is a technique dedicated making functional
understanding of gene regulation at a network level, a crucial step in the understanding
of cell fate determination. Specifically it seeks to understand the processes behind
GRNs and employ them to engineer novel biological devices, metabolic processes,
and therapeutics. Since the development of using Boolean networks [4] to model
cell fate determination, there have been massive advances in model accuracy. These
advances have sourced both from growth in computational abilities, but also in
greater understanding of cells and gene networks. As computational power has grown
[124], the desire for more powerful and predictive models has evolved the modeling
37
![Page 58: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/58.jpg)
landscape from simple and lightweight into more complex and flexible models. Initially
starting with Boolean networks [4], [125] and moving through deterministic and linear
models [126], [127], Bayesian network modeling [128]–[130], and more recently to
differential equation based models. Differential equations have the added benefit of
being extensible to incorporate stochastic simulations in a more natural way than
previous models.
Figure 9. Two fully connected(A/B), and two partially connected(C/D) triads aredepicted. (A/C) illustrate only positive network regulation (activation), indicated byred arrows, i.e. increasing quantities of the transcription factor at the tail of thearrow also increases quantities of the transcription factor at the head. (B/D) utilizeboth positive and negative network regulation (repression), indicated by the bluebars, i.e. increasing the quantities of the transcription factor at the tail of the errordecreases quantities of the transcription factor at the head
2.1.1 Boolean Networks
Since their introduction, Boolean networks have remained a first tool choice for many
gene network modelers, even as more powerful tools become available [125], [131],
[132]. The primary motivations for this are their low cost of simulation, and a strong
theoretical background from computer science theory. Additionally, because the model
38
![Page 59: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/59.jpg)
is very simple Boolean networks tend to be easy to fit with regard to experimental
data. As a technique for network approximation this remains a go-to tool, with more
recent and advanced Boolean network models [132], [133] capable of modeling cell state
dynamics, and even hysteresis. While these models are efficient for an overarching
view, they lose traction as one attempts to model cell state with more fidelity, both
due to their discretization of the input domain, and particularly because of difficulties
in modeling the influence of noise.
Boolean networks are modeled as a set of boolean expressions defining transitions
between state n and state n+1. As an example Equations (Equations (2.1)–(2.4))
show one set of boolean rules for the networks from Figure 9. It is relatively clear that
these rules are overly restrictive by the network’s inability to model oscillations, some
of the steady states, or many of the empirical behaviors observed by [119]. Larger
networks, however, are still capable of modeling some stochastic behaviors, including
oscillations and dynamic attractors [4]. These models have performed remarkably
well [122] despite their simplicity, likely due to the significantly reduced number of
features required to properly fit the model. Boolean network models have fairly limited
usage in simulation [133], but these are still quite successful at inferring rules for the
connectivity of genes, suggesting positive and negative regulations [121], [134], [135].
Boolean Equation for Figure 9A:
fx(x, y, z) = x ∧ y ∧ z
fy(x, y, z) = x ∧ y ∧ z
fz(x, y, z) = x ∧ y ∧ z
(2.1)
39
![Page 60: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/60.jpg)
Boolean Equation for Figure 9B:
fx(x, y, z) = x ∧ ¬y ∧ ¬z
fy(x, y, z) = ¬x ∧ y ∧ ¬z
fz(x, y, z) = ¬x ∧ ¬y ∧ z
(2.2)
Boolean Equation for Figure 9C:
fx(x, y, z) = y ∧ z
fy(x, y, z) = y ∧ z
fz(x, y, z) = y ∧ z
(2.3)
Boolean Equation for Figure 9D:
fx(x, y, z) = ¬y ∧ ¬z
fy(x, y, z) = y ∧ ¬z
fz(x, y, z) = ¬y ∧ z
(2.4)
In this formulation each variable (x,y,z ) represents the protein abundance of one
gene product. Much of the success of Boolean networks can be attributed to the
relatively small number of features required to fit the boolean networks, and the
simplicity of the resultant models. In the examples above the model only needs to
determine whether gene X is activated, repressed, or indifferent to the behavior of
other nodes. Boolean models can be extended to allow for some increased flexibility,
for example instead of requiring all conditions to match only a subset are required
to match, but as the number of subsets increases the difficulty in fitting the model
increases as well.
2.1.2 Bayesian and Neural Networks
Bayesian and Neural networks offer a more powerful alternative to the less compli-
cated Boolean networks. Bayesian networks allow a generated model and parameter
40
![Page 61: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/61.jpg)
combination to be inferred almost in their entirety from the input data [129], removing
some of the need for domain knowledge. Neural networks in particular have shown
great success in fields such as gene expression profiling [136] and credit card fraud
detection [137], [138]. Much of this success comes at two costs: first, the input data
set must be quite large relative to the dimensionality of the problem [139], and second,
much of the power behind these tools must be sacrificed in order to maintain human
interpretability of the resulting model.
Neural networks are typically highly connected [140], meaning that the parameters
required are often on the order of O(N^2). Additionally, where most other networks
have only an input or current step, and an output, or next step, neural networks often
contain “hidden nodes” nested between the input and output space. These nodes have
an effect on the result and are necessary to model high dimensional nonlinearity [134],
but they dramatically increase the number of features rendering network inference
problems exceedingly underdetermined. Additionally, the meaning of these hidden
nodes is often difficult, if possible, to interpret.
Bayesian Equation for Figure 9A:
Px(x, y, z) = a1P (x)b1P (y)c1P (z)
Py(x, y, z) = a2P (x)b2P (y)c2P (z)
Pz(x, y, z) = a3P (x)b3P (y)c3P (z)
(2.5)
Bayesian Equation for Figure 9B:
Px(x, y, z) = a1P (x)b1(1− P (y))c1(1− P (z))
Py(x, y, z) = a2(1− P (x))b2P (y)c2(1− P (z))
Pz(x, y, z) = a3(1− P (x))b3(1− P (y))c3P (z)
(2.6)
41
![Page 62: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/62.jpg)
Bayesian Equation for Figure 9C:
Px(x, y, z) = b1P (y)c1P (z)
Py(x, y, z) = b2P (y)c2P (z)
Pz(x, y, z) = b3P (y)c3P (z)
(2.7)
Bayesian Equation for Figure 9D:
Px(x, y, z) = b1(1− P (y))c1(1− P (z))
Py(x, y, z) = b2P (y)c2(1− P (z))
Pz(x, y, z) = b3(1− P (y))c3P (z)
(2.8)
In Equations (Equations (2.5)–(2.8)) each variable (x,y,z ) represents the protein
abundance of one gene product. Bayesian networks are not significantly more difficult
to learn than boolean networks, adding only a small number of new parameters. Thanks
to this simplicity Bayesian networks have been used to replace Boolean networks when
slightly more predictive power is needed, or more data is available [141], [142]. In the
examples above the model needs to determine whether x is activated, repressed, or
indifferent to the behavior of other nodes, and also determine the strength of each
link probabalistically. Boolean models can be extended to allow for some increased
flexibility, for example instead of requiring all conditions to match only a subset are
required to match, but as the number of subsets increases the difficulty in fitting the
model increases as well.
2.1.3 Differential Equation Networks
Linear and nonlinear differential equation network models provide an important
advance in the understanding and modeling of GRNs [119]. They present fewer
variables than Bayesian networks while allowing a potentially better behavioral fit.
42
![Page 63: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/63.jpg)
Additionally they utilize a continuous input and output domain while maintaining
the interpretability and some of the simplicity of Boolean networks. While linear
and nonlinear differential equation models are superficially similar, recent research
has shown that the effect of transcription regulations on genes is nonlinear in nature
[12]. The nonlinearity of transcription regulator binding has been demonstrated and
previous models have been capable of approximating nonlinearity through additional
parameters, but we recently showed that the effects are nonlinear even with a single
gene activator or repressor [12]. This nonlinearity explains the existence of multiple
stable steady states and encourages yet more accurate modeling of cell state, as a
small change in initial conditions in an undifferentiated cell may lead to a major
change between differentiation pathways.
Differential Equation for Figure 9A
dx
dt= a1
xn
kn + xn+ b1
yn
kn + yn+ c1
zn
kn + zn− δx
dy
dt= a2
xn
kn + xn+ b2
yn
kn + yn+ c2
zn
kn + zn− δy
dz
dt= a3
xn
kn + xn+ b3
yn
kn + yn+ c3
zn
kn + zn− δz
(2.9)
Differential Equation for Figure 9B
dx
dt= a1
xn
kn + xn+ b1
kn
kn + yn+ c1
kn
kn + zn− δx
dy
dt= a2
kn
kn + xn+ b2
yn
kn + yn+ c2
kn
kn + zn− δy
dz
dt= a3
kn
kn + xn+ b3
kn
kn + yn+ c3
zn
kn + zn− δz
(2.10)
Differential Equation for Figure 9C
dx
dt= b1
yn
kn + yn+ c1
zn
kn + zn− δx
dy
dt= b2
yn
kn + yn+ c2
zn
kn + zn− δy
dz
dt= b3
yn
kn + yn+ c3
zn
kn + zn− δz
(2.11)
43
![Page 64: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/64.jpg)
Differential Equation for Figure 9D
dx
dt= b1
kn
kn + yn+ c1
kn
kn + zn− δx
dy
dt= b2
yn
kn + yn+ c2
kn
kn + zn− δy
dz
dt= b3
kn
kn + yn+ c3
zn
kn + zn− δz
(2.12)
In this formulation each variable (x,y,z ) represents the protein abundance of one
gene product (Equations (2.9)–(2.12)). Here only one generic equation is used to
describe the gene expression regulation including transcription, translation, and other
post-transcriptional and post-translational modifications. Each equation captures
the core nonlinear dynamics of gene expression regulation and omits details of the
fine-tuning of it. The parameters (a1−3,b1−3,c1−3) represent the strength of mutual or
auto-regulation, be it activation or inhibition. Each function (f, g or h) has the form Fn
/ (kn + Fn) when representing activation, and the form kn/(kn+Fn) when representing
inhibition, where F can be either x, y, or z. The activation and repression of each
gene by other factors are modeled additively to account for reported mechanistic
independence of transcription regulations [16], [40], [46].
2.2 RNA Sequencing
2.3 Data Aggregation
The past decade has seen numerous revolutions in “next-gen” sequencing, and with
them new tools to aide in their use. This has come largely as an increase in sequencing
accuracy, increased sequence length, and the introduction of single-cell sequencing.
Analysis (Table 1) shows that these breakneck advances in sequencing technology have
resulted in a fractured landscape of techniques that are difficult to rectify, and that
44
![Page 65: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/65.jpg)
ExperimentID
Cell Types ReferenceGenome
Library Prep Aligner andQuantifier
200021180 ES, Neuro2A,MEF
NCBI 37.1 custom Bowtie v0.9.6
200029087 ES, MEF mm9 custom Bowtie v0.9.6200045284 NPC, MEF mm9 Illumina
TruSeq RNASample PrepKit v2
TopHat v2.03and Bowtie2.0.0.6 cuffdiff2.02
200045719 Liver andDevelopmentcells
mm9 SMARTer Ul-tra Low InputRNA for Illu-mina Sequenc-ing
bowtie 0.12.7Rpkmforgenes
200052583 Lung mm10 Nextera bowtie2-2.1.0tophat-2.0.8cufflinks-2.0.2
200059127 Kidney mm9 Nextera TopHat v1.4.1GeneSpring12.6.1-GX-NGS-RV
200059129 Kidney mm9 Nextera TopHat v1.4.1GeneSpring12.6.1-GX-NGS-RV
Table 1. Overview of RNA-seq experiments analyzed with a listing of tool and assetversions.
result in different outputs. In the interest of meta-genomics, authors should provide a
set of pre-processed data, which could be easily utilized by researchers for downstream
applications. In fact many grants require authors to submit their data to the NCBI
GEO repository so that replication of results and downstream analysis is more readily
available. Unfortunately this data is not always usable for researchers as the tools
used to quantify read counts introduce a significant bias into the reads Figure 11,
45
![Page 66: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/66.jpg)
when this bias is varied between experiments it introduces another source of technical
noise. Additionally, while NCBI’s GEO requires authors to submit processed data
there is little restriction on how the data is formatted and which calculations are used,
with some authors providing transcripts per million (TPM) [143], others fragments per
kilobase per million mapped reads (FPKM) [144], others provide a direct read count
[145], and some provide alignments for the individual reads without quantification
[146].
The next possibility is that researchers could have an easy way to take arbitrary
unprocessed reads and align them on their own. To this end there has been a great
deal of effort; from standardization of reference genomes, to online resources such as
Galaxy [147], [148], even including tools such as mygene.info [149] capable of converting
between different gene nomenclatures. Despite this progress many holes remain in
the analysis pipelines; quantified gene readings are not sufficiently available (or are
inconsistent), and quantifying genes is too complicated, requiring significant computing
power and in-depth knowledge of both the sequence library preparation technique, and
of how to perform alignment and quantification in general. Unfortunately this kind of
broad knowledge is not common, and while critical to the alignment phases, having
this kind of information that is not of interest to most downstream computational
studies. To this end we created a tool for automatically downloading experimental data,
recognizing experimental parameters, and performing alignment and quantification
steps such that output data has a minimal bias, and that the bias provided is uniform
among experiments. We further demonstrate that this data contains a smaller amount
of noise relative to the disparate processing pipelines provided by the original authors.
46
![Page 67: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/67.jpg)
2.3.1 Acquiring the Data
Through our analysis of datasets publicly available in NCBI’s GEO repository we
found a few key problems:
1. Varied adapter sequences
2. Multiple barcode handling techniques
3. Differing sequence lengths
Depending on the sequencing chemistry used there are different adapters present
to attach the recerse-transcribed mRNA transcript to the sequencer. In theory these
sequences should not be included in the reads provided, and as a result should have
no influence on downstream processing. In our results we find that they frequently
remain within the sequenced records, likely as a result of multiple adapter attachment.
While analyzing one data set we found over 30% of reads containing a copy of the
adapter sequence (NCBI GEO GSE52583) [143]. Unfortunately GEO does not provide
a standardized field for storing the adapter sequences that were used, consequently one
must analyze the sequenced data and search for overrepresented sequences occurring
on the forward or reverse strand near either end of the read. The results of a failure
to remove the adapter sequence are ambiguous and further depend on the alignment
pipeline used. Read aligners will have varying degrees of success in aligning the reads
to the genome, with some spurious alignments of the sequence against somewhat
random locations in the genome that happen to match the adapter sequence. In order
to remedy this our pipeline performs an automated search for likely adapter sequences
in the sequenced data, and trims them to reduce this noise. In the case that the reads
are too short after the adapter has been trimmed, the reads are discarded.
47
![Page 68: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/68.jpg)
One major problem in sequencing is the random dropout of individual molecules
during sequencing. To overcome this issue most experiments overcome this issue
by using a PCR amplification step, resulting in multiple copies of the same read,
slightly biased by the PCR procedure [150]. Brunskill et al [144] demonstrate that
this PCR bias can be overcome by attaching a barcode to each unique molecule. This
provides a significant reduction in noise, with a minimal loss of sequencing length.
This too provides a challenge to individual researchers who may not be aware of the
presence of a barcode, or the constant flanking sequences. Failure to account for the
barcodes can reduce the accuracy in the same way that an ignored adapter can, while
simultaneously throwing away valuable information. To account for barcodes our
pipeline has a special search within the adapter finder to look for varying bases, which
are automatically recognized as barcodes and utilized to remove duplicated reads.
Read lengths within second generation sequencing continue to grow, starting
with Illumina’s 25 base pair single-ended read, through their current 300 base pair
double-ended read. This increase in read length is critical to the proper resolution of
RNA isoforms, and unique mapping of identified reads. Nonetheless, it too presents
challenges. Increased read lengths drastically increase the probability of reading across
an RNA splice junction meaning that the genome annotation used should contain
the most extensive and accurate list of splice junctions for your organism of choice.
Most researchers use the University of California Santa Clara (UCSC) annotations,
but even among those annotations there are multiple versions in use. This results in
yet another source of noise as different annotation versions will refer to identical genes
with different names that an also be ambiguous, mapping to more than one gene in
annotations from a different group. Even differing versions of annotations from a
48
![Page 69: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/69.jpg)
single group may contain different genes, or may rename genes that were previously
suspected and were later confirmed.
2.3.2 Noise Minimization
Figure 10. In this figure we present the data processing pipeline from 5 otherScRNA-seq experiments available on NCBI GEO. GSE59127 and GSE59129 (left,blue), GSE 52583 (left center, red), GSE21180(center, olive), GSE45284 (right center,orange) and our pipeline (right).
As shown in Figure 11A-C the coefficient of variation(CV) of our pipeline is
approximately in line with those provided by the providers of the respective data sets.
In Figure 11D we demonstrate that despite having similar intra-experimental CV’s
our pipeline produces a reduced inter-experimental CV compared with the mixed
49
![Page 70: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/70.jpg)
Figure 11. Scatterplots showing ratios of Coefficient of Variation(CV), where pointsrepresent CV of a single gene within the original author’s pipeline, and ours. Thex=y line drawn, representing the point where CV is equal between both pipelines.Points on the x>y side are drawn in red, and point on the y>x side are drawn inblack. For the experiments in (A-C), raw scRNA-seq reads are provided by the NCBIGEO service and analyzed using our customized pipeline. In addition to raw readseach author provides estimates of gene activity levels through FPKM or TPM. In(A-C) we plot the coefficient of variation (CV) of our pipeline results against those ofthe original authors. In (D) we group the experiments from (A-C) to visualizeinter-experimental CV.
pipelines of the original authors. Furthermore we expect that the effect of this CV
reduction will continue to become more pronounced as the number of aggregated
works, and distinct sequencing pipelines, increases.
50
![Page 71: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/71.jpg)
2.3.3 Future Work
We propose to add two components to the pipeline: automated parameter optimization
and post-processing of the data. Due to differences in the data it is currently important
to determine key features such as maximum and minimum fragment lengths, adapter
and barcode presence, and in some cases even the format of the read type(color space or
base space) and quality score. This is currently done manually, with some parameters
determined through examination of the sequencing procedures and sequencing files
with other parameters collected by aligning a subset of the data and then resequencing
the remaining data. Automation of these tasks can be performed well through a
rule-based system utilizing lists of sequencers, and a distribution-based analysis of read
lengths. In addition, we propose to post-process data to remove previously included
contextual noise, such as cell-cycle noise. Not all cells undergo cell cycle, but much of
the available scRNA-seq data is from differentiating cells. This differentiation noise can
obfuscate subpopulations of cells, which may have expression levels that are masked
by the boom and bust of genes required for cell division [151]. In differentiating T cells
upwards of 50% of non-technical noise can be accounted for by cell cycle variations
[151]. The techniques to remove this noise can also be expanded to account for
other sources of noise, allowing us to extract and classify noise sources, validate them
biologically, then remove them from the data source. A large pool of homogenized
data will also allow us to replicate algorithmic experiments from other groups to
examine whether they function generally, or only in special cases, and similarly to
develop algorithms that generalize well across data sets from different labs.
51
![Page 72: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/72.jpg)
Chapter 3
NANOPORE READ SIMULATION
Nanopores represent the first commercial technology in decades to present a
significantly different technique for DNA sequencing, and one of the first technologies
to propose direct RNA sequencing. Despite significant differences with previous
sequencing technologies, read simulators to date make similar assumptions with
respect to error profiles and their analysis, resulting in incorrect characterization
of nanopore error. This is a great disservice to both nanopore sequencing and to
computer scientists who seek to optimize their tools for the platform. Previous works
have discussed the occurrence of some bias in the identifiability of certain k-mers, but
this discussion has been focused on homopolymers, leaving unanswered the question
of whether k-mer bias exists over all k-mers, the strength of the bias, how it occurs,
and what can be done to reduce the effects. In this work, we demonstrate that current
read simulators fail to accurately represent k-mer error distributions. We explore the
sources of k-mer bias in nanopore basecalls, and we present a model for predicting
k-mers that are difficult to identify. We also propose SNaReSim, a new state-of-the-art
simulator, and demonstrate that it provides higher accuracy with respect to 6-mer
accuracy biases.
3.1 Introduction
DNA sequencing has become an integral component of biological research, with ap-
plications ranging from gene network or organism identification through biological
engineering. This is accomplished by reading analog nucleotides and representing
52
![Page 73: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/73.jpg)
them as digital sequences of bases that can be further analyzed downstream. Histor-
ically, sequencing has been dominated by synthesis-based approaches involving the
replication of an existing DNA or cDNA molecule, with the fluorescent nucleotides,
providing a stepwise 4-dimensional color space readout of the DNA sequence. This
approach provides high accuracy through amplification of the input sequence, but
these amplification steps obfuscate modifications on the DNA, and lead to uneven
representation of input sequences, largely as a function of GC content [152]. A novel
approach known as nanopore sequencing allows for the identification of an input
sequence by measuring electrical current signals as the DNA strand passes through a
nanopore. This eschews the pre-amplification step, reducing sequence-selection bias,
and allowing for the detection of modifications on the raw DNA strands.
Nanopore data introduces novel challenges when compared with previous genera-
tions of sequencers. First, nanopore reads provide significantly larger mean lengths
and higher error rates than synthesis-based approaches (10,000 bases at 80% accuracy,
compared with 500 bases at 99.9% accuracy) [153]. Additionally, synthesis-based
approaches tend to have an error profile that is not heavily biased by the underlying
DNA sequence, while nanopore errors are strongly connected to the DNA sequence be-
ing read [152], [154]. This bias is unaccounted for in most simulators and downstream
analysis tools, and lacks a published model, but it has significant implications. For
most downstream analysis, sequences that are read in must be aligned to a reference
genome, or to another sequence. Because sequence alignment algorithms are O(N2),
and most genomes are immense, almost all current generation sequence aligners rely
on an indexed set of perfect or nearly perfect “seed“ sequences to propose candidates
for an alignment, where a seed is 6-30 nucleotides that must match on each sequence.
Because of observations on the DNA sequencing data when these aligners were writ-
53
![Page 74: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/74.jpg)
ten, seeds are typically treated as either completely equal, or as a function of “seed”
uniqueness within a genome [155]–[158]. Similarly, after matching a “seed“ between
two sequences the alignment is extended utilizing a cost function for matches and
mismatches, but these cost functions also do not incorporate the varied probability of
error within across a DNA sequence. Furthermore, the ability of the community to
address these issues has be hampered by a lack of tools capable of simulating these
results, and a model to explain possible sources of this bias.
Biological data is often difficult to obtain, requiring significant time and reagents
to culture, prepare, and process experimental samples. Because of this it is beneficial
to run in silico trials of their experiments before attempting a “real” run. Because
these in silico results are being used as a proxy for real data it is similarly desireable
to minimize the differences between in silico results and experimental results. To
this end simulators have been made for most sequencing technologies, incorporating
various facets and behaviors intrinsic to technology. Because nanopore sequencers are
novel in their approach to reading DNA there many of the assumptions inherited from
previous simulators are no longer valid, and while some invalid assumptions have been
identified there are more that remain [159]–[161].
We demonstrate that the specific sequence under investigation strongly contributes
to the local probability of observing an error. While previous studies have observed
that homopolymers are not identified correctly we extend this to show that it is
more extensive than the previously observed homopolymer bias, and that current
generation simulators are unable to properly model this variation in error probability.
We propose a read simulator based on a Markov chain, with observations modified
by the underlying seuqence and demonstrate that it provides a higher fidelity to the
error distribution, which fit from real data. To accomplish this task, we identify the
54
![Page 75: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/75.jpg)
accuracy of individual k-mers and their behavior in different contexts. We also predict
key features of the error bias, and calculate their individual contributions to the final
error. Our simulator is fully automated, allowing for both in silico amplification of
read data, or simulation of data on completely novel genomes. Our contributions are
summarized below:
• We demonstrate that while some base calling error is random there is a significant
component that is systematic and unaccounted for (section 3.4.1).
• We develop a set of features for predicting k-mer accuracy and show that our
model generally correlates well with data found empirically (section 3.4.2).
• We propose an algorithm for simulating reads, a variant of the Hidden Markov
Model (HMM) employed by Nanosim [159], with modifications to apply observed
k-mer bias. We demonstrate that it models the k-mer error distribution better
than other popular read simulators [159]–[161] (section 3.4.3).
3.2 Related Work
Cost, throughput, and accuracy have been major hindrances in DNA sequencing.
With the development of NGS technologies cost has reduced while throughput and
accuracy continue to climb. Still, simulators offer a significant benefit due to their
low cost and exceptionally high throughput, allowing testing while developing new
algorithms. Simulators aim to produce sequences with the most fidelity possible for
their given platform. As such, simulated reads should account for biological and
technical bias [162].
Simulators typically generate synthetic reads by extracting a sequence from a
reference genome and then introducing errors into that sequence. Parameters required
by simulators to introduce these features into the sequence are either provided at run
55
![Page 76: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/76.jpg)
time or are stored inside a metadata called a model or error profile. By analyzing the
alignment of empirical data to a reference genome error profiles are created. Errors
have been generated by first predicting a quality score [163], by base position within
a simulated read [164], or by predicting an error sequence and then applying it to a
read [159], [160], [165].
Modeling of third generation single-molecule reads has many advantages when
compared with second generation sequencers. Polymerase chain reaction (PCR),
required by second generation sequencers during the pre-amplification step, introduces
significant bias, but is not required for third generation sequencers [166]. GC bias, a
secondary effect of the PCR amplification step, resulting in low base accuracy and
high coverage variability, is also removed [167]. Third generation sequencing on the
other hand has new challenges that must be modeled. Simulators for third generation
sequencers must deal with longer reads containing significantly larger stretches of
errors, and a biased error that varies with nearby nucleotides. There are two main
platforms for third generation sequencing, PacBio’s SMRT sequencing, and Oxford
Nanopore’s nanopore sequencing. Each platform has their own read simulators, but
none are unable to properly model the k-mer bias of nanopore sequencing data.
NanoSim [159] is a read simulator for ONT data, modeling reads as the result
of a HMM. The model is fit by aligning empirical reads to a reference genome, then
collecting a list of error subtypes, lengths, and transition probabilities. Reads can
be simulated by sampling a length, then generating a sequence of errors. One major
limitation of this approach is that it is unable to properly model k-mer bias; this
is somewhat overcome by a post-processing step where all homopolymers of length
greater than 5 (e.g. “AAAAAAA“) are compressed to a length of 5.
LongISLND [161] is a read simulator developed for PacBio data. It models k-mer
56
![Page 77: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/77.jpg)
bias explicitly and directly by storing observed mutations for each k-mer it sees,
stored as a key-value pair. It also uses an Extended K-mer (EKmer) model to aid
in homopolymer error generation. An EKmer is a regular k-mer followed by an
integer representing a length of homopolymer covering the middle term in the k-mer
facilitating the simulation of arbitrary stretches of homopolymer without explicitly
observing them. One major limitation of LongISLND is that while is is able to
approximate the k-mer bias it lacks the accuracy level observed in true ONT data.
This is likely due to ONT data having grouped errors (i.e. the probability of error
given an error exists nearby is higher than normal) that are not properly modeled
using their simulator.
3.3 Problem Description
DNA has a label space of Σ ∈ {A,C, T,G} representing the different nucleotide bases.
Let s ∈ Σn be a DNA strand of length n. Using Nanopore technology, a series of
discrete measurements δ is generated, representing the current that passes through
the nanopore at each time step from time t0 to tT , where tT reflects the total time
required for s to completely transit the Nanopore. This creates a corresponding vector
∆ ∈ Rd, where d >> n is the number of discrete measurements obtained from time t0
to tT .
The vector ∆ is subsequently binned into q different bins, B = b1, b2...bq, using
different time intervals that capture ≈ dnmeasurements per bin. The mean µi and
standard deviation σi are subsequently derived for each bin bi. An approximation
strand, s, of the original strand s is reconstructed or “base called” using the sequence
of (µi, sigmai) pairs using either a Recurrent Neural Network [168], or a HMM [153],
57
![Page 78: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/78.jpg)
(A) (B)
(C) (D)
Figure 12. The accuracies of all 6-mers with accuracy on the vertical axis, and akmer index on the horizontal axis. (A) All experiments sorted with respect to theiraccuracy. (B) Experiments sorted with respect to the Escherichia coli R9 2Dexperiment. (C) The R9 E. coli template experiment divided randomly into 5 groups,sorted with respect to the first group. (D) Only the R9 E. coli template experiment,aligned with both Graphmap and BWA-MEM, sorted with respect to BWA-MEM.
[169] in conjunction with a lookup table from the Nanopore manufacturer that specifies
the most probable k-mer base pair for the given µi value. Using the R9 pore from
Oxford Nanopore µi is best explained by a sequence of 6 DNA bases, thus the k-mer
length k = 6 is used in later analysis.
58
![Page 79: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/79.jpg)
3.3.1 Error Source Identification
Naively, one may assume that error is distributed evenly over s, however it is trivial to
see that it is not the case. The expected mean current µ exists on a linear range, thus
k-mers near the minimum current are not as influenced by δ readings lower than their
expected mean, with the opposite true for readings near the top of the range. Second,
if the density range of the k-mer currents is uneven, more error is to be expected in
high density regions [154]. Third, some sequences of k-mers are easily identifiable
due to large changes in current, while others are difficult to identify due to small
changes [154].
To elucidate the drivers of k-mer bias we first calculate a list of k-mer accuracies K,
and propose a set of features F , where each feature fi confers a nonnegative accuracy
to each k-mer. Error cannot be negative, thus we can approximate the influence of
each feature by solving Equation 3.1.
arg minx
‖Fx−K‖2
subject to x ≥ 0
(3.1)
Where x is a vector indicating the contribution of each error feature F. Fx is then
the best approximation of the original k-mer bias with respect to euclidean distance.
3.3.2 Error Features
Error during sequencing can occur in multiple locations, and from multiple sources.
Some of these sources of error are recoverable because of constraints enforced by the
sequential relationship of the bins B, i.e. that each bin predicts a k-mer for itself,
but also implies limitations on adjacent k-mers. Other error sources are significantly
59
![Page 80: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/80.jpg)
more difficult to recover from, such is the case. To this end we generated features to
capture specific subtypes of errors.
Sequence Identifiability refers to the ability to easily differentiate one sequence of
k-mers from another given a sequence of bins. These kinds of errors can arise if there
is a shift in the current that persists for multiple bins, or a drift between recalibration
phases. This can be approximated by calculating the number of sequences of length L
with a manhattan distance lower than a provided threshold [154]. We also analyze
the local density of the pore model, as highly dense regions reduce the probability
that a bin’s mean value implicates a specific k-mer.
Transition Identifiability Some transitions between k-mers result in an expected
current change smaller than the standard deviation of either nucleotide. As a result,
some of these changes are not captured successfully by the event detection algorithm,
or in an attempt to correct this behavior there may be multiple bins that should only
be a single bin. In nanopore base callers these gaps or duplications are known as
“skips“ or “stays”. To predict this, for each k-mer, we calculate the difference in mean
current value between the k-mer under inspection and all immediately previous and
subsequent k-mers. We also calculate the number of steps required to return to the
same k-mer, and the difference in current that should have been observed across those
steps.
3.3.3 Read Simulation
Read simulation occurs in 2 phases: model fitting, then simulating reads from an
input genome. Fitting the model requires existing reads to be aligned to a reference
genome, alignment is performed using BWA-MEM [156] with standard options. From
each read, the top alignment is selected, and unaligned reads are discarded. From the
60
![Page 81: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/81.jpg)
remaining reads, parameters are extracted with respect to average error length for the
error types {Insertion, Deletion, Mismatch}, transition probabilities, and the accuracy
of all k-mers of length 6. The inverse of the k-mer accuracy is also used as a “cost”
of correctly predicting a k-mer. These parameters are then stored for downstream
simulation.
Reads are simulated according to a HMM which generates transitions between
correct and erroneous stretches, and the lengths of those stretches. This approach has
the benefit of easy explainability, but has limitations in that it is unable to correctly
model k-mer accuracy distribution. Moreover, this shortfall is difficult to overcome, as
modeling k-mer accuracies as states would require an intractable number of parameters.
To remedy this situation we allow errors and error lengths to be generated according
to the HMM, but we adjust the lengths by using a cost function described above in a
process detailed in Algorithm 1.
As a second approach we introduce read simulation as an optimization problem
where we attempt to minimize the differences between X ∈ Rn approximated by
X ∈ Rn, where each x is a parameter, including accuracies of each of the 6-mers,
transition probabilities between each error subtype, and average correct read length.
This is formally specified in Equation (3.2).
minimize{‖X− X‖2} (3.2)
Minimization is achieved by monte carlo simulation, iteratively proposing errors to
the simulated sequence until the error difference reaches a threshold, or an empirically
determined number of mutations has been rejected. The overall algorithm is detailed
in Algorithm 2.
61
![Page 82: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/82.jpg)
Data: read extracted from sample genome
Result: mutated read
while readPosition < readlength do
scaleFactor = mean k-mer cost around readPosition;
isError = random ∗ scaleFactor < .5;
if isError then
errorType = transition probability from HMM;
end
budget = length from model based on errorType;
while cost < budget do
cost += cost of k-mer at readPosition + i;
i += 1;
end
if isError thensave the mutation to a mutation list
end
readPosition += i;
endAlgorithm 1: Generation of mutation list for sampled read using the k-mer biased
HMM model.
62
![Page 83: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/83.jpg)
Data: read extracted from sample genome
Result: mutated read
while ‖X − X‖2 > minDifference AND numReject < maxNumReject do
errorPosition = uniformRandom;
errorType = random based on parameters;
errorLength = length from model based on errorType;
tempX = X with error applied;
if ‖X − tempX‖2 < ‖X − X‖2 then
X = tempX;
else
numReject += 1;
end
endAlgorithm 2: Random introduction of errors to achieve minimum difference.
3.4 Results
3.4.1 Modeling K-mer Bias
Before attempting to model k-mer bias in real data is it important to understand the
granularity at which it occurs, and the consistency. To this end we examined 2 public
datasets 1 2 to determine whether bias was present, and to what extent. We show
that while there is a consistent k-mer bias within experiments, there are significant
differences between the R7 and R9 pores in Figure 12. We attribute the majority
1http://lab.loman.net/2016/07/30/nanopore-r9-data-release/
2http://lab.loman.net/2014/10/01/where-can-i-get-oxford-nanopore-miniontm-data-from/
63
![Page 84: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/84.jpg)
(A) (B)
Figure 13. A plot of the non-negative least squares (nnls) fit of the predictedfeatures compared to the actual distribution from the R9 E. coli 2D experiment,sorted by the true experiment. (A) All model feautures and the true accuracy ofneighboring k-mers as a feature. (B) Only the features that are fully predictable fromthe model, as described in bias souces table.
Bias Source Example Low Example High ContributionTransition Identifiability AAAAAA ATATAT 64.6%Sequence Identifiability CTGTCA TATATT 3.32%Standard Deviation CTAGAG TTGAAA 1.79%
Fraction A CGTCGT AAAAAA 9.73%Fraction T CGACGA TTTTTT 2.90%Fraction C AGTAGT CCCCCC 9.33%Fraction G CATCAT GGGGGG 7.03%
Table 2. Identified error sources and their contribution to k-mer overall accuracy withrespect to the bias source
of the differences to the pores themselves, but some influence also likely comes from
different versions of ONT basecaller used on each dataset.
64
![Page 85: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/85.jpg)
3.4.2 Identifying Bias Sources
After identifying the presence of k-mer bias we attempted to elucidate the source of
the bias. To this end we generated a set of features with each providing a score for
each k-mer, and we attempted to find a linear combination of each feature capable of
explaining the bias, and providing a fractional contribution of each step. As shown
in Table 2 multiple error sources influence the accuracy fraction of each k-mer. We
find that the strongest feature for predicting the accuracy is the median accuracy of
neighbors at one step away. While this is not directly helpful, it does suggest that
direct modeling of error sources must incorporate neighbor accuracy aspects. Overall
we find that our predictive model provides a good first attempt, providing some insight,
but leaving much of the error signal uncaptured, as illustrated in Figure 13. We expect
that much of the remaining error signal is a result of missing features that may be
identified through a more thorough analysis. We also expect that some of the error
signal cannot be captured through linear combinations. For example, the original
basecallers were incapable of capturing homopolymers with a length greater than 5,
using linear combinations the k-mer “AAAAAA” would need to have 0 accuracy across
all features, an unrealistic expectation.
3.4.3 Simulation Results
To validate our simulation results we generated read profiles for Nanosim [159],
PBSim [160], LONGIslnd [161], and the two simulators we have proposed. Nanopore
sequencing experiments typically yield between 10,000 and 20,000 reads [159], with
measured statistics being approximately identical at even 20% of this size as shown in
Figure 12 (C). To measure this effect in simulations, we generated 2 data sets with
65
![Page 86: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/86.jpg)
(A) (B)
(C) (D)
Figure 14. Comparison of 3 previously published sequencing simulators with the newmodel we proposed. Fit of R9 pore model k-mer accuracies sorted internally (A), andby 2D E. coli, the training model (B). Fit of R7 pore model k-mer accuracies sortedinternally (C), and by 2D E. coli, the training model (D).
each simulator: one with 4,000 reads, and one with 20,000 reads. These reads were
then aligned using BWA-MEM [156] with standard parameters. The sum of squares
error is the sum of the difference between the model 6-mer accuracy and the simulated
k-mer accuracy, for all 6-mers. Results of both large and small simulations are shown
in Table 3, while the 20,000 read simulations are shown in Figure 14. Ultimately we
66
![Page 87: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/87.jpg)
Source Sample Size SSE
Split R9 6,481 0.121
LongISLND R9 20,000 167.04
LongISLND R9 4,000 167.80
Nanosim R9 20,000 44.26
Nanosim R9 4,000 46.76
SNRG R9 20,000 9.52
SNRG R9 4,000 9.66
LongISLND R7 20,000 88.78
LongISLND R7 4,000 88.96
Nanosim R7 20,000 30.76
Nanosim R7 4,000 36.16
SNRG R7 20,000 5.24
SNRG R7 4,000 5.48
Table 3. Sum of squares error (SSE) between k-mer simulators and their respectivetraining sets, and between the fragmented R9 data set and the complete R9 data set
find that sample size has very little influence on existing simulators, as the k-mer
bias is either completely un-modeled, or has systematic difficulties capturing the
k-mer error rate (LongISLND). In our simulators we find some improvement through
increased sample size, though most of the remaining difference does appear to be from
systematic errors.
3.4.4 Simulator Complexity
In addition to validity, the performance of a simulator is also of great importance. If
reads can be simulated properly but with an excessive requirement on the processing
power or memory the simulation is of little value. To this end we benchmarked
67
![Page 88: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/88.jpg)
Genome (MB) Sample Size Cores max RSS (MB) Runtime (h:m:s)
E. Coli (4.64) 2000 4 659.51 0:48:20
E. Coli (4.64) 2000 1 532.87 2:18:41
E. Coli (4.64) 10000 4 1,105.60 3:33:01
E. Coli (4.64) 10000 1 636.41 12:36:30
E. Coli (4.64) 20000 4 1,666.17 6:58:18
E. Coli (4.64) 20000 1 810.46 22:53:12
Saccharomyces (12.1) 2000 4 670.12 0:51:37
Saccharomyces (12.1) 2000 1 558.66 2:44:32
Saccharomyces (12.1) 10000 4 2,040.84 3:57:08
Saccharomyces (12.1) 10000 1 560.92 17:11:49
Saccharomyces (12.1) 20000 4 1023.23 12:29:48
Saccharomyces (12.1) 20000 1 556.47 31:38:11
Table 4. memory and wall-clock time consumption of SNaReSim under variousconditions
the simulator on an 8-core Ubuntu system with 32GB of RAM under 3 different
genome sizes and experiment sizes. The 2 genomes are chosen to be E. Coli, and
Saccharomyces Cerevisae, as each are model organisms, with at an order of magnitude
of size difference one magnitude of difference in size. The sizes chosen are 2,000, 10,000,
and 20,000, to represent simulation of small, medium, and very large experimental
runs. data was collected by running the simulator through ’time -v’, where each result
was tracked for speed and memory consumption. In order to improve results, each
test was conducted 5 times with only the median reported. The results are shown in
table 4.
The results illustrate a few strengths, and a few weaknesses, with many weaknesses
arriving from inefficient Python optimizations. Standard Python implementations for
68
![Page 89: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/89.jpg)
example (CPython) contain a Global Interpreter Lock (GIL), meaning that threading
where global variables are shared between threads is not capable of utilizing more than
one CPU core completely. To this end many projects are implemented using processes,
where each global variable is replicated and communication must occur using pipes,
or a block of data that will be returned by the process. SNaReSim implements data
passing in the second way, largely because of implementation simplicities, but this
results in a significant memory waste observed as the difference between the 4-core
and 1-core versions. When running in a single-thread mode simulated reads are
immediately dumped into an open file descriptor, meaning that the resident set size is
able to be quite small. The Multi-core implementation on the other hand must pass
the data blocks back to the head process before writing to the file. This difference in
memory consumption could be largely resolved by having child processes write to a
queue, and having the parent process constantly write the queue contents into a file.
3.5 Conclusion
We have demonstrated the presence of biased k-mer accuracy within Oxford Nanopore
Technologies sequencing platform. We show that this bias is consistent within ex-
periments, and between sequence aligners, but varies between pore models. This
information is of significant value to sequence aligners, which typically use in index of
seed sequences to begin alignment, in order to reduce the search space and increase
speed. Due to the bias in k-mer accuracy we show that some seed sequences are
virtually impossible to correctly identify in sequenced samples, while other seeds have
an accuracy significantly greater than the mean.
We demonstrate that this k-mer bias has some observable and predictable founda-
tion in the mean current readings of each k-mer(i.e. the manufacturers pore model),
69
![Page 90: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/90.jpg)
although significant progress remains to be made. This provides a way to estimate the
read accuracy for a provided pore model without performing large sequencing runs; it
also suggests that a model-guided approach for nanopore design could improve overall
accuracy. Alternatively it could allow for the development of specific pores, where
accurate discrimination of some k-mer subtypes is more important than others.
Finally we propose a novel nanopore read simulator capable of modeling the k-mer
accuracy bias observed in this experiment. We demonstrate that our simulator has
a significantly reduced sum of squares error with respect to 6-mer accuracy when
compared with other third generation sequencing simulators. This provides a more
realistic benchmark for sequence aligners and genome assemblers to compare against.
The availability of high accuracy reads allows for the exploration of new applications,
including; sequencing of larger organisms, organism disambiguation when sequencing
a population, exact sequence detection in diploid and polyploid organisms, and the
ability to scaffold genomes across exceptionally long repeat regions. Providing an
understanding of accuracy in nanopore design, and development of tools to aid the
alignment of the produced reads is thus critical to continued progress.
70
![Page 91: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/91.jpg)
Chapter 4
READ CORRECTION
Nanopore sequencing has introduced the ability to sequence long stretches of
DNA, enabling the resolution of repeating segments, or paired SNPs across long
stretches of DNA. Unfortunately significant error rates >15%, introduced through
systematic and random noise inhibit downstream analysis. We propose a novel method,
using unsupervised learning, to correct biologically amplified reads before downstream
analysis proceeds. We also demonstrate that our method has performance comparable
to existing techniques without limiting the detection of repeats, or the length of the
input sequence.
4.1 Introduction
DNA sequencing has become a critical part of most biological research, in tasks ranging
from gene network identification, to biological engineering, to organism identification
and the generation of phylogenies. Most of these applications have 2 primary foci:
the length of DNA sequence reads, and the per-base error in those reads. Third
generation sequencing technologies presented by Oxford Nanopore Technologies(ONT)
and PacBio offer read lengths 10-100x what was possible with previous sequencing
technologies, but at a per-base read accuracy near 85%, down from 99.99% using
second generation technologies. While the increased read length enables many new
biological applications [170], [171] the low read accuracy hinders others. Extensive
research has gone into developing techniques that can robustly improve Nanopore
read accuracy and analyzing the trade-offs [153], [172], [173].
71
![Page 92: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/92.jpg)
We propose an approach to correcting errors that does not make assumptions
about the presence of a reference genome and is robust to mixed biological populations
where reads with significant overlap must be identified as different. To accomplish this
task multiple copies of the same original DNA sequence must be read, resulting in a
large number reads, of which a small number are high-error copies of one another. We
then use the K-means algorithm to cluster reads based on an engineered feature space.
This technique does not require replicated reads to be biologically attached [174],
meaning that the individual sequence lengths are not constrained. Additionally our
approach groups reads with their own replicates, removing the need for a reference
genome. Our contributions are summarized below:
• We demonstrate that while some base calling error is random there is a significant
component that is systematic and can be modeled. in section 4.4.1 we present a
simple model for predicting easily identifiable k-mers and show that our model
correlates well with data found empirically.
• We propose a feature space based on the easily identifiable k-mers that accu-
rately groups copies of unique DNA strands. In section 4.5.1, we demonstrate
comparable clustering performance to MinHash [175] with a number of simulated
reads similar to biological experiments.
• We demonstrate that a simple K-means approach can replace a more complex
biological process to group unique DNA strands together.
• The proposed method allows for the analysis of larger DNA strands compared
to INC-Seq [174].
72
![Page 93: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/93.jpg)
GGCACGTTAC
GGCACGTTAC
GGCACGTTAC
(b) nanopore sequencing(a) replicated reads (c) base call
GGTCAGTTAC
GATCAGTTAC
GGGTCACGTTAC
. . .
. . .
. . .
Figure 15. (A) Strands of DNA bases are added to the nanopore sequencer as an
analog data input (B) As strands pass through the nanopores electrical current
readings are taken. These readings are consequently grouped into “events”, with each
event approximately representing a k-mer (sequence of bases). These events may be
improperly split, events may have a mean value higher or lower than expected for a
DNA k-mer, and event lengths may be different. (C) A base caller uses the sequence
of events to predict the original sequence of bases.
4.2 Related Works
Given the importance of sequence read accuracy, a variety of methods have been
proposed for increasing read accuracy. These approaches centered around either
detecting overlapping regions in genomic alignments to provide read corrections [153],
[157], [176], or correcting reads before alignment [172], [174], [177], [178].
Read overlapping [157], and correction from genomic alignments [153], [176] has
shown significant potential for read correction. These techniques have shown the
ability to increase sequence accuracy into the high 90%’s, with relatively minimal
requirements [153]. One major shortcoming is that they are limited by error rate
and genomic similarity, with large genomes or high error rates resulting in excessive
correction times or erroneous results [176]. Some of these shortcomings can be softened
73
![Page 94: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/94.jpg)
by the presence of a reasonable reference genome, as the error rate for read-genome
alignments is approximately half of that seen in read-read alignments.
Read pre-correction has gained significant steam recently, and particular advantages
have been seen with small genomes which contain more than 10x even within a single
sequencing run. Early attempts at pre-correction focused on aligning high accuracy
second generation reads to long nanopore reads [172], [178]. This provides the possibil-
ity to resolve many features but introduces systematic error within repeating regions
of the reads. More recent attempts have focused on nanopore reads exclusively [174],
trading a significantly reduced read length for increased read accuracy. To a large
extent both of these correction efforts are unacceptable. Introducing a systematic bias
into sequencing results prevents the detection of mutations in sequencing experiments,
or leads to incorrect references during de novo genome constructions [172], [177].
Decreasing the relative read length is similarly detrimental in that much of the benefit
of 3rd generation sequencing is consequently lost.
After correction reads typically must be mapped to a reference to be used. To aid
in this mapping high-information sequences have been explored. Although because
previous technologies yielded a very high level of accuracy ’high-information’ is taken to
mean rare k-mers, ones that appear less frequently than random chance would suggest.
Focusing on these highly informative subsequences has allowed for significantly faster
sequence matching [175], sequence pseudoalignment [179], or full alignment [158],
[180]. Unlike second generation sequencing, nanopore has significantly higher error,
and it is introduced as a biased error in readings. Because of this, the definition of
high-information subsequences can be stretched to incorporate the probability of being
able to correctly identify a k-mer.
74
![Page 95: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/95.jpg)
4.3 Problem Description
DNA has a label space of Σ ∈ {A,C, T,G} representing the different nucleotide bases.
Let s ∈ Σn be a DNA strand of length n. Using Nanopore technology, a series of
discrete measurements δ is generated that represents the change in electrical current as
each base passes through the Nanopore from time t0 to tT , where tT reflects the total
time required for s to completely transit the Nanopore. This creates a corresponding
vector ∆ ∈ Rd, where d >> n (d ≈ n ∗ 12 ) is the number of discrete measurements
obtained from time t0 to tT .
The vector ∆ is subsequently binned into q(≈ n) different bins, B = b1, b2...bq,
using different time intervals that capture ≈ dqmeasurements per bin. The mean µi,
and standard deviation σi are subsequently derived for each bin bi. An approximation
strand, s, of the original strand s is approximated from or “base called“ using µi in
conjunction with a lookup table from the Nanopore manufacturer that specifies the
most probable k-mer base pair for the given µi value. Using the R9 pore from Oxford
Nanopore µi is best explained by a sequence of 6 DNA bases, thus the k-mer length
k = 6 is used. Thus, a function that uses manufacturer provided information will
yield a 6-mer f(µi, σi) = [A|T |C|G]6. Unfortunately, the approximation s obtained
from this function has > 15% [174] median error rate.
Let each approximated DNA strand be treated as a single data point {s1, s2, ...sv},
where v is the total number of strands being analyzed. Let there also be GGG =
G1, G2, ...Gp different groups associated with p unique strands. If copies are made of
each strand such that v >> p, we would like to group each strand with its respective
copies. The problem is formally defined as:
arg minGGG
p∑k=1
∑φ(s)∈Gk
||φ(s)− ck||2, (4.1)
75
![Page 96: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/96.jpg)
where φ(·) represents the transformation of a DNA strand to a descriptive feature
space. ck is the centroid of group Gk defined as
ck =1
|Gk|∑
φ(s)∈Gk
φ(s), (4.2)
which is the mean of group k in the transformed feature space. Therefore, selecting an
appropriate feature space transformation is vital to clustering the copies together. The
objective becomes finding a feature space transformation that minimizes the distance
of the centroid and its members, such that the centroid is accurately representative of
each unique strand.
4.4 Proposed Method
4.4.1 High-Accuracy K-mers
One key step in clustering DNA sequences is identifying reliable features to generate a
“thumb print” for each DNA strand. This thumb print allows copies of each respective
DNA strand to be grouped together with high accuracy so that subsequent sequencing
corrections can be performed. In order to find the most reliable thumb print, we
analyze the transition behavior of the DNA bases passing through a Nanopore. Each
6-mer DNA sequence will have six different transitions reflecting the transition of
the sequence, one base at a time, through the pore. These transitions are reflected
by the change in the previously defined µ from time ti to ti+1. Some sequences and
transitions have been previously identified as being difficult to base call correctly [181].
The reference measurements provided by the nanopore manufacturer provide the mean
and standard deviation associated with what one should expect to observe given a
particular 6-mer DNA sequence. A state transition table is subsequently derived for
76
![Page 97: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/97.jpg)
(76.6, 109.1,92.4)
(79.5, 97.1)
(83.5, 83.5)
(83.5, 81.1)
(80.7, 98.9)
AAAAAAA
AAAAAAG
TTGTATG
AAACCTT
k-mers meancurrents
visualizedcurrents
TTGTCACG
(A) (B)
Figure 16. (A) DNA sequences can be interpreted as a sequence of k-mers, each
k-mer can be replaced by its expected mean current to provide a mapping from DNA
sequences into a high-dimensional current space. (B) Clusters of 2-dimensional points
(sequence length 7), clusters are found using DBSCAN with neighborhood size 1 and
epsilon 2
each K-mer DNA base. Each transition has an associated transition value τ that
reflects the change in value from one K-mer to the next.
We used DBSCAN [182], a density-based clustering approach, to identify the most
unique τ patterns. The DBSCAN algorithm clusters these transitions based on density
and their distance to one another. Clusters that are generated can be interpreted as
sequences of k-mers which are easy to confuse with other members of the cluster, but
easy to differentiate from other sequences. As a result, those clusters that tend to
be smaller and more isolated reflect transition sequences that are more unique and
less likely to be confused with other sequences; the more unique a sequence is, the
more likely it is to be base called correctly. Because of these requirements we set the
DBSCAN required density to 1 (i.e. a K-mer should be treated as a cluster even if
77
![Page 98: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/98.jpg)
it has no similar k-mers), and we varied the DBSCAN epsilon(neighbor euclidean
distance requirement) between .5 and 5; empirically we found that an epsilon of 2
(approximately the median standard deviations provided by the nanopore the model)
with k-mers of length 10 (4 transitions) gave us a sufficient number of high-quality
clusters. To determine the features used we selected the 500 smallest clusters, and
selected the longest common subsequence within the cluster. These k-mers were
then treated as high-accuracy k-mers. A virtual DNA thumb print that is capable of
clustering each unique strand with its respective copies was then generated using the
high-accuracy k-mers.
The results-guided search allowed us to verify and refine our model predictions
through the analysis of real-world data. To this end we gathered E. Coli K12 data
previously made public by Nick Loman’s lab {http://lab.loman.net/2016/07/30/
nanopore-r9-data-release/}. These reads were then aligned to the genome using
Graphmap [155], while discarding unmapped reads. From the resulting alignment
files the segments of exactly matching sequence were extracted, and all k-mers (for
k=3-10) within them were extracted and counted. These counts were then divided
by the true counts, defined by the regions on the reference genome matched by the
individual reads. This ratio provides recall for all k-mers, precision is calculated by
selecting all k-mers in the same range present in the reads that were not eliminated.
We find a high concordance in the k-mers identified by the model-guided and
results-guided approaches, with a significant portion of the disagreement arising from
subsequences with low change between dimensions as demonstrated in the first example
of 16(A), where low current change between k-mers results in a higher probability of
event caller error that is not accounted for in the model.
78
![Page 99: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/99.jpg)
4.4.2 Utilization of K-means
After DNA sequences have been reduced to their thumb print, the K-means algorithm
was employed over the feature space to cluster DNA copies. K-means is a robust
clustering algorithm that minimizes the objective function previously defined in
Equation 4.1 while allowing us to define the number of clusters. The number of
unique strands define how many clusters the K-means algorithm should find. With
simulated results the exact number of unique strands is known, and future biological
experiments will provide the same information (through a colony count). We find
though that we are able to recover the error-corrected original sequences even in cases
where the number of clusters is modestly overpredicted or underpredicted, which is of
great value when the counts may not be exact due to possible sample contamination.
K-means was employed in our system using the L2 norm within the reduced
subspace created by our DNA thumb print. One limitation of K-means is that it is
only guaranteed to find a local minimum; to remedy this we perform 10 runs and
select the run with the lowest sum of squared error (SSE). Aside from being a generic
choice in K-means clustering results we find that SSE has a strong negative correlation
with cluster purity on our simulated data. Finally, our implementation was employed
using individual points as initial clusters; We find that this significantly removes the
probability that a cluster will be empty. Additionally any cluster with less than 5
reads is treated as empty and ignored, as there is no chance of reads within being
corrected to an acceptable level.
4.4.3 Synthetic Data Generation
Current read simulators [159], [160] are incapable of properly modeling k-mer accuracy
distributions identified here. nonetheless due to the low cost, high availability, and
79
![Page 100: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/100.jpg)
Organism Simulator number of MHAP K-Means
reads purity purity
E. Coli NanoSim 2,000 98.2% 96.6%
E. Coli NanoSim 20,000 77.2% 75.9%
Table 5. Comparison of MHAP and K-means clusters over the engineered featurespace, the number of K-means clusters was set to the number of componentsidentified by MHAP
relatively high fidelity it is still valuable to validate computational techniques on
simulated data. To this end we generated synthetic reads using Nanosim [159]. Each
unique read was replicated according to a Poisson random value with an expected
value of 50, this was empirically chosen to ensure (>99%) that at least 30 replicates
were generated as we determined was required for read correction. Each replicate
originates with the same sequence, but errors are calculated and applied independently.
In total we generated two data sets; one with 2,000 (40 unique) synthetic reads 20,000
synthetic reads (414 unique) and performed downstream experiments on this set.
4.5 Results
4.5.1 Clustering on Synthetic Data
There are not currently any standard tools for identifying replicated reads without a
reference genome. Most existing tools were generated for 2nd generation sequencing
technologies where reads are already high accuracy, and duplicated reads provide
no additional information, and as such they are removed after alignment [183]. In
this regard the closest tool we could find is the popular read overlapper MinHash
Alignment Process (MHAP) [175] used for de novo genome construction. As a metric
of comparison we use cluster purity as shown in figure 5. To generate the following
80
![Page 101: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/101.jpg)
data we ran MHAP using default parameters, from the output we removed any links
with less than 70% overlap, and counted connected components as clusters. In small
data sets our performance is good, but less than MHAP due to MHAP performing an
alignment post-processing step; in this way the full reads can be used allowing for the
identification of many false positives. In larger data sets there are larger numbers of
more significant overlaps that come from distinct reads.
Despite the utilization of cluster purity as a metric, the ultimate goal is to
generate accurate reads to be used for downstream processing. To this end we utilized
PBDAGCON [177] to create consensus sequences from aligning reads within each
cluster. We find that in the smaller E. Coli data set we are able to accurately recover
each of the 40 unique reads with a mean read accuracy of 97%, up from an individual
mean read accuracy of 81%. In the larger data set we are able to recover 408 of 414
individual sequences, again with a mean accuracy of 97%. On inspection most of
the failing reads are caused by a division of read copies between multiple clusters.
Interestingly, while some sequences can be base called with only 15 copies, depending
on the error distribution the consensus caller can fail with as many as 30 copies,
indicating that biological sequencing experiments should aim to have more than 30
copies for read correction.
4.6 Conclusion
We have proposed a novel unsupervised learning technique capable of significantly
increasing accuracy of nanopore base calls. We demonstrate that our technique
provides significant advantages over current techniques by providing similar levels of
accuracy while removing read length limitations imposed by previous techniques. We
81
![Page 102: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/102.jpg)
tested our method on data simulated from a state of the art simulator, and hypothesize
that on actual results the performance should be even better.
The availability of high accuracy reads allows for the exploration of new applications,
including; sequencing of larger organisms, organism disambiguation when sequencing
a population, exact sequence detection in diploid and polyploid organisms, and the
ability to scaffold genomes across exceptionally long repeat regions. Providing a path
to these high accuracy-reads without compromising read length is therefore crucial to
continued progress.
82
![Page 103: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/103.jpg)
Chapter 5
CONCLUSIONS
This chapter concludes the dissertation by summarizing the contributions of the
work and highlighting future directions.
5.1 Methodological Contributions
1. Modeling Gene Network Motifs In Chapter 1 this thesis presents a thorough
analysis of a differential equation-based fully-connected triadic network construct.
This construct is analyzed in both stable and dynamic environments to identify
stable steady states, and to validate their differential stability, and transition
dynamics. Implementing these discoveries requires a way to map real genes
within a network into this triadic construct though, so this thesis delves deeper
into network inference techniques.
2. Techniques for Data Aggregation In Chapter 2 this thesis reviews tech-
niques for network inference, and provides techniques of data aggregation and
normalization to improve inference quality. This is performed by collecting
publicly available data and normalizing the process for interpreting raw reads.
We find that while this approach shows some reduction to the noise incurred by
combining data sets there is still asignificant amount of noise remaining, and for
significant progress to be made high quality data is required.
3. Modeling and Simulation of Nanopore Sequencing Data in Chapter
4 this thesis reviews error patters in third generation nanopore sequencing.
We demonstrate that existing read simulators fail to adequately represent the
83
![Page 104: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/104.jpg)
accuracy bias that is present in nanopore data. To explain the sources of error a
set of features are generated from the model and fit to the error profile, providing
an explanation for the sources of error. Finally an HMM-based simulator capable
of generating the accuracy-biased nanopore reads is demonstrated and validated
against current read simulators.
4. Error Correction in Nanopore Sequencing In the final chapters this thesis
introduces computational and biological techniques for improving the quality of
nanopore reads using basecalled reads, using a directed acyclic graph, and by
correcting the raw signal provided by reads, which can later be basecalled with
higher fidelity.
5.2 Future Work
1. Further Exploration of Nanopore Error Sources. while this thesis
presents some analysis on sources of errors in nanopore sequencing there is
a significant fraction of error that remains unexplained. A more thorough
understanding of this error, and the ability to predict k-mer accuracy errors
completely in silico would be of significant value in the future of pore design.
2. Integration of Error Bias Findings Into Existing Software Tools. Our
findings indicate that many existing software tools that assume parity among all
k-mers are incorrect in their assumptions. Collaborations with other researchers
to integrate our findings both into sequence aligners and basecallers would be of
significant value to the scientific community as a whole.
84
![Page 105: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/105.jpg)
NOTES
85
![Page 106: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/106.jpg)
REFERENCES
[1] LC Junqueira, J Carneiro, and RO Kelley. “Basic Histology. Appleton andLange”. In: Stamford, Connecticut. USA pp (1995), pp. 283–300.
[2] B. D. Macarthur, A. Ma’ayan, and I. R. Lemischka. “Systems biology of stemcell fate and cellular reprogramming”. In: Nat Rev Mol Cell Biol 10 (Oct. 2009).10, pp. 672–81. doi: 10.1038/nrm2766.
[3] C. H. Waddington. The strategy of the genes; a discussion of some aspects oftheoretical biology. London, Allen & Unwin, 1957. ix, 262 p.
[4] S.A. Kauffman. “Metabolic stability and epigenesis in randomly constructedgenetic nets”. In: Journal of Theoretical Biology 22.3 (Mar. 1969), pp. 437–467.doi: 10.1016/0022-5193(69)90015-0. (Visited on 09/26/2013).
[5] S. Huang, G. Eichler, Y. Bar-Yam, et al. “Cell fates as high-dimensionalattractor states of a complex gene regulatory network”. In: Phys Rev Lett 94(Apr. 1, 2005). 12, p. 128701.
[6] Manu, S. Surkova, A. V. Spirov, et al. “Canalization of gene expression anddomain shifts in the Drosophila blastoderm by dynamical attractors”. In: PLoSComput Biol 5 (Mar. 2009). 3, e1000303. doi: 10.1371/journal.pcbi.1000303.
[7] M. Acar, A. Becskei, and A. van Oudenaarden. “Enhancement of cellularmemory by reducing stochastic transitions”. In: Nature 435 (May 12, 2005).7039, pp. 228–32.
[8] T. S. Gardner, C. R. Cantor, and J. J. Collins. “Construction of a genetic toggleswitch in Escherichia coli”. In: Nature 403 (Jan. 20, 2000). 6767, pp. 339–42.
[9] Attila Becskei, Bertrand Séraphin, and Luis Serrano. “Positive feedback ineukaryotic gene networks: cell differentiation by graded to binary responseconversion”. In: The EMBO Journal 20.10 (May 15, 2001), pp. 2528–2535. doi:10.1093/emboj/20.10.2528. (Visited on 09/25/2013).
[10] F. J. Isaacs, J. Hasty, C. R. Cantor, et al. “Prediction and measurement of anautoregulatory genetic module”. In: Proc Natl Acad Sci U S A 100 (June 24,2003). 13, pp. 7714–9. doi: 10.1073/pnas.1332628100.
[11] Tom Ellis, Xiao Wang, and James J. Collins. “Diversity-based, model-guidedconstruction of synthetic gene networks with predicted functions”. In: Nature
86
![Page 107: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/107.jpg)
Biotechnology 27.5 (May 2009), pp. 465–471. doi: 10.1038/nbt.1536. (Visitedon 07/03/2013).
[12] Min Wu, Ri-Qi Su, Xiaohui Li, et al. “Engineering of regulated stochastic cellfate determination”. In: Proceedings of the National Academy of Sciences 110.26(June 25, 2013), pp. 10610–10615. doi: 10.1073/pnas.1305423110. (Visited on07/03/2013).
[13] Daniel Huang, William J. Holtz, and Michel M. Maharbiz. “A genetic bistableswitch utilizing nonlinear protein degradation”. In: Journal of Biological En-gineering 6.1 (July 9, 2012), p. 9. doi: 10.1186/1754-1611-6-9. (Visited on07/27/2013).
[14] W. Xiong and J. E. Ferrell. “A positive-feedback-based bistable ’memorymodule’ that governs a cell fate decision”. In: Nature 426 (Nov. 27, 2003). 6965,pp. 460–5.
[15] Tetsuya Shiraishi, Shinako Matsuyama, and Hiroaki Kitano. “Large-ScaleAnalysis of Network Bistability for Human Cancers”. In: PLoS Comput Biol6.7 (July 8, 2010), e1000851. doi: 10.1371/journal.pcbi.1000851. (Visited on07/27/2013).
[16] S. Huang, Y. P. Guo, G. May, et al. “Bifurcation dynamics in lineage-commitment in bipotent progenitor cells”. In: Dev Biol 305 (May 15, 2007). 2,pp. 695–713.
[17] I. H. Park, R. Zhao, J. A. West, et al. “Reprogramming of human somatic cellsto pluripotency with defined factors”. In: Nature 451 (Jan. 10, 2008). 7175,pp. 141–6. doi: 10.1038/nature06534.
[18] Kazutoshi Takahashi and Shinya Yamanaka. “Induction of Pluripotent StemCells from Mouse Embryonic and Adult Fibroblast Cultures by Defined Factors”.In: Cell 126.4 (Aug. 25, 2006), pp. 663–676. doi: 10.1016/j.cell.2006.07.024.(Visited on 10/08/2013).
[19] K. Takahashi, K. Tanabe, M. Ohnuki, et al. “Induction of pluripotent stemcells from adult human fibroblasts by defined factors”. In: Cell 131 (Nov. 30,2007). 5, pp. 861–72. doi: 10.1016/j.cell.2007.11.019.
[20] J. Yu, M. A. Vodyanik, K. Smuga-Otto, et al. “Induced pluripotent stem celllines derived from human somatic cells”. In: Science 318 (Dec. 21, 2007). 5858,pp. 1917–20. doi: 10.1126/science.1151526.
87
![Page 108: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/108.jpg)
[21] L. A. Boyer, T. I. Lee, M. F. Cole, et al. “Core transcriptional regulatorycircuitry in human embryonic stem cells”. In: Cell 122 (Sept. 23, 2005). 6,pp. 947–56. doi: 10.1016/j.cell.2005.08.020.
[22] Jonghwan Kim, Jianlin Chu, Xiaohua Shen, et al. “An Extended TranscriptionalNetwork for Pluripotency of Embryonic Stem Cells”. In: Cell 132.6 (Mar. 21,2008), pp. 1049–1061. doi: 10.1016/j.cell.2008.02.039. (Visited on 07/03/2013).
[23] T. Kalmar, C. Lim, P. Hayward, et al. “Regulated fluctuations in nanogexpression mediate cell fate decisions in embryonic stem cells”. In: PLoS Biol 7(July 2009). 7, e1000149.
[24] Ian Chambers, Jose Silva, Douglas Colby, et al. “Nanog safeguards pluripotencyand mediates germline development”. In: Nature 450.7173 (Dec. 20, 2007),pp. 1230–1234. doi: 10.1038/nature06403.
[25] T. Graf and M. Stadtfeld. “Heterogeneity of embryonic and adult stem cells”. In:Cell Stem Cell 3 (Nov. 6, 2008). 5, pp. 480–3. doi: 10.1016/j.stem.2008.10.007.
[26] Maurice A Canham, Alexei A Sharov, Minoru S H Ko, et al. “Functional het-erogeneity of embryonic stem cells revealed through translational amplificationof an early endodermal transcript”. In: PLoS biology 8.5 (May 2010), e1000379.doi: 10.1371/journal.pbio.1000379.
[27] E. M. Chan, S. Ratanasirintrawoot, I. H. Park, et al. “Live cell imagingdistinguishes bona fide human iPS cells from partially reprogrammed cells”. In:Nat Biotechnol 27 (Nov. 2009). 11, pp. 1033–7. doi: 10.1038/nbt.1580.
[28] Katsuhiko Hayashi, Susana M Chuva de Sousa Lopes, Fuchou Tang, et al.“Dynamic equilibrium and heterogeneity of mouse pluripotent stem cells withdistinct functional and epigenetic states”. In: Cell stem cell 3.4 (Oct. 9, 2008),pp. 391–401. doi: 10.1016/j.stem.2008.07.027.
[29] Ben D MacArthur, Ana Sevilla, Michel Lenz, et al. “Nanog-dependent feedbackloops regulate murine embryonic stem cell heterogeneity”. In: Nature cell biology14.11 (Nov. 2012), pp. 1139–1147. doi: 10.1038/ncb2603.
[30] Todd S Macfarlan, Wesley D Gifford, Shawn Driscoll, et al. “Embryonic stemcell potency fluctuates with endogenous retrovirus activity”. In: Nature 487.7405(July 5, 2012), pp. 57–63. doi: 10.1038/nature11244.
[31] J. Silva and A. Smith. “Capturing pluripotency”. In: Cell 132 (Feb. 22, 2008).4, pp. 532–6. doi: 10.1016/j.cell.2008.02.006.
88
![Page 109: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/109.jpg)
[32] Yayoi Toyooka, Daisuke Shimosato, Kazuhiro Murakami, et al. “Identificationand characterization of subpopulations in undifferentiated ES cell culture”.In: Development (Cambridge, England) 135.5 (Mar. 2008), pp. 909–918. doi:10.1242/dev.017400.
[33] Jamie Trott, Katsuhiko Hayashi, Azim Surani, et al. “Dissecting ensemblenetworks in ES cell populations reveals micro-heterogeneity underlying pluripo-tency”. In: Molecular bioSystems 8.3 (Mar. 2012), pp. 744–752. doi: 10.1039/c1mb05398a.
[34] J. Hanna, K. Saha, B. Pando, et al. “Direct cell reprogramming is a stochasticprocess amenable to acceleration”. In: Nature 462 (Dec. 3, 2009). 7273, pp. 595–601. doi: 10.1038/nature08592.
[35] S. Yamanaka. “Elite and stochastic models for induced pluripotent stem cellgeneration”. In: Nature 460 (July 2, 2009). 7251, pp. 49–52. doi: 10.1038/nature08180.
[36] A. Brock, H. Chang, and S. Huang. “Non-genetic heterogeneity–a mutation-independent driving force for the somatic evolution of tumours”. In: Nat RevGenet 10 (May 2009). 5, pp. 336–42. doi: 10.1038/nrg2556.
[37] Sui Huang. “Tumor progression: chance and necessity in Darwinian and Lamar-ckian somatic (mutationless) evolution”. In: Progress in biophysics and molecularbiology 110.1 (Sept. 2012), pp. 69–86. doi: 10.1016/j.pbiomolbio.2012.05.001.
[38] Wenzhe Ma, Ala Trusina, Hana El-Samad, et al. “Defining Network Topologiesthat Can Achieve Biochemical Adaptation”. In: Cell 138.4 (Aug. 21, 2009),pp. 760–773. doi: 10.1016/j.cell.2009.06.013. (Visited on 07/03/2013).
[39] R. Milo, S. Shen-Orr, S. Itzkovitz, et al. “Network motifs: simple buildingblocks of complex networks”. In: Science 298 (Oct. 25, 2002). 5594, pp. 824–7.doi: 10.1126/science.298.5594.824.
[40] Jian Shu, Chen Wu, Yetao Wu, et al. “Induction of pluripotency in mousesomatic cells with lineage specifiers”. In: Cell 153.5 (May 23, 2013), pp. 963–975.doi: 10.1016/j.cell.2013.05.001.
[41] I. Chambers and A. Smith. “Self-renewal of teratocarcinoma and embryonicstem cells”. In: Oncogene 23 (Sept. 20, 2004). 43, pp. 7150–60. doi: 10.1038/sj.onc.1207930.
89
![Page 110: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/110.jpg)
[42] T. Graf and T. Enver. “Forcing cells to change lineages”. In: Nature 462 (Dec. 3,2009). 7273, pp. 587–94. doi: 10.1038/nature08533.
[43] H. Niwa. “How is pluripotency determined and maintained?” In: Development134 (Feb. 2007). 4, pp. 635–46. doi: 10.1242/dev.02787.
[44] Piyush B. Gupta, Christine M. Fillmore, Guozhi Jiang, et al. “Stochastic StateTransitions Give Rise to Phenotypic Equilibrium in Populations of Cancer Cells”.In: Cell 146.4 (Aug. 19, 2011), pp. 633–644. doi: 10.1016/j.cell.2011.07.026.(Visited on 07/27/2013).
[45] Avigdor Eldar and Michael B. Elowitz. “Functional roles for noise in geneticcircuits”. In: Nature 467.7312 (Sept. 9, 2010), pp. 167–173. doi: 10.1038/nature09326. (Visited on 07/27/2013).
[46] Peter Laslo, Chauncey J. Spooner, Aryeh Warmflash, et al. “MultilineageTranscriptional Priming and Determination of Alternate Hematopoietic CellFates”. In: Cell 126.4 (Aug. 25, 2006), pp. 755–766. doi: 10.1016/j.cell.2006.06.052. (Visited on 09/04/2013).
[47] M. Kaufman, C. Soule, and R. Thomas. “A new necessary condition on inter-action graphs for multistationarity”. In: J Theor Biol 248 (Oct. 21, 2007). 4,pp. 675–85. doi: 10.1016/j.jtbi.2007.06.016.
[48] J. Hasty, J. Pradines, M. Dolnik, et al. “Noise-based switches and amplifiersfor gene expression”. In: Proc Natl Acad Sci U S A 97 (Feb. 29, 2000). 5,pp. 2075–80.
[49] Santhosh Palani and Casim A Sarkar. “Transient noise amplification and geneexpression synchronization in a bistable mammalian cell-fate switch”. In: Cellreports 1.3 (Mar. 29, 2012), pp. 215–224. doi: 10.1016/j.celrep.2012.01.007.
[50] Harukazu Suzuki, Alistair R. R. Forrest, Erik van Nimwegen, et al. “Thetranscriptional network that controls growth arrest and differentiation in ahuman myeloid leukemia cell line”. In: Nature Genetics 41.5 (May 2009),pp. 553–562. doi: 10.1038/ng.375. (Visited on 08/08/2013).
[51] Hao Yuan Kueh, Ameya Champhekhar, Stephen L. Nutt, et al. “PositiveFeedback Between PU.1 and the Cell Cycle Controls Myeloid Differentiation”.In: Science 341.6146 (Aug. 9, 2013), pp. 670–673. doi: 10.1126/science.1240831.(Visited on 08/08/2013).
90
![Page 111: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/111.jpg)
[52] Richard A. Young. “Control of the Embryonic Stem Cell State”. In: Cell 144.6(Mar. 18, 2011), pp. 940–954. doi: 10.1016/j.cell.2011.01.032. (Visited on06/04/2014).
[53] Bard Ermentrout. Simulating, analyzing, and animating dynamical systems : aguide to XPPAUT for researchers and students. Software, environments, tools.Philadelphia: Society for Industrial and Applied Mathematics, 2002. xiv, 290 p.
[54] Steven H. Strogatz. Nonlinear Dynamics and Chaos: With Applications toPhysics, Biology, Chemistry, and Engineering. Reading, Mass.: Addison-WesleyPub., 1994. xi, 498.
[55] K. C. Chen, L. Calzone, A. Csikasz-Nagy, et al. “Integrative analysis of cellcycle control in budding yeast”. In: Mol Biol Cell 15 (Aug. 2004). 8, pp. 3841–62.
[56] W. J. Blake, K. AErn M, C. R. Cantor, et al. “Noise in eukaryotic geneexpression”. In: Nature 422 (Apr. 10, 2003). 6932, pp. 633–7.
[57] W. J. Blake, G. Balazsi, M. A. Kohanski, et al. “Phenotypic consequences ofpromoter-mediated transcriptional noise”. In: Mol Cell 24 (Dec. 28, 2006). 6,pp. 853–65.
[58] M. B. Elowitz, A. J. Levine, E. D. Siggia, et al. “Stochastic gene expression ina single cell”. In: Science 297 (Aug. 16, 2002). 5584, pp. 1183–6.
[59] Nicholas J Guido, Xiao Wang, David Adalsteinsson, et al. “A bottom-upapproach to gene regulation”. In: Nature 439.7078 (Feb. 16, 2006), pp. 856–860.doi: 10.1038/nature04473.
[60] J. M. Raser and E. K. O’Shea. “Control of stochasticity in eukaryotic geneexpression”. In: Science 304 (June 18, 2004). 5678, pp. 1811–4.
[61] G. M. Suel, J. Garcia-Ojalvo, L. M. Liberman, et al. “An excitable generegulatory circuit induces transient cellular differentiation”. In: Nature 440(Mar. 23, 2006). 7083, pp. 545–50.
[62] M. Thattai and A. van Oudenaarden. “Intrinsic noise in gene regulatory net-works”. In: Proc Natl Acad Sci U S A 98 (July 17, 2001). 15, pp. 8614–9.
[63] M. L. Bell, J. B. Earl, and S. G. Britt. “Two types of Drosophila R7 photorecep-tor cells are arranged randomly: a model for stochastic cell-fate determination”.In: J Comp Neurol 502 (May 1, 2007). 1, pp. 75–85.
91
![Page 112: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/112.jpg)
[64] V. Chickarmane, T. Enver, and C. Peterson. “Computational modeling ofthe hematopoietic erythroid-myeloid switch reveals insights into cooperativity,priming, and irreversibility”. In: PLoS Comput Biol 5 (Jan. 2009). 1, e1000268.
[65] C. F. Chuang, M. K. Vanhoven, R. D. Fetter, et al. “An innexin-dependentcell network establishes left-right neuronal asymmetry in C. elegans”. In: Cell129 (May 18, 2007). 4, pp. 787–99. doi: 10.1016/j.cell.2007.02.052.
[66] A. Raj, S. A. Rifkin, E. Andersen, et al. “Variability in gene expression underliesincomplete penetrance”. In: Nature 463 (Feb. 18, 2010). 7283, pp. 913–8. doi:10.1038/nature08781.
[67] G. Yao, T. J. Lee, S. Mori, et al. “A bistable Rb-E2F switch underlies therestriction point”. In: Nat Cell Biol 10 (Apr. 2008). 4, pp. 476–82.
[68] D. Adalsteinsson, D. McMillen, and T. C. Elston. “Biochemical NetworkStochastic Simulator (BioNetS): software for stochastic modeling of biochemicalnetworks”. In: BMC Bioinformatics 5 (Mar. 8, 2004), p. 24.
[69] Koichi Akashi, Xi He, Jie Chen, et al. “Transcriptional accessibility for genes ofmultiple tissues and hematopoietic lineages is hierarchically controlled duringearly hematopoiesis”. In: Blood 101.2 (Jan. 15, 2003), pp. 383–389. doi: 10.1182/blood-2002-06-1780.
[70] M. A. Cross and T. Enver. “The lineage commitment of haemopoietic progenitorcells”. In: Curr Opin Genet Dev 7 (Oct. 1997). 5, pp. 609–13.
[71] M. Hu, D. Krause, M. Greaves, et al. “Multilineage gene expression precedescommitment in the hemopoietic system”. In: Genes Dev 11 (Mar. 15, 1997). 6,pp. 774–85.
[72] Carla F Bender Kim, Erica L Jackson, Amber E Woolfenden, et al. “Identifica-tion of bronchioalveolar stem cells in normal lung and lung cancer”. In: Cell121.6 (June 17, 2005), pp. 823–835. doi: 10.1016/j.cell.2005.03.032.
[73] T. Miyamoto, H. Iwasaki, B. Reizis, et al. “Myeloid or lymphoid promiscuityas a critical step in hematopoietic lineage commitment”. In: Dev Cell 3 (July2002). 1, pp. 137–47.
[74] I. S. Gradshteyn, I. M. Ryzhik, and Alan Jeffrey. Table of integrals, series,and products. 6th. San Diego: Academic Press, 2000. xlvii, 1163 p. url: http://www.loc.gov/catdir/description/els031/00104373.html%20http://www.loc.gov/catdir/toc/els031/00104373.html.
92
![Page 113: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/113.jpg)
[75] Joseph Xu Zhou, M. D. S. Aliyu, Erik Aurell, et al. “Quasi-potential landscapein complex multi-stable systems”. In: Journal of The Royal Society Interface(Aug. 29, 2012), rsif20120434. doi: 10 . 1098 / rsif . 2012 . 0434. (Visited on03/19/2014).
[76] Ben D. MacArthur and Ihor R. Lemischka. “Statistical Mechanics of Pluripo-tency”. In: Cell 154.3 (Aug. 1, 2013), pp. 484–489. doi: 10.1016/j.cell.2013.07.024. (Visited on 08/02/2013).
[77] T. Vierbuchen, A. Ostermeier, Z. P. Pang, et al. “Direct conversion of fibroblaststo functional neurons by defined factors”. In: Nature 463 (Feb. 25, 2010). 7284,pp. 1035–41. doi: 10.1038/nature08797.
[78] T. Nakagawa, M. Sharma, Y. Nabeshima, et al. “Functional hierarchy andreversibility within the murine spermatogenic stem cell compartment”. In:Science 328 (Apr. 2, 2010). 5974, pp. 62–7. doi: 10.1126/science.1182868.
[79] Y. H. Loh, Q. Wu, J. L. Chew, et al. “The Oct4 and Nanog transcriptionnetwork regulates pluripotency in mouse embryonic stem cells”. In: Nat Genet38 (Apr. 2006). 4, pp. 431–40. doi: 10.1038/ng1760.
[80] Miguel Fidalgo, Francesco Faiola, Carlos-Filipe Pereira, et al. “Zfp281 mediatesNanog autorepression through recruitment of the NuRD complex and inhibitssomatic cell reprogramming”. In: Proceedings of the National Academy ofSciences of the United States of America 109.40 (Oct. 2, 2012), pp. 16202–16207. doi: 10.1073/pnas.1208533109.
[81] Pablo Navarro, Nicola Festuccia, Douglas Colby, et al. “OCT4/SOX2-independent Nanog autorepression modulates heterogeneous Nanog geneexpression in mouse ES cells”. In: The EMBO journal 31.24 (Dec. 12, 2012),pp. 4547–4562. doi: 10.1038/emboj.2012.321.
[82] H Niwa, J Miyazaki, and A G Smith. “Quantitative expression of Oct-3/4defines differentiation, dedifferentiation or self-renewal of ES cells”. In: Naturegenetics 24.4 (Apr. 2000), pp. 372–376. doi: 10.1038/74199.
[83] Aliaksandra Radzisheuskaya, Gloryn Le Bin Chia, Rodrigo L dos Santos, et al.“A defined Oct4 level governs cell state transitions of pluripotency entry anddifferentiation into all embryonic lineages”. In: Nature cell biology 15.6 (June2013), pp. 579–590. doi: 10.1038/ncb2742.
[84] Violetta Karwacki-Neisius, Jonathan Göke, Rodrigo Osorno, et al. “ReducedOct4 Expression Directs a Robust Pluripotent State with Distinct Signaling
93
![Page 114: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/114.jpg)
Activity and Increased Enhancer Occupancy by Oct4 and Nanog”. In: CellStem Cell 12.5 (Feb. 5, 2013), pp. 531–545. doi: 10.1016/j.stem.2013.04.023.(Visited on 06/04/2014).
[85] J. Wang, D. N. Levasseur, and S. H. Orkin. “Requirement of Nanog dimerizationfor stem cell self-renewal and pluripotency”. In: Proc Natl Acad Sci U S A 105(Apr. 29, 2008). 17, pp. 6326–31. doi: 10.1073/pnas.0802288105.
[86] Joon-Lin Chew, Yuin-Han Loh, Wensheng Zhang, et al. “Reciprocal transcrip-tional regulation of Pou5f1 and Sox2 via the Oct4/Sox2 complex in embryonicstem cells”. In: Molecular and cellular biology 25.14 (July 2005), pp. 6031–6046.doi: 10.1128/MCB.25.14.6031-6046.2005.
[87] Vijay Chickarmane, Carl Troein, Ulrike A Nuber, et al. “Transcriptional Dynam-ics of the Embryonic Stem Cell Switch”. In: PLoS Comput Biol 2.9 (Sept. 15,2006), e123. doi: 10.1371/journal.pcbi.0020123. (Visited on 06/04/2014).
[88] Irene Aksoy, Ralf Jauch, Jiaxuan Chen, et al. “Oct4 switches partnering fromSox2 to Sox17 to reinterpret the enhancer code and specify endoderm”. In: TheEMBO journal 32.7 (Apr. 3, 2013), pp. 938–953. doi: 10.1038/emboj.2013.31.
[89] Dina A. Faddah, Haoyi Wang, Albert Wu Cheng, et al. “Single-Cell AnalysisReveals that Expression of Nanog Is Biallelic and Equally Variable as that ofOther Pluripotency Factors in Mouse ESCs”. In: Cell Stem Cell 13.1 (July 3,2013), pp. 23–29. doi: 10.1016/j.stem.2013.04.019. (Visited on 08/08/2013).
[90] Adam Filipczyk, Konstantinos Gkatzis, Jun Fu, et al. “Biallelic Expression ofNanog Protein in Mouse Embryonic Stem Cells”. In: Cell Stem Cell 13.1 (July 3,2013), pp. 12–13. doi: 10.1016/j.stem.2013.04.025. (Visited on 08/08/2013).
[91] Yusuke Miyanari and Maria-Elena Torres-Padilla. “Control of ground-statepluripotency by allelic regulation of Nanog”. In: Nature 483.7390 (Feb. 12,2012), pp. 470–473. doi: 10.1038/nature10807. (Visited on 03/22/2012).
[92] Austin Smith. “Nanog Heterogeneity: Tilting at Windmills?” In: Cell StemCell 13.1 (July 3, 2013), pp. 6–7. doi: 10.1016/j.stem.2013.06.016. (Visited on08/08/2013).
[93] O. Adewumi, B. Aflatoonian, L. Ahrlund-Richter, et al. “Characterization ofhuman embryonic stem cell lines by the International Stem Cell Initiative”. In:Nat Biotechnol 25 (July 2007). 7, pp. 803–16. doi: 10.1038/nbt1318.
94
![Page 115: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/115.jpg)
[94] Kevin F Murphy, Rhys M Adams, Xiao Wang, et al. “Tuning and controllinggene expression noise in synthetic gene networks”. In: Nucleic acids research38.8 (May 2010), pp. 2712–2726. doi: 10.1093/nar/gkq091.
[95] Hong-xi Zhao, Yang Li, Hai-feng Jin, et al. “Rapid and efficient repro-gramming of human amnion-derived cells into pluripotency by three factorsOCT4/SOX2/NANOG”. In: Differentiation; research in biological diversity 80.2(Oct. 2010), pp. 123–129. doi: 10.1016/j.diff.2010.03.002.
[96] R. Jaenisch and R. Young. “Stem cells, the molecular circuitry of pluripotencyand nuclear reprogramming”. In: Cell 132 (Feb. 22, 2008). 4, pp. 567–82. doi:10.1016/j.cell.2008.01.015.
[97] Vidyadhar G. Kulkarni. Modeling and analysis of stochastic systems. New York:Chapman & Hall, 1995. url: http://www.loc.gov/catdir/enhancements/fy0744/95015182-d.html.
[98] Filipe Grácio, Joaquim Cabral, and Bruce Tidor. “Modeling Stem Cell InductionProcesses”. In: PLoS ONE 8.5 (May 8, 2013), e60240. doi: 10.1371/journal.pone.0060240. (Visited on 08/08/2013).
[99] John E Pimanda, Katrin Ottersbach, Kathy Knezevic, et al. “Gata2, Fli1, andScl form a recursively wired gene-regulatory circuit during early hematopoieticdevelopment”. In: Proceedings of the National Academy of Sciences of theUnited States of America 104.45 (Nov. 6, 2007), pp. 17692–17697. doi: 10.1073/pnas.0707045104.
[100] Michael Kyba and George Q Daley. “Hematopoiesis from embryonic stem cells:lessons from and for ontogeny”. In: Experimental hematology 31.11 (Nov. 2003),pp. 994–1006.
[101] Sunhong Kim, Yongsung Kim, Jiwoon Lee, et al. “Regulation of FOXO1 byTAK1-Nemo-like kinase pathway”. In: The Journal of biological chemistry285.11 (Mar. 12, 2010), pp. 8122–8129. doi: 10.1074/jbc.M110.101824.
[102] B. Pourcet, I. Pineda-Torra, B. Derudas, et al. “SUMOylation of humanperoxisome proliferator-activated receptor alpha inhibits its trans-activitythrough the recruitment of the nuclear corepressor NCoR”. In: J Biol Chem285 (Feb. 26, 2010). 9, pp. 5983–92. doi: 10.1074/jbc.M109.078311.
[103] F. Wei, H. R. Scholer, and M. L. Atchison. “Sumoylation of Oct4 enhances itsstability, DNA binding, and transactivation”. In: J Biol Chem 282 (July 20,2007). 29, pp. 21551–60. doi: 10.1074/jbc.M611041200.
95
![Page 116: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/116.jpg)
[104] Alex K. Shalek, Rahul Satija, Xian Adiconis, et al. “Single-cell transcriptomicsreveals bimodality in expression and splicing in immune cells”. In: Nature498.7453 (June 13, 2013), pp. 236–240. doi: 10.1038/nature12172. url: http://www.nature.com/nature/journal/v498/n7453/full/nature12172.html (visitedon 08/28/2014).
[105] Ting Lu, Michael Ferry, Ron Weiss, et al. “A molecular noise generator”. In:Physical biology 5.3 (2008), p. 036006. doi: 10.1088/1478-3975/5/3/036006.
[106] Mathieu Gabut, Payman Samavarchi-Tehrani, Xinchen Wang, et al. “An Al-ternative Splicing Switch Regulates Embryonic Stem Cell Pluripotency andReprogramming”. In: Cell 147 (Sept. 2011), pp. 132–146. doi: 10.1016/j.cell.2011.08.023. (Visited on 10/18/2011).
[107] Yu Lu, Yuin-Han Loh, Hu Li, et al. “Alternative Splicing of MBD2 SupportsSelf-Renewal in Human Pluripotent Stem Cells”. In: Cell stem cell (May 7,2014). doi: 10.1016/j.stem.2014.04.002.
[108] T. L. Deans, C. R. Cantor, and J. J. Collins. “A tunable genetic switch basedon RNAi and repressor proteins for regulating gene expression in mammaliancells”. In: Cell 130 (July 27, 2007). 2, pp. 363–72.
[109] A. S. Khalil and J. J. Collins. “Synthetic biology: applications come of age”. In:Nat Rev Genet 11 (May 2010). 5, pp. 367–79. doi: 10.1038/nrg2775.
[110] T. K. Lu, A. S. Khalil, and J. J. Collins. “Next-generation synthetic genenetworks”. In: Nat Biotechnol 27 (Dec. 2009). 12, pp. 1139–50. doi: 10.1038/nbt.1591.
[111] J. Stricker, S. Cookson, M. R. Bennett, et al. “A fast, robust and tunablesynthetic gene oscillator”. In: Nature 456 (Nov. 27, 2008). 7221, pp. 516–9. doi:10.1038/nature07389.
[112] M. Tigges, T. T. Marquez-Lago, J. Stelling, et al. “A tunable synthetic mam-malian oscillator”. In: Nature 457 (Jan. 15, 2009). 7227, pp. 309–12.
[113] R. Lu, F. Markowetz, R. D. Unwin, et al. “Systems-level dynamic analyses offate change in murine embryonic stem cells”. In: Nature 462 (Nov. 19, 2009).7271, pp. 358–62. doi: 10.1038/nature08575.
[114] Ingmar Glauche, Maria Herberg, and Ingo Roeder. “Nanog Variability andPluripotency Regulation of Embryonic Stem Cells - Insights from a Mathe-
96
![Page 117: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/117.jpg)
matical Model Analysis”. In: PLoS ONE 5.6 (June 21, 2010), e11238. doi:10.1371/journal.pone.0011238. (Visited on 07/03/2013).
[115] A. Remenyi, K. Lins, L. J. Nissen, et al. “Crystal structure of aPOU/HMG/DNA ternary complex suggests differential assembly ofOct4 and Sox2 on two enhancers”. In: Genes Dev 17 (Aug. 15, 2003). 16,pp. 2048–59. doi: 10.1101/gad.269303.
[116] A. Dhooge, W. Govaerts, and Yu. A. Kuznetsov. “MATCONT: A MATLABPackage for Numerical Bifurcation Analysis of ODEs”. In: ACM Trans. Math.Softw. 29.2 (June 2003), pp. 141–164. doi: 10.1145/779359.779362. (Visited on11/15/2013).
[117] Daniel Gillespie. “Exact stochastic simulation of coupled chemical reactions”.In: J. Phys. Chem. 81 (1977), pp. 2340–2361.
[118] Tong Ihn Lee, Nicola J. Rinaldi, Francois Robert, et al. “Transcriptional Regu-latory Networks in Saccharomyces cerevisiae”. In: Science 298.5594 (Oct. 25,2002), pp. 799–804. doi: 10 . 1126/ science . 1075090. url: http : // science .sciencemag .org .ezproxy1. lib .asu.edu/content/298/5594/799 (visited on07/13/2017).
[119] Philippe C. Faucon, Keith Pardee, Roshan M. Kumar, et al. “Gene Networks ofFully Connected Triads with Complete Auto-Activation Enable Multistabilityand Stepwise Stochastic Transitions”. In: PLOS ONE 9.7 (July 24, 2014),e102873. doi: 10.1371/journal.pone.0102873. url: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0102873 (visited on 09/14/2016).
[120] Tanya Barrett, Stephen E. Wilhite, Pierre Ledoux, et al. “NCBI GEO: archivefor functional genomics data sets”. In: Nucleic Acids Research 41 (D1 Jan. 1,2013), pp. D991–D995. doi: 10.1093/nar/gks1193. url: https://academic.oup.com/nar/article/41/D1/D991/1067995/NCBI-GEO-archive-for-functional-genomics-data-sets (visited on 07/13/2017).
[121] Markus Maucher, Barbara Kracher, Michael Kühl, et al. “Inferring Booleannetwork structure via correlation”. In: Bioinformatics 27.11 (June 1, 2011),pp. 1529–1536. doi: 10.1093/bioinformatics/btr166. url: http://bioinformatics.oxfordjournals.org/content/27/11/1529 (visited on 03/21/2016).
[122] Daniel Marbach, James C. Costello, Robert Küffner, et al. “Wisdom of crowdsfor robust gene network inference”. In: Nature Methods 9.8 (Aug. 2012), pp. 796–804. doi: 10.1038/nmeth.2016. url: http://www.nature.com.ezproxy1.lib.asu.edu/nmeth/journal/v9/n8/full/nmeth.2016.html (visited on 07/13/2017).
97
![Page 118: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/118.jpg)
[123] Jing Zhao, Lili Miao, Jian Yang, et al. “Prediction of Links and Weights inNetworks by Reliable Routes”. In: Scientific Reports 5 (July 22, 2015), p. 12261.doi: 10.1038/srep12261. url: http://www.nature.com/articles/srep12261(visited on 04/04/2016).
[124] Gordon E Moore et al. Cramming more components onto integrated circuits.McGraw-Hill, 1965.
[125] Ilya Shmulevich, Edward R. Dougherty, Seungchan Kim, et al. “ProbabilisticBoolean networks: a rule-based uncertainty model for gene regulatory networks”.In: Bioinformatics 18.2 (Feb. 1, 2002), pp. 261–274. doi: 10.1093/bioinformatics/18.2.261. (Visited on 09/24/2013).
[126] E P van Someren, L F Wessels, and M J Reinders. “Linear modeling of geneticnetworks from experimental data”. In: Proceedings / ... International Conferenceon Intelligent Systems for Molecular Biology ; ISMB. International Conferenceon Intelligent Systems for Molecular Biology 8 (2000), pp. 355–366.
[127] P D’haeseleer, X Wen, S Fuhrman, et al. “Linear modeling of mRNA expres-sion levels during CNS development and injury”. In: Pacific Symposium onBiocomputing. Pacific Symposium on Biocomputing (1999), pp. 41–52.
[128] Min Zou and Suzanne D. Conzen. “A new dynamic Bayesian network (DBN)approach for identifying gene regulatory networks from time course microarraydata”. In: Bioinformatics 21.1 (Jan. 1, 2005), pp. 71–79. doi: 10.1093/bioinformatics/bth463. (Visited on 10/10/2013).
[129] Nir Friedman, Michal Linial, Iftach Nachman, et al. “Using Bayesian Net-works to Analyze Expression Data”. In: Journal of Computational Biology7.3 (Aug. 2000), pp. 601–620. doi: 10.1089/106652700750050961. (Visited on09/25/2013).
[130] E. J. Moler, M. L. Chow, and I. S. Mian. “Analysis of molecular profile datausing generative and discriminative methods”. In: Physiological Genomics 4.2(Jan. 1, 2000), pp. 109–126. (Visited on 09/25/2013).
[131] Sui Huang, Ingemar Ernberg, and Stuart Kauffman. “Cancer attractors: Asystems view of tumors from a gene network dynamics and developmentalperspective”. In: Seminars in Cell & Developmental Biology 20.7 (Sept. 2009),pp. 869–876. doi: 10.1016/j.semcdb.2009.07.003. (Visited on 09/26/2013).
98
![Page 119: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/119.jpg)
[132] Sui Huang and Donald E. Ingber. “The structural and mechanical complexityof cell-growth control”. In: Nature Cell Biology 1.5 (Sept. 1999), E131–E138.doi: 10.1038/13043. (Visited on 10/08/2013).
[133] Sui Huang. “Gene expression profiling, genetic networks, and cellular states:an integrating concept for tumorigenesis and drug discovery”. In: Journal ofMolecular Medicine 77.6 (June 1, 1999), pp. 469–480. doi: 10.1007/s001099900023. (Visited on 10/08/2013).
[134] Kevin Murphy and Saira Mian. “Modelling Gene Expression Data using Dy-namic Bayesian Networks”. In: (May 1999). url: http://www-devel.cs.ubc.ca/~murphyk/Papers/ismb99.pdf.
[135] Huilei Xu, Yen-Sin Ang, Ana Sevilla, et al. “Construction and validation ofa regulatory network for pluripotency and self-renewal of mouse embryonicstem cells”. In: PLoS computational biology 10.8 (Aug. 2014), e1003777. doi:10.1371/journal.pcbi.1003777.
[136] Javed Khan, Jun S. Wei, Markus Ringner, et al. “Classification and diagnosticprediction of cancers using gene expression profiling and artificial neural net-works”. In: Nature Medicine 7.6 (June 2001), pp. 673–679. doi: 10.1038/89044.(Visited on 09/27/2013).
[137] R. Brause, T. Langsdorf, and M. Hepp. “Neural data mining for credit card frauddetection”. In: 11th IEEE International Conference on Tools with ArtificialIntelligence, 1999. Proceedings. 11th IEEE International Conference on Toolswith Artificial Intelligence, 1999. Proceedings. 1999, pp. 103–106. doi: 10.1109/TAI.1999.809773.
[138] M. Syeda, Yan-Qing Zhang, and Y. Pan. “Parallel granular neural networks forfast credit card fraud detection”. In: Proceedings of the 2002 IEEE InternationalConference on Fuzzy Systems, 2002. FUZZ-IEEE’02. Proceedings of the 2002IEEE International Conference on Fuzzy Systems, 2002. FUZZ-IEEE’02. Vol. 1.2002, pp. 572–577. doi: 10.1109/FUZZ.2002.1005055.
[139] Richard Bellman and Robert E Kalaba. Dynamic programming and moderncontrol theory. Academic Press New York, 1965.
[140] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, et al. “Empirical Eval-uation of Gated Recurrent Neural Networks on Sequence Modeling”. In:arXiv:1412.3555 [cs] (Dec. 11, 2014). arXiv: 1412.3555. url: http://arxiv.org/abs/1412.3555 (visited on 10/11/2016).
99
![Page 120: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/120.jpg)
[141] Mukesh Bansal, Vincenzo Belcastro, Alberto Ambesi-Impiombato, et al. “Howto infer gene networks from expression profiles”. In: Molecular Systems Biology3.1 (Jan. 1, 2007), p. 78. doi: 10.1038/msb4100120. url: http://msb.embopress.org/content/3/1/78 (visited on 07/13/2017).
[142] Jing Yu, V. Anne Smith, Paul P. Wang, et al. “Advances to Bayesian networkinference for generating causal networks from observational biological data”.In: Bioinformatics 20.18 (Dec. 12, 2004), pp. 3594–3603. doi: 10.1093/bioinformatics/bth448. url: https://academic.oup.com/bioinformatics/article/20/18/3594/202541/Advances-to-Bayesian-network-inference-for (visited on07/13/2017).
[143] Barbara Treutlein, Doug G. Brownfield, Angela R. Wu, et al. “Reconstructinglineage hierarchies of the distal lung epithelium using single-cell RNA-seq”. In:Nature 509.7500 (May 15, 2014), pp. 371–375. doi: 10.1038/nature13173.
[144] Eric W. Brunskill, Joo-Seop Park, Eunah Chung, et al. “Single cell dissection ofearly kidney development: multilineage priming”. In: Development (Cambridge,England) 141.15 (Aug. 2014), pp. 3093–3101. doi: 10.1242/dev.110601.
[145] Saiful Islam, Una Kjallquist, Annalena Moliner, et al. “Characterization of thesingle-cell transcriptional landscape by highly multiplex RNA-seq”. In: GenomeResearch 21.7 (July 2011), pp. 1160–1167. doi: 10.1101/gr.110882.110.
[146] Robert N. Plasschaert, Sebastien Vigneau, Italo Tempera, et al. “CTCF bindingsite sequence differences are associated with unique regulatory and functionaltrends during embryonic stem cell differentiation”. In: Nucleic Acids Research42.2 (Jan. 2014), pp. 774–789. doi: 10.1093/nar/gkt910.
[147] Jeremy Goecks, Anton Nekrutenko, James Taylor, et al. “Galaxy: a com-prehensive approach for supporting accessible, reproducible, and transparentcomputational research in the life sciences”. In: Genome Biology 11.8 (2010),R86. doi: 10.1186/gb-2010-11-8-r86.
[148] Daniel Blankenberg, Gregory Von Kuster, Nathaniel Coraor, et al. “Galaxy: aweb-based genome analysis tool for experimentalists”. In: Current Protocols inMolecular Biology Chapter 19 (Jan. 2010), pages. doi: 10.1002/0471142727.mb1910s89.
[149] Chunlei Wu, Ian Macleod, and Andrew I. Su. “BioGPS and MyGene.info:organizing online, gene-centric information”. In: Nucleic Acids Research 41(Database issue Jan. 2013), pp. D561–565. doi: 10.1093/nar/gks1114.
100
![Page 121: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/121.jpg)
[150] Zhide Fang and Xiangqin Cui. “Design and validation issues in RNA-seqexperiments”. In: Briefings in Bioinformatics 12.3 (May 1, 2011), pp. 280–287.doi: 10.1093/bib/bbr004. url: http://bib.oxfordjournals.org/content/12/3/280 (visited on 10/09/2014).
[151] Florian Buettner, Kedar N. Natarajan, F. Paolo Casale, et al. “Computationalanalysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data revealshidden subpopulations of cells”. In: Nature Biotechnology 33.2 (Feb. 2015),pp. 155–160. doi: 10.1038/nbt.3102. url: http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v33/n2/full/nbt.3102.html (visited on 07/18/2017).
[152] Michael G. Ross, Carsten Russ, Maura Costello, et al. “Characterizing andmeasuring bias in sequence data”. In: Genome Biology 14.5 (May 29, 2013), R51.doi: 10.1186/gb-2013-14-5-r51. url: http://genomebiology.biomedcentral.com.ezproxy1.lib.asu.edu/articles/10.1186/gb-2013-14-5-r51 (visited on04/11/2017).
[153] Nicholas J. Loman, Joshua Quick, and Jared T. Simpson. “A complete bacterialgenome assembled de novo using only nanopore sequencing data”. In: NatureMethods 12.8 (Aug. 2015), pp. 733–735. doi: 10 . 1038/nmeth . 3444. url:http://www.nature.com.ezproxy1.lib.asu.edu/nmeth/journal/v12/n8/full/nmeth.3444.html (visited on 09/12/2016).
[154] Philippe Christophe Faucon, Robert Trevino, Parithi Balachandran, et al. “HighAccuracy Base Calls in Nanopore Sequencing”. In: bioRxiv (Apr. 11, 2017),p. 126680. doi: 10.1101/126680. url: http://biorxiv.org/content/early/2017/04/11/126680 (visited on 04/11/2017).
[155] Ivan Sović, Mile Šikić, Andreas Wilm, et al. “Fast and sensitive mapping ofnanopore sequencing reads with GraphMap”. In: Nature Communications 7(Apr. 15, 2016), p. 11307. doi: 10.1038/ncomms11307. url: http://www.nature.com.ezproxy1.lib.asu.edu/ncomms/2016/160415/ncomms11307/full/ncomms11307.html (visited on 01/09/2017).
[156] Heng Li. “Toward better understanding of artifacts in variant calling fromhigh-coverage samples”. In: Bioinformatics 30.20 (Oct. 15, 2014), pp. 2843–2851. doi: 10.1093/bioinformatics/btu356. url: https://academic.oup.com/bioinformatics/article/30/20/2843/2422145/Toward-better-understanding-of-artifacts-in (visited on 04/11/2017).
[157] Konstantin Berlin, Sergey Koren, Chen-Shan Chin, et al. “Assembling largegenomes with single-molecule sequencing and locality-sensitive hashing”. In:Nature Biotechnology 33.6 (June 2015), pp. 623–630. doi: 10.1038/nbt.3238.
101
![Page 122: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/122.jpg)
url: http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v33/n6/abs/nbt.3238.html (visited on 01/24/2017).
[158] Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, et al. “GappedBLAST and PSI-BLAST: a new generation of protein database search pro-grams”. In: Nucleic Acids Research 25.17 (Sept. 1, 1997), pp. 3389–3402. doi:10.1093/nar/25.17.3389. url: http://nar.oxfordjournals.org/content/25/17/3389 (visited on 12/15/2016).
[159] Chen Yang, Justin Chu, Ren, et al. “NanoSim: nanopore sequence read simulatorbased on statistical characterization”. In: bioRxiv (Mar. 18, 2016), p. 044545.doi: 10.1101/044545. url: http://biorxiv.org/content/early/2016/03/18/044545.1 (visited on 12/16/2016).
[160] Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. “PBSIM: PacBio readssimulator—toward accurate genome assembly”. In: Bioinformatics 29.1 (Jan. 1,2013), pp. 119–121. doi: 10.1093/bioinformatics/bts649. url: http://bioinformatics.oxfordjournals.org/content/29/1/119 (visited on 12/16/2016).
[161] Bayo Lau, Marghoob Mohiyuddin, John C. Mu, et al. “LongISLND: in silicosequencing of lengthy and noisy datatypes”. In: Bioinformatics 32.24 (Dec. 15,2016), pp. 3829–3832. doi: 10.1093/bioinformatics/btw602. url: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC5167071/ (visited on 02/15/2017).
[162] Merly Escalona, Sara Rocha, and David Posada. “A comparison of tools for thesimulation of genomic next-generation sequencing data”. In: Nature ReviewsGenetics 17.8 (Aug. 2016), pp. 459–469. doi: 10 .1038/nrg .2016.57. url:http://www.nature.com.ezproxy1.lib.asu.edu/nrg/journal/v17/n8/abs/nrg.2016.57.html (visited on 04/18/2017).
[163] Matthew Frampton and Richard Houlston. “Generation of Artificial FASTQFiles to Evaluate the Performance of Next-Generation Sequencing Pipelines”. In:PLOS ONE 7.11 (Nov. 12, 2012), e49110. doi: 10.1371/journal.pone.0049110.url: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0049110 (visited on 04/18/2017).
[164] Dent A. Earl, Keith Bradnam, John St John, et al. “Assemblathon 1: Acompetitive assessment of de novo short read assembly methods”. In: GenomeResearch (Sept. 16, 2011), gr.126599.111. doi: 10.1101/gr.126599.111. url:http://genome.cshlp.org/content/early/2011/09/16/gr.126599.111 (visited on04/18/2017).
102
![Page 123: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/123.jpg)
[165] Weichun Huang, Leping Li, Jason R. Myers, et al. “ART: a next-generationsequencing read simulator”. In: Bioinformatics 28.4 (Feb. 15, 2012), pp. 593–594. doi: 10.1093/bioinformatics/btr708. url: https://academic.oup.com/bioinformatics/article/28/4/593/213322/ART-a-next-generation-sequencing-read-simulator (visited on 04/18/2017).
[166] Juliane C. Dohm, Claudio Lottaz, Tatiana Borodina, et al. “Substantial biasesin ultra-short read data sets from high-throughput DNA sequencing”. In:Nucleic Acids Research 36.16 (Sept. 2008), e105. doi: 10.1093/nar/gkn425.url: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2532726/ (visited on04/18/2017).
[167] Yen-Chun Chen, Tsunglin Liu, Chun-Hui Yu, et al. “Effects of GC Bias inNext-Generation-Sequencing Data on De Novo Genome Assembly”. In: PLOSONE 8.4 (Apr. 29, 2013), e62856. doi: 10.1371/journal.pone.0062856. url:http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0062856(visited on 04/18/2017).
[168] Vladimír Boža, Broňa Brejová, and Tomáš Vinař. “DeepNano: Deep Recur-rent Neural Networks for Base Calling in MinION Nanopore Reads”. In:arXiv:1603.09195 [q-bio] (Mar. 30, 2016). arXiv: 1603 . 09195. url: http ://arxiv.org/abs/1603.09195 (visited on 10/11/2016).
[169] Matei David, Lewis Jonathan Dursi, Delia Yao, et al. “Nanocall: An OpenSource Basecaller for Oxford Nanopore Sequencing Data”. In: bioRxiv (Mar. 28,2016), p. 046086. doi: 10.1101/046086. url: http://biorxiv.org/content/early/2016/03/28/046086 (visited on 10/11/2016).
[170] Joshua Quick, Nicholas J. Loman, Sophie Duraffour, et al. “Real-time, portablegenome sequencing for Ebola surveillance”. In: Nature 530.7589 (Feb. 11, 2016),pp. 228–232. doi: 10 . 1038/nature16996. url: http : //www .nature . com.ezproxy1.lib.asu.edu/nature/journal/v530/n7589/full/nature16996.html(visited on 09/12/2016).
[171] Miten Jain, Ian T. Fiddes, Karen H. Miga, et al. “Improved data analysisfor the MinION nanopore sequencer”. In: Nature Methods 12.4 (Apr. 2015),pp. 351–356. doi: 10.1038/nmeth.3290. url: http://www.nature.com/nmeth/journal/v12/n4/full/nmeth.3290.html (visited on 09/13/2016).
[172] Sara Goodwin, James Gurtowski, Scott Ethe-Sayers, et al. “Oxford Nanoporesequencing, hybrid error correction, and de novo assembly of a eukaryoticgenome”. In: Genome Research 25.11 (Nov. 1, 2015), pp. 1750–1756. doi:
103
![Page 124: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/124.jpg)
10.1101/gr.191395.115. url: http://genome.cshlp.org/content/25/11/1750(visited on 09/12/2016).
[173] Tamas Szalay and Jene A. Golovchenko. “De novo sequencing and variantcalling with nanopores using PoreSeq”. In: Nature Biotechnology 33.10 (Oct.2015), pp. 1087–1091. doi: 10.1038/nbt.3360. url: http://www.nature.com/nbt/journal/v33/n10/full/nbt.3360.html (visited on 09/13/2016).
[174] Chenhao Li, Kern Rei Chng, Esther Jia Hui Boey, et al. “INC-Seq: accuratesingle molecule reads using nanopore sequencing”. In: GigaScience 5 (Aug. 2,2016), p. 34. doi: 10.1186/s13742-016-0140-7. url: http://dx.doi.org/10.1186/s13742-016-0140-7 (visited on 12/15/2016).
[175] Brian D. Ondov, Todd J. Treangen, Adam B. Mallonee, et al. “Fast genomeand metagenome distance estimation using MinHash”. In: bioRxiv (Oct. 26,2015), p. 029827. doi: 10.1101/029827. url: http://biorxiv.org/content/early/2015/10/26/029827 (visited on 12/15/2016).
[176] Sergey Koren, Brian P. Walenz, Konstantin Berlin, et al. Canu: scalable andaccurate long-read assembly via adaptive k-mer weighting and repeat separation| bioRxiv. Aug. 24, 2016. url: http://biorxiv.org/content/early/2016/08/24/071282 (visited on 12/15/2016).
[177] Chen-Shan Chin, David H. Alexander, Patrick Marks, et al. “Nonhybrid,finished microbial genome assemblies from long-read SMRT sequencing data”.In: Nature Methods 10.6 (June 2013), pp. 563–569. doi: 10.1038/nmeth.2474.url: http://www.nature.com.ezproxy1.lib.asu.edu/nmeth/journal/v10/n6/abs/nmeth.2474.html (visited on 12/15/2016).
[178] Sergey Koren, Michael C. Schatz, Brian P. Walenz, et al. “Hybrid error cor-rection and de novo assembly of single-molecule sequencing reads”. In: NatureBiotechnology 30.7 (July 2012), pp. 693–700. doi: 10.1038/nbt.2280. url:http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v30/n7/full/nbt.2280.html (visited on 12/15/2016).
[179] Nicolas L. Bray, Harold Pimentel, Páll Melsted, et al. “Near-optimal proba-bilistic RNA-seq quantification”. In: Nature Biotechnology 34.5 (May 2016),pp. 525–527. doi: 10.1038/nbt.3519. url: http://www.nature.com.ezproxy1.lib.asu.edu/nbt/journal/v34/n5/full/nbt.3519.html (visited on 12/15/2016).
[180] Stephen F. Altschul, Warren Gish, Webb Miller, et al. “Basic local alignmentsearch tool”. In: Journal of Molecular Biology 215.3 (Oct. 5, 1990), pp. 403–410.
104
![Page 125: Gene Network Inference via Sequence Alignment and ... · Figure Page 4. Stable Steady State Analysis of Network 84. (A) A Matrix Presenting theNumberofStableSteadyStatesGeneratedbyCombinationsofDiffer-](https://reader036.vdocuments.us/reader036/viewer/2022070903/5f64c9d3976c41076d70a906/html5/thumbnails/125.jpg)
doi: 10.1016/S0022-2836(05)80360-2. url: http://www.sciencedirect.com/science/article/pii/S0022283605803602 (visited on 12/15/2016).
[181] T. Laver, J. Harrison, P. A. O’Neill, et al. “Assessing the performance ofthe Oxford Nanopore Technologies MinION”. In: Biomolecular Detection andQuantification 3 (Mar. 2015), pp. 1–8. doi: 10.1016/j.bdq.2015.02.001. url:http://www.sciencedirect.com/science/article/pii/S2214753515000224 (visitedon 03/15/2016).
[182] Martin Ester, Hans-Peter Kriegel, Jörg Sander, et al. “A density-based algorithmfor discovering clusters in large spatial databases with noise.” In: Kdd. Vol. 96.1996, pp. 226–231.
[183] Philip Brennecke, Simon Anders, Jong Kyoung Kim, et al. “Accounting fortechnical noise in single-cell RNA-seq experiments”. In: Nature Methods 10.11(Nov. 2013), pp. 1093–1095. doi: 10.1038/nmeth.2645. url: http://www.nature . com/nmeth/ journal/v10/n11/ full/nmeth .2645 .html (visited on03/27/2015).
105