accurate mutation detection in cancer through improved ......1.1 sources and alterations of...

Accurate Mutation Detection in Cancer Through ImprovedError Suppression and Quantification of Molecular

Complexity

by

Ting Ting Wang

A thesis submitted in conformity with the requirementsfor the degree of Master of ScienceDepartment of Medical Biophysics

University of Toronto

c© Copyright 2018 by Ting Ting Wang

Abstract

Accurate Mutation Detection in Cancer Through Improved Error Suppression and

Quantification of Molecular Complexity

Ting Ting Wang

Master of Science

Department of Medical Biophysics

University of Toronto

2018

DNA from cancer cells is often masked by nucleic acids from normal cells in tumour tissues

and blood plasma, resulting in low allele frequency mutations. Next-generation sequenc-

ing has enabled detection of somatic alterations, however it is limited by sequence errors.

A promising strategy to suppress errors utilizes unique molecular identifiers (UMIs) to

amalgamate reads from the same molecule into a consensus sequence. While current

methods depend on redundant sequencing of template molecules, we have developed an

error suppression strategy that only requires single reads. We expanded UMIs with se-

quence properties to enhance molecular quantification, which reduced single-strand con-

sensus errors up to 70% for deep sequencing. Additionally, we suppressed errors down to

1 x 10-4% in a quarter of single reads with moderate sequencing coverage. Using com-

plementary strands of single reads, we improved recovery of double-strand molecules by

5-fold. Together, our method enabled sensitive mutation detection below 0.1% frequency.

ii

Acknowledgements

First and foremost, I would like to express my deepest gratitude to my supervisor,

Dr. Trevor Pugh, for his mentorship and insightful expertise over these past two years.

He has provided me with an abundance of opportunities and exposure to clinical and

translational research. His impeccable vision, optimism, and collaborative nature are

qualities to which I aspire. I would like to extend my sincerest gratitude to my co-

supervisor, Dr. Scott Bratman, who has been an integral component to this body of

work and as well to my own personal development. I am grateful for his kindness and

support through the toughest of times. His extensive knowledge in this field has been

central to the progression of my research. I have been privileged to learn from two

outstanding scientists and I would like to thank them for their guidance throughout my

graduate journey.

I would also like to thank members of my supervisory committee, Dr. Mathieu

Lupien and Dr. Sean Cleary, for their scientific support, feedback, and insightful career

discussions. Thank you to Dr. Sagi Abelson for technical suggestions and aid in the

development of the method presented in this body of work. I would also like to thank

members of the Princess Margaret Genomics Centre (Nicholas Khuu, Julissa Tsao, and

Zhibin Lu) for their assistance with sequencing and raw data processing. I am extremely

grateful to members of the Pugh lab and the Bratman lab, both past and present. Thank

you to Dr. Jeff Bruce, Arnavaz Danesh, Marco DiGrappa, and Rene Quevedo for their

constant computational support, as well as Zhen Zhao, Dr. Tiantian Li, and Iulia Cirlan

for sample preparation and wet lab expertise. I deeply appreciate the companionship of

the truly amazing people I am surrounded by daily, as they have made each and every

day a delight.

Finally and most importantly, I want to thank my family and friends for accompanying

me through my graduate pursuit. To mom and dad, thank you for your never-ending

encouragement and unconditional love and support for everything that I do. To my

friends, thank you for challenging me to embrace new experiences and inspiring me with

your compassion. In particular, I want to thank Ingrid Kao and Parasvi Patel for building

with me life-long friendships and Tommy Hu for always bringing a smile to my face. I am

extremely grateful to have shared so many memorable moments with all of you. Thank

you.

iii

Contents

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Terms and Abbreviations . . . . . . . . . . . . . . . . . . . . . . . xi

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1 Introduction 1

1.1 Next-generation sequencing in oncology . . . . . . . . . . . . . . . . . . . 1

1.1.1 Challenges with tissue biopsy . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Overview of liquid biopsy . . . . . . . . . . . . . . . . . . . . . . 2

1.1.3 Origins and characteristics of circulating tumour DNA . . . . . . 3

1.2 Clinical implications of circulating tumour DNA . . . . . . . . . . . . . . 6

1.2.1 Earlier diagnosis of disease . . . . . . . . . . . . . . . . . . . . . . 7

1.2.2 Minimal residual disease and recurrence monitoring . . . . . . . . 8

1.2.3 Evaluating patient response to treatment . . . . . . . . . . . . . . 9

1.3 Challenges with low-allele-frequency variant

detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.1 Pre-analytical barriers . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.2 Technology platforms for rare variant analysis . . . . . . . . . . . 11

1.3.3 Library complexity and artefactual errors . . . . . . . . . . . . . . 12

1.3.4 Molecular quantification and error suppression strategies . . . . . 15

1.4 Rationale and hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.5 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Materials and Methods 17

2.1 Target panel design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 Next-generation sequencing library preparation . . . . . . . . . . . . . . 17

2.3 Target capture and sequencing . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Molecular barcode analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.1 Unique molecular identifiers . . . . . . . . . . . . . . . . . . . . . 19

2.5.2 Single strand consensus sequences . . . . . . . . . . . . . . . . . . 21

2.5.3 Singleton rescue (SR) . . . . . . . . . . . . . . . . . . . . . . . . . 24

iv

2.5.4 Duplex Consensus Sequences . . . . . . . . . . . . . . . . . . . . . 25

2.6 Evaluation of consensus sequence formation . . . . . . . . . . . . . . . . 25

2.7 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.8 in silico downsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.9 Mutation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Results 28

3.1 Sequence properties can enhance barcode depiction of molecular diversity 28

3.2 Singleton rescue improves efficiency of consensus formation . . . . . . . . 30

3.3 Singleton rescue achieves comparable rates of error suppression as duplex

correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Oxidative damage is effectively suppressed by singleton rescue . . . . . . 34

3.5 Duplex molecular recovery is maximized through singleton rescue . . . . 35

3.6 Singleton enabled error suppression improves

lower limit of mutation detection . . . . . . . . . . . . . . . . . . . . . . 39

4 Discussion 46

4.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2 Potential impact and applications . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Bibliography 50

v

List of Tables

2.1 Summary of library preparation specifications for each cell line dilution. . 18

vi

List of Figures

1.1 Sources and alterations of cell-free DNA. cfDNA can be derived from a

multitude of cells and tissue types: tumour, stroma of the tumour mi-

croenvironment, immune, and other tissues or organs. Cells can release

cfDNA through apoptosis, necrosis, and secretion. Genetic and epigenetic

alterations can be used to infer tumour-derived cell-free DNA. Reprinted

by permission from Macmillan Publishers Ltd: Nature Reviews Cancer

[1], 2017. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 ctDNA applications for disease management. Clinical applications of ctDNA

analysis in a hypothetical patient throughout course of disease. ctDNA

could enable earlier diagnosis and molecular profiling of tumours prior

to metastatic spread. Localized cancer is more amenable to surgery and

treatments with curative intent. Detection of minimal residual disease

post-operation could inform of relapse and neo-adjuvant therapy can be

monitored with serial liquid biopsies to elucidate clonal evolution. This

schematic illustrates the development of a single tumour clone that di-

verges into multiple distinct entities as a result of selective pressures from

cancer progression and drug therapies. Reprinted by permission from

Macmillan Publishers Ltd: Nature Reviews Cancer [1], 2017. . . . . . . . 6

1.3 Stages impacting molecular complexity and artefactual errors. a) Detec-

tion of ctDNA is impacted by molecular complexity and is limited by level

of background errors. While adapter ligation and target capture may im-

pact molecular complexity, the lower limit of detection is determined by the

rate of artefactual errors. These sequence mistakes may arise from poly-

merase errors, sequencing blunders, or oxidative damage during sample

extraction and target capture. b) Raw sequence reads are not represen-

tative of true molecular diversity due to PCR duplicates. De-duplication

using genome coordinates is not an accurate depiction either as indepen-

dent molecules may map to the same position. UMIs enable more accurate

quantification of molecular complexity and can enable sequence error sup-

pression by correcting artefacts through consensus formation. Although

traditional barcoding methods rely on duplicate reads, we have developed

a strategy to enable error suppression in single reads. . . . . . . . . . . . 14

vii

2.1 UMIs: A barcode family is defined by barcodes and sequence features that

enable unique identification of an individual molecule. Our 1) 4-bp barcode

allows for 256 different molecules at each 2) genomic position. In addition,

3) CIGAR strings can provide additional identifier diversity as it encodes

sequence variation to the reference genome highlighting distinct attributes

such as clipped regions, insertions, or deletions. Furthermore, read 4) ori-

entation and 5) number can differentiate palindromic barcodes and provide

proper molecular grouping of reads belonging to different strands. R1 =

Read 1 and R2 = Read 2 of paired-end reads. . . . . . . . . . . . . . . . 20

2.2 Source of CIGAR: When DNA sequencing reads (grey bars) are aligned

to a reference sequence, specific genome coordinates and mismatches to

the reference (CIGAR strings) are noted in the sequence alignment file.

CIGAR strings encode information about fragment clipping (S, soft clipped

bases in 5’ or 3’ end of the read that are not part of the alignment), as well

as sequence concordance with the reference genome (M, match/mismatch

bases), insertions (I), and deletions (D). . . . . . . . . . . . . . . . . . . . 21

2.3 Duplex sequencing schematic: An uncollapsed BAM file is first processed

through SSCS maker.py to create an error suppressed single strand con-

sensus sequences (SSCS) BAM file and an uncorrected Singleton BAM

file. The single reads can be recovered through singleton strand rescue.py,

which salvages singletons with its complementary SSCS or singleton. SSCS

reads can be directly made into duplex consensus sequences (DCS) or

merged with rescued singletons to create an expanded pool of DCS reads

(Figure illustrates singleton rescue merged work flow). . . . . . . . . . . . 22

3.1 Impact of alignment information on molecular complexity and error sup-

pression. a) Fraction of reads with alternative alignment to the reference

genome in SSCS and singletons across two cell line dilutions. Error bars

represent standard error between samples. b) CIGAR strings are indica-

tive of alignment as they encapsulate information about base concordance,

fragment clipping, insertions, and deletions. Scatter plots show change in

singletons and SSCS error rate with CIGAR embedded identifiers com-

pared to reads derived without alignment differentiation. Left: a large

panel sequenced to moderate depth (LgMid, n = 12). Right: a small

panel with deep sequencing (SmDeep, n = 8). . . . . . . . . . . . . . . . 29

viii

3.2 Dual strategies of single read error suppression: Singleton rescue by SSCS

starts with collapsing duplicate reads to form SSCS, which are subse-

quently used to correct the reverse complement singleton. Additionally,

singletons can be rescued by other singletons through complementary

strand error suppression. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Proportion of reads with duplicate support eligible for consensus-based er-

ror suppression. Singleton rescue (SR) enables additional error correction

of singletons through rescue strategies using complementary SSCS or sin-

gletons. The left shows the average read distribution of a large panel with

sequencing to moderate depth (LgMid, n = 12), while the right shows a

summary of reads from a small panel with deep sequencing (SmDeep, n =

8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Error suppression performance at each stage of error correction between

sequencing with moderate (LgMid) and deep (SmDeep) coverage. Back-

ground error determined by non-reference bases at allele frequencies below

5%. Error bars represents standard error across all samples. . . . . . . . 33

3.5 Duplex strategies of error correction suppresses oxidative damage. a)

Schematic of oxidative damaging induced 8-oxoguanine (8-oxoG) asym-

metric strand lesions in DNA. 8-oxoG arises from oxidation and converts

to thymine after PCR amplification. As only plus strands are captured

with our target panel comprised of negative probes, there is a bias for C>A

errors. b) Comparison of 12 substitution classes between uncorrected and

error suppressed sequences in LgMid (n=12). c-g) Ratio of errors be-

tween reciprocal base substitutions. Imbalances between C>A/G>T is

indicative of oxidative damage. . . . . . . . . . . . . . . . . . . . . . . . 36

ix

3.6 Singleton rescue impact on consensus formation across a wide coverage

range (500x to 128,000x). Results are shown for LgMid and SmDeep cell

line dilutions that were down-sampled by half at each interval step. Error

bars represents standard error of all the samples across ten simulations.

Mean target coverage is log10 transformed to show trends at lower range.

a) Proportion of singletons rescued through complementary SSCS and

singleton correction. b) Efficiency of SSCS formation as determined by

number of SSCS / total uncollapsed reads. c) DCS efficiency as calculated

by number of DCS / total uncollapsed reads d) DCS recovery rate of

observed / expected DCS. Expected DCS corresponds to half of all SSCS

reads as two single molecules should theoretically form a DCS. Efficiency

is an assessment of over-sequencing relative to unique input molecules,

whereas recovery is an estimate of molecular retrieval. . . . . . . . . . . . 37

3.7 SmDeep: three HCT116 mutations identified through deep sequencing of

a 5 gene panel across dilutions from 100% to 10-4%. Every base at each

position was called to illustrate the frequency of background errors present.

Top panel compares limit of detection with standard uncollapsed reads,

middle panel represents single strand consensus sequences, and bottom

panel highlights duplex consensus sequences. . . . . . . . . . . . . . . . . 42

3.8 LgMid: Five MOLM13 mutations identified through sequencing to moder-

ate depth across dilutions from 100% to 0.04%. Every base at each position

was called to determine frequency of known mutations and as well to il-

lustrate the frequency of background errors. Each row corresponds to an

error suppression method and each column represents a different mutation. 43

3.9 Summary trends of SmDeep mutations across different error suppression

strategies. Comparison of detection limit using a) allele frequency and b)

reads supporting each variant. . . . . . . . . . . . . . . . . . . . . . . . . 44

3.10 Summary trends of LgMid mutations across different error suppression

strategies. Comparison of detection limit using a) allele frequency and b)

reads supporting each variant. . . . . . . . . . . . . . . . . . . . . . . . . 45

x

List of Terms and Abbreviations

8-oxoG 8-oxoguanine lesions

bp Base pairs

CCLE Cancer Cell Line Encyclopedia

CIGAR Concise Idiosyncratic Gapped Alignment Report

ctDNA Circulating tumour DNA

cfDNA Cell-free DNA

CTC Circulating tumour cell

CDX CTC-derived xenograft

DCS Duplex / double-strand consensus sequence

DNA Deoxyribonucleic acid

dPCR Digital PCR

EDTA Ethylenediaminetetraacetic acid

LLOD Lower limit of detection

MAPK Mitogen-activated protein kinase

miRNA MicroRNA

MRD Minimal residual disease

NGS Next-generation sequencing

NSCLC Non-small cell lung cancer

PCR Polymerase chain reaction

qPCR Quantitative PCR

SR Singleton rescue

SSCS Single strand consensus sequence

xi

TKI Tyrosine kinase inhibitor

UMI Unique molecular identifier

WBC White blood cells

WGS Whole genome sequencing

xii

Glossary

Allele frequency Frequency of an allele at a particular locus or position.

CIGAR Sequence alignment information relative to the reference genome.

Duplex consensus sequence Consensus sequence derived from complementary

single-strand consensus sequences of a molecule.

Mean target coverage Average number of reads per position within a target region.

Molecular diversity Unique DNA molecules within a sample, inferred by

bioinformatic removal of PCR duplicates.

Molecular recovery Proportion of original inferred molecular diversity retained after

the process of library preparation, sequencing, and bioinformatic analysis.

Molecular saturation The point of sequencing beyond which additional sequencing

does not result in increased molecular diversity or recovery.

Read String of consecutive bases from a single molecule of DNA produced by

next-generation sequencing.

Singleton Molecules that have been sequenced only once (i.e., without PCR

duplicates in the sequence data).

Singleton rescue Single read error correction through consensus formation with the

complementary strand.

Single strand consensus sequence Consensus of all reads derived from the same

strand of a template DNA molecule.

Uncollapsed reads Mapped reads that have not been subjected to de-duplication,

error suppression, or other analysis related to molecular identity using UMIs

or barcodes.

Unique molecular identifier A series of bases or barcodes added to a molecule

either through ligation or amplification. After sequencing, these barcodes can

be used to identify which original template DNA molecule a particular read is

derived from.

xiii

Chapter 1

Introduction

1.1 Next-generation sequencing in oncology

Cancer arises through an accumulation of molecular alterations affecting cellular growth

and survival. Cells become malignant as they undergo continuous proliferation, evasion

of growth suppressors and cell death, induction of angiogenesis, and initiation of tumour

invasion and metastasis [2]. This transformation is the result of combined destabilization

of many important cellular processes. As cancer evolves dynamically with development

of disease, tumours become heterogeneous habouring a diverse array of cells with dis-

tinct molecular profiles. Pressures applied by treatment can select for cancer cells with

specific genome alterations that confer a growth advantage. Despite acquiring similar

advantageous phenotypes, these alterations can differ between tumours, highlighting the

need for comprehensive molecular profiling, including massively parallel next-generation

sequencing of tumour DNA[3].

Diverse molecular alterations can propagate within a single tumour, between multiple

tumour sites within a patient, or across multiple patients. Inter-tumour heterogeneity is

the variability between patients with distinct molecular and histological profiles. Patients

can be classified into subgroups based on these different features to improve prediction of

prognosis and response to therapy. Intra-tumour heterogeneity refers to the variability

between tumour cells of an individual. This can manifest as spatial diversity across

different disease sites or within a tumour, as well as temporal variations in the molecular

composition of cancer cells over time [4, 5]. Selective pressure from treatment may confer

resistance through expansion of subclonal populations or evolution of drug tolerant cells.

While heterogeneity allows tumours to overcome evolutionary pressures, it can also be

exploited for therapeutic targets. Characterizing the molecular landscape of these mixed

cell populations is essential in guiding precision medicine to improve patient response to

therapy.

Next generation sequencing (NGS) has revolutionized the understanding of genetics

by enabling broader characterization of the genome. NGS has had profound implications

in cancer by uncovering alterations responsible for the development of disease. Single nu-

cleotide and structural variations with high allele frequencies can be profiled using NGS

1

2

for molecular characterization of tumour heterogeneity. However, detection of subclonal

mutations at lower frequencies (<1-2%) has been limited by technical artefacts and se-

quencer errors [6]. Whole genome sequencing (WGS) identifies a range of alterations

including copy number aberrations, nucleotide substitutions, and structural rearrange-

ments. This can enable discovery of novel chromosomal rearrangements and somatic

mutations of non-coding regions that were previously unobserved with other methods.

While WGS provides the most comprehensive profiling of the genome, it is burdened

by the high cost associated with the massive amount of sequencing it requires. Typically,

90 Gb of sequence data is needed to cover 3 billion bases of the human genome with

an average of 30x coverage per base [7]. As only 1% of the human genome is comprised

of coding exons of genes, whole exome sequencing can achieve 100-fold greater coverage

(3,000X) using the same number of reads. Targeted sequencing allows greater coverage

across regions of interest at a lower cost. Under the same scenario, a target panel of 0.3

MB can be sequenced 300,000x per position with the same number of reads as WGS.

Thus, deep molecular profiling with targeted sequencing has the potential to enable detec-

tion of low-allele-frequency mutations despite admixtures of tumour and non-malignant

cells.

1.1.1 Challenges with tissue biopsy

Heterogeneity confounds identification of biomarkers indicative of response, which is criti-

cal for patient selection and the success of targeted therapies. At the present time, patient

profiling is primarily achieved using a single tissue resection or biopsy sample. As an in-

dividual tumour sample is not representative of diversity within primary or metastatic

lesions, this greatly confounds patient stratification for personalized medicine. In addi-

tion, somatic mutations can be obscured by non-malignant cells within the tissue. Clonal

populations in one biopsy may be subclonal or absent in subsequent tumour sampling.

These invasive sampling procedures are subject to complications and are bound by the

accessibility of the tumour [8]. Moreover, the impracticality of serial tissue biopsies pre-

clude the possibility for multi-site genotyping or longitudinal profiling. There is a clinical

need for new molecular and diagnostic tools as current methodologies only provide a lim-

ited scope on the complexity of cancer.

1.1.2 Overview of liquid biopsy

Liquid biopsy has the potential to transform cancer management by overcoming barriers

inherent to conventional biomarkers [9]. Cells, exosomes, and nucleic acids found in the

3

blood or other bodily fluids enable a non-invasive holistic view of cancer clonal popula-

tions [10, 11]. Circulating tumour cells (CTCs) shed from primary or metastatic lesions

can be enriched through size selection or surface markers. These cells can then be ex-

panded and characterized to facilitate the development of personalized cancer medicine.

Although challenges exist with culturing CTC cell lines, CTC-derived xenograft (CDX)

models have been shown to mirror donor response to chemotherapy [12]. As CDXs share

similar genomic profiles to their donors, new drug targets could potentially be identified

by comparing models derived before and after treatment resistance. However, CTCs are

found at a frequency of approximately 1 tumour cell per billion blood cells in metastatic

cancer patients. While the low frequency and the time required to establish in vivo

models impede its immediate clinical utility, CTCs still have significant implications for

research [13].

In addition to intact cells, extracellular vesicles important for intercellular communi-

cation can be isolated through centrifugation or marker selection. Their RNA contents

can be sequenced for cancer-specific gene expression profiles [14]. This approach may

be useful for patients with limited quantities of detectable circulating tumour DNA, as

mRNA from highly expressed genes are of greater abundance relative to the two copies

present in circulating tumour DNA. While cell-free RNA can also be detected in the

circulation, mRNA are of low abundance due to plasma nucleases. MicroRNA (miRNA)

are more stable as they are protected by the argonaute-2 protein complex or encapsulated

in extracellular vesicles [15]. As miRNA transcripts differ between cancer and healthy

individuals, they may be potential candidates for disease biomarkers [16]. Although

CTCs and exosomes provide insight on the molecular architecture of cancer populations,

their utility is hindered by their limited availability and poor scalability to be translated

into the clinic [17]. Similarly, applications of cell-free RNA are also controversial as it

is unclear whether they are of tumour origins [9]. For these reasons, among the various

potential analyses for liquid biopsy applications, circulating DNA may hold the greatest

promise for providing clinical impact in oncology.

1.1.3 Origins and characteristics of circulating tumour DNA

Although the presence of cell-free nucleic acid was first described more than half a century

ago [18], observations in oncology were not reported until decades later and its impli-

cations for cancer patients were not extensively explored until recently [19]. With the

advent of next-generation sequencing, genetic and epigenetic alterations in the tumour

can be identified in the mutant DNA fragments of the peripheral blood. Aside from the

4

Figure 1.1: Sources and alterations of cell-free DNA. cfDNA can be derived from amultitude of cells and tissue types: tumour, stroma of the tumour microenvironment,immune, and other tissues or organs. Cells can release cfDNA through apoptosis, necrosis,and secretion. Genetic and epigenetic alterations can be used to infer tumour-derivedcell-free DNA. Reprinted by permission from Macmillan Publishers Ltd: Nature ReviewsCancer [1], 2017.

blood, circulating tumour DNA (ctDNA) is also evident in the cerebrospinal fluid for

central nervous system cancers and pleural effusion fluids for respiratory system cancers.

For truly non-invasive interrogation, ctDNA can be collected from the stool for colorectal

cancers, urine for bladder cancers, and saliva for head and neck cancers [9]. Concentra-

tion of ctDNA may be higher in bodily fluids in greater proximity to the tumour, however

detection in the blood has the broadest clinical applications across multiple cancers.

Cell-free DNA (cfDNA) can be derived from a multitude of cells and tissue types:

tumour, stroma of the tumour microenvironment, immune, and other tissues or organs.

While the mechanism of cfDNA origins is unknown, fragments are suggested to have

been released into the circulation via apoptosis, necrosis, and secretion (Figure 1.1).

cfDNA has a characteristic peak (166 bp) corresponding to the length of nuclease-cleaved

DNA from apoptotic degradation. This length encompasses the span of DNA wrapped

around histones (˜147 bp) and linked between nucleosome cores [20]. Using long-read

sequencing technology, large cfDNA fragments (>1,000 bp) have also been identified and

may be associated with exosomal or necrotic release [1]. Interestingly, cfDNA from non-

mutant cells are longer than those that are tumour derived (˜166 bp vs. 132-145 bp)

5

[21]. This is also observed in animal xenografts, where tumour fragments can be easily

distinguished by their human origin from background rat cfDNA. Differences in length

between mutant and wild-type cell-free molecules is potentially caused by variability in

nucleosome wrapping. This distinction between DNA length could be exploited for more

efficient isolation of tumour fragments.

Genome-wide sequencing of cfDNA can be used to infer nucleosome occupancy or

methylation profiles enabling determination of cell-types of origin [22, 23]. Germ line

cfDNA from non-mutant cells represent the majority of cell-free molecules and they are

primarily derived from haematopoeitic cells in healthy individuals [23]. Despite represent-

ing only a small fraction of total cfDNA, ctDNA abundance in the plasma is prognostic

of disease progression. While ctDNA can be found ranging below 0.1% to above 10%,

it has been shown to correlate with cancer stage and tumour size [24, 25]. Notably, the

quantity of cell-free molecules can also increase with exercise, infection, pregnancy, or

smoking [23, 9, 1]. Although cfDNA is continuously shed, their half-life ranges between

16 minutes and 2.5 hours with clearance from the circulation via kidneys, liver, spleen,

and circulating nucleases [1]. This constant overturn of cfDNA enables more dynamic

monitoring of patient response than other blood-based biomarkers [26].

In recent studies, ctDNA has demonstrated clinical utility with capabilities to reca-

pitulate both spatial and temporal heterogeneity [9]. With continuous release and rapid

clearance of circulating cfDNA, ctDNA can provide a real-time snapshot of tumour dis-

ease burden. ctDNA is shed by both primary and metastatic lesions and its abundance

has been shown to correlate with cancer stage and tumour size [25]. In a study with

640 patients across various stages and cancer types (breast, colorectal, gastroesophageal,

pancreatic), ctDNA was detected at a frequency of fewer than 10 copies per 5ml of

plasma in early-stage patients. In contrast, ctDNA was found at a median concentration

of 100-fold higher in advanced stage malignancies [24]. Analysis of ctDNA can provide

earlier indications of relapse and out perform conventional blood-based assays [27, 26].

The presence of these tumour-derived DNA fragments containing somatic mutations after

therapy is indicative of recurrence and can be detected 5 months prior to radiographic

imaging [26]. In turn, serial blood sampling can elucidate clonal evolution and permit

dynamic longitudinal disease monitoring [28]. In summary, ctDNA has significant utility

in oncology from earlier diagnosis to monitoring recurrent disease.

6

1.2 Clinical implications of circulating tumour DNA

Unlike tissue biopsies subject to complications and sampling biases, liquid biopsies pro-

vide a safer approach for comprehensive cancer profiling. The lower limit of detection

(LLOD) for ctDNA is the threshold to which we can confidently discern mutations from

background germ line cfDNA. Sensitivity of detection is measured by proportion of pos-

itive mutations correctly identified, whereas specificity measures the proportion of neg-

atives correctly identified. ctDNA abundance is often influenced by disease burden and

extent of tumour genomic characterization, which guides mutational search and governs

scope of detection. With NGS technology, the LLOD of ctDNA analysis is dictated by

the depth of sequencing as well as the level of technical artefacts from PCR or sequenc-

ing errors. This has significant implications in oncology where ctDNA abundance can

be used to monitor patient response, while presence of ctDNA can be used for earlier

diagnosis or identification of recurrent disease (Figure 1.2).

Figure 1.2: ctDNA applications for disease management. Clinical applications ofctDNA analysis in a hypothetical patient throughout course of disease. ctDNA couldenable earlier diagnosis and molecular profiling of tumours prior to metastatic spread.Localized cancer is more amenable to surgery and treatments with curative intent. Detec-tion of minimal residual disease post-operation could inform of relapse and neo-adjuvanttherapy can be monitored with serial liquid biopsies to elucidate clonal evolution. Thisschematic illustrates the development of a single tumour clone that diverges into multipledistinct entities as a result of selective pressures from cancer progression and drug ther-apies. Reprinted by permission from Macmillan Publishers Ltd: Nature Reviews Cancer[1], 2017.

7

1.2.1 Earlier diagnosis of disease

Tumour material obtained from the blood can be used as a marker of disease for early

diagnosis. Dawson et al. have shown that mutations in ctDNA have greater correla-

tion with changes in tumour burden than other blood-based markers, including protein

antigens or CTCs [26]. Recently, evaluation of ctDNA in 200 patients with early-stage

cancers found mutations in the plasma of 71% of colorectal, 68% of ovarian, 59% of lung,

and 59% of breast cancer patients with a minimum threshold of 0.1% allele frequency

[27]. Interestingly, presence of ctDNA can also be detected in healthy individuals [29, 27],

some of whom progress to develop cancer at a later time [30]. While mutations were de-

tectable in plasma on average 18 months before diagnosis, caution should be used when

screening healthy individuals [30, 31]. Mutations in genes commonly linked to cancer,

such as TP53, have been identified at low frequencies in the plasma of healthy individuals

[32].

As cancer arises from accumulation of alterations, pre-cancerous mutations may be

present years before onset of clinical symptoms [33]. Recurrent leukemic mutations ac-

cumulate in peripheral blood cells of healthy individuals with increased age. Somatic

mutations in genes associated with leukemia have been observed in 1% of people under

50 years of age. Although the prevalence of these alterations increased to 10% in people

over 65 years of age, the absolute risk conversion into hematologic cancer was 1% per

year as not all mutations led to the development of disease [34]. To distinguish individ-

uals at high risk of developing leukemia, pre-leukemic mutations could be evaluated in

peripheral blood of individuals prior to diagnosis of disease. As early mutations are likely

low in frequency, a sensitive detection method is required to screen these rare variants.

Since not all mutations in white blood cells (WBC) lead to cancer progression, it is im-

portant to distinguish tumour-derived alterations from mutated cfDNA of non-malignant

cells. In a study of 821 healthy individuals, there was a high correlation between somatic

variants detected between WBC and cfDNA. This warrants caution when searching for

low frequency tumour fragments in plasma, as 90% of somatic mutations detected in

cfDNA have been identified to be derived from blood cells [35]. Therefore, it is im-

portant to use patient matched WBC to differentiate ctDNA from background somatic

mutations. To prevent over diagnosis, ctDNA screening could be performed in symp-

tomatic individuals or those at high risk due to germ line genetic carrier status and

family history. Early detection prior to metastatic spread can improve patient outcomes,

as localized cancer is more amenable to curative therapy.

8

1.2.2 Minimal residual disease and recurrence monitoring

Following surgery or treatments with curative intent, there may be minimal residual

disease (MRD) that is radiologically undetectable. Currently, disease burden is monitored

using blood-based markers that lack specificity or imaging which may expose patients to

radiation. The paradigm of using ctDNA to detect MRD remains understudied but has

shown early promise in certain clinical scenarios for outperforming standard monitoring

methods [26, 36]. In a landmark paper, Diehl et al. showed ctDNA as a highly specific

marker of recurrence with 100% sensitivity compared to 56% with the standard protein

biomarker for tracking disease in colorectal cancer, carcinoembryonic antigen. In patients

with detectable ctDNA post-surgery, they observed relapse within one year [37]. In a

separate cohort of 230 resected stage II colorectal cancer patients, all of the patients with

presence of ctDNA relapsed. On the contrary, 91% of patients with absence of ctDNA

had recurrence-free survival [36]. Similarly in a study of 55 patients with early-stage

breast cancer receiving neo-adjuvant chemotherapy, serial sampling improved recurrence

predictions estimating relapse by a median of 7.9 months prior to clinical observation

[28]. Recently in a cohort of 40 patients across stages I-III of lung cancer, ctDNA was

detectable in the first post-treatment blood sample of 94% of patients with recurrence.

ctDNA identification surpassed radiological monitoring in 72% of patients by a median of

5 months [38]. Together, these studies suggest ctDNA is a reliable marker of recurrence

with prognostic predictive value months in advance of imaging.

Liquid biopsies could offer an alternative to invasive procedures for MRD detection

such as bone marrow aspirates. While the majority of patients with multiple myeloma

achieve complete response to treatment, MRD is prognostic of relapse. Currently, the

standard method for MRD testing is flow cytometry where malignant plasma cells are

enriched from bone marrow samples and detected using fluorescent markers. MRD by

flow cytometry is determined by the detection of 1 myeloma cell amongst 106 bone

marrow cells; however, flow sorting of a single biopsy is subject to sampling biases [39].

Recently, our group has demonstrated the utility of a hybrid-capture based liquid biopsy

sequencing method for complementary molecular profiling of bone marrow in multiple

myeloma. We detected 96% of somatic mutations in cfDNA identified from matched

bone-marrow samples with 98% specificity. While we found ctDNA at allele frequencies

as low as 0.25%, current ctDNA sequencing methods are inadequate for sensitive MRD

detection due to the high prevalence of sequence errors [40]. New molecular methods are

needed to suppress artifacts for real-time patient monitoring with serial liquid biopsies.

Moreover, earlier detection of MRD can be used as patient selection criteria for adjuvant

therapy and raises the possibility of reducing over-treatment in disease free individuals.

9

1.2.3 Evaluating patient response to treatment

As it is not feasible nor desirable to perform serial tissue biopsies, non-invasive liquid

biopsies provide a good alternative to characterize patient response to treatment. A

considerable amount of effort has been devoted to identifying mutations in the plasma of

difficult-to-biopsy cancers. In non-small cell lung cancer (NSCLC), there is a reduction in

allele frequency observed in 92% of ctDNA mutations within two days after surgery [41].

Serial plasma biopsies post-surgery have enabled phylogenetic characterization of relapse

in early-stage lung cancer patients following adjuvant chemotherapy [11]. As surgery is

not accessible to all patients, epidermal growth factor receptor (EGFR) tyrosine kinase

inhibitors (TKI) are used as first-line of treatment. Following the first cycle of targeted

therapy, resistance occurs in the majority of patients through secondary mutations in the

EGFR kinase domain. Plasma assays offer as an alternative for patients who warrant

tissue re-biopsies for confirmation of therapeutic resistance drivers, such as the EGFR

T790M mutation. While third-generation EGFR inhibitors have been developed to target

these resistance mutations [42], ctDNA analyses have revealed tumour heterogeneity leads

to further drug tolerance through selection for T790M wild-type cells [43].

Beyond lung cancer, ctDNA profiling has revealed clonal evolution and resistance to

EGFR blockade in colorectal cancer. KRAS mutants emerged with introduction of anti-

EGFR antibodies and subdued with their withdrawal. Fluctuating mutation frequencies

in ctDNA from multiple challenges of EGFR blockade resulted in partial response, pro-

viding molecular evidence for rechallenge therapies [10]. Similarly, clonal evolution in the

plasma samples of a patient with metastatic breast cancer reflected the clonal evolution

inferred from multi-regional tumour tissues [44]. These studies have demonstrated that

clonal mutations occurring early in tumour development are ideal candidates for moni-

toring tumour burden in ctDNA, whereas subclonal mutations arising later in phylongey

are better to inform treatment decisions. Currently, the US Food and Drug Administra-

tion and the European Medicines Agency has approved ctDNA analysis as a companion

diagnostic to aid with NSCLC patient selection for EGFR inhibitors [45, 46]. While these

approved tests enable sensitive detection of a few mutations, a challenge still remains for

simultaneous profiling of a large number of genomic regions from a limited sample of

blood.

10

1.3 Challenges with low-allele-frequency variant

detection

Detection of low-allele-frequency variants has wide implications in oncology from early

detection of disease to monitoring advanced-stage malignancies. These rare variant alle-

les are often confounded by tumour heterogeneity and contamination from normal cells

[23]. While this is a known issue for identification of subclones and elucidation of clonal

evolution, this also has impact on circulating tumour DNA (ctDNA) detection. In this

study, we have developed a method to improve low-allele-frequency variant detection

with the main intention of improving ctDNA analysis for liquid biopsies. From sample

collection to processing, pre-analytical barriers may diffuse ctDNA signals with noise

from germ line cfDNA. The detection limit of low abundance populations is dictated by

amount of input molecules, sequencing depth, and level of technical artefacts [47]. Errors

can arise as a result of oxidative damage, polymerase mistakes, or sequencer artifacts

during DNA extraction and target capture (Figure 1.3) [48, 49, 50]. Collectively, these

errors determine the threshold to which we can discern true genetic variants from false

positives.

1.3.1 Pre-analytical barriers

Abundance of tumour fragments in the plasma is influenced by disease burden and tu-

mour location. As such, cancers with low mutational burden or those stemming from

protected regions like the brain have limited quantities of ctDNA identified in the periph-

eral blood [24]. One of the major factors confounding ctDNA analysis is contamination of

germ line DNA from white blood cell lysis after blood draw. This can influence the limit

of detection as mutant allele frequencies can be below one mutant molecule per 10,000

wild-type molecules in advanced-stage solid malignancies (0.01%) [37, 32, 29]. Standard-

ization of sample processing is important for reproducibility and sensitive detection of

low-abundance ctDNA molecules. The state of blood following collection impacts the

quantity of cell-free DNA as coagulated serum samples can be subject to contamination

from lysed immune cells during the clotting process. To improve ctDNA signal, samples

should be stored in tubes coated with an anticoagulant, as plasma samples that are free

of clotting factors contain lower levels of germ line contamination. It is also important

to use an anticoagulant agent compatible with polymerase chain reaction (PCR), such as

ethylenediaminetetraacetic acid (EDTA), as other agents have been reported to interfere

with PCR efficiency [51].

11

To prevent contamination by cfDNA, plasma should be isolated through centrifuga-

tion within a few hours of blood draw. cfDNA inflation has been observed after 2 hours

of storage [51]. This can be problematic in clinical practice as collection sites are often

not equipped for immediate plasma isolation. In addition, sample storage conditions and

laboratory hours of operation may delay processing and compromise sample integrity.

One solution is to use collection tubes containing fixative agents to preserve cell stability

and prevent lysis, such as propriety fixatives employed by commercialized Streck tubes.

This has been shown to maintain cfDNA yield and conserve background mutational rates

for up to 5 days at room temperature [52]. After centrifugation, a layer of buffy coat

can be isolated and used as a source of matched germline DNA to determine somatic

mutations in cfDNA. The buffy coat is also the source for detecting blood-based cancers

present in the circulation, such as acute myeloid leukemia. In addition, there is a multi-

tude of extraction methods available that can impact cfDNA including affinity column,

filtration, magnetic bead, phenol-chloroform, and polymer based methods [53, 1]. As

there may be size differences between cfDNA and ctDNA, variability between fragment

size captured by each method may affect molecular recovery of ctDNA [21].

1.3.2 Technology platforms for rare variant analysis

Analysis of low frequency alleles can pertain to quantification of a single mutation to

evaluation of a whole genome. This is relevant for characterizing ctDNA, peripheral

blood mononuclear cells, and tissues. Using allele-specific primers for recurrent hot-spot

mutations, quantitative PCR (qPCR) can be used to detect ctDNA in the plasma of can-

cer patients. While some qPCR assays have been approved for clinical use to determine

treatment eligibility in NSCLC patients, this method has poor differentiation of low-

allele-frequency mutations with 0.05% limit of detection [1]. Subsequent development of

digital PCR (dPCR) has allowed for more accurate identification and absolute quantifi-

cation of rare mutant fragments. As molecules are partitioned into hundreds or millions

of parallel PCR reactions and quantified with sequence-specific fluorescent probes, dPCR

negates the impact of amplification inefficiencies inherent with conventional PCR [54].

While dPCR can achieve limits of detection below 0.001%, it is difficult to multiplex

or simultaneously quantify multiple sequences of interest as each target requires its own

unique primer. The possibility of primers hybridizing with one another restricts the

applicability of dPCR to at most 10 mutations for simultaneous interrogation [1, 55].

With the development of NGS technologies, a wider range of the genome can be exam-

ined at a higher throughput. While whole genome sequencing profiles the entire genome,

12

targeted sequencing uses panels containing genes or select gene regions for characteriza-

tion of a few exons to the entire exome. Smaller panels targeting fewer number of loci

allows for deeper sequencing at a lower cost. One configuration is amplicon-based assays

where targets are enriched through PCR amplification followed by deep sequencing for

detection of mutations across multiple loci [56]. Another approach is hybrid capture

where sequences of interest are hybridized to biotinylated probes and isolated using mag-

netic beads enabling interrogation of hundreds of genes [32]. Although amplicon-based

protocols require less input DNA and have higher on-target levels, they are prone to

amplification biases, high duplicate rates, and are unable to detect fusions without prior

knowledge of the genomic breakpoint. Hybrid capture on the other hand retains addi-

tional sequences surrounding regions of interest giving potential for discovery of novel

translocations and other structural variants [57].

Whole genome sequencing of cfDNA has identified cancer-specific chromosomal aber-

rations and rearrangements [58]. Recently, shallow whole genome sequencing (sWGS)

of cfDNA with 0.4x mean coverage found aberrations previously unseen in the tumour

biopsy [59]. This highlights the inadequacy of tissue biopsies and the capacity of cfDNA

to reflect spatial tumour heterogeneity. sWGS may offer a cost-effective alternative to

tissue copy number calling. Whole exome sequencing of cfDNA revealed high concor-

dance with tumour biopsies, however analysis was limited to patients with >10% cfDNA

tumour content [60]. Detection of low-allele-frequency mutations requires more sensitive

methods such as digital droplet PCR or targeted sequencing. Current commercial gene

panels have a mutation detection limit above 1% as they are restricted by background

errors inherent to the technology [61, 62]. As recurrent mutations are not found in all

patients, personalized cancer profiling with patient-specific panels in combination with

targeted sequencing can further enhance rare variant detection [32]. The threshold to

which mutations can be confidently called is restricted by background sequencer errors.

Recently, advancements in error suppression strategies have led to improved detection of

allele frequencies below 0.01% for targeted sequencing [29].

1.3.3 Library complexity and artefactual errors

Complexity of a sequence library is reflective of the molecular diversity within a sample

and mirrors the technical challenges throughout the preparation process. Diversity is

characterized by the unique molecules present in the source material. In contrast, molec-

ular recovery is the proportion of those original molecules retained through the process

of library construction, sequencing, and bioinformatic analysis (Figure 1.3a). During

13

sample isolation, contamination from exogenous DNA or germ line DNA can mask low-

allele-frequency oncogenic variants. This is problematic for analysis of rare fragments

such as subclones in mixed cell populations and ctDNA in peripheral blood plasma.

During library preparation, adapter ligation is critical for multiplexing a large number

of samples on a single sequence run. The adapter-to-DNA ratio is crucial as insufficient

adapters will lead to ligation inefficiencies and excessive adapters could result in adapter

dimers, both of which would reduce complexity [63]. While PCR is important for ampli-

fication, uneven coverage of molecules as result of GC biased regions reduces the fidelity

of the original diversity [64]. Molecular complexity is further impacted by hybridization

capture as only select molecules within the target range are retained. Probes for target

enrichment of genes are designed based on a reference sequence, thus sequences with

substantial deviations or insertions and deletions may not be captured as efficiently [63].

Together, these technical challenges impact molecular recovery, and a method for accu-

rate quantification is needed to characterize library complexity of limited input samples,

such as cfDNA.

Sequence errors and technical artefacts pose a major barrier for sensitive detection

as they mimic low-allele-frequency somatic variants. Since the limit of detection is de-

termined by the level of background errors, it is important to consider sources of bias

during experimental design (Figure 1.3a). Errors in DNA sequence can arise from sample

degradation during formalin fixed preservation of tissue. In addition, oxidative damage

during DNA extraction or target capture is a prevalent source of sequence errors con-

founding rare variant detection. 8-oxoguanine (8-oxoG) lesions during DNA shearing

or hybrid-capture enrichment results in characteristic imbalance between C>A / G >T

substitution errors. Through PCR amplification, oxidation induced 8-oxoG is corrected

to a thymine and paired with an adenine, resulting in an asymmetric artefact on one

strand of a molecule [65, 29, 49].

Errors can arise during PCR and sequencing. High fidelity DNA polymerases cause

1 error per 3.6 x 106 nucleotide incorporations [66]. Error rates inherent to distinct

NGS platforms range from 0.1-1% [67]. Sequencer artefacts can arise during cluster

amplification, cycle sequencing, or image analysis. While sequencing is a known source

of artefactual error, its impact on molecular complexity is less clearly defined. Altogether,

these background errors set the threshold to which we can discern true somatic variants

and effective error suppression strategies can lower our limit of detection.

14

Figure 1.3: Stages impacting molecular complexity and artefactual errors. a) Detec-tion of ctDNA is impacted by molecular complexity and is limited by level of backgrounderrors. While adapter ligation and target capture may impact molecular complexity, thelower limit of detection is determined by the rate of artefactual errors. These sequencemistakes may arise from polymerase errors, sequencing blunders, or oxidative damageduring sample extraction and target capture. b) Raw sequence reads are not represen-tative of true molecular diversity due to PCR duplicates. De-duplication using genomecoordinates is not an accurate depiction either as independent molecules may map to thesame position. UMIs enable more accurate quantification of molecular complexity andcan enable sequence error suppression by correcting artefacts through consensus forma-tion. Although traditional barcoding methods rely on duplicate reads, we have developeda strategy to enable error suppression in single reads.

15

1.3.4 Molecular quantification and error suppression strategies

Molecular complexity can be measured by the frequency of unique molecules present in

sequencing data. This requires removal of redundant sequencing reads from the same

template molecule through bioinformatic processing. One approach is to eliminate dupli-

cate copies sharing the same mapping coordinates to a reference genome. However, this

is an inaccurate measure as the chance of profiling independent molecules sharing the

same genome coordinate increases with sequencing depth (Figure 1.3b). Quantification

of molecular complexity using NGS can be improved with unique molecular identifiers

(UMIs). Each unique DNA molecule is labeled with a barcode prior to amplification. As

each barcode marks a different molecule, reads sharing the same UMI are inferred to result

from PCR duplication from the same original template molecule. Thus, quantification of

distinct UMIs can determine the absolute number of molecules [68]. In addition, UMIs

can be used to aggregate duplicate reads from the same fragment into a consolidated se-

quence with the most frequent base represented in each position. With these properties,

UMIs have been used in a wide range of applications from tagging unique transcripts in

single cell sequencing to error suppression in duplex sequencing [69, 70, 71, 29, 27].

A promising strategy to suppress artefactual errors is duplex sequencing. Barcodes

are ligated onto the ends of double-strand DNA fragments to track individual molecules

for quantification of true molecular complexity. Reads sharing the same UMI are derived

from the same strand of a unique fragment and can be collapsed to form single strand

consensus sequences (SSCS). This process removes PCR duplicates by establishing a con-

solidated consensus sequence representative of all duplicate reads. As the most common

base is identified at each position, this process eliminates miscalled bases caused by poly-

merase and sequencer errors. These condensed sequences can be subsequently combined

with their complementary strand into duplex consensus sequences (DCS). This additional

layer of error removal corrects for DNA damage on asymmetric strands accrued prior to

PCR, for example due to oxidative damage (Figure 1.3b). Only mutations present in

the consensus sequences of both strands of DNA are retained through this second step

of correction. While the duplex strategy has shown greater error suppression, it is an

inefficient process dictated by quantity of input molecules and sequencing depth [71, 29].

Barcode-based strategies were initially employed to improve counting accuracy of

unique molecules and later expanded to suppress PCR and sequencing errors on single-

strands of DNA [72, 73, 74, 68, 75]. In 2012, Schmitt et al. introduced a duplex strategy

using UMIs to track and correct errors within double-stranded DNA [70]. As they used

a long 12 base barcode on each adapter end, the applicability of their method was re-

stricted to small genomes due to costs associated with long barcodes. Succeeding their

16

initial work, they re-designed their adapters and demonstrated broader implementation of

the method with targeted sequencing [71]. Subsequently, Newman et al. highlighted the

use of short 2 base barcodes combined with genome coordinates as cost-effective UMIs for

ctDNA. In addition, they combined duplex sequencing with a background error polishing

strategy. While modeling position-specific errors using control samples enables further

error correction, the power of this method relies on additional sequencing of a group of

healthy individuals. Moreover, the use of short barcodes may suffer from UMI collisions

where two molecules could share the same identifier [29]. Recently, Phallen et al. vali-

dated the utility of short barcodes combined with genomic coordinates for detection of

early stage cancers using ctDNA. In addition, they presented different variant filtering

strategies depending on the number of supporting reads for a given molecule. However,

these UMI aware callers do not address common issues filtered from conventional callers

such as triallelic sites, variants observed in control, or clustered positions to name a few.

As duplicate reads are needed to construct consensus sequences, current molecular bar-

coding strategies rely on a minimum of 2 supporting reads. With inadequate sequencing,

consensus based error suppression cannot be achieved for every molecule resulting in a

high abundance of error prone single reads.

1.4 Rationale and hypothesis

In order to achieve accurate molecular quantification and optimal error suppression for

sensitive detection of low-allele-frequency variants, there is a need to overcome barri-

ers imposed by molecular recovery and technical artefacts from library preparation and

sequencing. Current error suppression strategies using UMIs are limited to a subset of

molecules with duplicate reads. I hypothesize that extension of error suppression to single

reads will improve detection sensitivity of rare mutations.

1.5 Objectives

1. Develop an NGS strategy for accurate measurement of molecular complexity.

2. Improve duplex sequencing through error suppression incorporating single reads.

3. Evaluate ability to detect rare variants with sensitive profiling of cell line dilution

series.

Chapter 2

Materials and Methods

In order to achieve accurate molecular quantification and optimal error suppression

for sensitive low-allele-frequency variant detection, we developed a novel computational

pipeline for molecular barcode-based sequencing across two applications. The first is

LgMid which employs a large (1.2 MB) panel sequenced to moderate depth (4,223x tar-

get coverage). The second is SmDeep which employs a small (0.01 Mb) panel sequenced

to great depth (186,312x).

2.1 Target panel design

We constructed hybrid capture panels targeting a range of genomic footprints from 13 KB

to 1.2 MB sequenced to an average of 186,312x and 4,223x target coverage, respectively.

The deeply sequenced small panel (SmDeep) encompassed exons of 5 genes (KRAS,

NRAS, BRAF, EGFR, and PIK3CA) important in the mitogen-activated protein kinase

(MAPK) pathway. An application of this panel was to assess whether sequencing of cell-

free DNA in blood was comparable to bone marrow biopsies for molecular profiling of

multiple myeloma [40]. We also moderately sequenced a large panel (LgMid) comprised

of 260 leukemia associated genes. This was used for the identification of pre-leukemic

mutations in white blood cells of individuals who later developed acute myeloid leukemia.

To evaluate our limit of detection using these target panels of different sizes, we mixed

cancer cell lines with known genetic alterations to emulate varying levels of mutant allele

frequencies from 100% to 10-4% (Table 2.1). As background errors set the threshold to

which we can discern true mutations from technical artefacts, we developed a sequencing

strategy enabling error suppression for single reads.

2.2 Next-generation sequencing library preparation

Illumina NGS libraries were prepared for limit of detection assays using cell line DNA.

Library preparation was performed according to sample and application as outlined in

Table 2.1. Cell line genomic DNA was used to create a 6-point dilution at ratios of

1/5 for LgMid and an 8-point dilution at ratios of 1/10 for SmDeep. For each dilution,

17

18

Project LgMid SmDeepCelllinespike-in MOLM13 HCT116Celllinebackground SW48 MM1SDilutionpoints 6(x2technicalreplicates) 8Dilution(%) 5to5-2 10to10-4

DNAinput(ng) 100 60Fragmentsize(bp) 250 180Pre-capturePCRcycles 8 4Samplespooledpercapture

3 4

TargetPanel xGen®AMLCancerPanelv1.0

AllexonsofBRAF,EGFR,KRAS,NRAS,PIK3CA

Panelsize(MB) 1.20 0.01Post-capturePCRcycles 10 15Sequencer HiSeq2500 HiSeq2000v3Sequencelength(bp) 125 100Numberofreads(billions) 5.32 2.05MeanTargetCoverage 4,223 186,312On/Neartarget 90% 60%

Table 2.1: Summary of library preparation specifications for each cell line dilution.

60 and 100ng DNA was sheared before library construction using a Covaris R© M220

sonicator (Covaris, Woburn, MA, USA) for 180 and 250 bp fragments in SmDeep and

LgMid respectively. The DNA libraries were constructed using the KAPA Hyper Prep

kits (#KK8504, Kapa Biosystems, Wilmington, MA, USA) with custom double strand

duplex molecular barcodes (xGen Lockdown Custom Probes Mini Pool, Integrated DNA

Technologies, Coralville, IA, USA) designed by Dr. Scott Bratman. Following end repair

and A-tailing, adapter ligation was performed overnight using 100-fold molar excess of

molecular barcode adapters. Agencourt AMPure XP beads (Beckman-Coulter) were used

for library clean up and ligated fragments were amplified between 4-8 cycles using 0.5µM

Illuminal universal and sample specific index primers.

2.3 Target capture and sequencing

For each dilution, half of the indexed Illumina libraries were pooled together in a single

capture hybridization (Table 2.1). Following the IDT Hybridization capture protocol,

each pool of DNA was combined with Cot-I DNA (Invitrogen) and xGen Universal

Blocking Oligo (Integrated DNA Technologies, Coralville, IA, USA) to prevent cross

19

hybridization and minimize off-target capture. Samples were dried and re-suspended

in hybridization buffer and enhancer. Target capture with custom xGen Lockdown

Probes (Integrated DNA Technologies, Coralville, IA, USA) was performed overnight.

Streptavidin-coated magnetic beads were used to isolate hybridized targets according to

manufacturer’s specifications. Following target selection, the captured DNA fragments

were enriched with 10-15 cycles of PCR. Pooled libraries were sequenced using 100-125

bp paired-end runs on Illumina platforms (HiSeq v3 2000, HiSeq 2500) at the Princess

Margaret Genomics Centre. See Table 2.1 for details.

2.4 Data preprocessing

After sequencing, reads were de-multiplexed using sample specific indices into separate

paired-end FASTQ files. A two base pair molecular barcode and a one base pair invariant

spacer sequence were removed from each read. A thymine base was encoded in the third

position for adapter ligation and a spacer filter was enforced to remove reads incompli-

ant with this design. The extracted barcodes from paired-end reads were grouped and

written into the FASTQ sequence identifier header of each read for downstream in silico

molecular identification [71]. FASTQ files were mapped to the human reference genome

hg19 using BWA (v 0.7.12) [76], processed using the Genome Analysis ToolKit (GATK)

IndelRealigner (v 3.4-46) [77], and sorted by genome position and indexed using SAM-

tools (v 1.3) [78]. This process created sorted BAM files containing sequence alignment

data.

2.5 Molecular barcode analysis

2.5.1 Unique molecular identifiers

A range of barcode lengths have been developed for the identification of unique molecules.

Kennedy et al. originally proposed the concept of duplex sequencing using a 12 bp

identifier tag with a 5 bp invariant spacer on each end of a molecule [71]. Subsequently,

Newman et al. demonstrated the utility of dual end 2 bp barcodes and 2 bp spacers

partnered with genome mapping positions [29]. While shorter barcodes are more cost

effective, they are subject to barcode collisions where the same identifier may be assigned

to two different molecules. As there are only 256 possible combinations at a given locus

with 4 bp barcodes (2 bp on each end), Newman et al. estimated the fraction of molecular

loss to be between 0.15% and 0.56% [29]. Phallen et al. further confirmed the utility

20

of short sequence tags through Monte Carlo simulations of varying barcode lengths in

combination with genomic positions. Their evaluations revealed that barcodes between

4-16 bp in length are sufficient to distinguish individual molecules when paired with

genome positional identifiers [27].

Figure 2.1: UMIs: A barcode family is defined by barcodes and sequence features thatenable unique identification of an individual molecule. Our 1) 4-bp barcode allows for256 different molecules at each 2) genomic position. In addition, 3) CIGAR strings canprovide additional identifier diversity as it encodes sequence variation to the referencegenome highlighting distinct attributes such as clipped regions, insertions, or deletions.Furthermore, read 4) orientation and 5) number can differentiate palindromic barcodesand provide proper molecular grouping of reads belonging to different strands. R1 =Read 1 and R2 = Read 2 of paired-end reads.

Short oligonucleotide barcodes have the benefit of reduced cost for barcode synthesis

and conservation of DNA bases for biological DNA in short read sequencing. To charac-

terize unique molecules, we utilized a pair of 2 bp barcodes (4 bp in total) in combination

with 4 sequence features from paired-end reads: i) genomic position, ii) concise idiosyn-

cratic gapped alignment report (CIGAR), iii) read orientation, and iv) first read or second

read (Fig. 2.1). Hybridization capture approaches have the benefit of catching a wide

range of molecules with varying mapping positions, whereas amplicon-based methods

capture fragments with conserved positions. By utilizing the diverse genome locations of

hybrid capture fragments, shorter barcodes can be employed in combination for unique

molecular identification. By encoding both the i) start and end locations of each frag-

ment, we can preserve read pairs across large insert sizes of chromosomal translocations.

A 4-bp barcode enables 44 or 256 unique possible combinations with each positional pair.

In addition, we integrated ii) CIGAR strings in our identifier as they reflect alignment to

the reference genome (Fig. 2.2). CIGAR operations can add diversity through incorpo-

ration of read features such as M = match/mismatches, I = insertions, D = deletions, S

= soft clips, and H = hard clips. Inclusion of sequence alignment properties may improve

identifier complexity and expand the pool of UMI combinations. This can potentially

21

enable more accurate categorization of reads and mitigate barcode related errors.

Figure 2.2: Source of CIGAR: When DNA sequencing reads (grey bars) are aligned to areference sequence, specific genome coordinates and mismatches to the reference (CIGARstrings) are noted in the sequence alignment file. CIGAR strings encode informationabout fragment clipping (S, soft clipped bases in 5’ or 3’ end of the read that are notpart of the alignment), as well as sequence concordance with the reference genome (M,match/mismatch bases), insertions (I), and deletions (D).

Moreover, we incorporated iii) strand orientation and iv) first or second sequence in

paired reads to distinguish palindromic barcodes for differentiation of reads from separate

strands. As there are 4n/2 (n = number of bp in barcode) palindromic combinations, our

4 bp barcode (e.g. AT from read R1 and AT from R2) encodes 16 possibilities where the

barcode sequence reads the same for plus and minus strands. Without these sequence

distinctions, reads from separate strands of a molecule may be grouped together. While

this only represents 6.25% of our barcodes, this is an issue that increases exponentially

with longer barcodes where genome coordinates alone are insufficient for differentiating

palindromic sequences. Thus, it is crucial to employ multiple sequence measures to

supplement short barcodes for accurate duplex molecular identification.

2.5.2 Single strand consensus sequences

Using our UMIs, reads derived from the same strand of a molecule can be condensed

into single strand consensus sequences (SSCS) (Fig. 2.3). First, a filter was applied to

exclude reads which were unmapped, paired with an unmapped mate, or had multiple

alignments. Paired reads were assigned UMIs as described above using barcode, genome

22

Figure 2.3: Duplex sequencing schematic: An uncollapsed BAM file is first processedthrough SSCS maker.py to create an error suppressed single strand consensus sequences(SSCS) BAM file and an uncorrected Singleton BAM file. The single reads can berecovered through singleton strand rescue.py, which salvages singletons with its comple-mentary SSCS or singleton. SSCS reads can be directly made into duplex consensussequences (DCS) or merged with rescued singletons to create an expanded pool of DCSreads (Figure illustrates singleton rescue merged work flow).

23

mapping, CIGAR string, strand of origin, orientation, and read number information.

Reads sharing the same UMIs were grouped into the same read family. Only families

with 2 or more members were error suppressed and collapsed to form SSCSs as following:

1. For each position i across a sequence length, a Phred quality threshold of Q30 was

enforced for every read r j. As Phred scores are a measure of the quality of base

assignment called by the sequencer, only bases with an error probability of one in

a thousand or less (r i,j >Q30) were evaluated for consensus formation.

2. The most frequent base at each position across all replicate reads of the same

molecule was established as the consensus (ci). The most common base was as-

signed if the proportion of reads representing that base was greater than or equal

to the threshold required to confidently call a consensus (default cutoff 0.7 - based

on previous literature[71]), otherwise an N was assigned.

3. Here, we introduced a molecular Phred quality score for each position to reflect

bases that went into making each consensus sequence. A sequencer Phred score

(Qseq) is a measure of the estimated probability (Pseq) of incorrectly calling a single

base:

P seq = 10-Qseq/10 (2.1)

Qseq = −10log10(P seq) (2.2)

A molecular Phred score (Pmol) estimates the probability (Pmol) of incorrectly call-

ing ci. It is the product of error probabilities of Phred scores (Qk) corresponding

to bases supporting ci, where k is the index of scores and n is the total number of

scores matching the consensus base.

Pmol =n∏

k=1

10-Qk/10 (2.3)

Qmol = −10log10(Pmol) (2.4)

The molecular Phred score can be simplified to the addition of quality scores to

summarize probability of errors.

Qmol =n∑

k=1

Qk (2.5)

While Phred quality scores can theoretically range from 0 to infinity, a maximum

Q60 cap was enforced to accommodate thresholds imposed by various genomic tools

24

to flag misencoded quality scores. If no consensus could be reached at a position

and an N was present, Pmol = 0 was assigned for that base.

4. As each SSCS represents multiple reads derived from the same strand of a unique

fragment, a consensus query name was assigned to each SSCS pair. Similar to

our UMIs, the pairing tag consists of a barcode along with 4 sequence features:

i) genome mapping ordered by coordinate, ii) strand of origin inferred from read

orientation and number, iii) CIGAR string ordered by strand of origin and read

number, and iv) family size (number of reads supporting SSCS).

Filtered reads, SSCS with two or more supporting reads, and singletons were written

as separate BAM files. While single reads are traditionally excluded from consensus

formation, we incorporated a singleton rescue strategy enabling error suppression without

molecular duplicates. Salvaged singletons can be combined with SSCS for downstream

DCS formation.

2.5.3 Singleton rescue (SR)

Previous UMI-based techniques for duplex sequencing were restricted by the high depth

required for duplicate recovery. Kennedy, Newman, and Phallen consensus approaches

have limited capacity for error correction in singleton dominant samples [71, 29, 27].

Here, we introduce two approaches for singleton rescue using the duplex nature of DNA

molecules for biological elimination of technical artefacts (Fig. 3.2). Following the for-

mation of SSCS, singletons can be corrected with its reverse complement SSCS for i)

singleton rescue by SSCS. If a complementary SSCS cannot be identified, single reads

can also be paired with its reverse complement singleton for ii) singleton rescue by single-

tons. In this case similar to the duplex consensus strategy, only two reads corresponding

to the dual strands of a template molecule are required to error correct singletons as

following:

1. UMIs were assigned to singleton and SSCS reads. For each singleton, a duplex

identifier was determined by interchanging barcodes, swapping CIGAR strings,

switching strand direction, and changing read number. If R1 and R2 on a positive

strand had AC/GT as barcodes and 98M/4S94M for CIGAR strings, their duplex

would be GT/AC for barcodes and 4S94M/98M for CIGAR strings on the minus

strand. R1 in the the forward orientation on the plus strand corresponds to R2 in

the forward orientation on the minus strand.

25

2. Singleton rescue can be achieved using either its reverse complement i) SSCS or ii)

singleton read. For each base, a Phred quality filter of Q30 was enforced to remove

error prone bases. Consensus sequences were formed by taking concordant bases at

each position or by assigning Ns for mismatches. Similar to SSCS, molecular Phred

quality scores were computed by taking the sum of sequencer Phred qualities and

capped at a maximum score of Q60.

3. Error suppressed singleton pairs were assigned a consensus query name as described

above for SSCS reads.

Recovered singleton were written to separate BAM files depending on method of rescue.

They were subsequently merged with SSCS reads for duplex formation. The inclusion

of singletons allows a larger pool of reads to undergo duplex error correction removing

strand bias errors.

2.5.4 Duplex Consensus Sequences

For optimal error suppression, duplex consensus sequences (DCS) can be established by

comparing SSCSs corresponding to the double strands of a molecule. This second stage

of duplex error suppression eliminates asymmetric strand damage induced by oxidation.

While DCS methods form sequences with lower rates of error, it has been characterized

as a fairly inefficient process [71, 29]. Here, we optimized DCS formation by expanding

our pool of candidates to include singletons. While our method error corrected single-

tons prior to DCS formation, uncorrected singletons can be used directly for DCS in

theory as both DCS and singleton rescue utilize the duplex strategy for error suppres-

sion. Consensus sequences were established by preserving matched bases between reverse

complementary reads. Mismatches were denoted as N and molecular Phred qualities were

assigned to consistent bases. Although DCS reads have the lowest rates of error, they

only depict a portion of the total molecular population. To portray accurate molecular

representation for variant calling, a BAM file containing all unique molecules can be

created by combining DCS, SSCS (without duplex pair), and singletons.

2.6 Evaluation of consensus sequence formation

Efficiency of consensus formation reflects the frequency of consensus sequences generated

per read. This is determined by the average number of reads required to construct a

consensus. For example, an efficiency rate of 10% indicates each read contributes to 0.1

of a consensus or rather 10 reads are needed to form a single consensus.

26

In order to compare target panels of varying sizes, efficiency rates were calculated

using the mean target coverage (cov). GATK (v 3.6) DepthOfCoverage was used to

determine mean fragment coverage per target position. Notably, we performed fragment

counting as it considers overlapping reads as a single entity rather than double-counting

those reads:

Efficiency =covDCS/SSCS

covUncollapsed

(2.6)

As DCS formation is dependent on the number of SSCS and rescued singletons, DCS

recovery rates were estimated by comparing observed over expected rates:

RecoveryDCS =ObservedDCS

ExpectedDCS

=covDCS

(covSSCS

2)

(2.7)

2.7 Error analysis

Error is an estimation of the confidence that a single base is assigned to the correct nu-

cleotide. Probabilities of error were determined using the integrated digital error suppres-

sion (iDES) tool (https://cappseq.stanford.edu/ides/download.php#bgReport) [29].

BAM files were first converted to base frequency files for each genomic position us-

ing ides-bam2freq.pl. With the ides-bgreport.pl, background errors were calculated using

non-reference bases with at least one read support below 5% allele frequency. Error rates

were determined as the number of non-reference bases over all sequenced bases within

our capture panel range.

2.8 in silico downsampling

To compare hybrid capture panels of different sizes sequenced to various depths, we

performed down-sampling of average read coverage per target position to set intervals.

9 intervals were established between 128,000x to 500x coverage with depth decreasing

by half each step. Each sample was reduced in coverage to the set intervals through in

silico down-sampling of paired-reads with Samtools (v 1.3). This process was repeated

ten times for each sample across all intervals to address sampling biases. Next, each

down-sampled file was run through our duplex sequencing pipeline to generate singletons,

SSCS, DCS, singleton rescue by SSCS, and singleton rescue by singletons files. Rescued

singletons were merged with SSCS to generate SSCS + SR for down stream duplex

27

formation (DCS + SR). Efficiency for consensus formation and molecular recovery was

assessed for each error suppression strategy across the broad range of coverage intervals.

2.9 Mutation analysis

We selected variants based on annotated SNPs from the Cancer Cell Line Encyclopedia

(CCLE) overlapping our target panel and corresponding to the cell lines we used for

our dilution series. These variants were filtered for only those found exclusively in the

spike-in cell line. For SmDeep, we identified 3 mutations in BRAF, KRAS, and PIK3CA

of cell line HCT116. Whereas, 5 mutations in DST, NF1, RYR1, SMG1, and VCAN

were called for LgMid.

Somatic mutation calls were generated for each sample using MuTect (v 1.1.5) with

the following parameters [79]:

–enable extended output –tumor f pretest 0.000001f –downsampling type NONE –

force output –force alleles –gap events threshold 1000 –fraction contamination 0.00f –

coverage file

We force called every base of each selected CCLE annotated position to evaluate the

number of reads and allele frequency supporting each variant. MuTect metrics were used

to assess limit of detection and background noise for each stage of barcode-mediated error

correction across the dilution series.

Chapter 3

Results

3.1 Sequence properties can enhance barcode depic-

tion of molecular diversity

Here, we developed a method with a 2 base pairs (bp) barcode followed by a 1 bp spacer

for adapter ligation. We designed compact molecular labels to reduce barcode synthe-

sis expenditures, while minimizing artificial sequence uptake reserving more bases for

biological DNA. To supplement our short barcodes, we incorporated sequence features

including genome coordinates which have been shown to be effective for distinct molec-

ular characterization of ctDNA [29, 27]. In addition, we integrated sequence properties

(CIGAR strings) to our UMIs indicative of base concordance, fragment clipping, inser-

tions, and deletions (Fig. 2.1). We assessed the contribution of these additional features

for capturing molecular diversity by quantifying reads with misalignment to the reference

genome. To evaluate our method, we compared molecular complexity and error suppres-

sion in cell line dilutions across two contrasting scenarios: targeted deep sequencing and

broad moderate sequencing (SmDeep and LgMid, see methods). As we require two or

more reads supporting the same UMI to construct single-strand consensus sequences

(SSCS), single reads without duplicate support for error correction were classified as sin-

gletons. We observed a higher ratio of alternative alignments within singletons compared

to SSCS (Fig. 3.1a), which suggests CIGAR strings may pose as a stringent filter for

consensus formation. Next, we wanted to explore whether inclusion of alignment in-

formation enables more accurate depiction of molecular complexity and improves error

suppression. To test this idea, we compared consensus sequences derived from molecular

identifiers with and without alignment information. While there were minor changes

in reads of LgMid, there was on average a 33% increase in singletons of SmDeep with

CIGAR labeling (Fig. 3.1b). As molecules were saturated across a narrow target range

with deep sequencing, these results suggest barcodes and genome coordinates alone were

insufficient for characterizing molecular complexity. This may be also be problematic

for cfDNA molecules in plasma as their tight size distribution would result in a narrow

range of mapping coordinates. Expanding the combinations of molecular identifiers with

28

29

Figure 3.1: Impact of alignment information on molecular complexity and error sup-pression. a) Fraction of reads with alternative alignment to the reference genome in SSCSand singletons across two cell line dilutions. Error bars represent standard error betweensamples. b) CIGAR strings are indicative of alignment as they encapsulate informationabout base concordance, fragment clipping, insertions, and deletions. Scatter plots showchange in singletons and SSCS error rate with CIGAR embedded identifiers compared toreads derived without alignment differentiation. Left: a large panel sequenced to mod-erate depth (LgMid, n = 12). Right: a small panel with deep sequencing (SmDeep, n =8).

alignment information may improve depiction of cfDNA populations. In order to un-

derstand the impact of alignment on consensus-based error correction, we evaluated the

change in background error profiles by examining non-reference bases excluding germline

30

SNPs. With CIGAR derived consensus sequences, we observed an average decrease in

error rate by 7% in LgMid and 41% in SmDeep. Notably, we found the reduction in SSCS

error was correlated with increased singletons for SmDeep (n = 8, r = -0.87, p <0.005,

Pearson correlation). This is likely due to CIGAR identifiers enabling more defined read

categorization. As large read families sharing the same barcode and genome coordinate

are differentiated into smaller groups with varying alignment, some UMIs may lose multi-

read support and become singletons. Overall, our findings suggest alignment information

enables more distinct molecular identification and improves error suppression.

3.2 Singleton rescue improves efficiency of consensus

formation

A major barrier to adequate error suppression with conventional UMI-based strategies is

the constraint of high depth required for sufficient molecular recovery. Traditional duplex

approaches rely on duplicate reads to form consensus sequences and have limited utility

when there is inadequate coverage and an overabundance of singletons [71]. Here, we

propose harnessing the double-strand nature of DNA molecules for biological elimination

of technical artefacts in singletons. We can recover singletons by employing the duplex

strategy of error correction using a complementary SSCS or singleton (Fig. 3.2). In the

scenario where one strand S1 of a fragment has multiple copies and the other strand S2

only has one copy, reads corresponding to S1 are first formed into a SSCS. The consensus

sequence of S1 can be subsequently matched with the singleton of S2 for 1) singleton

rescue by SSCS. Alternatively if both S1 and S2 are singletons, the lone reads can error

correct one another through 2) singleton rescue by singletons. Together, these dual

strategies of singleton rescue (SR) increase the candidate pool of reads for downstream

duplex formation.

To evaluate the SR strategies, we examined consensus formation in two mixtures of

cancer cell lines emulating varying levels of ctDNA. As a consensus sequence requires a

minimum of two duplicate supports for construction, 67% of all reads in LgMid (4,223x

mean target coverage) qualified for consensus formation (Fig. 3.3). This corresponded to

a 25% single-strand consensus sequence formation rate (equation 2.6), which is equivalent

to an average of 4 reads used to build each SSCS. With the assumption that 2 single-

strand molecules form a duplex molecule, we would expect the frequency of DCS to be

half of total SSCS if every double-strand molecule was recovered. However, only 14% of

the expected DCS (equation 2.7) were observed. This corresponded to a low efficiency

31

Figure 3.2: Dual strategies of single read error suppression: Singleton rescue by SSCSstarts with collapsing duplicate reads to form SSCS, which are subsequently used tocorrect the reverse complement singleton. Additionally, singletons can be rescued byother singletons through complementary strand error suppression.

rate of ˜2% for DCS formation with an average of 54 reads needed to construct each DCS.

As a third of all sequences lacked duplicate read support, our SR strategy recovered a

quarter of singletons. SR corrected 10% of singletons using the complementary SSCS

strand and 12% using the complementary singleton strand. This suggests a quarter of

singletons had the potential to qualify for duplex correction, but would have been missed

with conventional UMI strategies due to the lack of duplicate support. Furthermore, one

tenth of those singletons were the complementary strand to SSCSs, which explains the low

DCS recovery rate we observed. With the traditional duplex model, these SSCSs failed

to form DCSs as their other strand lacked sequencing replicates. Expansion of consensus

formation to singletons reduced the number of reads needed to build a consensus sequence

to 3 for SSCS (33% efficiency rate) and 11 for DCS (9% efficiency). Together, the dual

strategies of SR recovered 53% of expected DCS resulting in a 4.6-fold improvement

compared to conventional duplex methods.

In contrast to the panel sequenced to moderate depth, 98.7% of reads in the deeply se-

quenced SmDeep (186,312x mean target coverage) contributed to consensus development.

As the majority of molecules had duplicate read support, SSCS formation was inefficient

needing 30 reads on average to assemble a single consensus sequence (3% SSCS forma-

tion rate). Consequently, DCS efficiency was even lower at 0.5%, which corresponds

to an average of 184 reads used to construct each DCS. The high ratio of duplicates

32

to singletons sequenced account for the low efficiency rates. However, deep sequencing

corresponded to a higher DCS recovery rate of 33.6% compared to the 14% previously

observed with sequencing to moderate depth. As singletons accounted for only ˜1% of

all reads in SmDeep, SR had insignificant impact on SSCS and DCS formation as rela-

tively few singletons were recovered. Although singleton correction improves consensus

efficiency, it is not equally effective in all scenarios. Therefore, to efficiently maximize

molecular recovery, there is a need to characterize the optimal sequencing context for

error suppression.

Figure 3.3: Proportion of reads with duplicate support eligible for consensus-basederror suppression. Singleton rescue (SR) enables additional error correction of singletonsthrough rescue strategies using complementary SSCS or singletons. The left shows theaverage read distribution of a large panel with sequencing to moderate depth (LgMid, n= 12), while the right shows a summary of reads from a small panel with deep sequencing(SmDeep, n = 8).

3.3 Singleton rescue achieves comparable rates of er-

ror suppression as duplex correction

Singletons typically cannot be error corrected with barcoding and contain artefacts which

may confound detection of variant allele frequencies <0.1% expected in ctDNA. To assess

the error suppression performance of SR, we evaluated errors at each stage of consensus

formation. Approximately half of the large panel was error-free, whereas the small panel

had errors in nearly every position (Fig. 3.4). The lower rate of error-free positions in

33

SmDeep compared to LgMid may be explained by the higher probability of mismatched

bases occurring within a narrow target range. Although SSCS reduced the frequency of

positions with errors to less than 25%, duplex was required for maximal error reduction.

DCS had the highest rate of error-free positions with 99.96% for LgMid and 99.82% for

SmDeep. Across both panels, SR achieved comparable metrics with mismatches found

below 0.1% across the target panel.

Figure 3.4: Error suppression performance at each stage of error correction betweensequencing with moderate (LgMid) and deep (SmDeep) coverage. Background errordetermined by non-reference bases at allele frequencies below 5%. Error bars representsstandard error across all samples.

Panel-wide error rates corresponded with error-free positions, where higher frequency

of positions without errors concurred with lower error rates. Across the selected genes of

both target panels, we found SSCS reduced error rates to below 0.01%. Whereas, removal

of asymmetric-strand damage with DCS suppressed errors to an order of magnitude

lower below 0.001% (Fig. 3.4). While SR of SmDeep presented the lowest error rates

across the error suppression strategies, this was biased by the relatively low proportion

34

of singletons rescued. With deep sequencing, only 1% of SmDeep reads were singletons

and 2% of those singletons were rescued. As SmDeep did not adequately represent

correctable singletons, we focused on LgMid to provide a better assessment of errors

with SR. Across the 23% of singletons corrected in LgMid, We observed error rates

of 1.0 x 10-3% for SR by SSCS and 1.2 x 10-3% for SR by singletons. Most notably,

both SR strategies minimized singleton errors by 25-fold down to an average of 1.1 x

10-3% per base. This is noteworthy as it suggests high quality error suppression can

be achieved with an individual read, challenging the fundamental notion of requiring

multiple supporting reads for conventional consensus formation. Despite the high rate

of artefacts in singletons, our correction strategy obtained similar background levels of

error as top performing DCS.

3.4 Oxidative damage is effectively suppressed by

singleton rescue

An analysis of background artefacts in LgMid across base substitution classes (C>A,

C>G, C>T, T>A, T>C, T>G; and their reciprocal substitutions) revealed different

error profiles with each stage of consensus correction. We observed similar error rates

across all 12 substitution classes between uncollapsed reads and singletons without error

suppression, suggesting singletons are the result of molecule sampling biases and are not

inherently different in error profiles. Consistent with previous reports by Schmitt et al.

and Newman et al., we found errors indicative of oxidative damage in uncorrected and

SSCS reads as indicated by the imbalance between C>A/G>T transversions (Fig. 3.5b)

[70, 29]. Oxidation induces 8-oxoguanine (8-oxoG) lesions during DNA shearing and

hybridization capture. Through PCR amplification, 8-oxoG is corrected to a thymine

and paired with an adenine, resulting in asymmetric damage to one strand of a molecule

(Fig. 3.5a).

As our hybrid capture panel utilizes probes generated against the positive strand of

the human genome reference sequence, only the plus strand of DNA was captured. Con-

sequently, we observed a ratio of 0.6 G >T compared to C>A errors. Although Schmitt

et al. previously reported a higher frequency of G >T errors, our findings are consis-

tent with those previously reported by Newman et al. where they demonstrated inverse

ratios in substitution errors depending on the strandedness of probes [70, 29]. While

the oxidative damage signature was present in uncollapsed and SSCS reads, we did not

observe its presence within DCS (Fig. 3.5c-e). Together, these results demonstrate the

35

importance of DCS error suppression for removal of asymmetric strand damage induced

by oxidation.

Next, we explored the ability of SR to address mutational signatures of error by

examining base substitutions. We observed elimination of oxidative damage across both

SR by SSCS and SR by singletons (Fig. 3.5f-g). Interestingly, this raises the question of

whether multiple reads are necessary for error suppression as complementary singletons

achieved error rates and substitution profiles similar to DCS. Both DCS and SR rely

on the same mechanism of artefact elimination as they are contingent on consolidation

of double-strands. While ratio of substitution errors vary between samples in DCS,

this difference is potentially due to the few reads available to form DCS. From ˜4,000x

uncollapsed reads per target position, only 80x reads remain after DCS construction.

Contrary, SSCS rescued 142x singletons and complementary singletons recovered 171x

reads per target position. As both strategies of SR was similar to DCS across error-free

positions, error rates, and substitution profiles, our findings strongly support inclusion

of singletons to maximize error suppression.

3.5 Duplex molecular recovery is maximized through

singleton rescue

While UMI based error suppression strategies have been deployed for circulating tumour

DNA detection, DCS has been described as an inefficient process [29, 27, 71]. This

is partly due to the need for deep sequencing, as published UMI algorithms require a

minimum of 2 read supports for consensus formation. Since more replicates are needed to

construct consensus sequences, less molecules will meet the criteria resulting in lower rates

of SSCS and DCS formation. In addition to the bottlenecks on molecular recovery during

library preparation and sequencing, bioinformatic analysis may also bias quantification

of molecular diversity depending on criteria for defining unique molecules. Here, we

explore the impact of an alternative UMI approach with expansion of error suppression

to singletons. As SR has the potential to broaden the applicability of UMI correction, we

wanted to determine the conditions in which it can improve SSCS and DCS formation.

To compare efficiency of consensus formation across panels of different sizes sequenced to

various depths, we performed in silico down-sampling to coverage intervals between 500x

and 128,000x read per target base. Each sample was downsampled to 9 coverage levels

starting at 500x reads per target position and doubling with each increased interval.

Through ten simulations across nine intervals, we assessed the frequency of SR and

36

Figure 3.5: Duplex strategies of error correction suppresses oxidative damage. a)Schematic of oxidative damaging induced 8-oxoguanine (8-oxoG) asymmetric strand le-sions in DNA. 8-oxoG arises from oxidation and converts to thymine after PCR ampli-fication. As only plus strands are captured with our target panel comprised of negativeprobes, there is a bias for C>A errors. b) Comparison of 12 substitution classes be-tween uncorrected and error suppressed sequences in LgMid (n=12). c-g) Ratio of errorsbetween reciprocal base substitutions. Imbalances between C>A/G>T is indicative ofoxidative damage.

its impact on consensus formation across a broad coverage range. As we utilize the

double-strand nature of DNA for biological elimination of errors, the success of SR was

37

Figure 3.6: Singleton rescue impact on consensus formation across a wide coverage range(500x to 128,000x). Results are shown for LgMid and SmDeep cell line dilutions thatwere down-sampled by half at each interval step. Error bars represents standard error ofall the samples across ten simulations. Mean target coverage is log10 transformed to showtrends at lower range. a) Proportion of singletons rescued through complementary SSCSand singleton correction. b) Efficiency of SSCS formation as determined by number ofSSCS / total uncollapsed reads. c) DCS efficiency as calculated by number of DCS /total uncollapsed reads d) DCS recovery rate of observed / expected DCS. Expected DCScorresponds to half of all SSCS reads as two single molecules should theoretically form aDCS. Efficiency is an assessment of over-sequencing relative to unique input molecules,whereas recovery is an estimate of molecular retrieval.

dictated by the presence of its complementary strand. With increased sequencing depth,

we observed a greater portion of unique molecules and consequently a higher number of

rescued singletons (Fig. 3.6a). Starting from an average of 500x uncollapsed reads per

target position, we observed a 3% recovery of singletons in SmDeep. As the number of

reads per position doubled at each coverage interval, we found the frequency of SR to

double as well. This trend continued until reaching a peak SR rate of 20% at 8,000x

coverage, where one out of five singletons could be recovered. Beyond this point, we

38

observed a decrease in SR suggesting an increased prevalence of duplicate reads. This

was validated by SSCS efficiency, where we found the distribution of SR corresponded

with SSCS formation. As the development of SSCS relies on repeat measures of the

same molecule, rate of consensus formation increased until sign of unique molecular

saturation observed around 8,000x coverage (Fig. 3.6b). This inflection marked the

maximum efficiency rate of SSCS in SmDeep generated from 27% of molecules in the

library, representing an average of 4 reads to construct a single-strand molecule. After

this point, we began to saturate our unique single-strand molecules with duplicate reads,

which led to decreased SSCS efficiency and number of singletons rescued.

Overabundance of singletons pose as a major limitation for conventional UMI meth-

ods as there are insufficient reads to form consensus sequences. At 500x reads per target

position, inclusion of SR improved SSCS efficiency from 6% to 8% in SmDeep. With in-

creased depth of coverage, we observed a gradual improvement in SSCS efficiency through

SR, reaching a peak of 33% at 8,000x mean target coverage. This is a 6% improvement

from conventional SSCS formation where only duplicate reads are utilized. However, in-

creases in SSCS efficiency with SR was only observed until 32,00x mean target coverage

in SmDeep, at which point we saw no improvements from SR (Fig. 3.6b). Our results

suggest SSCS formation only benefit from SR at sequencing depths where molecules lack

duplicate reads.

While we also observed increased DCS formation with higher depth of sequencing, the

peak DCS rate of 3% was not observed until 16,000x mean target coverage (Fig. 3.6c).

Despite deep sequencing of SmDeep, we still found DCS formation to be highly inefficient

requiring a minimum of 33 reads at its highest observed performance. With the inclusion

of SR, we observed an increase in DCS efficiency and a shift towards more effective DCS

formation at lower coverage. This resembled the distribution of SSCS efficiency, as peak

DCS efficiency occurred between 4,000x - 8,000x reads per target base. In addition,

expansion of DCS formation to include SR resulted in a 2-fold improvement in DCS

efficiency up to 6% in SmDeep. While LgMid could only be down-sampled to coverage

intervals between 500x and 4,000x as it was moderately sequenced, we observed similar

trends in SR and SSCS/DCS efficiency as SmDeep. Notably, LgMid achieved 8.7%

DCS efficiency with an average of 4,000x reads per target position. This corresponded

to approximately 12 reads used for DCS construction, which is the highest efficiency

reported in literature to date.

Next, we assessed the frequency of expected double-strand molecules to estimate the

recovery of molecular complexity after library preparation, sequencing, and bioinformatic

analysis. DCS recovery was calculated by taking the ratio of observed over expected DCS.

39

Since two single-strand molecules could theoretically form a DCS, we determined the

expected frequency of DCS to be half of the total SSCS population. Using conventional

UMI principles requiring duplicates to construct each consensus sequence, we observed a

gradual recovery of DCS with increased depth of coverage. At 64,000x reads per target

position, we reached DCS saturation with only 34% recovery from both strands of a DNA

molecule (Fig. 3.6d).

Expansion of DCS formation with SR, maximized DCS recovery regardless of sequenc-

ing depth. Starting from 500x uncollapsed reads per target position, we demonstrated

that incorporation of SR improved DCS recovery rate from 0% to 40% in SmDeep and

50% in LgMid. Notably, the frequency of recovered DCS remained in this range despite

increased sequencing coverage. This suggests inclusion of consensus sequences from com-

plementary singletons overcomes traditional hurdles of error suppression and can achieve

optimal recovery of DCS at any given sequencing depth.

As not all DCSs were retrieved, our results suggest the matching strand was miss-

ing for half of all molecules in SmDeep. Further enhancements in library preparation

upstream of in silico analyses may improve double-strand molecular recovery. While

both dilutions displayed similar trends across all four performance metrics (Fig. 3.6),

the variability between them may be accounted by the difference in library preparation

and sample input (see methods). Together, our findings demonstrate the ability of SR to

recover additional molecules for error suppression, potentially increasing the sensitivity

of mutation detection.

3.6 Singleton enabled error suppression improves

lower limit of mutation detection

Using mixed cancer cell lines with known genetic alterations, we emulated varying levels of

mutant allele frequencies. For each dilution series, we evaluated the frequency of SNPs in

the spike-in cell line previously identified by the Cancer Cell Line Encyclopedia (CCLE).

We called every base of these CCLE annotated SNPs found within each target panel. In

SmDeep, we examined the frequency of three known mutations in BRAF, KRAS, and

PIK3CA of cell line HCT116 across the dilution series from 100% to 10-4%. With deep

sequencing, we detected the BRAF and KRAS mutations down to 1% and PIK3CA

down to 0.1% dilution in the uncollapsed reads without error suppression. Beyond these

frequencies, detection was prohibited by the rate of background errors (Fig. 3.7).

To improve upon this, we used our SSCS error correction and observed suppression

40

of artefacts represented by alternative mutations that are not matching the reference or

known allele. This process permitted identification of all three variants at 0.1% dilution

with observed allele frequencies as low as 0.01% for the BRAF mutation. While DCS

error suppression was required for complete removal of artefacts, known mutations were

only uncovered at 1% dilution for BRAF and KRAS and 10% for PIK3CA. While the

PIK3CA mutation was detected at a lower dilution compared to the other two mutations

in uncollapsed reads, the absence of the PIK3CA mutation in lower dilutions of DCS

suggests poor double-strand recovery. These results highlight the variability in molecu-

lar recovery and the need to optimize experimental design for improved double-strand

molecular retention as currently only 33.6% of maximum theoretical DCSs were retained.

As nearly all of the reads in SmDeep had duplicates sequenced, few molecules were

recovered with SR. Consequently, we did not identify observable changes in our limit of

detection with the addition of error suppressed singletons (Fig. 3.9a). Without elimi-

nation of artefacts, singletons contained errors corresponding to alternative bases rather

than the reference or known allele. Despite error correction using duplicate reads, a few

errors still persisted in SSCS. As these artefacts in SSCS were only found in 1-3 reads,

filters can be imposed to enable higher confidence variant calling with SSCS (Fig. 3.9b).

In the LgMid experiment, we evaluated 5 mutations present in the MOLM13 cell line

that were captured by our mutation panel (DST, NF1, RYR1, SMG1, VCAN ). With

a dilution from 100% to 0.04%, all 5 mutations were detectable at the 1% spike-in in

the uncollapsed reads (Fig. 3.8). Although the mutation in DST was also identified at

0.04%, it was undetectable at the 0.2% dilution in uncorrected reads. With SSCS, DST

and RYR1 mutations were detectable down to 0.04%. While NF1 was also identified at

similar frequencies, it was also found in the background cell line along with an alternative

allele, both of which were artefacts. As artefactual errors compete with signal of true

variants in SSCS, it is difficult to confidently call such mutations in NF1 as errors are

retained at 1% allele frequency. Although detection limit of SMG1 was the same in

uncollapsed and SSCS, the mutation in VCAN was detected at a lower limit of 0.2%

allele frequency in SSCS.

Despite DCSs presenting the lowest error rates, poor DCS recovery with sequencing

to moderate depth resulted in reduced number of mutations identified at lower dilutions

compared to uncollapsed reads. Incorporation of SR had minimal impact on mutation

detection in SSCS, whereas inclusion of SR with DCS improved mutation detection from

5% to 1% in SMG1 and from 1% to 0.2% in VCAN. In addition, expansion of DCS with

singletons recovered the known mutation in NF1 at the 1% dilution, which was previously

unidentified in DCS. Together, our results demonstrate the utility of SR to improve DCS

41

recovery enabling high confidence mutation detection down to 0.2% frequency in DCS.

In LgMid, we observe frequencies of background error at 0.1% in uncollapsed reads

prohibiting detection of mutations below such threshold (Fig. 3.10a). While the reduction

in errors enabled variant detection at allele frequencies down to 0.04% in SSCS, false

positives still persisted as observed by the presence of the spike in variant at 0% dilution.

Notably, these false positives were removed in DCS where no sign of the spike-in mutant

was observed in the undiluted background cell line (Fig. 3.10b). While DCS of SmDeep

was completely error free, a single read supporting an alternative mutation indicative of

error was present in DCS of LgMid at the 1% dilution. As a single error persisted into

DCS, a lenient filter could also be applied to DCS for higher confidence variant calls.

While DCS may enable higher confidence calling, SSCS achieves a lower limit of variant

detection. In order to optimally utilize error suppressed consensus sequences, our results

suggest a UMI aware caller is needed to enable different filtering strategies for SSCS

and DCS depending on confidence of sequence quality. Polishing strategies to model

background error rates, as previously described by Newman et al., may further improve

our limit of detection, as it can suppress artefacts not addressed by barcoding [29].

42

Figure 3.7: SmDeep: three HCT116 mutations identified through deep sequencing ofa 5 gene panel across dilutions from 100% to 10-4%. Every base at each position wascalled to illustrate the frequency of background errors present. Top panel compareslimit of detection with standard uncollapsed reads, middle panel represents single strandconsensus sequences, and bottom panel highlights duplex consensus sequences.

43

Figure 3.8: LgMid: Five MOLM13 mutations identified through sequencing to moder-ate depth across dilutions from 100% to 0.04%. Every base at each position was called todetermine frequency of known mutations and as well to illustrate the frequency of back-ground errors. Each row corresponds to an error suppression method and each columnrepresents a different mutation.

44

Figure 3.9: Summary trends of SmDeep mutations across different error suppressionstrategies. Comparison of detection limit using a) allele frequency and b) reads support-ing each variant.

45

Figure 3.10: Summary trends of LgMid mutations across different error suppressionstrategies. Comparison of detection limit using a) allele frequency and b) reads support-ing each variant.

Chapter 4

Discussion

4.1 Conclusions

Here, we proposed two strategies to improve molecular characterization and error sup-

pression for detection of low-allele-frequency variants across two applications: 1) enhance

molecular distinction using alignment information for deep sequencing (SmDeep), and 2)

expand error suppression to singletons for moderate depth sequencing (LgMid).

With deep sequencing, we demonstrated the utility of alignment information (i.e.

CIGAR strings) to increased identification of singletons by 33%. This improved quantifi-

cation of molecular diversity for our small deeply sequenced panel and enabled additional

error suppression in SSCS up to 70%. This has potential of improving distinction of rare

fragments, such as cfDNA with tight size distributions where genome mapping and bar-

codes alone may be insufficient. Deep sequencing is required for consensus formation

with traditional UMI methods as multiple reads are needed for error suppression. We

established a novel approach for consensus construction enabling error correction within

single reads. Utilizing the duplex methodology of aggregating complementary strands,

we rescued singletons using their matching SSCS or singleton strand to remove technical

artefacts. This expands the pool of reads for downstream consensus formation with up to

25% of singletons and improved efficiency rates for SSCS and DCS development. Interest-

ingly, rescued singletons outperformed conventional SSCSs achieving similar error rates

as DCS at 1 x 10-4% error per target base. Upon further examination of error profiles,

singleton rescue displayed similar substitution error features as DCS. Notably, corrected

singletons successfully eliminated oxidative damage, which SSCS was incapable of despite

having multiple read support. This is a phenomenon that can only be addressed through

double-strand artefact suppression.

Through down-sampling of SmDeep, we mapped the trajectory of consensus formation

with increasing depth of coverage. With rise in coverage, we observed a decrease in the

number of reads required to construct each consensus. SSCS efficiency decreased upon

saturating single-strand molecules at 8,000x coverage, while DCS saturation was not

observed until 16,000x coverage. At the maximum efficiency rate, DCS was constructed

from 3% of molecules in the library. This highlighted the inefficiencies with conventional

46

47

duplex barcoding as previously reported by other groups [71, 29]. With the incorporation

of singleton rescue, we achieved DCS saturation between 4,000x and 8,000x coverage

with maximum DCS efficiency doubling the conventional duplex method. Furthermore,

inclusion of singleton rescue maximized theoretical DCS recovery regardless of sequencing

depth. These observations can provide a reference to guide future sequencing experiments

utilizing UMIs for optimal consensus establishment. Our method offers a lower limit of

detection than conventional calling algorithms reaching 0.1% with deep sequencing. In

addition, singleton rescue improved variant detection limit in DCS from 1% to 0.2% allele

frequency with moderate depth sequencing. Together, our findings suggest UMI enabled

error suppression can achieve lower limits of detection and has potential of improving

sensitive detection of low-allele-frequency variants for clinical applications.

4.2 Potential impact and applications

Singleton rescue offers an error suppression strategy to address barriers for low-allele-

frequency variant detection. This has potential applications for circulating tumour DNA

detection in liquid biopsies. As sample input is often limited and signal is diluted by

cfDNA from non-malignant cells, molecular recovery and suppression of background er-

rors are critical for ctDNA analysis. This study highlights an improved strategy to depict

molecular complexity and eliminate artefactual errors. The results emphasize the capabil-

ity of mutation calling with DCS at a lower detection limit of 0.2%. This has implications

for identifying ctDNA at frequencies below the current threshold set by background lev-

els of technical artefacts. Furthermore, in silico assessment of molecular recovery across

defined coverage intervals allowed us to quantify the frequency of unique molecules per

target position. This distribution can be modeled to guide target panel design and depth

of coverage for future liquid-biopsy sequencing experiments.

As much of the morbidity and mortality associated with cancer is related to late di-

agnosis of disease, earlier detection could improve timing and efficacy of interventions.

ctDNA may be used to evaluate the onset and progression of disease. Although UMIs

has been employed in ctDNA analysis for earlier diagnosis, identifying somatic alter-

ations remains a major challenge without knowledge of matched tumour genotypes. As

ctDNA is limited at early stages of disease, sensitivity may be improved with increased

molecular recovery using complex identifiers reflective of individual sequence properties.

In addition, singleton rescue may further increase sensitivity through error suppression

of additional molecules. These refinements may improve screening of high risk individ-

uals with family histories of carrying genetic mutations, paving the path for potential

48

diagnostics of disease in the general population.

While advancements are needed to enhance applications of liquid biopsies for screen-

ing, ctDNA has shown promise with detection of residual disease [26, 36]. Following

surgery or treatments with curative intent, ctDNA analysis has demonstrated earlier

identification of recurrence than current standard of care imaging [38]. With our sensi-

tive method for single molecule error suppression, we hope to facilitate monitoring and

enable earlier detection of disease. Currently, we have ongoing projects using UMI for

detection of minimal residual disease in ctDNA of multiple myeloma patients, negating

bone marrow biopsies [40]. Additionally, this technique has implications for monitoring

relapse in patients post-surgery or transplantation. Particular impact is likely in diseases

such as liver cancer where recurrence rates can be as high as 70% [80]. Overall, serial

monitoring with ctDNA has relevance throughout the patient life cycle from screening

to detection of relapse. Across many cancers, liquid biopsies can be exploited to improve

clinical care and enable personalized treatment decisions.

4.3 Future directions

In this study, we observed a lack of molecular diversity within deeply profiled targeted

panels. While we sequenced to 186,000x depth per target position in one experiment, we

found that only 8,000x coverage was sufficient to generate SSCS from 27% of molecules in

the library. We observed molecular saturation at 8,000x coverage for SSCS and 16,000x

coverage for DCS. As higher depth of sequencing is required, DCS saturation is unlikely

achievable with larger panels without significant increases in costs. Therefore, singleton

rescue is needed to reduce the number of reads required to construct DCSs. In addition,

singleton rescue maximized overall DCS recovery regardless of sequencing coverage across

9 intervals between 500x to 128,000x reads per target position. Together, our results

support inclusion of singleton error suppression for all future UMI experiments as it is

fundamental for optimizing DCS sequencing.

A major challenge remains with alteration of molecular barcodes caused by poly-

merase and sequencer errors. These technical artefacts modify UMIs causing mis-cat-

egorization of reads and inaccurate quantification of molecules. While deeper coverage

profiling is often suggested as a solution to improve molecular recovery and sensitivity of

detection, increased depth also results in a linear increase in UMI errors [81]. Conven-

tional duplex strategies require deep sequencing for DCS recovery, whereas our improved

strategy of singleton rescue reduces the necessity for high depth of coverage. Rather, our

survey of molecular abundance across a broad range of coverage intervals urges against

49

over sequencing for economical reasons and avoidance of artificial UMIs generated with

deep sequencing. Overall, this work will enable precise modelling of depth and expected

molecular complexity to guide the design of future panels and targeted DNA sequencing

studies.

Bibliography

[1] J. C. M. Wan, C. Massie, J. Garcia-Corbacho, F. Mouliere, J. D. Brenton, C. Caldas,

S. Pacey, R. Baird, and N. Rosenfeld, “Liquid biopsies come of age: towards imple-

mentation of circulating tumour DNA,” Nature Reviews Cancer, vol. 17, pp. 223–

238, Apr. 2017.

[2] D. Hanahan and R. Weinberg, “Hallmarks of Cancer: The Next Generation,” Cell,

vol. 144, pp. 646–674, Mar. 2011.

[3] L. Ding, T. J. Ley, D. E. Larson, C. A. Miller, D. C. Koboldt, J. S. Welch, J. K.

Ritchey, M. A. Young, T. Lamprecht, M. D. McLellan, J. F. McMichael, J. W.

Wallis, C. Lu, D. Shen, C. C. Harris, D. J. Dooling, R. S. Fulton, L. L. Fulton,

K. Chen, H. Schmidt, J. Kalicki-Veizer, V. J. Magrini, L. Cook, S. D. McGrath,

T. L. Vickery, M. C. Wendl, S. Heath, M. A. Watson, D. C. Link, M. H. Tomasson,

W. D. Shannon, J. E. Payton, S. Kulkarni, P. Westervelt, M. J. Walter, T. A.

Graubert, E. R. Mardis, R. K. Wilson, and J. F. DiPersio, “Clonal evolution in

relapsed acute myeloid leukaemia revealed by whole-genome sequencing,” Nature,

vol. 481, pp. 506–510, Jan. 2012.

[4] R. A. Burrell, N. McGranahan, J. Bartek, and C. Swanton, “The causes and conse-

quences of genetic heterogeneity in cancer evolution,” Nature, vol. 501, pp. 338–345,

Sept. 2013.

[5] A. T. Shaw and I. Dagogo-Jack, “Tumour heterogeneity and resistance to cancer

therapies,” Nature Reviews Clinical Oncology, Nov. 2017.

[6] A. Bardelli, A. Snyder, A. Califano, A. A. Alizadeh, C. Caldas, C. Blanpain,

C. Swanton, C. W. M. Roberts, C. Borowski, C. Bock, C. E. Mason, D. Pe’er,

H. Stower, J. O. Korbel, J. C. Zenklusen, J. Zucman-Rossi, J. Zuber, K. Polyak,

L. Siu, M. Esteller, M. Elsner, M. Doherty, N. Navin, P. Lichter, R. Fitzgerald,

R. G. W. Verhaak, and V. Aranda, “Toward understanding and exploiting tumor

heterogeneity,” Nature Medicine, vol. 21, p. 846, Aug. 2015.

[7] M. Meyerson, S. Gabriel, and G. Getz, “Advances in understanding cancer genomes

through second-generation sequencing,” Nature Reviews Genetics, vol. 11, pp. 685–

696, Oct. 2010.

50

51

[8] M. J. Overman, J. Modak, S. Kopetz, R. Murthy, J. C. Yao, M. E. Hicks, J. L.

Abbruzzese, and A. L. Tam, “Use of Research Biopsies in Clinical Trials: Are Risks

and Benefits Adequately Discussed?,” Journal of Clinical Oncology, vol. 31, pp. 17–

22, Jan. 2013.

[9] G. Siravegna, S. Marsoni, S. Siena, and A. Bardelli, “Integrating liquid biopsies into

the management of cancer,” Nature Reviews Clinical Oncology, vol. advance online

publication, Mar. 2017.

[10] G. Siravegna, B. Mussolin, M. Buscarino, G. Corti, A. Cassingena, G. Crisafulli,

A. Ponzetti, C. Cremolini, A. Amatu, C. Lauricella, S. Lamba, S. Hobor, A. Aval-

lone, E. Valtorta, G. Rospo, E. Medico, V. Motta, C. Antoniotti, F. Tatangelo,

B. Bellosillo, S. Veronese, A. Budillon, C. Montagut, P. Racca, S. Marsoni, A. Fal-

cone, R. B. Corcoran, F. Di Nicolantonio, F. Loupakis, S. Siena, A. Sartore-Bianchi,

and A. Bardelli, “Clonal evolution and resistance to EGFR blockade in the blood of

colorectal cancer patients,” Nature Medicine, vol. 21, pp. 795–801, July 2015.

[11] C. Abbosh, N. J. Birkbak, G. A. Wilson, M. Jamal-Hanjani, T. Constantin, R. Salari,

J. L. Quesne, D. A. Moore, S. Veeriah, R. Rosenthal, T. Marafioti, E. Kirkizlar,

T. B. K. Watkins, N. McGranahan, S. Ward, L. Martinson, J. Riley, F. Fraioli, M. A.

Bakir, E. GrOnroos, F. Zambrana, R. Endozo, W. L. Bi, F. M. Fennessy, N. Sponer,

D. Johnson, J. Laycock, S. Shafi, J. Czyzewska-Khan, A. Rowan, T. Chambers,

N. Matthews, S. Turajlic, C. Hiley, S. M. Lee, M. D. Forster, T. Ahmad, M. Falzon,

E. Borg, D. Lawrence, M. Hayward, S. Kolvekar, N. Panagiotopoulos, S. M. Janes,

R. Thakrar, A. Ahmed, F. Blackhall, Y. Summers, D. Hafez, A. Naik, A. Ganguly,

S. Kareht, R. Shah, L. Joseph, A. M. Quinn, P. Crosbie, B. Naidu, G. Middleton,

G. Langman, S. Trotter, M. Nicolson, H. Remmen, K. Kerr, M. Chetty, L. Gomer-

sall, D. A. Fennell, A. Nakas, S. Rathinam, G. Anand, S. Khan, P. Russell, V. Ezhil,

B. Ismail, M. Irvin-sellers, V. Prakash, J. F. Lester, M. Kornaszewska, R. Attanoos,

H. Adams, H. Davies, D. Oukrif, A. U. Akarca, J. A. Hartley, H. L. Lowe, S. Lock,

N. Iles, H. Bell, Y. Ngai, G. Elgar, Z. Szallasi, R. F. Schwarz, J. Herrero, A. Stew-

art, S. A. Quezada, P. Van Loo, C. Dive, C. J. Lin, M. Rabinowitz, H. J. Aerts,

A. Hackshaw, J. A. Shaw, B. G. Zimmermann, the TRACERx Consortium, the

PEACE Consortium, and C. Swanton, “Phylogenetic ctDNA analysis depicts early

stage lung cancer evolution,” Nature, vol. advance online publication, Apr. 2017.

[12] C. L. Hodgkinson, C. J. Morrow, Y. Li, R. L. Metcalf, D. G. Rothwell, F. Tra-

pani, R. Polanski, D. J. Burt, K. L. Simpson, K. Morris, S. D. Pepper, D. Nonaka,

52

A. Greystoke, P. Kelly, B. Bola, M. G. Krebs, J. Antonello, M. Ayub, S. Faulkner,

L. Priest, L. Carter, C. Tate, C. J. Miller, F. Blackhall, G. Brady, and C. Dive,

“Tumorigenicity and genetic profiling of circulating tumor cells in small-cell lung

cancer,” Nature Medicine, vol. 20, pp. 897–903, Aug. 2014.

[13] D. A. Haber and V. E. Velculescu, “Blood-Based Analyses of Cancer: Circulating

Tumor Cells and Circulating Tumor DNA,” Cancer discovery, vol. 4, pp. 650–661,

June 2014.

[14] S. Rani, K. O’Brien, F. C. Kelleher, C. Corcoran, S. Germano, M. W. Radomski,

J. Crown, and L. O’Driscoll, “Isolation of Exosomes for Subsequent mRNA, Mi-

croRNA, and Protein Profiling,” in Gene Expression Profiling, Methods in Molecular

Biology, pp. 181–195, Humana Press, 2011. DOI: 10.1007/978-1-61779-289-2 13.

[15] J. D. Arroyo, J. R. Chevillet, E. M. Kroh, I. K. Ruf, C. C. Pritchard, D. F. Gibson,

P. S. Mitchell, C. F. Bennett, E. L. Pogosova-Agadjanyan, D. L. Stirewalt, J. F. Tait,

and M. Tewari, “Argonaute2 complexes carry a population of circulating microRNAs

independent of vesicles in human plasma,” Proceedings of the National Academy of

Sciences, vol. 108, pp. 5003–5008, Mar. 2011.

[16] T. Yuan, X. Huang, M. Woodcock, M. Du, R. Dittmar, Y. Wang, S. Tsai, M. Kohli,

L. Boardman, T. Patel, and L. Wang, “Plasma extracellular RNA profiles in healthy

and cancer patients,” Scientific Reports, vol. 6, p. srep19413, Jan. 2016.

[17] P. Li, M. Kaslan, S. H. Lee, J. Yao, and Z. Gao, “Progress in Exosome Isolation

Techniques,” Theranostics, vol. 7, pp. 789–804, Jan. 2017.

[18] P. Mandel and P. Metais, “Les acides nucleiques du plasma sanguin chez l’homme.,”

Comptes Rendus Des Seances De La Societe De Biologie Et De Ses Filiales, vol. 142,

pp. 241–243, Feb. 1948. bibtex: mandel cfDNA 1948.

[19] S. A. Leon, B. Shapiro, D. M. Sklaroff, and M. J. Yaros, “Free DNA in the Serum of

Cancer Patients and the Effect of Therapy,” Cancer Research, vol. 37, pp. 646–650,

Mar. 1977.

[20] Y. M. D. Lo, K. C. A. Chan, H. Sun, E. Z. Chen, P. Jiang, F. M. F. Lun, Y. W.

Zheng, T. Y. Leung, T. K. Lau, C. R. Cantor, and R. W. K. Chiu, “Maternal Plasma

DNA Sequencing Reveals the Genome-Wide Genetic and Mutational Profile of the

Fetus,” Science Translational Medicine, vol. 2, pp. 61ra91–61ra91, Dec. 2010.

53

[21] H. R. Underhill, J. O. Kitzman, S. Hellwig, N. C. Welker, R. Daza, D. N. Baker,

K. M. Gligorich, R. C. Rostomily, M. P. Bronner, and J. Shendure, “Fragment

Length of Circulating Tumor DNA,” PLOS Genetics, vol. 12, p. e1006162, July

2016.

[22] M. Snyder, M. Kircher, A. Hill, R. Daza, and J. Shendure, “Cell-free DNA Comprises

an In Vivo Nucleosome Footprint that Informs Its Tissues-Of-Origin,” Cell, vol. 164,

pp. 57–68, Jan. 2016.

[23] K. Sun, P. Jiang, K. C. A. Chan, J. Wong, Y. K. Y. Cheng, R. H. S. Liang, W.-

k. Chan, E. S. K. Ma, S. L. Chan, S. H. Cheng, R. W. Y. Chan, Y. K. Tong,

S. S. M. Ng, R. S. M. Wong, D. S. C. Hui, T. N. Leung, T. Y. Leung, P. B. S.

Lai, R. W. K. Chiu, and Y. M. D. Lo, “Plasma DNA tissue mapping by genome-

wide methylation sequencing for noninvasive prenatal, cancer, and transplantation

assessments,” Proceedings of the National Academy of Sciences, vol. 112, pp. E5503–

E5512, Oct. 2015.

[24] C. Bettegowda, M. Sausen, R. J. Leary, I. Kinde, Y. Wang, N. Agrawal, B. R.

Bartlett, H. Wang, B. Luber, R. M. Alani, E. S. Antonarakis, N. S. Azad, A. Bardelli,

H. Brem, J. L. Cameron, C. C. Lee, L. A. Fecher, G. L. Gallia, P. Gibbs, D. Le,

R. L. Giuntoli, M. Goggins, M. D. Hogarty, M. Holdhoff, S.-M. Hong, Y. Jiao, H. H.

Juhl, J. J. Kim, G. Siravegna, D. A. Laheru, C. Lauricella, M. Lim, E. J. Lip-

son, S. K. N. Marie, G. J. Netto, K. S. Oliner, A. Olivi, L. Olsson, G. J. Riggins,

A. Sartore-Bianchi, K. Schmidt, l.-M. Shih, S. M. Oba-Shinjo, S. Siena, D. Theodor-

escu, J. Tie, T. T. Harkins, S. Veronese, T.-L. Wang, J. D. Weingart, C. L. Wolfgang,

L. D. Wood, D. Xing, R. H. Hruban, J. Wu, P. J. Allen, C. M. Schmidt, M. A. Choti,

V. E. Velculescu, K. W. Kinzler, B. Vogelstein, N. Papadopoulos, and L. A. Diaz,

“Detection of Circulating Tumor DNA in Early- and Late-Stage Human Malignan-

cies,” Science translational medicine, vol. 6, p. 224ra24, Feb. 2014.

[25] A. R. Thierry, F. Mouliere, C. Gongora, J. Ollier, B. Robert, M. Ychou, M. Del Rio,

and F. Molina, “Origin and quantification of circulating DNA in mice with human

colorectal cancer xenografts,” Nucleic Acids Research, vol. 38, pp. 6159–6175, Oct.

2010.

[26] S.-J. Dawson, D. W. Tsui, M. Murtaza, H. Biggs, O. M. Rueda, S.-F. Chin, M. J.

Dunning, D. Gale, T. Forshew, B. Mahler-Araujo, S. Rajan, S. Humphray, J. Becq,

D. Halsall, M. Wallis, D. Bentley, C. Caldas, and N. Rosenfeld, “Analysis of Circu-

54

lating Tumor DNA to Monitor Metastatic Breast Cancer,” New England Journal of

Medicine, vol. 368, pp. 1199–1209, Mar. 2013.

[27] J. Phallen, M. Sausen, V. Adleff, A. Leal, C. Hruban, J. White, V. Anagnostou,

J. Fiksel, S. Cristiano, E. Papp, S. Speir, T. Reinert, M.-B. W. Orntoft, B. D.

Woodward, D. Murphy, S. Parpart-Li, D. Riley, M. Nesselbush, N. Sengamalay,

A. Georgiadis, Q. K. Li, M. R. Madsen, F. V. Mortensen, J. Huiskens, C. Punt, N. v.

Grieken, R. Fijneman, G. Meijer, H. Husain, R. B. Scharpf, L. A. Diaz, S. Jones,

S. Angiuoli, T. Ørntoft, H. J. Nielsen, C. L. Andersen, and V. E. Velculescu, “Direct

detection of early-stage cancers using circulating tumor DNA,” Science Translational

Medicine, vol. 9, p. eaan2415, Aug. 2017.

[28] I. Garcia-Murillas, G. Schiavon, B. Weigelt, C. Ng, S. Hrebien, R. J. Cutts,

M. Cheang, P. Osin, A. Nerurkar, I. Kozarewa, J. A. Garrido, M. Dowsett, J. S.

Reis-Filho, I. E. Smith, and N. C. Turner, “Mutation tracking in circulating tumor

DNA predicts relapse in early breast cancer,” Science Translational Medicine, vol. 7,

pp. 302ra133–302ra133, Aug. 2015.

[29] A. M. Newman, A. F. Lovejoy, D. M. Klass, D. M. Kurtz, J. J. Chabon, F. Scherer,

H. Stehr, C. L. Liu, S. V. Bratman, C. Say, L. Zhou, J. N. Carter, R. B. West,

G. W. Sledge Jr, J. B. Shrager, B. W. Loo Jr, J. W. Neal, H. A. Wakelee, M. Diehn,

and A. A. Alizadeh, “Integrated digital error suppression for improved detection of

circulating tumor DNA,” Nature Biotechnology, vol. 34, pp. 547–555, May 2016.

[30] E. Gormally, P. Vineis, G. Matullo, F. Veglia, E. Caboux, E. L. Roux, M. Peluso,

S. Garte, S. Guarrera, A. Munnia, L. Airoldi, H. Autrup, C. Malaveille, A. Dun-

ning, K. Overvad, A. Tjønneland, E. Lund, F. Clavel-Chapelon, H. Boeing, A. Tri-

chopoulou, D. Palli, V. Krogh, R. Tumino, S. Panico, H. B. Bueno-de Mesquita,

P. H. Peeters, G. Pera, C. Martinez, M. Dorronsoro, A. Barricarte, C. Navarro,

J. R. Quiros, G. Hallmans, N. E. Day, T. J. Key, R. Saracci, R. Kaaks, E. Riboli,

and P. Hainaut, “TP53 and KRAS2 Mutations in Plasma DNA of Healthy Sub-

jects and Subsequent Cancer Occurrence: A Prospective Study,” Cancer Research,

vol. 66, pp. 6871–6876, July 2006.

[31] K. C. A. Chan, E. C. W. Hung, J. K. S. Woo, P. K. S. Chan, S.-F. Leung, F. P. T.

Lai, A. S. M. Cheng, S. W. Yeung, Y. W. Chan, T. K. C. Tsui, J. S. S. Kwok,

A. D. King, A. T. C. Chan, A. C. van Hasselt, and Y. M. D. Lo, “Early detec-

tion of nasopharyngeal carcinoma by plasma Epstein-Barr virus DNA analysis in a

surveillance program,” Cancer, vol. 119, pp. 1838–1844, May 2013.

55

[32] A. M. Newman, S. V. Bratman, J. To, J. F. Wynne, N. C. W. Eclov, L. A. Modlin,

C. L. Liu, J. W. Neal, H. A. Wakelee, R. E. Merritt, J. B. Shrager, B. W. Loo Jr,

A. A. Alizadeh, and M. Diehn, “An ultrasensitive method for quantitating circulat-

ing tumor DNA with broad patient coverage,” Nature Medicine, vol. 20, pp. 548–554,

May 2014.

[33] M. Greaves, “Leukaemia ’firsts’ in cancer research and treatment,” Nature Reviews

Cancer, vol. 16, p. 163, Mar. 2016.

[34] G. Genovese, A. K. Kahler, R. E. Handsaker, J. Lindberg, S. A. Rose, S. F.

Bakhoum, K. Chambert, E. Mick, B. M. Neale, M. Fromer, S. M. Purcell, O. Svantes-

son, M. Landen, M. Hoglund, S. Lehmann, S. B. Gabriel, J. L. Moran, E. S. Lander,

P. F. Sullivan, P. Sklar, H. Gronberg, C. M. Hultman, and S. A. McCarroll, “Clonal

Hematopoiesis and Blood-Cancer Risk Inferred from Blood DNA Sequence,” New

England Journal of Medicine, vol. 371, pp. 2477–2487, Dec. 2014.

[35] L. Xia, Z. Li, B. Zhou, G. Tian, L. Zeng, H. Dai, X. Li, C. Liu, S. Lu, F. Xu,

X. Tu, F. Deng, Y. Xie, W. Huang, and J. He, “Statistical analysis of mutant allele

frequency level of circulating cell-free DNA and blood cells in healthy individuals,”

Scientific Reports, vol. 7, p. 7526, Aug. 2017.

[36] J. Tie, Y. Wang, C. Tomasetti, L. Li, S. Springer, I. Kinde, N. Silliman, M. Tacey,

H.-L. Wong, M. Christie, S. Kosmider, I. Skinner, R. Wong, M. Steel, B. Tran,

J. Desai, I. Jones, A. Haydon, T. Hayes, T. J. Price, R. L. Strausberg, L. A. Diaz,

N. Papadopoulos, K. W. Kinzler, B. Vogelstein, and P. Gibbs, “Circulating tumor

DNA analysis detects minimal residual disease and predicts recurrence in patients

with stage II colon cancer,” Science Translational Medicine, vol. 8, pp. 346ra92–

346ra92, July 2016.

[37] F. Diehl, K. Schmidt, M. A. Choti, K. Romans, S. Goodman, M. Li, K. Thornton,

N. Agrawal, L. Sokoll, S. A. Szabo, K. W. Kinzler, B. Vogelstein, and L. A. Diaz Jr,

“Circulating mutant DNA to assess tumor dynamics,” Nature Medicine, vol. 14,

pp. 985–990, Sept. 2008.

[38] A. A. Chaudhuri, J. J. Chabon, A. F. Lovejoy, A. M. Newman, H. Stehr, T. D. Azad,

M. S. Khodadoust, M. S. Esfahani, C. L. Liu, L. Zhou, F. Scherer, D. M. Kurtz,

C. Say, J. N. Carter, D. J. Merriott, J. C. Dudley, M. S. Binkley, L. Modlin, S. K.

Padda, M. F. Gensheimer, R. B. West, J. B. Shrager, J. W. Neal, H. A. Wakelee,

B. W. Loo, A. A. Alizadeh, and M. Diehn, “Early Detection of Molecular Residual

56

Disease in Localized Lung Cancer by Circulating Tumor DNA Profiling,” Cancer

Discovery, Sept. 2017.

[39] S. Mailankody, N. Korde, A. M. Lesokhin, N. Lendvai, H. Hassoun, M. Stetler-

Stevenson, and O. Landgren, “Minimal residual disease in multiple myeloma: bring-

ing the bench to the bedside,” Nature Reviews Clinical Oncology, vol. 12, pp. 286–

295, May 2015.

[40] O. Kis, R. Kaedbey, S. Chow, A. Danesh, M. Dowar, T. Li, Z. Li, J. Liu, M. Mansour,

E. Masih-Khan, T. Zhang, S. V. Bratman, A. M. Oza, S. Kamel-Reid, S. Trudel, and

T. J. Pugh, “Circulating tumour DNA sequence analysis as an alternative to multiple

myeloma bone marrow aspirates,” Nature Communications, vol. 8, p. ncomms15086,

May 2017.

[41] N. Guo, F. Lou, Y. Ma, J. Li, B. Yang, W. Chen, H. Ye, J.-B. Zhang, M.-Y.

Zhao, W.-J. Wu, R. Shi, L. Jones, K. S. Chen, X. F. Huang, S.-Y. Chen, and Y. Liu,

“Circulating tumor DNA detection in lung cancer patients before and after surgery,”

Scientific Reports, vol. 6, p. srep33519, Sept. 2016.

[42] P. A. Janne, J. C.-H. Yang, D.-W. Kim, D. Planchard, Y. Ohe, S. S. Ramalingam,

M.-J. Ahn, S.-W. Kim, W.-C. Su, L. Horn, D. Haggstrom, E. Felip, J.-H. Kim,

P. Frewer, M. Cantarini, K. H. Brown, P. A. Dickinson, S. Ghiorghiu, and M. Ran-

son, “AZD9291 in EGFR Inhibitor–Resistant Non–Small-Cell Lung Cancer,” New

England Journal of Medicine, vol. 372, pp. 1689–1699, Apr. 2015.

[43] Z. Piotrowska, M. J. Niederst, C. A. Karlovich, H. A. Wakelee, J. W. Neal, M. Mino-

Kenudson, L. Fulton, A. N. Hata, E. L. Lockerman, A. Kalsy, S. Digumarthy,

A. Muzikansky, M. Raponi, A. R. Garcia, H. E. Mulvey, M. K. Parks, R. H. DiCecca,

D. Dias-Santagata, A. J. Iafrate, A. T. Shaw, A. R. Allen, J. A. Engelman, and L. V.

Sequist, “Heterogeneity Underlies the Emergence of EGFR T790 Wild-Type Clones

Following Treatment of T790m-Positive Cancers with a Third Generation EGFR

Inhibitor,” Cancer discovery, vol. 5, pp. 713–722, July 2015.

[44] M. Murtaza, S.-J. Dawson, K. Pogrebniak, O. M. Rueda, E. Provenzano, J. Grant,

S.-F. Chin, D. W. Y. Tsui, F. Marass, D. Gale, H. R. Ali, P. Shah, T. Contente-

Cuomo, H. Farahani, K. Shumansky, Z. Kingsbury, S. Humphray, D. Bentley, S. P.

Shah, M. Wallis, N. Rosenfeld, and C. Caldas, “Multifocal clonal evolution charac-

terized using circulating tumour DNA in a case of metastatic breast cancer,” Nature

Communications, vol. 6, p. ncomms9760, Nov. 2015.

57

[45] “US Food & Drug Administration. Premarket approval

P150044 — Cobas EGFR MUTATION TEST V2. FDA

https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpma/pma.cfm?id=P150044.”

[46] “European Medicines Agency. Tagrisso: public as-

sessment report — product information. EMA

http://www.ema.europa.eu/ema/index.jsp?curl=pages/medicines/human/medicines/

004124/human med 001961.jsp&mid=WC0b01ac058001d124.”

[47] J. Chung, D.-S. Son, H.-J. Jeon, K.-M. Kim, G. Park, G. H. Ryu, W.-Y. Park, and

D. Park, “The minimal amount of starting DNA for Agilent’s hybrid capture-based

targeted massively parallel sequencing,” Scientific Reports, vol. 6, p. 26732, May

2016.

[48] K. Robasky, N. E. Lewis, and G. M. Church, “The role of replicates for error miti-

gation in next-generation sequencing,” Nature Reviews Genetics, vol. 15, pp. 56–62,

Jan. 2014.

[49] L. Chen, P. Liu, T. C. Evans, and L. M. Ettwiller, “DNA damage is a pervasive cause

of sequencing errors, directly confounding variant identification,” Science, vol. 355,

pp. 752–756, Feb. 2017.

[50] K. Wang, S. Lai, X. Yang, T. Zhu, X. Lu, C.-I. Wu, and J. Ruan, “Ultrasensitive

and high-efficiency screen of de novo low-frequency mutations by o2n-seq,” Nature

Communications, vol. 8, p. ncomms15335, May 2017.

[51] S. El Messaoudi, F. Rolet, F. Mouliere, and A. R. Thierry, “Circulating cell free

DNA: Preanalytical considerations,” Clinica Chimica Acta, vol. 424, pp. 222–230,

Sept. 2013.

[52] I. M. Diaz, A. Nocon, D. H. Mehnert, J. Fredebohm, F. Diehl, and F. Holtrup,

“Performance of Streck cfDNA Blood Collection Tubes for Liquid Biopsy Testing,”

PLOS ONE, vol. 11, p. e0166354, Nov. 2016.

[53] A. S. Devonshire, A. S. Whale, A. Gutteridge, G. Jones, S. Cowen, C. A. Foy, and

J. F. Huggett, “Towards standardisation of cell-free DNA measurement in plasma:

controls for extraction efficiency, fragment size bias and quantification,” Analytical

and Bioanalytical Chemistry, vol. 406, no. 26, pp. 6499–6512, 2014.

[54] B. J. Hindson, K. D. Ness, D. A. Masquelier, P. Belgrader, N. J. Heredia, A. J.

Makarewicz, I. J. Bright, M. Y. Lucero, A. L. Hiddessen, T. C. Legler, T. K. Kitano,

58

M. R. Hodel, J. F. Petersen, P. W. Wyatt, E. R. Steenblock, P. H. Shah, L. J.

Bousse, C. B. Troup, J. C. Mellen, D. K. Wittmann, N. G. Erndt, T. H. Cauley,

R. T. Koehler, A. P. So, S. Dube, K. A. Rose, L. Montesclaros, S. Wang, D. P.

Stumbo, S. P. Hodges, S. Romine, F. P. Milanovich, H. E. White, J. F. Regan,

G. A. Karlin-Neumann, C. M. Hindson, S. Saxonov, and B. W. Colston, “High-

Throughput Droplet Digital PCR System for Absolute Quantitation of DNA Copy

Number,” Analytical Chemistry, vol. 83, pp. 8604–8610, Nov. 2011.

[55] L. Mamanova, A. J. Coffey, C. E. Scott, I. Kozarewa, E. H. Turner, A. Kumar,

E. Howard, J. Shendure, and D. J. Turner, “Target-enrichment strategies for next-

generation sequencing,” Nature Methods, vol. 7, pp. 111–118, Feb. 2010.

[56] T. Forshew, M. Murtaza, C. Parkinson, D. Gale, D. W. Y. Tsui, F. Kaper, S.-J.

Dawson, A. M. Piskorz, M. Jimenez-Linan, D. Bentley, J. Hadfield, A. P. May,

C. Caldas, J. D. Brenton, and N. Rosenfeld, “Noninvasive Identification and Moni-

toring of Cancer Mutations by Targeted Deep Sequencing of Plasma DNA,” Science

Translational Medicine, vol. 4, pp. 136ra68–136ra68, May 2012.

[57] J. Gagan and E. M. Van Allen, “Next-generation sequencing to guide cancer ther-

apy,” Genome Medicine, vol. 7, no. 1, p. 80, 2015.

[58] R. J. Leary, M. Sausen, I. Kinde, N. Papadopoulos, J. D. Carpten, D. Craig,

J. O’Shaughnessy, K. W. Kinzler, G. Parmigiani, B. Vogelstein, L. A. Diaz, and V. E.

Velculescu, “Detection of Chromosomal Alterations in the Circulation of Cancer

Patients with Whole-Genome Sequencing,” Science Translational Medicine, vol. 4,

pp. 162ra154–162ra154, Nov. 2012.

[59] N. V. Roy, M. V. d. Linden, B. Menten, A. Dheedene, C. Vandeputte, J. V. Dorpe,

G. Laureys, M. Renard, T. Sante, T. Lammens, B. D. Wilde, F. Speleman, and

K. D. Preter, “Shallow whole genome sequencing on circulating cell-free DNA al-

lows reliable non-invasive copy number profiling in neuroblastoma patients,” Clinical

Cancer Research, p. clincanres.0675.2017, Jan. 2017.

[60] V. A. Adalsteinsson, G. Ha, S. S. Freeman, A. D. Choudhury, D. G. Stover, H. A.

Parsons, G. Gydush, S. C. Reed, D. Rotem, J. Rhoades, D. Loginov, D. Livitz,

D. Rosebrock, I. Leshchiner, J. Kim, C. Stewart, M. Rosenberg, J. M. Francis, C.-

Z. Zhang, O. Cohen, C. Oh, H. Ding, P. Polak, M. Lloyd, S. Mahmud, K. Helvie,

M. S. Merrill, R. A. Santiago, E. P. O’Connor, S. H. Jeong, R. Leeson, R. M. Barry,

J. F. Kramkowski, Z. Zhang, L. Polacek, J. G. Lohr, M. Schleicher, E. Lipscomb,

59

A. Saltzman, N. M. Oliver, L. Marini, A. G. Waks, L. C. Harshman, S. M. Tolaney,

E. M. V. Allen, E. P. Winer, N. U. Lin, M. Nakabayashi, M.-E. Taplin, C. M.

Johannessen, L. A. Garraway, T. R. Golub, J. S. Boehm, N. Wagle, G. Getz, J. C.

Love, and M. Meyerson, “Scalable whole-exome sequencing of cell-free DNA reveals

high concordance with metastatic tumors,” Nature Communications, vol. 8, p. 1324,

Nov. 2017.

[61] R. Lebofsky, C. Decraene, V. Bernard, M. Kamal, A. Blin, Q. Leroy, T. Rio Frio,

G. Pierron, C. Callens, I. Bieche, A. Saliou, J. Madic, E. Rouleau, F.-C. Bidard,

O. Lantz, M.-H. Stern, C. Le Tourneau, and J.-Y. Pierga, “Circulating tumor DNA

as a non-invasive substitute to metastasis biopsy for tumor genotyping and person-

alized medicine in a prospective trial across all tumor types,” Molecular Oncology,

vol. 9, pp. 783–790, Apr. 2015.

[62] J. S. Frenel, S. Carreira, J. Goodall, D. Roda, R. Perez-Lopez, N. Tunariu, R. Ri-

isnaes, S. Miranda, I. Figueiredo, D. NavaRodrigues, A. Smith, C. Leux, I. Garcia-

Murillas, R. Ferraldeschi, D. Lorente, J. Mateo, M. Ong, T. A. Yap, U. Banerji,

D. G. Tandefelt, N. Turner, G. Attard, and J. S. de Bono, “Serial Next Generation

Sequencing of Circulating Cell Free DNA Evaluating Tumour Clone Response To

Molecularly Targeted Drug Administration,” Clinical cancer research : an official

journal of the American Association for Cancer Research, vol. 21, pp. 4586–4596,

Oct. 2015.

[63] S. R. Head, H. K. Komori, S. A. LaMere, T. Whisenant, F. Van Nieuwerburgh,

D. R. Salomon, and P. Ordoukhanian, “Library construction for next-generation

sequencing: Overviews and challenges,” BioTechniques, vol. 56, pp. 61–passim, Feb.

2014.

[64] D. J. Turner, I. Kozarewa, M. J. Sanders, M. Berriman, M. A. Quail, and

Z. Ning, “Amplification-free Illumina sequencing-library preparation facilitates im-

proved mapping and assembly of (G+C)-biased genomes,” Nature Methods, vol. 6,

p. 291, Mar. 2009.

[65] M. Costello, T. J. Pugh, T. J. Fennell, C. Stewart, L. Lichtenstein, J. C. Meldrim,

J. L. Fostel, D. C. Friedrich, D. Perrin, D. Dionne, S. Kim, S. B. Gabriel, E. S.

Lander, S. Fisher, and G. Getz, “Discovery and characterization of artifactual mu-

tations in deep coverage targeted capture sequencing data due to oxidative DNA

damage during sample preparation,” Nucleic Acids Research, vol. 41, pp. e67–e67,

Apr. 2013.

60

[66] C. Xu, M. R. N. Ranjbar, Z. Wu, J. DiCarlo, and Y. Wang, “Detecting very low allele

fraction variants using targeted DNA sequencing and a novel molecular barcode-

aware variant caller,” BMC Genomics, vol. 18, p. 5, Jan. 2017.

[67] S. Goodwin, J. D. McPherson, and W. R. McCombie, “Coming of age: ten years of

next-generation sequencing technologies,” Nature Reviews Genetics, vol. 17, pp. 333–

351, June 2016.

[68] T. Kivioja, A. Vaharautio, K. Karlsson, M. Bonke, M. Enge, S. Linnarsson, and

J. Taipale, “Counting absolute numbers of molecules using unique molecular identi-

fiers,” Nature Methods, vol. 9, pp. 72–74, Nov. 2011.

[69] E. Z. Macosko, A. Basu, R. Satija, J. Nemesh, K. Shekhar, M. Goldman, I. Tirosh,

A. R. Bialas, N. Kamitaki, E. M. Martersteck, J. J. Trombetta, D. A. Weitz, J. R.

Sanes, A. K. Shalek, A. Regev, and S. A. McCarroll, “Highly Parallel Genome-wide

Expression Profiling of Individual Cells Using Nanoliter Droplets,” Cell, vol. 161,

pp. 1202–1214, May 2015.

[70] M. W. Schmitt, S. R. Kennedy, J. J. Salk, E. J. Fox, J. B. Hiatt, and L. A. Loeb, “De-

tection of ultra-rare mutations by next-generation sequencing,” Proceedings of the

National Academy of Sciences of the United States of America, vol. 109, pp. 14508–

14513, Sept. 2012.

[71] S. R. Kennedy, M. W. Schmitt, E. J. Fox, B. F. Kohrn, J. J. Salk, E. H. Ahn,

M. J. Prindle, K. J. Kuong, J.-C. Shen, R.-A. Risques, and L. A. Loeb, “Detect-

ing ultralow-frequency mutations by Duplex Sequencing,” Nature Protocols, vol. 9,

pp. 2586–2606, Nov. 2014.

[72] J. B. Hiatt, R. P. Patwardhan, E. H. Turner, C. Lee, and J. Shendure, “Parallel, tag-

directed assembly of locally derived short sequence reads,” Nature Methods, vol. 7,

pp. 119–122, Feb. 2010.

[73] I. Kinde, J. Wu, N. Papadopoulos, K. W. Kinzler, and B. Vogelstein, “Detection and

quantification of rare mutations with massively parallel sequencing,” Proceedings of

the National Academy of Sciences, vol. 108, pp. 9530–9535, June 2011.

[74] C. B. Jabara, C. D. Jones, J. Roach, J. A. Anderson, and R. Swanstrom, “Accu-

rate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID,”

Proceedings of the National Academy of Sciences, vol. 108, pp. 20166–20171, Dec.

2011.

61

[75] J. A. Casbon, R. J. Osborne, S. Brenner, and C. P. Lichtenstein, “A method for

counting PCR template molecules with application to next-generation sequencing,”

Nucleic Acids Research, vol. 39, pp. e81–e81, July 2011.

[76] H. Li and R. Durbin, “Fast and accurate short read alignment with Burrows–Wheeler

transform,” Bioinformatics, vol. 25, pp. 1754–1760, July 2009.

[77] M. DePristo, E. Banks, R. Poplin, K. Garimella, J. Maguire, C. Hartl, A. Philip-

pakis, G. del Angel, M. Rivas, M. Hanna, A. McKenna, T. Fennell, A. Kernytsky,

A. Sivachenko, K. Cibulskis, S. Gabriel, D. Altshuler, and M. Daly, “A framework

for variation discovery and genotyping using next-generation DNA sequencing data,”

Nature genetics, vol. 43, pp. 491–498, May 2011.

[78] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth,

G. Abecasis, and R. Durbin, “The Sequence Alignment/Map format and SAMtools,”

Bioinformatics, vol. 25, pp. 2078–2079, Aug. 2009.

[79] K. Cibulskis, M. S. Lawrence, S. L. Carter, A. Sivachenko, D. Jaffe, C. Sougnez,

S. Gabriel, M. Meyerson, E. S. Lander, and G. Getz, “Sensitive detection of somatic

point mutations in impure and heterogeneous cancer samples,” Nature Biotechnol-

ogy, vol. 31, pp. 213–219, Mar. 2013.

[80] S. T. Fan, R. T. P. Poon, C. Yeung, C. M. Lam, C. M. Lo, W. K. Yuen, K. K. C. Ng,

C. L. Liu, and S. C. Chan, “Outcome after partial hepatectomy for hepatocellular

cancer within the Milan criteria,” British Journal of Surgery, vol. 98, pp. 1292–1300,

Sept. 2011.

[81] T. Smith, A. Heger, and I. Sudbery, “UMI-tools: modeling sequencing errors in

Unique Molecular Identifiers to improve quantification accuracy,” Genome Research,

vol. 27, pp. 491–499, Mar. 2017.

accurate mutation detection in cancer through improved ......1.1 sources and alterations of...

Documents