overview of ngs errors working group · title: overview of ngs errors working group - or...my...

13
Overview of NGS Errors Working Group Or...My Declaration of War on the Bioinformatics Pipeline K. S. Dorman Department of Statistics and Genetics, Development & Cell Biology Iowa State University SAMSI - Beyond Bioinformatics May 11–13 2015 NGS Error Iowa State University

Upload: others

Post on 25-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

Overview of NGS Errors Working GroupOr...My Declaration of War on the Bioinformatics Pipeline

K. S. Dorman

Department of Statistics and Genetics, Development & Cell BiologyIowa State University

SAMSI - Beyond BioinformaticsMay 11–13 2015

NGS Error Iowa State University

Page 2: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

Initial Proposed WG Goals

Overarching Goal. Increase reproducibility by reducing noisefrom the pipeline.

I To learn about common pipeline components.

I To identify pipeline components most likely to affectdownstream analyses.

I To write a review on statistical concerns in commonpipeline components: What you need to know.

I Develop research projects to integrate moremathematical/statistical thinking in pipeline components.

NGS Error Iowa State University

Page 3: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

WG Meetings - Pipeline Components Covered. Sequencing platforms. - Maria C. Rivera,Virginia Commonwealth U.. Base calling. - Xinping Cui, U. California,Riverside. Read quality filtering/trimming. - GabrielMurillo, U. California, Riverside. Error correction. - Karin Dorman, VahidNoroozi, Xin Yin, Iowa State U.. Alignment. - Adam B. Olshen, U. California,San FranciscoAssembly - A. Severin, Iowa State U.. Bud Mishra, Courant Institute (TotalRe-Caller, Base-Calling, Alignment, Assembly,GWAS)

NGS Error Iowa State University

Page 4: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

Sequencing PlatformsI Main platforms from: Illumina, Life

Technologies, Pacific Biosciences,Oxford nanopore

I Isn’t it dangerous to invest so muchenergy in a technology that willundoubtedly be soon replaced?

I Errors. Error rates remain high,highest for the newest technology.

I Trade-off. High throughput and errorsor low throughput and precision.

I The key replication in these experimentsis the reading and re-reading of shortfragments.

NGS Error Iowa State University

Page 5: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

Base Calling (Illumina)

I The overwhelming majority of calls are being made byBustard, the unpublished Illumina base caller.

I Bustard is not the best method.

NGS Error Iowa State University

Page 6: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

Error RemovalI Base calling. Really good base calling can prevent errors.I The consequences of residual errors.

I De novo assembly. Errors lead to false kmers andincreased complexity of and errors in resulting assembly.

I Alignment. Errors lead to misaligned or unalignedsequences.

I Read trimming. Identify low quality bases and drop themfrom the end(s) of reads.

I Error correction. Identify and correct low abundancekmers, but cannot work when coverage is not uniform(e.g. RNA-seq, metagenomics, single-cell, heterogeneouspopulations, like cancer).

I Alignment or Assembly. Smart aligners can toleratesome residual errors, and assemblers often do errorcorrection.

NGS Error Iowa State University

Page 7: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

Read Trimming

Del Fabbro, C.; Scalabrin, S.; Morgante, M. & Giorgi, F. M. Anextensive evaluation of read trimming effects on Illumina NGS

data analysis. PLoS One, 2013, 8, e85024

“What is the best trimming algorithm? [N]o generic answer canbe given. ... [It depends] on the dataset, downstream analysis,and user-decided parameter-dependent tradeoffs. ... [but for]

processing DNA-Seq data ... trimming should be applied everytime in order to improve quality and performance.”

NGS Error Iowa State University

Page 8: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

The Science of Pipelines - ConcernsI Few studies. There are few comprehensive comparisons

of base-calling, trimming, error correction, or alignmentpipeline components. The combinatorics is forbidding.

I Wrong outcome. In comparison studies, downstreamperformance may be implied via:

I Alignability, or correct genome localization if simulationdata.

I Error rate conditional on alignment, or true error rate ifsimulation data.

I Complicated findings. Almost anything can affectdownstream performance.

I The dataset, particularly its quality.I The particular pipeline component used.I The parameter settings of the pipeline components.I The downstream objective itself.

NGS Error Iowa State University

Page 9: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

The Usage of Pipelines - ConcernsI Hard to set up.

I If the manuals are good, setting parameters still takesextensive experience and empirical experimentation.

I There are tens of versions of each component, andgenerally not thoroughly compared.

.I Not generalizable. In desperation, we learn frompracticed experts, but not all their experience extends tonew situations, our situation.

I Poorly understood. Not only is our knowledge limited, it islargely empirical; there is little insight as to why onepipeline works better than another.

I Herd mentality. Since no one really understands them,pipelines are popularized as temporary fads, with littlescience driving their adoption or death.

NGS Error Iowa State University

Page 10: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

The solution?I Pipelines do not seem to be leaking completely at random,

andI Additional components have work to do only because

signal is lost, unused, and ignored, especially at pipelinejunctions.

Death to the bioinformatics pipeline! (At least everythingbetween base calling and alignment/assembly.)

I Ashley Cacho, U. California Riverside: A comprehensivecomparison of base calling methods.

I Xin Yin, Iowa State University: Combined base callingand error correction.

NGS Error Iowa State University

Page 11: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

The solution?I Pipelines do not seem to be leaking completely at random,

andI Additional components have work to do only because

signal is lost, unused, and ignored, especially at pipelinejunctions.

Death to the bioinformatics pipeline!

(At least everythingbetween base calling and alignment/assembly.)

I Ashley Cacho, U. California Riverside: A comprehensivecomparison of base calling methods.

I Xin Yin, Iowa State University: Combined base callingand error correction.

NGS Error Iowa State University

Page 12: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

The solution?I Pipelines do not seem to be leaking completely at random,

andI Additional components have work to do only because

signal is lost, unused, and ignored, especially at pipelinejunctions.

Death to the bioinformatics pipeline! (At least everythingbetween base calling and alignment/assembly.)

I Ashley Cacho, U. California Riverside: A comprehensivecomparison of base calling methods.

I Xin Yin, Iowa State University: Combined base callingand error correction.

NGS Error Iowa State University

Page 13: Overview of NGS Errors Working Group · Title: Overview of NGS Errors Working Group - Or...My Declaration of War on the Bioinformatics Pipeline Author: K. S. Dorman Created Date:

The solution?I Pipelines do not seem to be leaking completely at random,

andI Additional components have work to do only because

signal is lost, unused, and ignored, especially at pipelinejunctions.

Death to the bioinformatics pipeline! (At least everythingbetween base calling and alignment/assembly.)

I Ashley Cacho, U. California Riverside: A comprehensivecomparison of base calling methods.

I Xin Yin, Iowa State University: Combined base callingand error correction.

NGS Error Iowa State University