practical course in genome bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/... ·...

Practical Course in Genome Bioinformatics

Day 2 - Friday 27th January 2017

What is involved in a Genome Project?

• After experimental design, library construction and other wet lab preparations, from a bioinformatics perspective, a genome project involves:

• Sequencing and sequencing platforms

• De novo Assembly

• RNA-sequencing and Mapping

• Ab initio Gene Prediction

• Protein annotation

• Submission and publication of genome in database

• Further downstream analysis

Shortest Common Supersequence Problem

• Definition: Given two sequences X and Y, what is the shortest sequence Z, such that X and Y are both subsequences of Z

• X = "I am the very model"

• Y = "model of a modern Major-General"

• Z = "I am the very model of a modern Major-General"

• Looks deceptively easy, but in the general case (i.e. an arbitrary number of sequences) the problem is NP-complete

Simple Example

• The first line of Charles Dickens novel 'A tale of two cities' is: "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness."

• Our sequences will be 5 words long each: "it was the best of", "was the best of times", etc

• We have no knowledge of the original sentence, just the 5 word sequences

• build a graph of the sequences

• traverse the graph

• output the "best" path as an estimate of the original sentence

Simple Example (2)

Figure adapted from Michael C. Schatz

it was

Assembly Overview

• DNA isolated from target species

• Random fragmentation for shotgun sequencing

• Assemble paired-end reads into contigs (consensus sequence)

• Scaffold contigs based on long range information (mate pairs)

Sequencing reads

• Recall from last lecture: many different platforms and approaches to sequence DNA

• Simplified approach for assembly:

• Paired-end reads for assembly

• Mate-pair reads for scaffolding

• Long reads for gap filling

Figure adapted from A field guide to whole-genome sequencing, assembly and annotation Ekblom and Wolf, Evolutionary Applications, 2014

Contigs

• Contig: a contiguous, linear stretch of DNA/RNA consensus sequence

• Contigs terminate at:

• coverage gaps (more on this in a few slides!)

• conflicts, i.e. errors, repeat boundaries

Scaffolds

• We have a large number of contigs, but they are all disordered, use mate pairs to reconstruct contig ordering

• Final scaffold can still have gaps, we know order, orientation and approximate spacing, by default fill with Ns, or use PacBio long reads

Coverage

• Why do we have gaps?

• Related to random fragmentation during shotgun sequencing

• 1x coverage:

0

5

10

15

20

0 250 500 750 1000Bin

Num

ber o

f bal

ls

0

100

200

300

400

0 5 10 15 20Balls in bin

Freq

Coverage



• 2x coverage:

0

5

10

15

20

0 250 500 750 1000Bin

Num

ber o

f bal

ls

0

100

200

300

400


Freq

Coverage



• 4x coverage:

0

5

10

15

20

0 250 500 750 1000Bin

Num

ber o

f bal

ls

0

100

200

300

400


Freq

Coverage



• 8x coverage:

0

5

10

15

20

0 250 500 750 1000Bin

Num

ber o

f bal

ls

0

100

200

300

400


Freq

Poisson distribution

• The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event

• Resembles normal distribution over positive values, defined by a single parameter λ (the expected number of occurrences)

• Standard deviation is sqrt(λ)

Text and figure adapted from Wikipedia https://en.wikipedia.org/wiki/Poisson_distribution

https://en.wikipedia.org/wiki/Poisson_distribution

Overlap-Layout-Consensus Assembly

• Overlap-layout-consensus (OLC) proceeds in three steps:

• Overlap: find approximate overlaps between all pairs of sequences (consistent with the error rate of the platform)

• Layout: use overlap information to find the layout or tiling of reads

• Consensus: produce consensus sequence of a given region

• Find the optimal path through a graph where each sequence is a node and edges are weighted by the length of overlap between sequences

• Early methods found Hamiltonian path giving the shortest common supersequence with a greedy algorithm (compressed repetitive regions :-( )

• Later ('90s) methods used a string graph and found the Eulerian tour such that the "arrival rate" (sequence start positions) is consistent with coverage

Overlap-Layout-Consensus Assembly (2)

de Bruijn Graph Assembly

• OLC difficult to apply to NGS data

• Reads are shorter (is the overlap significant?)

• Many more reads (finding overlaps scales quadratically!)

• de Bruijn graph:

• Break reads into k-mers: constant length subsequences

• Build graph were nodes are k-mers and edges are k-1 length overlaps

• Clean the graph (remove tips and bubbles)

• Find the Eulerian tour

de Bruijn Graph Assembly

Figure adapted from Wikipedia

Domain specific problems

• Outside of the scope of this course, but the assembly problem is made harder by the following problems:

• read orientation (which DNA strand it came from) is unknown

• reads contain errors (sequencing errors)

• incomplete coverage (non-random insert sampling)

• paired-end data can be misleading in a tiny minority of cases

• repetition (genome specific)

• chimeric reads* (PCR errors)

• contamination* (student errors ;-P)

• excessively high error rates despite good quality scores*

Assembly Evaluation

• We want to be able to compare and assess the output of de novo assembly programs:

• Number of contigs

• Average, minimum, maximum contig lengths

• Sum of contig lengths (assembly size)

• N50

• Coverage

Assembly size

• Summation of the lengths of all contigs in basepairs

• Ideally should be close to the genome size of the studied organism

N50

• N50 is the length of the contig for which all other contigs longer than it comprise 50% of the assembly size

• Can be computed for N25, N50, N90, etc

• N50 ranking highly correlated with overall ranking of many evaluation metrics (Assemblathon 1 & 2 articles)

Is there a best assembler?

"These analyses reveal that — at least in this competition — it is very hard to make an assembly that performs consistently when assessed by different metrics within a species, or when assessed by the same

metrics in different species."

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species, Bradnam et al., GigaScience, 2013

"An assembler may produce an excellent assembly when judged by one approach, but a much poorer assembly when judged by another. "

"Even when an assembler performs well across a range of metrics in one species, it is no guarantee that this assembler will work

as well with a different genome."

Advise from Assemblathon 2 for large eukaryotic genomes

• Don’t trust the results of a single assembly. If possible, generate several assemblies (with different assemblers and/or different assembler parameters). Some of the best assemblies entered for Assemblathon 2 were the evaluation assemblies rather than the competition entries.

• Do not place too much faith in a single metric. It is unlikely that we would have considered SGA to have produced the highest ranked snake assembly if we had only considered a single metric.

• Potentially choose an assembler that excels in the area you are interested in (e.g., coverage, continuity, or number of error free bases).

• If you are interested in generating a genome assembly for the purpose of genic analysis (e.g., training a gene finder, studying codon usage bias, looking for intron-specific motifs), then it may not be necessary to be concerned by low N50/NG50 values or by a small assembly size.

• Assess the levels of heterozygosity in your target genome before you assemble (or sequence) it and set your expectations accordingly.

Computer Exercises

• Today:

• Sequence data from A.cyanobacterium generated using Illumina MiSeq

• Read quality evaluation using FastQC

• Assemblers

• Minimo: OLC assembler (from AMOS package)

• Velvet: de Bruijn graph based assembler

• SPAdes: recent de Bruijn graph based assembler

• Assembly evaluation

• Exercises: http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2017/Exercises_day2.pdf

http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2017/Exercises_day2.pdf

A Technological History of Genome Assembly

• 1982: Sanger and colleagues use shotgun method to sequence the bacteriophage λ (50Kbp genome, 200bp reads)

• 1990: Public Human genome project initiated. 100Kb clones shotgun sequenced (assumed upper limit) and assembled by many research groups

• 1995: Venter assembled 1.5-3Mbp bacterial genomes using pure shotgun approach

• 1996: ABI sequencers (500bp reads, 2-5% error, 10Kbp / day)

• 1997: Weber and Myers invent paired-end reads

• 1997: Capillary gels for Sanger sequencing introduced (700bp reads, 2% error, 1Mbp / day)

• 1998: Celera genomics founded

• 1999: Fruit fly genome (Celera) 140Mbp

• 2000: Human genome (public and Celera) 3Gbp + Mouse genome (Celera) 2.8Gbp

• 2000-present: Celera approach (shotgun + paired-end) is the dominant paradigm for genome assembly

practical course in genome bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/... ·...

Documents