practical course in genome bioinformaticsekhidna.biocenter.helsinki.fi/downloads/teaching/... ·...
TRANSCRIPT
Practical Course in Genome Bioinformatics
Day 2 - Friday 27th January 2017
What is involved in a Genome Project?
• After experimental design, library construction and other wet lab preparations, from a bioinformatics perspective, a genome project involves:
• Sequencing and sequencing platforms
• De novo Assembly
• RNA-sequencing and Mapping
• Ab initio Gene Prediction
• Protein annotation
• Submission and publication of genome in database
• Further downstream analysis
Shortest Common Supersequence Problem
• Definition: Given two sequences X and Y, what is the shortest sequence Z, such that X and Y are both subsequences of Z
• X = "I am the very model"
• Y = "model of a modern Major-General"
• Z = "I am the very model of a modern Major-General"
• Looks deceptively easy, but in the general case (i.e. an arbitrary number of sequences) the problem is NP-complete
Simple Example
• The first line of Charles Dickens novel 'A tale of two cities' is: "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness."
• Our sequences will be 5 words long each: "it was the best of", "was the best of times", etc
• We have no knowledge of the original sentence, just the 5 word sequences
• build a graph of the sequences
• traverse the graph
• output the "best" path as an estimate of the original sentence
Simple Example (2)
Figure adapted from Michael C. Schatz
it was
Assembly Overview
• DNA isolated from target species
• Random fragmentation for shotgun sequencing
• Assemble paired-end reads into contigs (consensus sequence)
• Scaffold contigs based on long range information (mate pairs)
Sequencing reads
• Recall from last lecture: many different platforms and approaches to sequence DNA
• Simplified approach for assembly:
• Paired-end reads for assembly
• Mate-pair reads for scaffolding
• Long reads for gap filling
Figure adapted from A field guide to whole-genome sequencing, assembly and annotation Ekblom and Wolf, Evolutionary Applications, 2014
Contigs
• Contig: a contiguous, linear stretch of DNA/RNA consensus sequence
• Contigs terminate at:
• coverage gaps (more on this in a few slides!)
• conflicts, i.e. errors, repeat boundaries
Scaffolds
• We have a large number of contigs, but they are all disordered, use mate pairs to reconstruct contig ordering
• Final scaffold can still have gaps, we know order, orientation and approximate spacing, by default fill with Ns, or use PacBio long reads
Coverage
• Why do we have gaps?
• Related to random fragmentation during shotgun sequencing
• 1x coverage:
0
5
10
15
20
0 250 500 750 1000Bin
Num
ber o
f bal
ls
0
100
200
300
400
0 5 10 15 20Balls in bin
Freq
Coverage
• Why do we have gaps?
• Related to random fragmentation during shotgun sequencing
• 2x coverage:
0
5
10
15
20
0 250 500 750 1000Bin
Num
ber o
f bal
ls
0
100
200
300
400
0 5 10 15 20Balls in bin
Freq
Coverage
• Why do we have gaps?
• Related to random fragmentation during shotgun sequencing
• 4x coverage:
0
5
10
15
20
0 250 500 750 1000Bin
Num
ber o
f bal
ls
0
100
200
300
400
0 5 10 15 20Balls in bin
Freq
Coverage
• Why do we have gaps?
• Related to random fragmentation during shotgun sequencing
• 8x coverage:
0
5
10
15
20
0 250 500 750 1000Bin
Num
ber o
f bal
ls
0
100
200
300
400
0 5 10 15 20Balls in bin
Freq
Poisson distribution
• The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event
• Resembles normal distribution over positive values, defined by a single parameter λ (the expected number of occurrences)
• Standard deviation is sqrt(λ)
Text and figure adapted from Wikipedia https://en.wikipedia.org/wiki/Poisson_distribution
Overlap-Layout-Consensus Assembly
• Overlap-layout-consensus (OLC) proceeds in three steps:
• Overlap: find approximate overlaps between all pairs of sequences (consistent with the error rate of the platform)
• Layout: use overlap information to find the layout or tiling of reads
• Consensus: produce consensus sequence of a given region
• Find the optimal path through a graph where each sequence is a node and edges are weighted by the length of overlap between sequences
• Early methods found Hamiltonian path giving the shortest common supersequence with a greedy algorithm (compressed repetitive regions :-( )
• Later ('90s) methods used a string graph and found the Eulerian tour such that the "arrival rate" (sequence start positions) is consistent with coverage
Overlap-Layout-Consensus Assembly (2)
de Bruijn Graph Assembly
• OLC difficult to apply to NGS data
• Reads are shorter (is the overlap significant?)
• Many more reads (finding overlaps scales quadratically!)
• de Bruijn graph:
• Break reads into k-mers: constant length subsequences
• Build graph were nodes are k-mers and edges are k-1 length overlaps
• Clean the graph (remove tips and bubbles)
• Find the Eulerian tour
de Bruijn Graph Assembly
Figure adapted from Wikipedia
Domain specific problems
• Outside of the scope of this course, but the assembly problem is made harder by the following problems:
• read orientation (which DNA strand it came from) is unknown
• reads contain errors (sequencing errors)
• incomplete coverage (non-random insert sampling)
• paired-end data can be misleading in a tiny minority of cases
• repetition (genome specific)
• chimeric reads* (PCR errors)
• contamination* (student errors ;-P)
• excessively high error rates despite good quality scores*
Assembly Evaluation
• We want to be able to compare and assess the output of de novo assembly programs:
• Number of contigs
• Average, minimum, maximum contig lengths
• Sum of contig lengths (assembly size)
• N50
• Coverage
Assembly size
• Summation of the lengths of all contigs in basepairs
• Ideally should be close to the genome size of the studied organism
N50
• N50 is the length of the contig for which all other contigs longer than it comprise 50% of the assembly size
• Can be computed for N25, N50, N90, etc
• N50 ranking highly correlated with overall ranking of many evaluation metrics (Assemblathon 1 & 2 articles)
Is there a best assembler?
"These analyses reveal that — at least in this competition — it is very hard to make an assembly that performs consistently when assessed by different metrics within a species, or when assessed by the same
metrics in different species."
Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species, Bradnam et al., GigaScience, 2013
"An assembler may produce an excellent assembly when judged by one approach, but a much poorer assembly when judged by another. "
"Even when an assembler performs well across a range of metrics in one species, it is no guarantee that this assembler will work
as well with a different genome."
Advise from Assemblathon 2 for large eukaryotic genomes
• Don’t trust the results of a single assembly. If possible, generate several assemblies (with different assemblers and/or different assembler parameters). Some of the best assemblies entered for Assemblathon 2 were the evaluation assemblies rather than the competition entries.
• Do not place too much faith in a single metric. It is unlikely that we would have considered SGA to have produced the highest ranked snake assembly if we had only considered a single metric.
• Potentially choose an assembler that excels in the area you are interested in (e.g., coverage, continuity, or number of error free bases).
• If you are interested in generating a genome assembly for the purpose of genic analysis (e.g., training a gene finder, studying codon usage bias, looking for intron-specific motifs), then it may not be necessary to be concerned by low N50/NG50 values or by a small assembly size.
• Assess the levels of heterozygosity in your target genome before you assemble (or sequence) it and set your expectations accordingly.
Computer Exercises
• Today:
• Sequence data from A.cyanobacterium generated using Illumina MiSeq
• Read quality evaluation using FastQC
• Assemblers
• Minimo: OLC assembler (from AMOS package)
• Velvet: de Bruijn graph based assembler
• SPAdes: recent de Bruijn graph based assembler
• Assembly evaluation
• Exercises: http://ekhidna.biocenter.helsinki.fi/downloads/teaching/spring2017/Exercises_day2.pdf
A Technological History of Genome Assembly
• 1982: Sanger and colleagues use shotgun method to sequence the bacteriophage λ (50Kbp genome, 200bp reads)
• 1990: Public Human genome project initiated. 100Kb clones shotgun sequenced (assumed upper limit) and assembled by many research groups
• 1995: Venter assembled 1.5-3Mbp bacterial genomes using pure shotgun approach
• 1996: ABI sequencers (500bp reads, 2-5% error, 10Kbp / day)
• 1997: Weber and Myers invent paired-end reads
• 1997: Capillary gels for Sanger sequencing introduced (700bp reads, 2% error, 1Mbp / day)
• 1998: Celera genomics founded
• 1999: Fruit fly genome (Celera) 140Mbp
• 2000: Human genome (public and Celera) 3Gbp + Mouse genome (Celera) 2.8Gbp
• 2000-present: Celera approach (shotgun + paired-end) is the dominant paradigm for genome assembly