next-generation sequencing course, part 1: technologies
TRANSCRIPT
![Page 1: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/1.jpg)
[I0D51A] Bioinformatics: High-Throughput AnalysisNext-generation sequencing. Part 1: Technologies
Prof Jan AertsFaculty of Engineering - ESAT/[email protected]
TA: Alejandro Sifrim ([email protected])
1
![Page 2: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/2.jpg)
Announcements
May 27th (9am-noon): evaluation
open book
2
![Page 3: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/3.jpg)
Note to self...
Upload s_1_sequence.txt and s_2_sequence.txt to Galaxy first...
3
![Page 4: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/4.jpg)
Overview
• linux refresher (6/5)
• next-generation sequencing technologies and applications (6/5)
• sequence mapping (13/5)
• variant calling - SNPs (20/5)
• variant calling - structural variation (20/5)
4
![Page 5: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/5.jpg)
Linux Refresher...
5
![Page 6: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/6.jpg)
Next-generation sequencing technologies
6
![Page 7: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/7.jpg)
General principle
7
![Page 8: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/8.jpg)
Big data...
8
![Page 9: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/9.jpg)
First vs second generation sequencing
Shendure & Ji, 2008
Sanger sequencing (1st gen) 2nd/next gen sequencing
9
![Page 10: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/10.jpg)
10
Korbel et al, 2007
Paired-end sequencing
![Page 11: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/11.jpg)
General approaches
• 2nd generation: clonally amplified single molecules
• Roche 454 pyrosequencing
• Illumina Genome Analyzer -> HiSeq: reversible terminator technology
• ABI SOLiD: ligation-based extension
• Next-next-generation/3rd generation: true single molecule
• Helicos: Heliscore
• Pacific Biosciences: SMRT
11
![Page 12: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/12.jpg)
12
Mardis, 2011
![Page 13: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/13.jpg)
Steps
template preparation
sequencing and imaging
data analysis
genome enrichment
13
![Page 14: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/14.jpg)
A. Genome enrichment
14
![Page 15: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/15.jpg)
Sequencing costs
15
![Page 16: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/16.jpg)
What?
Only sequence relevant parts of the genome instead of whole genome, e.g.:
• specific Mb-scale regions known to be involved in particular disease (e.g. based on GWAS)
• specific candidate genes belonging to disease pathway
• exome (= all exons)
=> how to isolate these from non-target sequence? “pulldown”
16
![Page 17: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/17.jpg)
Pulldown: on-array
Turner et al, 2009
17
![Page 18: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/18.jpg)
Pulldown: in-solution
Turner et al, 2009
18
![Page 19: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/19.jpg)
Performance metrics
• fold-enrichment: ratio of abundance of target sequences post-enrichment vs pre-enrichment
• capture specificity: fraction of sequence reads that map to target
• uniformity: relative abundance of individual targets after enrichment
• completeness: fraction of target bases detectably captured
19
![Page 20: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/20.jpg)
B. Template preparation
20
![Page 21: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/21.jpg)
Problem: most imaging systems not designed to detect single fluorescent event => need amplified templates
Aim: to produce a representative, non-biased source of nucleic acid material from the genome under investigation => population of identical templates
Steps:
1. shear DNA
2. amplify templates
Options: emulsion PCR (emPCR) or solid phase amplification
21
![Page 22: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/22.jpg)
emulsion = mixture of two or more immiscible (unblendable) liquids; e.g. mayonnaise, vinaigrette
emPCR: thousands of microreactors/micro-eppendorfs
one bead + one DNA molecule per microreactor => PCR to 1000s of copies
Amplification by emulsion PCR
22
![Page 23: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/23.jpg)
Metzker et al, 2010
Williams et al, 2006
23
![Page 24: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/24.jpg)
Solid-phase amplification
Metzker et al, 2010
http://bit.ly/6JYIUz
http://www.youtube.com/watch?v=77r5p8IBwJk&NR=1
24
![Page 25: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/25.jpg)
C. Sequencing and imaging
25
![Page 26: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/26.jpg)
Sequencing and imaging
Technologies:
1. cyclic reversible termination
2. sequencing by ligation
3. pyrosequencing
4. real-time sequencing
26
![Page 27: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/27.jpg)
Cyclic reversible termination
DNA synthesis is terminated after adding single nucleotide
start/stop/start/stop/start/stop/...
Illumina: 4-colour
sequencing steps
Metzker et al, 2010
sequencing result
27
![Page 28: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/28.jpg)
Helicos: 1-colour
Metzker et al, 2010
sequencing steps
Metzker et al, 2010
sequencing result
28
![Page 29: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/29.jpg)
Sequencing by ligation
http://bit.ly/fPh22X
sequencing steps
29
![Page 31: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/31.jpg)
Pyrosequencing
Metzker et al, 2010
Metzker et al, 2010
31
![Page 32: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/32.jpg)
Real-time sequencing
“ZMW” zero-mode waveguide
DNA polymerase
“strobe sequencing”
32
![Page 33: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/33.jpg)
Run time Gb/run
Roche 454
Illumina
SOLiD
Helicos
PacBio
8.5 hr 45
9 days 35
14 days 50
8 days 37
? ?
33
![Page 34: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/34.jpg)
• base quality drops along read
Sanger > SOLiD > Illumina > 454 > Helicos
(“dephasing” within clusters)
• base calling errors
Accuracy - base calling error
34
![Page 35: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/35.jpg)
Accuracy - homopolymer runs
Issue for Roche 454:
39% of errors are homopolymers
A5 motifs: 3.3% error rate
A8 motifs: 50% error rate
Reason: use signal intensity as a measure for homopolymer length
35
![Page 36: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/36.jpg)
36
![Page 37: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/37.jpg)
Ronaghi, Genome Res 11:3-11 (2001)
37
![Page 38: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/38.jpg)
http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg
38
![Page 39: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/39.jpg)
Is it 4? Is it 5? Is it 4?
http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg
39
![Page 40: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/40.jpg)
Consensus accuracy
Increase accuracy for SNP calling by increasing coverage:
Illumina: 20X
SOLiD: 12X
454: 7.4X
Sanger: 3X
Factors: raw accuracy + read length
How deep do you have to sequence? => Poisson distribution: “If you sequence at average of 10X, how much of the genome will be covered at least 5X”?
40
![Page 41: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/41.jpg)
Bentley et al, Nature 456:53-56 (2008)
41
![Page 42: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/42.jpg)
FASTQ file format
“@” + identifier
sequence
“+” + identifier (optional)
phred-based quality scores
phred quality score encoding
Wikipedia
example fastq entries (n=2)
example fasta entries (n=2)
42
![Page 43: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/43.jpg)
Sequence quality control
Is this good sequence? (essential!)
E.g.: using FastQC tool (Babraham Institute, UK; http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/)
43
![Page 44: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/44.jpg)
Sequence quality control
per base sequence quality
good bad
44
![Page 45: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/45.jpg)
Sequence quality control
per sequence quality scores
good bad
45
![Page 46: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/46.jpg)
Sequence quality control
per base sequence content
good bad
46
![Page 47: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/47.jpg)
Sequence quality control
per base GC content
good bad
47
![Page 48: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/48.jpg)
Sequence quality control
per sequence GC content
good bad
48
![Page 49: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/49.jpg)
Sequence quality control
k-mer content
good bad
49
![Page 50: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/50.jpg)
Intermezzo: Galaxy
50
![Page 51: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/51.jpg)
Online genome analysis
51
http://galaxy.psu.edu/
“Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more...”
![Page 52: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/52.jpg)
52
![Page 53: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/53.jpg)
53
![Page 54: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/54.jpg)
Applications of next-generation sequencing
54
![Page 55: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/55.jpg)
55
Kahvejian et al, 2008
![Page 56: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/56.jpg)
5650
Kahvejian et al, 2008
DNA-seq
ChIP-seq
RNA-seq
![Page 57: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/57.jpg)
575150
Kahvejian et al, 2008
DNA-seq
ChIP-seq
RNA-seq
identify sequence variations
identify pathogens
![Page 58: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/58.jpg)
Exercises
58
![Page 59: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/59.jpg)
59
Try to login to the server mentioned on Toledo with username and password provided there.
There are 2 FASTQ files in /mnt/homes/jaerts/: s_1_sequence.txt and s_2_sequence.txt (= paired ends)
• How many sequences are in s_1_sequence.txt?
• What encoding was used for the quality score? Illumina? Sanger?
• What are the numerical quality scores for the first sequence in s_1_sequence.txt (i.e. 7172283/1)?
![Page 60: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/60.jpg)
• Create an account on the Galaxy server
• Download s_1_sequence.txt and s_2_sequence.txt from Toledo and upload them into Galaxy. These files are also available on the linux server
• Have a look at the contents of s_1_sequence.txt.
• Convert quality scores to numeric values for s_1_sequence.txt (“FASTQ Groomer”)
• Draw the quality score boxplot for s_1_sequence.txt
• Draw the nucleotide distribution chart for s_1_sequence.txt
60
![Page 61: Next-generation sequencing course, part 1: technologies](https://reader036.vdocuments.us/reader036/viewer/2022081401/557d7a46d8b42a75548b4b70/html5/thumbnails/61.jpg)
Bentley DR et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456: 53-59 (2008)
Kahvejian A, Quackenbush J & Thompson JF. What would you do if you could sequence everything? Nature Biotechnology 26: 1125-1133 (2008)
Korbel JO et al. Paired-end mapping reveals extensive structural variation in the human genome. Science 318: 420-426 (2007)
Mardis ER. A decade’s perspective on DNA sequencing technology. Nature 470: 198-203 (2011)
Metzker ML. Sequencing technologies - the next generation. Nature Reviews Genetics 11:31-46 (2010)
Shendure J & Ji H. Next-generation DNA sequencing. Nature Biotechnology 26:1135-1145 (2008)
Turner EH et al. Methods for genomic partitioning. Annual Review of Genomics and Human Genetics 10 (2009)
61
References