next-generation sequencing – the informatics angle
DESCRIPTION
Next-generation sequencing – the informatics angle. Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008. T1. Roche / 454 FLX system. pyrosequencing technology variable read-length the only new technology with >100bp reads - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/1.jpg)
Next-generation sequencing – the informatics
angle
Gabor T. MarthBoston College Biology Department
AGBT 2008Marco Island, FL. February 6. 2008
![Page 2: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/2.jpg)
T1. Roche / 454 FLX system
• pyrosequencing technology• variable read-length• the only new technology with >100bp reads• tested in many published applications• supports paired-end read protocols with up to 10kb separation size
![Page 3: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/3.jpg)
T2. Illumina / Solexa Genome Analyzer
• fixed-length short-read sequencer• read properties are very close traditional capillary sequences • very low INDEL error rate• tested in many published applications• paired-end read protocols support short (<600bp) separation
![Page 4: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/4.jpg)
T3. AB / SOLiD system
A C G T
A
C
G
T
2nd Base
1st
Bas
e
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
• fixed-length short-read sequencer• employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy• requires color-space informatics• published applications underway / in review• paired-end read protocols support up to 10kb separation size
![Page 5: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/5.jpg)
T4. Helicos / Heliscope system
• experimental short-read sequencer system• single molecule sequencing• no amplification• variable read-length• error rate reduced with 2-pass template sequencing
![Page 6: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/6.jpg)
A1. Variation discovery: SNPs and short-INDELs
1. sequence alignment
2. dealing with non-unique mapping
3. looking for allelic differences
![Page 7: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/7.jpg)
A2. Structural variation detection
• structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations
• copy number (for amplifications, deletions) from depth of read coverage
![Page 8: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/8.jpg)
A3. Identification of protein-bound DNA
genome sequence
aligned reads
Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)
Transcription binding sites. Robertson et al. Nature Methods, 2007
![Page 9: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/9.jpg)
A4. Novel transcript discovery (genes)
Inferred exon 1• novel genes / exons
Inferred exon 2
• novel transcripts in known genes
Known exon 1 Known exon 2
Known exon 1 Known exon 2
![Page 10: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/10.jpg)
A5. Novel transcript discovery (miRNAs)
Ruby et al. Cell, 2006
![Page 11: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/11.jpg)
A6. Expression profiling by tag counting
aligned reads
aligned reads
Jones-Rhoads et al. PLoS Genetics, 2007
gene gene
![Page 12: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/12.jpg)
A7. De novo organismal genome sequencing
assembled sequence contigs
short reads
longer reads
read pairs
Lander et al. Nature 2001
![Page 13: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/13.jpg)
C1. Read length
read length [bp]0 100 200 300
~250 (var)
25-40 (fixed)
25-35 (fixed)
20-35 (var)
![Page 14: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/14.jpg)
When does read length matter?
• short reads often sufficient where the entire read length can be used for mapping:
SNPs, short-INDELs, SVsCHIP-SEQshort RNA discoverycounting (mRNA miRNA)
• longer reads are needed where one must use parts of reads for mapping:
de novo sequencing
novel transcript discovery
aacttagacttacagacttacatacgta
Known exon 1 Known exon 2
accgattactatacta
![Page 15: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/15.jpg)
C2. Read error rate
• error rate dictates how many errors the aligner should tolerate
• error rate typically 0.4 - 1%
• the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned
0 1 20.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
Fra
ctio
n of
gen
ome
Number of mismatches allowed
• applications where, in addition, specific alleles are essential, error rate is even more important
![Page 16: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/16.jpg)
0
5
10
15
20
25
30
35
40
0 5 10 15 20 25 30 35 40
Position on Read
0.00%
1.00%
2.00%
3.00%
4.00%
5.00%
6.00%
7.00%
8.00%
9.00%
10.00%
Err
or r
ate
C3. Error rate grows with each cycle
• this phenomenon limits useful read length
![Page 17: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/17.jpg)
C4. Substitutions vs. INDEL errors
• SNP discovery may require higher coverage for allele confirmation• INDELs can be discovered with very high confidence!
• gapped alignment necessary• good SNP discovery accuracy• short-INDEL discovery difficult
![Page 18: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/18.jpg)
C5. Quality values are important for allele calling
• PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles
• inaccurate or not well calibrated base quality values hinder allele calling
Q-values should be accurate … and high!
![Page 19: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/19.jpg)
Quality values should be well-calibrated
assigned base quality value should be calibrated to represent the actual base quality value in every sequencing cycle
![Page 20: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/20.jpg)
C6. Representational biases / library complexity
fragmentation biases
amplification biases
PCR
sequencing biases
sequencing
low/no representati
on high
representation
![Page 21: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/21.jpg)
Dispersal of read coverage
• this affects variation discovery (deeper starting read coverage is needed)• it has major impact is on counting applications
![Page 22: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/22.jpg)
Amplification errors
many reads from clonal copies of a single fragment
• early PCR errors in “clonal” read copies lead to false positive allele calls
early amplification error gets propagated onto every clonal copy
![Page 23: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/23.jpg)
C7. Paired-end reads
• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency
• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity
Korbel et al. Science 2007
• paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)
![Page 24: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/24.jpg)
Paired-end reads for SV discovery
• longer fragments increase the chance of spanning SV breakpoints and/or entire events
• SV breakpoint detection sensitivity & resolution depend on the width of the fragment length distribution (most 2kb deletions would be detected at 10% std but missed at 30% std)
• longer fragments tend to have wider fragment length distributions
![Page 25: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/25.jpg)
C8. Technologies / properties / applications
Technology
Roche/454 Illumina/Solexa AB/SOLiD
Read properties
Read length 250bp 20-40bp 25-35bp
Error rate <0.5% <1.0% <0.5%
Dominant error type INDEL SUB SUB
Paired-end reads available yes yes yes
Paired-end separation < 10kb (3kb optimal) 100 - 600bp 500bp - 10kb (3kb optimal)
Applications
SNP discovery ○ ● ●
short-INDEL discovery ● ○
SV discovery ○ ○ ●
CHIP-SEQ ○ ● ●
small RNA/gene discovery ○ ● ●
mRNA Xcript discovery ● ○ ○
Expression profiling ○ ● ●
De novo sequencing ● ? ?
![Page 26: Next-generation sequencing – the informatics angle](https://reader036.vdocuments.us/reader036/viewer/2022062409/56814f77550346895dbd28b7/html5/thumbnails/26.jpg)
Thanks
http://bioinformatics.bc.edu/marthlab
Derek BarnettEric Tsung
Aaron QuinlanDamien Croteau-Chonka
Weichun Huang
Michael Stromberg
Chip Stewart
Michele Busby
MOSAIK talk Thursday, 7:40PM
Michael Egholm
David Bentley
Francisco de la Vega
Kristen StoopsEd Thayer
Clive Brown
Elaine Mardis