national center for genome analysis support: ://ncgas.org carrie ganote ram podicheti le-shin wu tom...
TRANSCRIPT
![Page 1: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/1.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Carrie Ganote
Ram Podicheti
Le-Shin Wu
Tom Doak
Quality Control and Assessment ofRNA-Seq Data
![Page 2: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/2.jpg)
National Center for Genome Analysis Support: http://ncgas.org
What do the data look like?
@SRR638895.6046 6046 length=76GTGAAAGACTCTCGTAGCAAACGAAACGTCAAGTCGGTGAGGCCAACTCTTGTCGTAGCCGCGTCCATTGCGCCCT+SRR638895.6046 6046 length=76GDGACGFFGF7EDDAECBEDFEGFGECGEDGFGE:=BDD@FD59B67>:=9>:8>>;;<;=CD@9+=???######
Fastq is a common format for storing Next Gen Sequencing data.
• Text based• Stores both the sequence and quality information• Originally developed at Wellcome Trust Sanger Insitute
and later adopted by Solexa (Bennett, 2004)• Information for each read comprises of 4 lines
Bennett, S. (2004). Solexa Ltd. Pharmacogenomics, 5(4), 433-438. doi: 10.1517/14622416.5.4.433
![Page 3: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/3.jpg)
@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTGCGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA+@@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI
• Sequence Identifier• Begins with a @ symbol• Comprises of
• Instrument Name• Flowcell Lane• Tile• X and Y coordinates of the Cluster on the Tile• Member of a Pair (1 or 2)• Index
FASTQ Format
![Page 4: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/4.jpg)
FASTQ Format
@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTGCGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA+@@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI
Read Sequence (A, G, T, C, N)
![Page 5: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/5.jpg)
FASTQ Format
@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTGCGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA+@@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI
• ‘+’ character• Can be followed by the same Sequence Identifier (from
Line1)
![Page 6: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/6.jpg)
@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTGCGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA+@@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI
• Base Quality Scores (Phred33) for the sequence in Line2• Must contain the same number of characters as those in
the sequence
FASTQ Format
![Page 7: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/7.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Sequencers can assign a “confidence” value per call based on how ambiguous the base call is
Quality Scores
Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186–194. doi:10.1101/gr.8.3.186. PMID 9521922.
The sequencer will estimate the probability that a given base call is NOT correct (Erwing 1998)
![Page 8: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/8.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Quality Scores
Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186–194. doi:10.1101/gr.8.3.186. PMID 9521922.
10 20 30 400.89
0.91
0.93
0.95
0.97
0.99
PHRED Scores Estimated Accuracy
PHRED Score
% E
stim
ated
Acc
ura
cy
P -10*log10(p)Est. Accuracy =
1-P
0.1 10 0.9
0.01 20 0.99
0.001 30 0.999
0.0001 40 0.9999
PHRED Score is defined as q = -10 x log10(p)(Erwing 1998)
P = probability call is not correct
![Page 9: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/9.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Why not just have numbers?
Quality Score Encodings
@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTGCGTTCAGT…+3131303537373739…
![Page 10: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/10.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Why not just have numbers?
Quality Score Encodings
@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTGCGTTCAGT…+3131303537373739…
Quality symbols to the rescue
![Page 11: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/11.jpg)
National Center for Genome Analysis Support: http://ncgas.org
• Letters are represented deep down in the computer as numbers
• The quality score + a constant number (33 or 64, usually) is the number, which is converted to the quality symbol using ASCII
Quality Score Encodings
![Page 13: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/13.jpg)
National Center for Genome Analysis Support: http://ncgas.org
FastQC is an excellent program for visualizing the overall quality of all reads in a fastq file
Quality Scores
FastQC is developed by the Babraham Bioinformatics Group:http://www.bioinformatics.babraham.ac.uk
![Page 14: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/14.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Tactics for increasing overall quality
We want to cut away the low quality bases!
Trimming Based on Quality
✔
![Page 15: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/15.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Wholesale cutting by base position
Trimming Based on Quality
![Page 16: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/16.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Start from ends of read and cut away until quality is above a specified threshold (usually 20)
Trimming Based on Quality
✔@SRR638895.6046 6046 length=76GTGAAAGACTCTCGTAGCAAACGAAACGTCAAGTCGGTGAGGCCAACTCTTGTCGTAGCCGCGTCCATTGCGCCCT+SRR638895.6046 6046 length=76######################EEEEEEEEEEEEEEEEEEEEEEEEEEE###########################
36 22
![Page 17: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/17.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Start from one end and keep bases until they fall below a specified threshold
Trimming Based on Quality
@SRR638895.6046 6046 length=76GTGAAAGACTCTCGTAGCAAACGAAACGTCAAGTCGGTGAGGCCAACTCTTGTCGTAGCCGCGTCCATTGCGCCCT+SRR638895.6046 6046 length=76EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE###########################
36 2
![Page 18: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/18.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Sliding windows and minimum vs. average quality scores
Trimming Based on Quality
ACGAAAACGGTGAGGCCT::::::EEEEEE######
25 36 2
Average:Min:Max:
252525
Target: Average below 20
![Page 19: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/19.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Sliding windows and minimum vs. average quality scores
Trimming Based on Quality
ACGAAAACGGTGAGGCCT::::::EEEEEE######
25 36 2
Average:Min:Max:
34.22536
Target: Average below 20
Step Size = 5Window Size = 6
![Page 20: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/20.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Sliding windows and minimum vs. average quality scores
Trimming Based on Quality
ACGAAAACGGTGAGGCCT::::::EEEEEE######
25 36 2
Average:Min:Max:
13.3236
Target: Average below 20
Step Size = 5Window Size = 6
![Page 21: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/21.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Sliding windows and minimum vs. average quality scores
Trimming Based on Quality
ACGAAAACGGTGAGGCCT::::::EEEEEE######
25 36 2
Average:Min:Max:
13.3236
Target: Average below 20
Step Size = 5Window Size = 6
![Page 22: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/22.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Mate pairs, orphans and minimum sequence length
Trimming Based on Quality
@Right ReadACGAAAACGG+::::::EEEE
Right read too short to keep
@Left ReadGTGAAAGACTCTCGTAGCAAACGAAACGTCAAGTCGGTGAGGCCAACT+EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Left read survives trimming
![Page 23: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/23.jpg)
National Center for Genome Analysis Support: http://ncgas.org
• TrimmomaticBolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.
• Trim Galore! developed by the Babraham Bioinformatics Group:http://www.bioinformatics.babraham.ac.uk
• FASTX Toolkit http://hannonlab.cshl.edu/fastx_toolkit
• Galaxy Trimming tools
Trimming Software
• Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86.
• Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21.
• Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research. 2005 Oct; 15(10):1451-5.
![Page 24: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/24.jpg)
National Center for Genome Analysis Support: http://ncgas.org
What’s a Kmer?
For a given sequence and a number, K, how many sub-sequences of length K are there?
Kmers
![Page 26: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/26.jpg)
National Center for Genome Analysis Support: http://ncgas.org
When fragments are shorter than total length of the read, adapters will be sequenced on both mates of a paired-end read.
For example, if we use technology that can sequence up to 100 bp:
Primers and Adapters
![Page 27: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/27.jpg)
National Center for Genome Analysis Support: http://ncgas.org
When to suspect this:
Patterns toward ends of reads
Primers and Adapters
![Page 28: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/28.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Software for removing adapters
Primers and Adapters
• Cutadapt Martin, M. (2011). Cutadapt removes adapter
sequences from high-throughput sequencing reads. 2011, 17(1). doi:
10.14806/ej.17.1.200 pp. 10-12
• FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit
• Scythe https://github.com/ucdavis-bioinformatics/scythe
![Page 29: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/29.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Library Prep – retained and sequenced poly-As/poly-Ts
When to suspect this:
Poly-A Tails and Other Artifacts
![Page 30: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/30.jpg)
National Center for Genome Analysis Support: http://ncgas.org
PRINSEQ (Schmieder 2011) for trimming poly-Ts – takes a % of the read that contains T’s and sorts them out
Conservatively, 60% of a read is T? Kick it out.
Filter on % base, sequence complexity, duplicates
Poly-A Tails and Other Artifacts
Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]
![Page 31: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/31.jpg)
National Center for Genome Analysis Support: http://ncgas.org
How much sequence one can afford to cut out depends on the following:• Coverage: If your sequence was run with very low
coverage, you may not want to cut aggressively• Sequence length: You can afford to cut 20bp out of a
150bp read, but not 30bp read• Goals: Depending on your end goal, cut more or less
aggressively
Conservative QC vs Aggressive QC - factors
![Page 32: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/32.jpg)
National Center for Genome Analysis Support: http://ncgas.org
References• Bennett, S. (2004). Solexa Ltd. Pharmacogenomics, 5(4), 433-438. doi:
10.1517/14622416.5.4.433• Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor
J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21.
• Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.
• Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186–194. doi:10.1101/gr.8.3.186. PMID 9521922.
• Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research. 2005 Oct; 15(10):1451-5.
• Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86.
• Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. 2011, 17(1). doi: 10.14806/ej.17.1.200 pp. 10-12
• Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]
![Page 33: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649e245503460f94b11d06/html5/thumbnails/33.jpg)
National Center for Genome Analysis Support: http://ncgas.org
Fin
Thanks for watching!
Questions and comments:
Email [email protected]