national center for genome analysis support: ://ncgas.org carrie ganote ram podicheti le-shin wu tom...

33
National Center for Genome Analysis Support: http://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment of RNA-Seq Data

Upload: maude-fleming

Post on 26-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Carrie Ganote

Ram Podicheti

Le-Shin Wu

Tom Doak

Quality Control and Assessment ofRNA-Seq Data

Page 2: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

What do the data look like?

@SRR638895.6046 6046 length=76GTGAAAGACTCTCGTAGCAAACGAAACGTCAAGTCGGTGAGGCCAACTCTTGTCGTAGCCGCGTCCATTGCGCCCT+SRR638895.6046 6046 length=76GDGACGFFGF7EDDAECBEDFEGFGECGEDGFGE:=BDD@FD59B67>:=9>:8>>;;<;=CD@9+=???######

Fastq is a common format for storing Next Gen Sequencing data.

• Text based• Stores both the sequence and quality information• Originally developed at Wellcome Trust Sanger Insitute

and later adopted by Solexa (Bennett, 2004)• Information for each read comprises of 4 lines

Bennett, S. (2004). Solexa Ltd. Pharmacogenomics, 5(4), 433-438. doi: 10.1517/14622416.5.4.433

Page 3: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTGCGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA+@@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI

• Sequence Identifier• Begins with a @ symbol• Comprises of

• Instrument Name• Flowcell Lane• Tile• X and Y coordinates of the Cluster on the Tile• Member of a Pair (1 or 2)• Index

FASTQ Format

Page 4: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

FASTQ Format

@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTGCGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA+@@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI

Read Sequence (A, G, T, C, N)

Page 5: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

FASTQ Format

@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTGCGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA+@@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI

• ‘+’ character• Can be followed by the same Sequence Identifier (from

Line1)

Page 6: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTGCGTTCAGTCATAATCCAGCGCACGGTAGCTTCGCGCCACTGGCTTTTCAA+@@?DFFFFHGHHHIJJJJIIJJJJIHGHIEIIIFIEI>BHIJIIJIJEGI

• Base Quality Scores (Phred33) for the sequence in Line2• Must contain the same number of characters as those in

the sequence

FASTQ Format

Page 7: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Sequencers can assign a “confidence” value per call based on how ambiguous the base call is

Quality Scores

Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186–194. doi:10.1101/gr.8.3.186. PMID 9521922.

The sequencer will estimate the probability that a given base call is NOT correct (Erwing 1998)

Page 8: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Quality Scores

Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186–194. doi:10.1101/gr.8.3.186. PMID 9521922.

10 20 30 400.89

0.91

0.93

0.95

0.97

0.99

PHRED Scores Estimated Accuracy

PHRED Score

% E

stim

ated

Acc

ura

cy

P -10*log10(p)Est. Accuracy =

1-P

0.1 10 0.9

0.01 20 0.99

0.001 30 0.999

0.0001 40 0.9999

PHRED Score is defined as q = -10 x log10(p)(Erwing 1998)

P = probability call is not correct

Page 9: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Why not just have numbers?

Quality Score Encodings

@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTGCGTTCAGT…+3131303537373739…

Page 10: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Why not just have numbers?

Quality Score Encodings

@CCRI0219:135:D243EACXX:1:1101:1682:1955 1:N:0:ACAGTGCGTTCAGT…+3131303537373739…

Quality symbols to the rescue

Page 11: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

• Letters are represented deep down in the computer as numbers

• The quality score + a constant number (33 or 64, usually) is the number, which is converted to the quality symbol using ASCII

Quality Score Encodings

Page 12: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

ASCII Table

Page 13: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

FastQC is an excellent program for visualizing the overall quality of all reads in a fastq file

Quality Scores

FastQC is developed by the Babraham Bioinformatics Group:http://www.bioinformatics.babraham.ac.uk

Page 14: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Tactics for increasing overall quality

We want to cut away the low quality bases!

Trimming Based on Quality

Page 15: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Wholesale cutting by base position

Trimming Based on Quality

Page 16: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Start from ends of read and cut away until quality is above a specified threshold (usually 20)

Trimming Based on Quality

✔@SRR638895.6046 6046 length=76GTGAAAGACTCTCGTAGCAAACGAAACGTCAAGTCGGTGAGGCCAACTCTTGTCGTAGCCGCGTCCATTGCGCCCT+SRR638895.6046 6046 length=76######################EEEEEEEEEEEEEEEEEEEEEEEEEEE###########################

36 22

Page 17: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Start from one end and keep bases until they fall below a specified threshold

Trimming Based on Quality

@SRR638895.6046 6046 length=76GTGAAAGACTCTCGTAGCAAACGAAACGTCAAGTCGGTGAGGCCAACTCTTGTCGTAGCCGCGTCCATTGCGCCCT+SRR638895.6046 6046 length=76EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE###########################

36 2

Page 18: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Sliding windows and minimum vs. average quality scores

Trimming Based on Quality

ACGAAAACGGTGAGGCCT::::::EEEEEE######

25 36 2

Average:Min:Max:

252525

Target: Average below 20

Page 19: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Sliding windows and minimum vs. average quality scores

Trimming Based on Quality

ACGAAAACGGTGAGGCCT::::::EEEEEE######

25 36 2

Average:Min:Max:

34.22536

Target: Average below 20

Step Size = 5Window Size = 6

Page 20: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Sliding windows and minimum vs. average quality scores

Trimming Based on Quality

ACGAAAACGGTGAGGCCT::::::EEEEEE######

25 36 2

Average:Min:Max:

13.3236

Target: Average below 20

Step Size = 5Window Size = 6

Page 21: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Sliding windows and minimum vs. average quality scores

Trimming Based on Quality

ACGAAAACGGTGAGGCCT::::::EEEEEE######

25 36 2

Average:Min:Max:

13.3236

Target: Average below 20

Step Size = 5Window Size = 6

Page 22: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Mate pairs, orphans and minimum sequence length

Trimming Based on Quality

@Right ReadACGAAAACGG+::::::EEEE

Right read too short to keep

@Left ReadGTGAAAGACTCTCGTAGCAAACGAAACGTCAAGTCGGTGAGGCCAACT+EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

Left read survives trimming

Page 23: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

• TrimmomaticBolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.

• Trim Galore! developed by the Babraham Bioinformatics Group:http://www.bioinformatics.babraham.ac.uk

• FASTX Toolkit http://hannonlab.cshl.edu/fastx_toolkit

• Galaxy Trimming tools

Trimming Software

• Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86.

• Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21.

• Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research. 2005 Oct; 15(10):1451-5.

Page 24: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

What’s a Kmer?

For a given sequence and a number, K, how many sub-sequences of length K are there?

Kmers

Page 25: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Why? Kmers

K = 5

Page 26: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

When fragments are shorter than total length of the read, adapters will be sequenced on both mates of a paired-end read.

For example, if we use technology that can sequence up to 100 bp:

Primers and Adapters

Page 27: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

When to suspect this:

Patterns toward ends of reads

Primers and Adapters

Page 28: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Software for removing adapters

Primers and Adapters

• Cutadapt Martin, M. (2011). Cutadapt removes adapter

sequences from high-throughput sequencing reads. 2011, 17(1). doi:

10.14806/ej.17.1.200 pp. 10-12

• FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit

• Scythe https://github.com/ucdavis-bioinformatics/scythe

Page 29: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Library Prep – retained and sequenced poly-As/poly-Ts

When to suspect this:

Poly-A Tails and Other Artifacts

Page 30: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

PRINSEQ (Schmieder 2011) for trimming poly-Ts – takes a % of the read that contains T’s and sorts them out

Conservatively, 60% of a read is T? Kick it out.

Filter on % base, sequence complexity, duplicates

Poly-A Tails and Other Artifacts

Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]

Page 31: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

How much sequence one can afford to cut out depends on the following:• Coverage: If your sequence was run with very low

coverage, you may not want to cut aggressively• Sequence length: You can afford to cut 20bp out of a

150bp read, but not 30bp read• Goals: Depending on your end goal, cut more or less

aggressively

Conservative QC vs Aggressive QC - factors

Page 32: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

References• Bennett, S. (2004). Solexa Ltd. Pharmacogenomics, 5(4), 433-438. doi:

10.1517/14622416.5.4.433• Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, Nekrutenko A, Taylor

J. "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology. 2010 Jan; Chapter 19:Unit 19.10.1-21.

• Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.

• Ewing B, Green P (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities". Genome Res. 8 (3): 186–194. doi:10.1101/gr.8.3.186. PMID 9521922.

• Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, Miller W, Kent WJ, Nekrutenko A. "Galaxy: a platform for interactive large-scale genome analysis." Genome Research. 2005 Oct; 15(10):1451-5.

• Goecks, J, Nekrutenko, A, Taylor, J and The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010 Aug 25;11(8):R86.

• Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. 2011, 17(1). doi: 10.14806/ej.17.1.200 pp. 10-12

• Schmieder R and Edwards R: Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011, 27:863-864. [PMID: 21278185]

Page 33: National Center for Genome Analysis Support: ://ncgas.org Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment

National Center for Genome Analysis Support: http://ncgas.org

Fin

Thanks for watching!

Questions and comments:

Email [email protected]