rna-seq workshop counting & htseq erin osborne nishimura
TRANSCRIPT
RNA-seq workshopCOUNTING & HTSEQ
Erin Osborne Nishimura
_trim.fastq file
.bam/.sam file
.bw file
counts.txt file
TOPHAT2
bedGraphToBigWig
bedtools genomecov
.bg file
HTseq
DESeq2/R
Differentially AbundantgenesIGV/UCSC
Pretty browser shots
Today’s simple analysis pipeline.fastq file
trimmomatic/bbduk.sh
Quantification with htseq
Quantification with htseq
The problem
Counting reads
• What will we count?• Genes?• Exons?• Isoforms?
• What are some of the issues we need to account for when counting reads?
• Paralogs?• Overlap?• Isoforms?• Errors?
• How to count?• Raw counts• RPKM -- Reads aligned kilobase per million mapped reads• FPKM -- Fragments per kilobase per million mapped reads
htseq-count
• Manual:– http://www-huber.embl.de/users/anders/HTSeq/
doc/count.html
• Paper– http://bioinformatics.oxfordjournals.org/content/3
1/2/166
The problem
The three htseq-count modes
Switch to hands on tutorial
• https://github.com/erinosb/HTSF_workshop/blob/master/02_RNAseq_count.md
Assessing differential abundance
Assessing pairwise differential abundance, relatively simple
Anders and Huber, 2010
Identifying genes with shared patterns across multiple samples, complex
For today…
Anders and Huber, 2010
Many publications report performance comparisons of the of different packages
• Seyednasrollah et al., 2013
– http://bib.oxfordjournals.org/content/16/1/59.full.pdf+html
• Soneson et al., 2013.• http://www.biomedcentr
al.com/1471-2105/14/91
• Rapaport et al., 2013– http://www.genomebiolog
y.com/2013/14/9/r95
Why is this hard? Why is this different from other types of data?
• Your question• The data
– Discretness– Small numbers of
replicates– Large dynamic range– Outliers– Data is overdispersed
• Variance does not scale linearly with mean
• Breaks the assumptions of some inference tests
Anders and Huber, 2010
Why DESeq?
• Original paperhttp://www.genomebiology.com/content/11/10/R106
• DESeq2 paper• http://www.genomebiology.com/2014/15/12/550
• Bioconductor• http://bioconductor.org/packages/release/bioc/html/DESeq2.ht
ml• Vignette
• https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
A final word about the fate of your data
• You will need to submit your raw and processed files in a repository PRIOR to submitting your paper for publication.
• Keep track of what you did!– Module Versions– Conversion & transformation steps– Settings/Options
Switch to hands-on tutorial
• https://github.com/erinosb/HTSF_workshop/blob/master/02_RNAseq_count.md
20
Key Quality Control Metrics
• Gene coverage– CEAS
• Over-amplification– FASTQC
• Complexity– TOPHAT output
• Reproducibilitybility