part 4 of rna-seq for de analysis: extracting count table and qc
DESCRIPTION
Fourth part of the training session 'RNA-seq for Differential expression analysis'. We explain how we get a count table from a mapping result. We show how to do quality control on the count table. Interested in following this session? Please contact http://www.jakonix.be/contact.htmlTRANSCRIPT
![Page 1: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/1.jpg)
This presentation is available under the Creative Commons Attribution-ShareAlike 3.0 Unported License. Please refer to http://www.bits.vib.be/ if you use this presentation or parts hereof.
RNA-seq for DE analysis training
Generating the count table and validating assumptions
Joachim Jacob22 and 24 april 2014
![Page 2: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/2.jpg)
2 of 40
Overview
http://www.nature.com/nprot/journal/v8/n9/full/nprot.2013.099.html
![Page 3: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/3.jpg)
3 of 40
Bioinformatics analysis will take most of your time
Quality control (QC) of raw reads
Preprocessing: filtering of reads and read parts, to help our goal of differential detection.
QC of preprocessing Mapping to a reference genome(alternative: to a transcriptome)
QC of the mapping
Count table extraction
QC of the count table
DE test
Biological insight
1
2
3
4
5
6
![Page 4: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/4.jpg)
4 of 40
Goal
We need to summarize the read counts per gene from a mapping result.
The outcome is a raw count table on which we can perform some QC, to validate the experimental setup.
This table is used by the differential expression algorithm to detect DE genes.
![Page 5: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/5.jpg)
5 of 40
Status
20M
25M
15M
~16%
~5%
~10%
![Page 6: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/6.jpg)
6 of 40
Tools to count 'features'
● 'Features' = type of annotation on a genome = exons in our case.
● Different tools exist to accomplish this
http://wiki.bits.vib.be/index.php/RNAseq_toolbox#Feature_counting
![Page 7: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/7.jpg)
7 of 40
The challenge in counting'Exons' are the type of features used here.
They are summarized per 'gene'
Concept:GeneA = exon 1 + exon 2 + exon 3 + exon 4 = 215 readsGeneB = exon 1 + exon 2 + exon 3 = 180 reads
No normalization yet! Just pure counts, aka 'raw counts',
Overlaps no feature
Alt splicingMapping result of RNA-seq data
![Page 8: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/8.jpg)
8 of 40
Dealing with ambiguity
● Genes, often consist of different isoforms. These contain different exons, some shared between them, some not. Furthermore...
● Reads that do not overlap a feature, but appear in introns. Take into account?
● Reads that align to more than one gene? Transcripts can be overlapping - perhaps on different strands. (PE, and strandedness can resolve this partially).
● Reads that partially overlap a feature, not following known annotations.
![Page 9: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/9.jpg)
9 of 40
The tool HTSeq-count has 3 modes
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
HTSeq-count recommends the 'union mode'. But depending on your genome, you may opt for the 'intersection_strict mode'. Galaxy allows experimenting!
![Page 10: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/10.jpg)
10 of 40
Indicate the SE or PE nature of your data(note: mate-pair is not
appropriate naming here)
The annotation file with the coordinatesof the features to be counted
mode
Check with mapping QC (see earlier)
For RNA-seq DE we summarize over'exons' grouped by 'gene_id'. Make surethese fields are correct in your GTF file.
Reverse stranded: heck with mapping viz
![Page 11: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/11.jpg)
11 of 40
Resulting count table column
One sample !
![Page 12: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/12.jpg)
12 of 40
Merging to create experiment count table
Tool 'Column join'
![Page 13: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/13.jpg)
13 of 40
Resulting count table
![Page 14: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/14.jpg)
14 of 40
Quality control of count table
In the end, we used about 70% of the reads. Check for your experiment.
Relative numbers Absolute numbers
![Page 15: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/15.jpg)
15 of 40
Quality control of count table
2 types of QC:● General metrics● Sample-specific quality control
![Page 16: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/16.jpg)
16 of 40
QC: general metrics
● General numbersTotal number of counted reads
![Page 17: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/17.jpg)
17 of 40
QC: general metrics
● General numbers
![Page 18: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/18.jpg)
18 of 40
QC: general metrics
Which genes are most highly present? Which fractions do they occupy?
42 genes (0,0063%) of the 6665 genes take 25% of all counts.
This graph can be constructed from the count table.
Gene Counts
TEF1alpha, putative ribo prot,...
![Page 19: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/19.jpg)
19 of 40
QC: general metrics
● We can plot the counts per sample: filter out the '0', and transform on log2.
log2(count)
The bulk of the genes have countsin the hundreds.
Few are extremely highly expressed
A minority have extremely low counts
![Page 20: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/20.jpg)
20 of 40
QC: log2 density graph
● We can do this for all samples, and merge
Strange Deviation
here
All samples show nice overlap, peaks
are similar
![Page 21: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/21.jpg)
21 of 40
QC: log2 merging samples
Here, we take one sample, plot the log2 density graph, add the counts of another sample, and plot again, add the counts of another sample, etc. until we have merged all samples.
You can conclude different things when a horizontal or vertical shift of the graph, is appearing.
![Page 22: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/22.jpg)
22 of 40
QC: rarefaction curve
Code:ggplot(data = nonzero_counts, aes(total, counts)) + geom_line() + labs(x = "total number of sequenced reads", y = "number of genes with counts > 0")
What is the number of total detected features, how does the feature space increase with each additional sample added?
There should be saturation, but here there is none.
![Page 23: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/23.jpg)
23 of 40
QC: rarefaction curve
Saturation: OK!
….Sa
mp
le A
Sam
ple
A +
sam
ple
BSa
mp
le A
+ s
amp
le B
+ s
amp
le C
Etc.
![Page 24: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/24.jpg)
24 of 40
Alternative to log2 transformations
● Log2 transformations suffer from bloated variance.
http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
VSTrLogLog2
Not normalizations!
http://www.biomedcentral.com/1471-2105/14/91
![Page 25: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/25.jpg)
25 of 40
QC: count transformations
● Other transformations do not have this behavior, especially VST.
http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
VSTrLogLog2
Not normalizations!
http://www.biomedcentral.com/1471-2105/14/91
![Page 26: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/26.jpg)
26 of 40
Alternative to log2 transformations
Regularized log (rLog) and 'Variance Stabilizing Transformation' (VST) as alternatives to log2.
http://www.bioconductor.org/packages/2.12/bioc/html/DESeq2.html
rLog VST
![Page 27: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/27.jpg)
27 of 40
Beyond simple metrics QC
● We can also include condition information, to interpret our QC better. For this, we need to gather sample information.
● Make a separate file
in which sample info
is provided (metadata)
![Page 28: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/28.jpg)
28 of 40
QC with condition information
What are the differences in counts in each sample
dependent on? Here: counts are dependent on the treatment and the strain. Must match
the sample descriptions file.
![Page 29: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/29.jpg)
29 of 40
QC with condition infoClustering of the distance between samples based on transformed counts can reveal sample errors.
VST transformed rLog transformed
Colour scaleOf the distance
measure between Samples. Similar conditions
Should cluster together
![Page 30: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/30.jpg)
30 of 40
QC with condition infoClustering of transformed counts can reveal sample errors.
VST transformed rLog transformed
Biological samplesShould cluster
together
![Page 31: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/31.jpg)
31 of 40
QC with condition info
Principal component (PC) analysis allows to display the samples in a 2D scatterplot based on variability between the samples. Samples close to each other resemble each other more.
![Page 32: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/32.jpg)
32 of 40
Collect enough metadata
Principal component (PC) analysis allows to display the samples in a 2D scatterplot based on variability between the samples. Samples close to each other resemble each other more.
Why do these lie so close together?
![Page 33: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/33.jpg)
33 of 40
You can never collect enough
During library preparation, collect as much as information as possible, to add to the sample descriptions. Pay particular attention to differences between samples: e.g. day of preparation, centrifuges used, ...
![Page 34: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/34.jpg)
34 of 40
Collect enough metadata
In the QC of the count table, you can map this additional info to the PC graph. In this case, library prep on a different day had effect on the WT samples (batch effect).
Additional metadata
![Page 35: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/35.jpg)
35 of 40
Collect enough metadata
In the QC of the count table, you can map this additional info to the PC graph. In this case, library prep on a different day had effect on the WT samples (batch effect).
Day 1
Day 2
![Page 36: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/36.jpg)
36 of 40
Collect enough metadata
Days are includedAnd give us more
insight
![Page 37: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/37.jpg)
37 of 40
Next step
Now we know our data from the inside out, we can run a DE algorithm on the count table!
![Page 38: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/38.jpg)
38 of 40
KeywordsRaw counts
Count table
Overlapping features
Density graph
Rarefaction curve
Count transformation
VST
Sample metadata
PCA plot
Write in your own words what the terms mean
![Page 39: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/39.jpg)
39 of 40
Exercises
● → Extracting counts and doing QC
![Page 40: Part 4 of RNA-seq for DE analysis: Extracting count table and QC](https://reader033.vdocuments.us/reader033/viewer/2022052908/55944e941a28ab3f6f8b47a2/html5/thumbnails/40.jpg)
40 of 40
Break