new rna-seq workflows - bioconductor · new rna-seq workflows charlotte soneson university of...
TRANSCRIPT
![Page 1: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/1.jpg)
New RNA-seq workflowsCharlotte Soneson University of Zurich
Brixen 2016
![Page 2: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/2.jpg)
Wikipedia
![Page 3: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/3.jpg)
7 . . . 13 . . . . . . . . . . . . . . .
Gene AGene B
.
.
.Gene X
The traditional workflow
ALIGNMENT
COUNTING
ANALYSIS
![Page 4: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/4.jpg)
doesn’t account for all uncertainty in abundance estimates
7 . . . 13 . . . . . . . . . . . . . . .
Gene AGene B
.
.
.Gene X
• slow• unnecessary?
The traditional workflow
ALIGNMENT
COUNTING
ANALYSIS
inaccurate?
![Page 5: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/5.jpg)
doesn’t account for all uncertainty in abundance estimates
7 . . . 13 . . . . . . . . . . . . . . .
Gene AGene B
.
.
.Gene X
• slow• unnecessary?
inaccurate?
The traditional workflow
ALIGNMENT
COUNTING
ANALYSIS
![Page 6: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/6.jpg)
length = L
length = 2L
sample 1 sample 2
T1
T2
Abundance quantification
![Page 7: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/7.jpg)
length = L
length = 2L
sample 1 sample 2
T1
T2
Abundance quantification
![Page 8: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/8.jpg)
Gene-level read countslength = L
length = 2L
150 reads
sample 1 sample 2
150 reads
T1
T2
Gene length = 2.6L
Gene S1 S2
Count 150 150
![Page 9: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/9.jpg)
What can we do?
• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)
• Include “adjustment” of gene counts to reflect underlying isoform composition
![Page 10: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/10.jpg)
• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)
• Include “adjustment” of gene counts to reflect underlying isoform composition
What can we do?
How can we get such values?
How could such adjustment be done?
Are they any good?
![Page 11: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/11.jpg)
• Consider another abundance unit that better reflects the underlying abundances (“number of transcript molecules”)
• Include “adjustment” of gene counts to reflect underlying isoform composition
What can we do?
How can we get such values?
How could such adjustment be done?
Are they any good?
We need transcript-level
information!
![Page 12: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/12.jpg)
library size
ti =cir
`i
TPMi = 106 · tiPk tk
RPKMi = 109 · ci`iP
k ck= 109 · tiP
k (tk`k)
Abundance units
`i
ciread count for transcript i
length of transcript i
library size
fragment length
TPMi / RPKMiX
i
TPMi = 106
![Page 13: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/13.jpg)
library size
ti =cir
`i
TPMi = 106 · tiPk tk
RPKMi = 109 · ci`iP
k ck= 109 · tiP
k (tk`k)
Abundance units
`i
ciread count for transcript i
length of transcript i
library size
fragment length
TPMi / RPKMiX
i
TPMi = 106
![Page 14: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/14.jpg)
Abundance units
`i
ciread count for transcript i
length of transcript i
library size
fragment length
TPMi / RPKMiX
i
TPMi = 106
ti =cir
`i
TPMi = 106 · tiPk tk
RPKMi = 109 · ci`iP
k ck= 109 · tiP
k (tk`k)
![Page 15: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/15.jpg)
Abundance units
`i
ciread count for transcript i
length of transcript i
library size
fragment length
TPMi / RPKMiX
i
TPMi = 106
ti =cir
`i
TPMi = 106 · tiPk tk
RPKMi = 109 · ci`iP
k ck= 109 · tiP
k (tk`k)
![Page 16: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/16.jpg)
Abundance units
`i
ciread count for transcript i
length of transcript ifragment length
TPMi / RPKMiX
i
TPMi = 106
library size
ti =cir
`i
TPMi = 106 · tiPk tk
RPKMi = 109 · ci`iP
k ck= 109 · tiP
k (tk`k)
![Page 17: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/17.jpg)
• Similar to correction factors for library size, but sample- and gene-specific
• Weighted average of transcript lengths, weighted by estimated abundances (TPMs)
• Average transcript length for gene g in sample s:
Offsets (“average transcript lengths”)
ATLgs =
X
i2g
✓is¯`is,X
i2g
✓is = 1
¯`is = e↵ective length of isoform i (in sample s)✓is = relative abundance of isoform i in sample s
![Page 18: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/18.jpg)
length = L
length = 2L
T1
T2
Average transcript lengths
ATLg1 = 1 · L+ 0 · 2L = L
ATLg2 = 0 · L+ 1 · 2L = 2L
![Page 19: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/19.jpg)
length = L
length = 2L
T1
T2
Average transcript lengths
ATLg2 = 0.5 · L+ 0.5 · 2L = 1.5L
ATLg1 = 0.75 · L+ 0.25 · 2L = 1.25L
![Page 20: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/20.jpg)
7 . . . 13 . . . . . . . . . . . . . . .
Gene AGene B
.
.
.Gene X
The traditional workflow
ALIGNMENT
COUNTING
ANALYSIS
![Page 21: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/21.jpg)
7 . . . 13 . . . . . . . . . . . . . . .
Gene AGene B
.
.
.Gene X
The “modern” workflow
“MAPPING”
ESTIMATION
ANALYSIS
![Page 22: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/22.jpg)
• Does not provide “full” alignment information (i.e., no exact base-by-base alignment).
• Rather, finds all transcripts (and positions) that a read is compatible with.
• Comes in various flavors: • pseudoalignment (kallisto) • lightweight alignment (Salmon) • quasimapping (Sailfish, RapMap)
The “mapping” step
Bray et al. 2016; Patro et al. 2014; Patro et al. 2015; Srivastava et al. 2016
![Page 23: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/23.jpg)
• Input: for each read, the “equivalence class” of compatible transcripts
• Probabilistic modeling of read generation process, with transcript abundance as parameter
• EM algorithm
• Output: estimated abundance of each transcript
The “estimation” step
![Page 24: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/24.jpg)
Step 1: build transcriptome index
kallisto
Salmon
name of index
transcriptome fasta file
name of index
transcriptome fasta file
number of cores
![Page 25: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/25.jpg)
Where to find transcript fasta?www.ensembl.org/info/data/ftp/index.html
![Page 26: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/26.jpg)
Where to find transcript fasta?www.ensembl.org/info/data/ftp/index.html
reference files for alignment-based workflow
![Page 27: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/27.jpg)
Step 2: quantify
kallisto
Salmon
name of index
output folder
number of cores
name of index
input fastq files
# bootstrapsnumber of cores input fastq files
libtype
output folder # bootstraps
![Page 28: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/28.jpg)
Salmon LIBTYPE argumenthttp://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype
![Page 29: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/29.jpg)
output
kallisto
Salmon
![Page 30: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/30.jpg)
output
kallisto
Salmon
[abundance.tsv]
[quant.sf]
![Page 31: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/31.jpg)
Comparison to traditional workflow
Salmon/kallisto…
• … are considerably faster than traditional alignment+counting -> allow bootstrapping
• … provide more highly resolved estimates (transcripts rather than gene) - can be aggregated to gene level
• … can use a larger fraction of the reads
• … don’t give precise alignments (for e.g. visualization in genome browser) - but avoid large alignment files
![Page 32: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/32.jpg)
kallisto and Salmon gene counts overall similar
0 1 2 3 4 5
01
23
45
SRR1039508
kallisto (log(counts + 1))
Salm
on (l
og(c
ount
s +
1))
![Page 33: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/33.jpg)
Gene-level counts mostly similar to traditional approach
0 1 2 3 4 5
01
23
45
SRR1039508
featureCounts (log(counts + 1))
Salm
on (l
og(c
ount
s +
1))
![Page 34: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/34.jpg)
kallisto and Salmon can use slightly more reads
0e+00
1e+07
2e+07
3e+07
SRR
1039
508
SRR
1039
509
SRR
1039
512
SRR
1039
513
SRR
1039
516
SRR
1039
517
SRR
1039
520
SRR
1039
521
Num
ber o
f map
ped
read
s
featureCounts kallisto Salmon
![Page 35: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/35.jpg)
How to get the estimated values into R?
![Page 36: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/36.jpg)
How to get the estimated values into R?
![Page 37: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/37.jpg)
How to get the estimated values into R?
TPMs
counts
“ATL” offsets
![Page 38: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/38.jpg)
• Abundance estimates for lowly expressed transcripts are highly variable and should be interpreted with caution
A word of warning
A B
C
1
Estim
ated
TPM
True TPM
A B
C
1
Soneson, Love & Robinson, F1000 Research 2016
![Page 39: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/39.jpg)
• Problematic when coverage of region defining an isoform is low
A word of warning
Soneson, Love & Robinson, F1000 Research 2016
![Page 40: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/40.jpg)
• When aggregated to the gene level, abundance estimates are less variable
A word of warning
A B
C
1
Estim
ated
TPM
True TPM
A B
C
1
Soneson, Love & Robinson, F1000 Research 2016
A B
C
1
![Page 41: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/41.jpg)
Differential analysis types for RNA-seq
• Has the total output of a gene changed? DGE
• Has the expression of individual transcripts changed? DTE
• Has any isoform of a given gene changed? DTE+G
• Has the isoform composition for a given gene changed? DTU/DEU
- need different abundance quantification of transcriptomic features (genes, transcripts, exons)
![Page 42: New RNA-seq workflows - Bioconductor · New RNA-seq workflows Charlotte Soneson University of Zurich Brixen 2016. Wikipedia. 7 . . . ... Step 1: build transcriptome index kallisto](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec621968d12144b8d424cce/html5/thumbnails/42.jpg)
• Srivastava et al.: RapMap: a rapid, sensitive and accurate tool for mapping RNA-seq read to transcriptomes. Bioinformatics 32:i192-i200 (2016) - RapMap
• Patro et al.: Accurate, fast, and model-aware transcript expression quantification with Salmon. bioRxiv http://dx.doi.org/10.1101/021592 (2015) - Salmon
• Bray et al.: Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34(5):525-527 (2016) - kallisto • Patro et al.: Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms.
Nature Biotechnology 32:462-464 (2014) - Sailfish • Pimentel et al.: Differential analysis of RNA-Seq incorporating quantification uncertainty. bioRxiv http://dx.doi.org/
10.1101/058164 (2016) - sleuth • Wagner et al.: Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among
samples. Theory in Biosciences 131:281-285 (2012) - TPM vs FPKM • Soneson et al.: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences.
F1000Research 4:1521 (2016) - ATL offsets (tximport package) • Li et al.: RNA-seq gene expression estimation with read mapping uncertainty. Bioinformatics 26(4):493-500 (2010) -
TPM, RSEM
References