rnaseq’analysis’ - open source at scilifelab · rnaseq’analysis ’!it’s ... •...
TRANSCRIPT
Defining functional DNA elements in the human genome Kellis M et al. PNAS 2014;111:6131-6138
RNA reads are not enough to iden:fy func:onal RNAs
Depending on the steps from sample to RNA seq will give different results
AAAAAAAA
enrichments -‐>
reads -‐>
library -‐>
RNA-‐> PolyA (mRNA) RiboMinus (-‐ rRNA) Size <50 nt (miRNA ) …..
Size of fragment Strand specific 5’ end specific 3’ end specific …..
Single end (1 read per fragment) Paired end (2 reads per fragment)
Mapping (Pär Engström)
• Use RNA specific mapper • Use a two-‐pass workflow • STAR or HISAT • For long (PacBio) reads, STAR, BLAT or GMAP can be used
Promises and pi_alls Long reads
• Low throughput (-‐) • Complete transcripts (++) • Not quan:ta:ve (-‐) • Only highly expressed genes (-‐) • Expensive (-‐) • Low background noise (+) • Easy downstream analysis (++)
short reads • High throughput (++) • Quan:ta:ve (++) • Frac:ons of transcripts (-‐) • Full dynamic range (+-‐) • Unlimited dynamic range (+) • Cheap (+) • Low background noise (+) • Strand specificity (+) • Re-‐sequencing (+)
RNA-‐seq analysis workflow RNA$seq(analysis(workflow(
Reads(
Reference(
Mapping( Mapped(reads(STAR(
genome.fa(
mappedReads.bam(reads.fastq.gz(
Gene(expression(
Gene(annotaCon:(ref.bed(/(ref.ga(
Do a lot of QC
Reads(
Reference(
Mapping( Mapped(reads(STAR(
genome.fa(
mappedReads.bam(reads.fastq.gz(
Gene(expression(
Gene(annotaCon:(ref.bed(/(ref.ga(
Read(QC($(FastQC(
Mapping(staCsCcs(
Mapping(QC($(RseQC(
Outlier(detecCon,(
sample(swaps(
More varia:on when using top hat 2 with default seengs than when using STAR or
Stampy with default seeng
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●● ●
●
●
●
● ● ●
●
●
● ● ●●
●
●
● ● ●
●
● ●
● ●
●
●
● ● ●
●
●
●
● ● ●
●●
● ● ● ●●
●
● ● ●
●
● ●
●●
●
●
● ● ●
60
70
80
90
Cg4
a_KS
2C
g88_
14_K
S3C
g88_
44_K
S2C
g89_
16_K
S2C
g89_
3_KS
1C
g8c_
KS2
Cg8
f_KS
1C
g94_
6_KS
1C
r1G
R1_
2_KS
1C
r1G
R1_
2_KS
2C
r1G
R1_
2_KS
3C
r23_
9_KS
2C
r39_
1_TS
1_KS
1C
r75_
2_3_
KS1
Cr7
9_29
_ext
ra1
Cr8
4_21
_KS1
CrN
2256
1_KS
2In
ter3
_1In
ter4
_1_1
Inte
r4_1
_2In
ter4
_1_3
Inte
r4_1
_4In
ter5
_1In
tra6_
3In
tra7_
2_1
Intra
7_2_
2In
tra7_
2_3
Intra
8_2
species
Perc
ent p
rope
rly m
appe
d pa
ired
end
read
s
Colour●
●
●
StampySTARTophat2
Shape● Flower
Leaf
Compare mapping efficacy of different RNA−seq assemblers
C.rube
lla
C.grandiflo
ra
hybrid
C.grandiflo
ra
Principal component 1 separates samples from flowers and leaves
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
−150 −100 −50 0 50 100 150
−50
050
PC1
PC2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Cg4a_KS2_FCg4a_KS2_LCg88_14_KS3_FCg88_14_KS3_LCg88_44_KS2_FCg88_44_KS2_LCg89_16_KS2_FCg89_16_KS2_LCg89_3_KS1_FCg89_3_KS1_LCg8c_KS2_FCg8c_KS2_LCg8f_KS1_FCg8f_KS1_LCg94_6_KS1_FCg94_6_KS1_LCr1GR1_2_KS1_FCr1GR1_2_KS1_LCr1GR1_2_KS2_FCr1GR1_2_KS2_LCr1GR1_2_KS3_FCr1GR1_2_KS3_LCr23_9_KS2_FCr23_9_KS2_LCr39_1_TS1_KS1_FCr39_1_TS1_KS1_LCr75_2_3_KS1_FCr75_2_3_KS1_LCr79_29_extra1_FCr79_29_extra1_LCr84_21_KS1_FCr84_21_KS1_LCrN22561_KS2_FCrN22561_KS2_LInter3_1_FInter3_1_LInter4_1_1_FInter4_1_1_LInter4_1_2_FInter4_1_2_LInter4_1_3_FInter4_1_4_LInter5_1_FInter5_1_LIntra6_3_FIntra6_3_LIntra7_2_1_FIntra7_2_1_LIntra7_2_2_FIntra7_2_2_LIntra7_2_3_FIntra7_2_3_LIntra8_2_FIntra8_2_L
C. rubella
Principal component 2 and 3 separates the different species
C. rubella
−60 −40 −20 0 20 40 60 80
−50
050
100
PC2
PC3
C. grandiflora
C. rubella
hybrid
Differen:al expression analysis Mikael Huss
The iden:fica:on of genes (or other types of genomic features, such as transcripts or exons) that are expressed in significantly different quan::es in dis:nct groups of samples, be it biological condi:ons (drug-‐treated vs. controls), diseased vs. healthy individuals, different :ssues, different stages of development, or something else.
Typically univariate analysis (one gene at a :me) – even though we know that genes are not independent
Gene-‐set analysis (GSA)
Immune response
Pyruvate
Gene-‐level data Gene-‐set data (results)
PPARG
Gene-‐set analysis GO-‐terms Pathways Chromosomal loca?ons Transcrip?on factors Histone modifica?ons Diseases etc…
Samples
Gen
es
We will focus on transcriptomics and differen:al expression analysis However, GSA can in principle be used on all types of genome-‐wide data.
Analysis regarding Type II Diabetes p−values up
0
0.00017
0.00033
5e−04
0.00067
0.00083
0.001
p−values down
0
0.00017
0.00033
5e−04
0.00067
0.00083
0.001
mean, distinct−directional (both)
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
KEGG_GLYCOLYSIS_GLUCONEOGENESIS
KEGG_PENTOSE_PHOSPHATE_PATHWAY
KEGG_STEROID_BIOSYNTHESIS
KEGG_OXIDATIVE_PHOSPHORYLATION
KEGG_PURINE_METABOLISM
KEGG_PYRIMIDINE_METABOLISMKEGG_GLUTATHIONE_METABOLISM
KEGG_LINOLEIC_ACID_METABOLISM
KEGG_GLYCOSPHINGOLIPID_BIOSYNTHESIS_LACTO_AND_NEOLACTO_SERIES
KEGG_TERPENOID_BACKBONE_BIOSYNTHESIS
KEGG_AMINOACYL_TRNA_BIOSYNTHESISKEGG_RIBOSOME
KEGG_DNA_REPLICATION
KEGG_PROTEASOME
KEGG_PROTEIN_EXPORT
KEGG_CELL_CYCLE
KEGG_CARDIAC_MUSCLE_CONTRACTION
KEGG_FOCAL_ADHESIONKEGG_ECM_RECEPTOR_INTERACTION
KEGG_ALDOSTERONE_REGULATED_SODIUM_REABSORPTION
KEGG_ALZHEIMERS_DISEASEKEGG_PARKINSONS_DISEASE
KEGG_AMYOTROPHIC_LATERAL_SCLEROSIS_ALSKEGG_HUNTINGTONS_DISEASE
KEGG_VIBRIO_CHOLERAE_INFECTION
Max node size:177 genes
Min node size:15 genes
Max edge width:84 genes
Min edge width:1 genes
Exercises • Mapping
– STAR – HISAT2
• Tutorial for reference guided assembly – Cufflinks – String:e
• Tutorial for de novo assembly – Trinity
• Visualise mapped reads and assembled transcripts on reference – IGV
• RNA quality controll – Tutorial for RNA seq Quality Control
• Differen:al expression analysis – DEseq2 – Calisto and Sleuth – mul: variate analysis in SIMCA
• small RNA analysis – miRNA analysis
• Introductory – Introduc:on to the RNA seq data provided – Short introduc:on to R – Short introduc:on to IGV
• Beta labs – Single cell RNA PCA and clustering – Gene set analysis
• UPPMAX – sbatch script example