bioinformatics and omics group meeting reference guided rna sequencing
TRANSCRIPT
Bioinformatics and OMICs Group MeetingREFERENCE GUIDED RNA SEQUENCING
Hi Name: David Oliver
Advisor: Dr. Shtutman
Research: Understanding the role of COPZ2 silencing in cancer progression using RNA-seq to identify transcriptional changes caused by the loss of COPZ2 and its encoded microRNA.
Experience: Microarray analysis, multiple RNA-seq analyses including long-read (PacBio) and short-read (illumina) sequencing experiments.
Why RNA-seq What’s the question?
◦ Differential Expression◦ Differential splicing
Advantage over other technologies◦ Increased sensitivity◦ Increased reproducibility
RNA-Seq vs Dual- and Single-Channel Microarray Data: Sensitivity Analysis for Differential Expression and Clustering. Alina Sîrbu, Gráinne Kerr, Martin Crane, Heather J. Ruskin. Published: December 10, 2012DOI: 0.1371/journal.pone.0050986
Before You Start Consult a statistician Consult your sequencing core
Actually Doing RNA-seq Minimum Requirements
◦ Have consulted a statistician and your sequencing core◦ Know that your question can be answered using sequencing technology and that the
experimental design is appropriate.
Actually Doing RNA-seq Minimum Requirements
◦ Have consulted a statistician and your sequencing core◦ Know that your question can be answered using sequencing technology and that the
experimental design is appropriate.◦ > 10,000,000 reads per sample
◦ Much more depth required for differential splicing◦ ≥ 3 biological replicates
Actually Doing RNA-seq Minimum Requirements
◦ Have consulted a statistician and your sequencing core◦ Know that your question can be answered using sequencing technology and that the
experimental design is appropriate.◦ > 10,000,000 reads per sample
◦ Much more depth required for differential splicing◦ ≥ 3 biological replicates◦ Access to decent amount of computing power
◦ Can be done on a laptop but it takes ~ 3 weeks (ask me how I know)◦ Basic knowledge of Unix system and R
Actually Doing RNA-seq Minimum Requirements
◦ Have consulted a statistician and your sequencing core◦ Know that your question can be answered using sequencing technology and that the
experimental design is appropriate.◦ > 10,000,000 reads per sample
◦ Much more depth required for differential splicing◦ ≥ 3 biological replicates◦ Access to decent amount of computing power
◦ Can be done on a laptop but it takes ~ 3 weeks (ask me how I know)◦ Basic knowledge of Unix system and R
◦ Or, know someone who is willing to help you.
Actually Doing RNA-seq Suggested Pipeline
◦ Quality assessment:◦ FastQC◦ FastX toolkit
◦ Alignment: ◦ Bowtie2/Tophat2◦ STAR◦ NovoAlign
◦ Counting reads: ◦ FeatureCounts◦ Gencode annotation
◦ Differential expression analysis◦ edgeR
◦ Manipulating sequencing files◦ Samtools, bamtools
Total RNA or mRNA
RNA-Seq
RNA expression levels
Align to genome
NovoAlign
BowTie2
Normalization/Quantification edgeR
Quality Filtering
Raw Reads
Biological System
STAR
fastQC
Read Counting
FeatureCount
Gencode
Target Genome
RNA-seq Walkthrough Check some quality markers Getting the target genome Build aligner-specific indexed genome Perform alignment Do some file manipulation Get the annotation file Count reads Perform normalization and quantification
RNA-seq Walkthrough Check some quality markers Getting the target genome Build aligner-specific indexed genome Perform alignment Do some file manipulation Get the annotation file Count reads Perform normalization and quantification
Total RNA or mRNA
RNA-Seq
Quality Filtering
Raw Reads
Biological System
fastQC
Check some quality markers FastQC
◦ Basic tool for generating reports◦ Java based◦ Does not provide tools for correcting errors (FastX toolkit)◦ http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Other tools◦ FASTX toolkit: For fixing some problems with datasets (adapter trimming,
readthrough error correction, etc)◦ SAMstat: A tool for alignment QC
RNA-seq Walkthrough Check some quality markers Getting the target genome Build aligner-specific indexed genome Perform alignment Do some file manipulation Get the annotation file Count reads Perform normalization and quantification
Total RNA or mRNA
RNA-Seq
Align to genome
Quality Filtering
Raw Reads
Biological System
fastQC
Target Genome
Getting the target genomehttp://genome.ucsc.edu/
RNA-seq Walkthrough Check some quality markers Getting the target genome Build aligner-specific indexed genome Perform alignment Do some file manipulation Get the annotation file Count reads Perform normalization and quantification
Build aligner-specific indexed genome
This step is performed by the aligner and takes a variable amount of time depending on the type of index used and the size of the genome to be indexed.
RNA-seq Walkthrough Check some quality markers Getting the target genome Build aligner-specific indexed genome Perform alignment Do some file manipulation Get the annotation file Count reads Perform normalization and quantification
Total RNA or mRNA
RNA-Seq
Align to genome
NovoAlign
Bowtie2
Quality Filtering
Raw Reads
Biological System
STAR
fastQC
Target Genome
Perform alignmenttophat2 -p 12 --no-coverage-search --b2-N 1 --b2-L 32 --b2-i S,1,0.5 --b2-D 250 --b2-R 25 -o $RNAwork/ $RNAwork/Indexes/hg38_index $RNAwork/sample1.fastq
Reads:
Input : 20889144
Mapped : 18935684 (90.6% of input)
of these: 2674218 (14.1%) have multiple alignments (436 have >20)
90.6% overall read mapping rate.
RNA-seq Walkthrough Check some quality markers Getting the target genome Build aligner-specific indexed genome Perform alignment Do some file manipulation Get the annotation file Count reads Perform normalization and quantification
Do some file manipulation Depending on which aligner you choose to use, the aligned sequences may be output as a BAM or SAM file.◦ Sequence alignment/map format (SAM)
◦ Contains all the alignment information plus room for user-defined information about the alignments◦ Binary alignment/map format (BAM)
◦ A binary version of the SAM file◦ Added benefit of being much smaller and quickly accessed by other software◦ Not all software can manage the conversion from BAM back to SAM
To manipulate these formats i.e. sort, remove duplicates, remove unaligned sequences, use either samtools or bamtools
RNA-seq Walkthrough Check some quality markers
Getting the target genome
Build aligner-specific indexed genome
Perform alignment
Do some file manipulation
Get the annotation file
Count reads
Perform normalization and quantification
Total RNA or mRNA
RNA-Seq
Align to genome
BowTie2 Quality Filtering
Raw Reads
Biological System
fastQC
Read Counting Gencode
Target Genome
Get the annotation file Annotation files are readily available from multiple sources
◦ Gencode ( http://www.gencodegenes.org/releases/ )◦ Ensembl ( http://useast.ensembl.org/info/data/ftp/index.html?redirect=no )◦ Vega ( http://vega.sanger.ac.uk/info/about/data_access.html )◦ RefSeq ( http://www.ncbi.nlm.nih.gov/refseq/ )
These annotation sources mainly vary in the number of non-coding RNAs which have been annotated. ◦ RefSeq < Gencode < Ensembl < Vega
We use Gencode
RNA-seq Walkthrough Check some quality markers
Getting the target genome
Build aligner-specific indexed genome
Perform alignment
Do some file manipulation
Get the annotation file
Count reads
Perform normalization and quantification
Total RNA or mRNA
RNA-Seq
Align to genome
BowTie2 Quality Filtering
Raw Reads
Biological System
fastQC
Read Counting
FeatureCount
Gencode
Target Genome
Count Reads FeatureCounts
◦ We used to use HTseq-Count which was quite nice but we’ve switched to FeatureCounts because it is much, much, much faster.
◦ Also comes as an R package (bioc::Rsubread)
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
RNA-seq Walkthrough Check some quality markers
Getting the target genome
Build aligner-specific indexed genome
Perform alignment
Do some file manipulation
Get the annotation file
Count reads
Perform normalization and quantification
Total RNA or mRNA
RNA-Seq
Align to genome
BowTie2 Quality Filtering
Raw Reads
Biological System
fastQC
Read Counting
FeatureCount
Gencode
Target Genome
RNA expression levels
Normalization/Quantification edgeR
Perform normalization and quantification
EdgeR:counts <- read.table(file = "All_counts.csv”)counts <- na.omit(counts)counts <- counts[-(which(rowSums(counts) == 0)),]
### start edgeR ###group <- factor(rep(c("DU145.miR1","DU145.miR148a","DU145.miR148b","DU145.miR152"), each =3))y <- DGEList(counts = counts, group = group)### convert count matrix to a DGEList objectdesign <- model.matrix(~0+group) ### Experimental designkeep <- which(rowMeans(cpm(y)) > 10); y <- y[keep,] ### Remove genes with really low counts per milliony$samples$lib.size <- colSums(y$counts) ### this re-calculates the library size after removing samples with low CPMy <- calcNormFactors(y) ### calculate between sample normalizationy <- estimateGLMRobustDisp(y, design) ### calculate within sample normalizations (sort of)fit <- glmFit(y, design) ### fit the “massaged data” to a generalized linear model
### perform Likelihood Ratio Test on each contrast ###lrt.du145.mir148a <- glmLRT(fit, contrast=c(-1,1,0,0,0,0,0,0)) lrt.du145.mir148b <- glmLRT(fit, contrast=c(-1,0,1,0,0,0,0,0))lrt.du145.mir152 <- glmLRT(fit, contrast=c(-1,0,0,1,0,0,0,0))
### generate a user-friendly output table ###tt.du145.mir148a <- topTags(lrt.du145.mir148a, n = Inf, sort.by = "none")tt.du145.mir148b <- topTags(lrt.du145.mir148b, n = Inf, sort.by = "none")tt.du145.mir152 <- topTags(lrt.du145.mir152, n = Inf, sort.by = "none")
Expected Results
du145-148a du145-148b du145-152MAL2 -3.559 -1.869 -4.668CDH1 -2.634 -2.173 -4.030ERRFI1 -1.209 -0.824 -1.595PPP6R1 -1.015 -0.546 -1.082NTSR1 -0.954 -2.126 -1.314ITGA5 -0.865 -0.928 -1.077PPAP2B -0.616 -0.476 -1.413MCAM 0.407 0.702 1.622IGFBP5 1.897 1.398 2.848GPC4 2.106 2.415 3.420CCL2 2.114 2.758 2.956
Long Read (> 1kb) RNA-seq Long read analysis is performed with essentially the same workflow.
For alignment, STAR or GMAP work equally well
Questions?