bioinformatics and omics group meeting reference guided rna sequencing

32
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Upload: gertrude-foster

Post on 28-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Bioinformatics and OMICs Group MeetingREFERENCE GUIDED RNA SEQUENCING

Page 2: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Hi Name: David Oliver

Advisor: Dr. Shtutman

Research: Understanding the role of COPZ2 silencing in cancer progression using RNA-seq to identify transcriptional changes caused by the loss of COPZ2 and its encoded microRNA.

Experience: Microarray analysis, multiple RNA-seq analyses including long-read (PacBio) and short-read (illumina) sequencing experiments.

Page 3: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Why RNA-seq What’s the question?

◦ Differential Expression◦ Differential splicing

Advantage over other technologies◦ Increased sensitivity◦ Increased reproducibility

RNA-Seq vs Dual- and Single-Channel Microarray Data: Sensitivity Analysis for Differential Expression and Clustering. Alina Sîrbu, Gráinne Kerr, Martin Crane, Heather J. Ruskin. Published: December 10, 2012DOI: 0.1371/journal.pone.0050986

Page 4: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Before You Start Consult a statistician Consult your sequencing core

Page 5: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Actually Doing RNA-seq Minimum Requirements

◦ Have consulted a statistician and your sequencing core◦ Know that your question can be answered using sequencing technology and that the

experimental design is appropriate.

Page 6: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Actually Doing RNA-seq Minimum Requirements

◦ Have consulted a statistician and your sequencing core◦ Know that your question can be answered using sequencing technology and that the

experimental design is appropriate.◦ > 10,000,000 reads per sample

◦ Much more depth required for differential splicing◦ ≥ 3 biological replicates

Page 7: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Actually Doing RNA-seq Minimum Requirements

◦ Have consulted a statistician and your sequencing core◦ Know that your question can be answered using sequencing technology and that the

experimental design is appropriate.◦ > 10,000,000 reads per sample

◦ Much more depth required for differential splicing◦ ≥ 3 biological replicates◦ Access to decent amount of computing power

◦ Can be done on a laptop but it takes ~ 3 weeks (ask me how I know)◦ Basic knowledge of Unix system and R

Page 8: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Actually Doing RNA-seq Minimum Requirements

◦ Have consulted a statistician and your sequencing core◦ Know that your question can be answered using sequencing technology and that the

experimental design is appropriate.◦ > 10,000,000 reads per sample

◦ Much more depth required for differential splicing◦ ≥ 3 biological replicates◦ Access to decent amount of computing power

◦ Can be done on a laptop but it takes ~ 3 weeks (ask me how I know)◦ Basic knowledge of Unix system and R

◦ Or, know someone who is willing to help you.

Page 9: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Actually Doing RNA-seq Suggested Pipeline

◦ Quality assessment:◦ FastQC◦ FastX toolkit

◦ Alignment: ◦ Bowtie2/Tophat2◦ STAR◦ NovoAlign

◦ Counting reads: ◦ FeatureCounts◦ Gencode annotation

◦ Differential expression analysis◦ edgeR

◦ Manipulating sequencing files◦ Samtools, bamtools

Total RNA or mRNA

RNA-Seq

RNA expression levels

Align to genome

NovoAlign

BowTie2

Normalization/Quantification edgeR

Quality Filtering

Raw Reads

Biological System

STAR

fastQC

Read Counting

FeatureCount

Gencode

Target Genome

Page 10: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

RNA-seq Walkthrough Check some quality markers Getting the target genome Build aligner-specific indexed genome Perform alignment Do some file manipulation Get the annotation file Count reads Perform normalization and quantification

Page 11: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

RNA-seq Walkthrough Check some quality markers Getting the target genome Build aligner-specific indexed genome Perform alignment Do some file manipulation Get the annotation file Count reads Perform normalization and quantification

Total RNA or mRNA

RNA-Seq

Quality Filtering

Raw Reads

Biological System

fastQC

Page 12: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Check some quality markers FastQC

◦ Basic tool for generating reports◦ Java based◦ Does not provide tools for correcting errors (FastX toolkit)◦ http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Other tools◦ FASTX toolkit: For fixing some problems with datasets (adapter trimming,

readthrough error correction, etc)◦ SAMstat: A tool for alignment QC

Page 13: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

RNA-seq Walkthrough Check some quality markers Getting the target genome Build aligner-specific indexed genome Perform alignment Do some file manipulation Get the annotation file Count reads Perform normalization and quantification

Total RNA or mRNA

RNA-Seq

Align to genome

Quality Filtering

Raw Reads

Biological System

fastQC

Target Genome

Page 14: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Getting the target genomehttp://genome.ucsc.edu/

Page 15: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING
Page 16: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING
Page 17: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING
Page 18: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

RNA-seq Walkthrough Check some quality markers Getting the target genome Build aligner-specific indexed genome Perform alignment Do some file manipulation Get the annotation file Count reads Perform normalization and quantification

Page 19: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Build aligner-specific indexed genome

This step is performed by the aligner and takes a variable amount of time depending on the type of index used and the size of the genome to be indexed.

Page 20: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

RNA-seq Walkthrough Check some quality markers Getting the target genome Build aligner-specific indexed genome Perform alignment Do some file manipulation Get the annotation file Count reads Perform normalization and quantification

Total RNA or mRNA

RNA-Seq

Align to genome

NovoAlign

Bowtie2

Quality Filtering

Raw Reads

Biological System

STAR

fastQC

Target Genome

Page 21: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Perform alignmenttophat2 -p 12 --no-coverage-search --b2-N 1 --b2-L 32 --b2-i S,1,0.5 --b2-D 250 --b2-R 25 -o $RNAwork/ $RNAwork/Indexes/hg38_index $RNAwork/sample1.fastq

Reads:

Input : 20889144

Mapped : 18935684 (90.6% of input)

of these: 2674218 (14.1%) have multiple alignments (436 have >20)

90.6% overall read mapping rate.

Page 22: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

RNA-seq Walkthrough Check some quality markers Getting the target genome Build aligner-specific indexed genome Perform alignment Do some file manipulation Get the annotation file Count reads Perform normalization and quantification

Page 23: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Do some file manipulation Depending on which aligner you choose to use, the aligned sequences may be output as a BAM or SAM file.◦ Sequence alignment/map format (SAM)

◦ Contains all the alignment information plus room for user-defined information about the alignments◦ Binary alignment/map format (BAM)

◦ A binary version of the SAM file◦ Added benefit of being much smaller and quickly accessed by other software◦ Not all software can manage the conversion from BAM back to SAM

To manipulate these formats i.e. sort, remove duplicates, remove unaligned sequences, use either samtools or bamtools

Page 24: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

RNA-seq Walkthrough Check some quality markers

Getting the target genome

Build aligner-specific indexed genome

Perform alignment

Do some file manipulation

Get the annotation file

Count reads

Perform normalization and quantification

Total RNA or mRNA

RNA-Seq

Align to genome

BowTie2 Quality Filtering

Raw Reads

Biological System

fastQC

Read Counting Gencode

Target Genome

Page 25: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Get the annotation file Annotation files are readily available from multiple sources

◦ Gencode ( http://www.gencodegenes.org/releases/ )◦ Ensembl ( http://useast.ensembl.org/info/data/ftp/index.html?redirect=no )◦ Vega ( http://vega.sanger.ac.uk/info/about/data_access.html )◦ RefSeq ( http://www.ncbi.nlm.nih.gov/refseq/ )

These annotation sources mainly vary in the number of non-coding RNAs which have been annotated. ◦ RefSeq < Gencode < Ensembl < Vega

We use Gencode

Page 26: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

RNA-seq Walkthrough Check some quality markers

Getting the target genome

Build aligner-specific indexed genome

Perform alignment

Do some file manipulation

Get the annotation file

Count reads

Perform normalization and quantification

Total RNA or mRNA

RNA-Seq

Align to genome

BowTie2 Quality Filtering

Raw Reads

Biological System

fastQC

Read Counting

FeatureCount

Gencode

Target Genome

Page 27: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Count Reads FeatureCounts

◦ We used to use HTseq-Count which was quite nice but we’ve switched to FeatureCounts because it is much, much, much faster.

◦ Also comes as an R package (bioc::Rsubread)

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Page 28: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

RNA-seq Walkthrough Check some quality markers

Getting the target genome

Build aligner-specific indexed genome

Perform alignment

Do some file manipulation

Get the annotation file

Count reads

Perform normalization and quantification

Total RNA or mRNA

RNA-Seq

Align to genome

BowTie2 Quality Filtering

Raw Reads

Biological System

fastQC

Read Counting

FeatureCount

Gencode

Target Genome

RNA expression levels

Normalization/Quantification edgeR

Page 29: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Perform normalization and quantification

EdgeR:counts <- read.table(file = "All_counts.csv”)counts <- na.omit(counts)counts <- counts[-(which(rowSums(counts) == 0)),]

### start edgeR ###group <- factor(rep(c("DU145.miR1","DU145.miR148a","DU145.miR148b","DU145.miR152"), each =3))y <- DGEList(counts = counts, group = group)### convert count matrix to a DGEList objectdesign <- model.matrix(~0+group) ### Experimental designkeep <- which(rowMeans(cpm(y)) > 10); y <- y[keep,] ### Remove genes with really low counts per milliony$samples$lib.size <- colSums(y$counts) ### this re-calculates the library size after removing samples with low CPMy <- calcNormFactors(y) ### calculate between sample normalizationy <- estimateGLMRobustDisp(y, design) ### calculate within sample normalizations (sort of)fit <- glmFit(y, design) ### fit the “massaged data” to a generalized linear model

### perform Likelihood Ratio Test on each contrast ###lrt.du145.mir148a <- glmLRT(fit, contrast=c(-1,1,0,0,0,0,0,0)) lrt.du145.mir148b <- glmLRT(fit, contrast=c(-1,0,1,0,0,0,0,0))lrt.du145.mir152 <- glmLRT(fit, contrast=c(-1,0,0,1,0,0,0,0))

### generate a user-friendly output table ###tt.du145.mir148a <- topTags(lrt.du145.mir148a, n = Inf, sort.by = "none")tt.du145.mir148b <- topTags(lrt.du145.mir148b, n = Inf, sort.by = "none")tt.du145.mir152 <- topTags(lrt.du145.mir152, n = Inf, sort.by = "none")

Page 30: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Expected Results

du145-148a du145-148b du145-152MAL2 -3.559 -1.869 -4.668CDH1 -2.634 -2.173 -4.030ERRFI1 -1.209 -0.824 -1.595PPP6R1 -1.015 -0.546 -1.082NTSR1 -0.954 -2.126 -1.314ITGA5 -0.865 -0.928 -1.077PPAP2B -0.616 -0.476 -1.413MCAM 0.407 0.702 1.622IGFBP5 1.897 1.398 2.848GPC4 2.106 2.415 3.420CCL2 2.114 2.758 2.956

Page 31: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Long Read (> 1kb) RNA-seq Long read analysis is performed with essentially the same workflow.

For alignment, STAR or GMAP work equally well

Page 32: Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING

Questions?