introduction to rnaseq. ngs - quick recap many applications -> research intent determines...

25
Introduction to RNAseq

Upload: imogene-bell

Post on 17-Jan-2016

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Introduction to RNAseq

Page 2: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

NGS - Quick Recap• Many applications -> research intent determines

technology platform choice

• High volume data BUT error prone

• FASTQ is accepted format standard

• Must assess quality scores before proceeding

• ‘Bad’ data can be rescued

Page 3: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

The Central Dogma of Molecular Biology

3

ReverseTranscription

Page 4: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

RNAseq Protocols

• cDNA, not RNA sequencing

• Types of libraries available:– Total RNA sequencing (not advised)– polyA+ RNA sequencing– Small RNA sequencing (specific size range

targeted)

Page 5: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

cDNA Synthesis

Page 6: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Genome-scale Applications• Transcriptome analysis

• Identifying new transcribed regions

• Expression profiling

• Resequencing to find genetic polymorphisms:– SNPs, micro-indels – CNVs– Question: Why even bother with exome sequencing

then?

Page 7: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

What about microarrays??!!!

• Assumes we know all transcribed regions and that spliceforms are not important

• Cannot find anything novel

• BUT may be the best choice depending on QUESTION

Page 8: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Arrays vs RNAseq (1)

• Correlation of fold change between arrays and RNAseq is similar to correlation between array platforms (0.73)

• Technical replicates almost identical• Extra analysis: prediction of alternative

splicing, SNPs• Low- and high-expressed genes do not

match

Page 9: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

RNA-Seq promises/pitfalls

• can reveal in a single assay: – new genes – splice variants– quantify genome-wide gene expression

• BUT– Data is voluminous and complex– Need scalable, fast and mathematically principled

analysis software and LOTS of computing resources

Page 10: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Experimental considerations

• Comparative conditions must make biological sense

• Biological replicates are always better than technical ones

• Aim for at least 3 replicates per condition

• ISOLATE the target mRNA species you are after

Page 11: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Analysis strategies• De novo assembly of transcripts:

+ re-constructs actual spliced transcripts+ does not require genome sequence

easier to work post-transcriptional modifications- requires huge computational resources (RAM)- low sensitivity: hard to capture low abundance transcripts

• Alignment to the genome => Transcript assembly+ computationally feasible+ high sensitivity+ easier to annotate using genomic annotations- need to take special care of splice junctions

# 11

Page 12: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Basic analysis flowchart

# 12

Illuminareads

Remove artifacts

AAA..., ...N...

Clip adapters(small RNA)

Pre-filter: low complexity

synthetic

Countand

discard

mappedAlign to the

genome

un-mapped

un-mapped

Re-align with different number of mismatchesetc

"Collapse" identical

reads

Assemble:contigs (exons)+ connectivity

mapped

Annotate

Filter out low confidence

contigs(singletons)

Page 13: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Software• Short-read aligners

• BWA, Novoalign, Bowtie, TOPHAT (eukaryotes)

• Data preprocessing• Fastx toolkit, samtools

• Expression studies• Cufflinks package, R packages (DESeq, edgeR, more…)

• Alternative splicing• Cufflinks, Augustus

Page 14: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

The ‘Tuxedo’ protocol• TOPHAT + CUFFLINKS

• TopHat aligns reads to genome and discovers splice sites

• Cufflinks predicts transcripts present in dataset

• Cuffdiff identifies differential expression

Very widely adopted suite

Page 15: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone
Page 16: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Read alignment with TopHat

• Uses BOWTIE aligner to align reads to genome

• BOWTIE cannot deal with large gaps (introns)

• Tophat segments reads that remain unaligned

• Smaller segments mostly end up aligning

Page 17: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Read alignment with TopHat (2)

Page 18: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Read alignment with TopHat (3)

• When there is a large gap between segments of same read -> probable INTRON

• Tophat uses this to build an index of probable splice sites

• Allows accurate measurement of spliceform expression

Page 19: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Cufflinks package

• http://cufflinks.cbcb.umd.edu/

• Cufflinks:– Expression values calculation– Transcripts de novo assembly

• Cuffdiff:– Differential expression analysis

Page 20: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Cufflinks: Transcript assembly

• Assembles individual transcripts based on aligned reads

• Infers likely spliceforms of each gene

• Quantifies expression level of each

Page 21: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Cuffmerge

• Merges transfrags into transcripts where appropriate

• Also performs a reference based assembly of transcripts using known transcripts

• Produces single annotation file which aids downstream analysis

Page 22: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Cuffdiff: Differential expression

• Calculates expression level in two or more samples

• Expression level relates to read abundance

• Because of bias sources, cuffdiff tries to model the variance in its significance calculation

Page 23: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

FPKM (RPKM): Expression Values

C= the number of reads mapped onto the gene's exonsN= total number of reads in the experimentL= the sum of the exons in base pairs.

Page 24: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Cuffdiff (differential expression)

• Pairwise or time series comparison• Normal distribution of read counts• Fisher’s test

test_idgene locus sample_1 sample_2 status value_1 value_2 ln(fold_change) test_stat p_value significantENSG00000000003 TSPAN6chrX:99883666-99894988 q1 q2 NOTEST 0 0 0 0 1 noENSG00000000005 TNMD chrX:99839798-99854882 q1 q2 NOTEST 0 0 0 0 1 noENSG00000000419 DPM1 chr20:49551403-49575092 q1 q2 NOTEST 15.0775 23.8627 0.459116 -1.39556 0.162848 noENSG00000000457 SCYL3 chr1:169631244-169863408 q1 q2 OK 32.5626 16.5208 -0.678541 15.8186 0 yes

Page 25: Introduction to RNAseq. NGS - Quick Recap Many applications -> research intent determines technology platform choice High volume data BUT error prone

Recommendations

• You can use BOWTIE or BOWTIE2 but

• Use CUFFDIFF2– Better statistical model– Detection of truly differentially expressed genes– VERY easy to parse output file (See example on

course page)