introduction to rnaseq. ngs - quick recap many applications -> research intent determines...

Introduction to RNAseq

NGS - Quick Recap• Many applications -> research intent determines

technology platform choice

• High volume data BUT error prone

• FASTQ is accepted format standard

• Must assess quality scores before proceeding

• ‘Bad’ data can be rescued

The Central Dogma of Molecular Biology

ReverseTranscription

RNAseq Protocols

• cDNA, not RNA sequencing

• Types of libraries available:– Total RNA sequencing (not advised)– polyA+ RNA sequencing– Small RNA sequencing (specific size range

targeted)

cDNA Synthesis

Genome-scale Applications• Transcriptome analysis

• Identifying new transcribed regions

• Expression profiling

• Resequencing to find genetic polymorphisms:– SNPs, micro-indels – CNVs– Question: Why even bother with exome sequencing

What about microarrays??!!!

• Assumes we know all transcribed regions and that spliceforms are not important

• Cannot find anything novel

• BUT may be the best choice depending on QUESTION

Arrays vs RNAseq (1)

• Correlation of fold change between arrays and RNAseq is similar to correlation between array platforms (0.73)

• Technical replicates almost identical• Extra analysis: prediction of alternative

splicing, SNPs• Low- and high-expressed genes do not

RNA-Seq promises/pitfalls

• can reveal in a single assay: – new genes – splice variants– quantify genome-wide gene expression

• BUT– Data is voluminous and complex– Need scalable, fast and mathematically principled

analysis software and LOTS of computing resources

Experimental considerations

• Comparative conditions must make biological sense

• Biological replicates are always better than technical ones

• Aim for at least 3 replicates per condition

• ISOLATE the target mRNA species you are after

Analysis strategies• De novo assembly of transcripts:

+ re-constructs actual spliced transcripts+ does not require genome sequence

easier to work post-transcriptional modifications- requires huge computational resources (RAM)- low sensitivity: hard to capture low abundance transcripts

• Alignment to the genome => Transcript assembly+ computationally feasible+ high sensitivity+ easier to annotate using genomic annotations- need to take special care of splice junctions

Basic analysis flowchart

Illuminareads

Remove artifacts

AAA..., ...N...

Clip adapters(small RNA)

Pre-filter: low complexity

synthetic

Countand

discard

mappedAlign to the

genome

un-mapped

Re-align with different number of mismatchesetc

"Collapse" identical

Assemble:contigs (exons)+ connectivity

mapped

Annotate

Filter out low confidence

contigs(singletons)

Software• Short-read aligners

• BWA, Novoalign, Bowtie, TOPHAT (eukaryotes)

• Data preprocessing• Fastx toolkit, samtools

• Expression studies• Cufflinks package, R packages (DESeq, edgeR, more…)

• Alternative splicing• Cufflinks, Augustus

The ‘Tuxedo’ protocol• TOPHAT + CUFFLINKS

• TopHat aligns reads to genome and discovers splice sites

• Cufflinks predicts transcripts present in dataset

• Cuffdiff identifies differential expression

Very widely adopted suite

Read alignment with TopHat

• Uses BOWTIE aligner to align reads to genome

• BOWTIE cannot deal with large gaps (introns)

• Tophat segments reads that remain unaligned

• Smaller segments mostly end up aligning

Read alignment with TopHat (2)

Read alignment with TopHat (3)

• When there is a large gap between segments of same read -> probable INTRON

• Tophat uses this to build an index of probable splice sites

• Allows accurate measurement of spliceform expression

Cufflinks package

• http://cufflinks.cbcb.umd.edu/

• Cufflinks:– Expression values calculation– Transcripts de novo assembly

• Cuffdiff:– Differential expression analysis

Cufflinks: Transcript assembly

• Assembles individual transcripts based on aligned reads

• Infers likely spliceforms of each gene

• Quantifies expression level of each

Cuffmerge

• Merges transfrags into transcripts where appropriate

• Also performs a reference based assembly of transcripts using known transcripts

• Produces single annotation file which aids downstream analysis

Cuffdiff: Differential expression

• Calculates expression level in two or more samples

• Expression level relates to read abundance

• Because of bias sources, cuffdiff tries to model the variance in its significance calculation

FPKM (RPKM): Expression Values

C= the number of reads mapped onto the gene's exonsN= total number of reads in the experimentL= the sum of the exons in base pairs.

Cuffdiff (differential expression)

• Pairwise or time series comparison• Normal distribution of read counts• Fisher’s test

test_idgene locus sample_1 sample_2 status value_1 value_2 ln(fold_change) test_stat p_value significantENSG00000000003 TSPAN6chrX:99883666-99894988 q1 q2 NOTEST 0 0 0 0 1 noENSG00000000005 TNMD chrX:99839798-99854882 q1 q2 NOTEST 0 0 0 0 1 noENSG00000000419 DPM1 chr20:49551403-49575092 q1 q2 NOTEST 15.0775 23.8627 0.459116 -1.39556 0.162848 noENSG00000000457 SCYL3 chr1:169631244-169863408 q1 q2 OK 32.5626 16.5208 -0.678541 15.8186 0 yes

Recommendations

• You can use BOWTIE or BOWTIE2 but

• Use CUFFDIFF2– Better statistical model– Detection of truly differentially expressed genes– VERY easy to parse output file (See example on

course page)

introduction to rnaseq. ngs - quick recap many applications -> research intent determines...

total rna sequencing

genome sequenceeasier

technical replicates

new genes splice

transcripts present

identicalextra analysis

splice sitescufflinks

suiteread alignment

Documents

geuvadis rnaseq analysis at unige analysis plans

rnaseq introduction - biology.umd.edu

ngs, cancer and...

from mapped rnaseq data fast, robust and accurate splice...

day1 laros rnaseq galaxy 2012

rnaseq analysis of the parasitic nematode strongyloides...

ngs investing in opportunities ngs investing in

genetic circuit characterization and debugging using rnaseq

small rnaseq data analysis - inra

rnaseq pevsner

rnaseq forgenefinding

development therapeutics in 2018€¦ · monday, june 19th,...

rnaseq data mining. - bti plant bioinformatics course · 1....

rna-seq analysis -...

microarrays, rnaseq and functional genomics cpsc265 matt...

rnaseq vip

rnaseq expression analysis tutorial(365) expression... ·...

introduction to next generation sequencing (ngs) …...

rnaseq: normalization and differential expression i ·...

mapping ngs sequences to a reference genome. why?...