introduction to rnaseq. ngs - quick recap many applications -> research intent determines...
Post on 17-Jan-2016
220 Views
Preview:
TRANSCRIPT
Introduction to RNAseq
NGS - Quick Recap• Many applications -> research intent determines
technology platform choice
• High volume data BUT error prone
• FASTQ is accepted format standard
• Must assess quality scores before proceeding
• ‘Bad’ data can be rescued
The Central Dogma of Molecular Biology
3
ReverseTranscription
RNAseq Protocols
• cDNA, not RNA sequencing
• Types of libraries available:– Total RNA sequencing (not advised)– polyA+ RNA sequencing– Small RNA sequencing (specific size range
targeted)
cDNA Synthesis
Genome-scale Applications• Transcriptome analysis
• Identifying new transcribed regions
• Expression profiling
• Resequencing to find genetic polymorphisms:– SNPs, micro-indels – CNVs– Question: Why even bother with exome sequencing
then?
What about microarrays??!!!
• Assumes we know all transcribed regions and that spliceforms are not important
• Cannot find anything novel
• BUT may be the best choice depending on QUESTION
Arrays vs RNAseq (1)
• Correlation of fold change between arrays and RNAseq is similar to correlation between array platforms (0.73)
• Technical replicates almost identical• Extra analysis: prediction of alternative
splicing, SNPs• Low- and high-expressed genes do not
match
RNA-Seq promises/pitfalls
• can reveal in a single assay: – new genes – splice variants– quantify genome-wide gene expression
• BUT– Data is voluminous and complex– Need scalable, fast and mathematically principled
analysis software and LOTS of computing resources
Experimental considerations
• Comparative conditions must make biological sense
• Biological replicates are always better than technical ones
• Aim for at least 3 replicates per condition
• ISOLATE the target mRNA species you are after
Analysis strategies• De novo assembly of transcripts:
+ re-constructs actual spliced transcripts+ does not require genome sequence
easier to work post-transcriptional modifications- requires huge computational resources (RAM)- low sensitivity: hard to capture low abundance transcripts
• Alignment to the genome => Transcript assembly+ computationally feasible+ high sensitivity+ easier to annotate using genomic annotations- need to take special care of splice junctions
# 11
Basic analysis flowchart
# 12
Illuminareads
Remove artifacts
AAA..., ...N...
Clip adapters(small RNA)
Pre-filter: low complexity
synthetic
Countand
discard
mappedAlign to the
genome
un-mapped
un-mapped
Re-align with different number of mismatchesetc
"Collapse" identical
reads
Assemble:contigs (exons)+ connectivity
mapped
Annotate
Filter out low confidence
contigs(singletons)
Software• Short-read aligners
• BWA, Novoalign, Bowtie, TOPHAT (eukaryotes)
• Data preprocessing• Fastx toolkit, samtools
• Expression studies• Cufflinks package, R packages (DESeq, edgeR, more…)
• Alternative splicing• Cufflinks, Augustus
The ‘Tuxedo’ protocol• TOPHAT + CUFFLINKS
• TopHat aligns reads to genome and discovers splice sites
• Cufflinks predicts transcripts present in dataset
• Cuffdiff identifies differential expression
Very widely adopted suite
Read alignment with TopHat
• Uses BOWTIE aligner to align reads to genome
• BOWTIE cannot deal with large gaps (introns)
• Tophat segments reads that remain unaligned
• Smaller segments mostly end up aligning
Read alignment with TopHat (2)
Read alignment with TopHat (3)
• When there is a large gap between segments of same read -> probable INTRON
• Tophat uses this to build an index of probable splice sites
• Allows accurate measurement of spliceform expression
Cufflinks package
• http://cufflinks.cbcb.umd.edu/
• Cufflinks:– Expression values calculation– Transcripts de novo assembly
• Cuffdiff:– Differential expression analysis
Cufflinks: Transcript assembly
• Assembles individual transcripts based on aligned reads
• Infers likely spliceforms of each gene
• Quantifies expression level of each
Cuffmerge
• Merges transfrags into transcripts where appropriate
• Also performs a reference based assembly of transcripts using known transcripts
• Produces single annotation file which aids downstream analysis
Cuffdiff: Differential expression
• Calculates expression level in two or more samples
• Expression level relates to read abundance
• Because of bias sources, cuffdiff tries to model the variance in its significance calculation
FPKM (RPKM): Expression Values
C= the number of reads mapped onto the gene's exonsN= total number of reads in the experimentL= the sum of the exons in base pairs.
Cuffdiff (differential expression)
• Pairwise or time series comparison• Normal distribution of read counts• Fisher’s test
test_idgene locus sample_1 sample_2 status value_1 value_2 ln(fold_change) test_stat p_value significantENSG00000000003 TSPAN6chrX:99883666-99894988 q1 q2 NOTEST 0 0 0 0 1 noENSG00000000005 TNMD chrX:99839798-99854882 q1 q2 NOTEST 0 0 0 0 1 noENSG00000000419 DPM1 chr20:49551403-49575092 q1 q2 NOTEST 15.0775 23.8627 0.459116 -1.39556 0.162848 noENSG00000000457 SCYL3 chr1:169631244-169863408 q1 q2 OK 32.5626 16.5208 -0.678541 15.8186 0 yes
Recommendations
• You can use BOWTIE or BOWTIE2 but
• Use CUFFDIFF2– Better statistical model– Detection of truly differentially expressed genes– VERY easy to parse output file (See example on
course page)
top related