rna-seq transcriptome profiling. before we start: align sequence reads to the reference genome the...

Post on 13-Dec-2015

218 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RNA-Seq Transcriptome Profiling

Before we start: Align sequence reads to the reference genomeThe most time-consuming part of the analysis is doing the alignments of the reads (in Sanger fastq format) for all replicates against the reference genome.

Overview: This training module is designed to provide a hands on experience in using RNA-Seq for transcriptome profiling.

Question: How can we compare gene expression levels using RNA-Seq data in Arabidopsis WT and hy5 genetic backgrounds?

RNA-seq in the Discovery Environment

Scientific Objective

LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF).

Mutations in the HY5 gene cause aberrant phenotypes in Arabidopsis morphology, pigmentation and hormonal response.

We will use RNA-seq to compare the transcriptomes of seedlings from WT and hy5 genetic backgrounds to identify HY5-regulated genes.

Samples

• Experimental data downloaded from the NCBI Short Read Archive (GEO:GSM613465 and GEO:GSM613466)

• Two replicates each of RNA-seq runs for Wild-type and hy5 mutant seedlings.

Specific Objectives

By the end of this module, you should

1)Be more familiar with the DE user interface

1)Understand the starting data for RNA-seq analysis

1)Be able to align short sequence reads with a reference genome in the DE

1)Be able to analyze differential gene expression in the DE

1)Be able to visualize RNA-Seq data in Atmosphere

RNA-Seq Conceptual Overview

Image source: http://www.bgisequence.com

RNA-Seq Data

@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC+BA?39AAA933BA05>A@A=?4,9#################@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT+@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA+A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9?@SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC+BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B?@SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA+BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG+BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC+?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:

…Now What?

@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC+BA?39AAA933BA05>A@A=?4,9#################@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT+@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA+A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9?@SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC+BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B?@SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA+BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG+BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC+?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:

Bioinformagician

$ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq$ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq$ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C1_R1_2.fq$ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C1_R2_2.fq$ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C1_R3_2.fq

$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam$ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam$ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam$ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam

$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt

$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam

Your RNA-Seq Data

Your transformed RNA-Seq Data

RNA-Seq Analysis Workflow

Tophat (bowtie)

Cufflinks

Cuffmerge

Cuffdiff

CummeRbund

Your Data

iPlant Data Store

FASTQ

Disco

very E

nviro

nm

en

t A

tmo

sph

ere

Quick Summary

Find D

iffere

ntially

Expre

ssed genes

Align to

Genome: T

opHat

View Alig

nments: IGV

Differe

ntial E

xpressio

n: CuffD

iff

Download R

eads from S

RA

Export Reads to

FASTQ

Import SRA data from NCBI SRA

Extract FASTQ files from the

downloaded SRA archives

Pre-Configured: Getting the RNA-seq Data

Examining Data Quality with fastQC

Examining Data Quality with fastQC

RNA-Seq Workflow Overview

Align the four FASTQ files to Arabidopsis genome using Tophat

Align Reads to the Genome

TopHat

• TopHat is one of many applications for aligning short sequence reads to a reference genome.

• It uses the BOWTIE aligner internally.

• Other alternatives are BWA, MAQ, OLego, Stampy, Novoalign, etc.

RNA-seq Sample Read Statistics

• Genome alignments from TopHat were saved as BAM files, the binary version of SAM (samtools.sourceforge.net/).

• Reads retained by TopHat are shown below

Sequence run WT-1 WT-2 hy5-1 hy5-2

Reads 10,866,702 10,276,268 13,410,011 12,471,462

Seq. (Mbase) 445.5 421.3 549.8 511.3

ATG44120 (12S seed storage protein) significantly down-regulated in hy5 mutantBackground (> 9-fold p=0). Compare to gene on right lacking differential expression

RNA-Seq Workflow Overview

CuffDiff

• CuffLinks is a program that assembles aligned RNA-Seq reads into transcripts, estimates their abundances, and tests for differential expression and regulation transcriptome-wide.

• CuffDiff is a program within CuffLinks that compares transcript abundance between samples

Examining Differential Gene Expression

Examining the Gene Expression Data

Filter CuffDiff results for up or down-regulated gene expression in hy5 seedlings

Differentially expressed genes

Differentially expressed genes

Example filtered CuffDiff results generated with the Filter_CuffDiff_Results to1)Select genes with minimum two-fold expression difference2)Select genes with significant differential expression (q <= 0.05)3)Add gene descriptions

top related