introduction to rna-seq and transcriptome analysis hands – on activities (fun with unix!)...

55
Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab | Jessica Kirkpatrick | 2015 1

Upload: madison-anthony

Post on 11-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

1

Introduction to RNA-Seq and Transcriptome Analysis

Hands – on activities (Fun with UNIX!)

PowerPoint: Jessica Kirkpatrick and Casey Hanson

RNA-Seq Lab | Jessica Kirkpatrick | 2015

Page 2: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Exercise

1. Use the Tuxedo Suite to:

a. Align RNA-Seq reads using TopHat (splice-aware aligner).

b. Perform reference-based transcriptome assembly with Cufflinks.

c. Obtain a new transcriptome using Cufflinks & Cuffmerge.

d. Use Cuffdiff to obtain a list of differentially expressed genes.

e. Report a list of significantly expressed genes.

2. Use a genome browser and visualization tool to observe the aligned

data and the new transcriptome.

2

Page 3: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Trapnell et al., Nature Protocols, March 2012

Tophat uses either Bowtie or Bowtie2 to align reads in a

splice-aware manner and aids the discovery of new

splice junctions

The Cufflinks package has 4 components, the 2 major

ones are listed below -

Cufflinks does reference-based transcriptome

assembly

Cuffdiff does statistical analysis and identifies

differentially expressed transcripts in a simple pairwise

comparison, and a series of pairwise comparisons in a

time-course experiment

Bowtie and Bowtie use Burrows-Wheeler indexing for

aligning reads. With bowtie2 there is no upper limit on

the read length

Tuxedo Suite

Page 4: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Premise

1. Procedure:

Run 1: Allow TopHat to select splice junctions and proceed through the steps

without giving the software any information about known genes/gene models.

Run 2: Force TopHat to use only known splice junctions (i.e. known genes/gene

models) and proceed through the steps making sure we are doing our analysis in

the context of these gene models.

2. Evaluation:

a. 2 metrics:

# of mapped reads and # of significantly different identified genes

b. Compare new transcriptome to known genes. 4

Question: Is there a difference in the results if the Tuxedo Suite is run 2 different ways?

Page 5: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Premise

VS

Page 6: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

sample replicate # fastq name # reads

control Replicate 1 thrombin_control.fastq 10,953

experiment Replicate 1 thrombin_expt.fastq 12,027

name description

chr22.faFasta file with the sequence of chromosome 22 from the human

genome (hg19 – UCSC) (reference genome)

genes-chr22.gtf GTF file with gene annotation, known genes (hg19 – UCSC)

RNA-Seq: 100 bp, single end data

Genome & gene information:

Input data

6

Page 7: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Sign in to Galaxy

Go to https://galaxy.illinois.edu

Click on the button

Sign in using your classroom ID and password

Page 8: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Christopher Fields

How Galaxy works with the biocluster

BioclusterSigning up - http://biocluster.igb.illinois.edu/Usage and cost - http://help.igb.illinois.edu/Biocluster

Page 9: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Rename the History

Page 10: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Accessing the input filesThe data are located in the following directory:

/home/classroom/rnaseq-mayo/

The rnaseq-mayo directory contains an input_data folder as well as a results folder.

(Note “~” is a symbol in UNIX paths referring to your home directory).

10

$ mkdir rnaseq-mayo

# Make a working directory in your home directory.

$ cp /home/classroom/rnaseq-mayo/input_data/* ~/rnaseq-mayo/

# Copy data to your working directory.

$ qsub -I -q classroom -l nodes=1;ppn=4

# Login to a “classroom” computer on the cluster with 4 processors and in an

interactive mode.

Page 11: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Getting data into Galaxy (Method 4)

Click on the “Shared Data” pulldown menu

Click on “Published Histories”

Page 12: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Getting data

Click on the “Workshop FASTQs”

Page 13: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Getting data

Click on the “Import History” on the top, towards the right

Page 14: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Getting data

Page 15: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Now your current history is the imported history, called “imported: RNA-Seq Chr 22 Data”

In the top right corner of the history panel is a wheel, click on that wheel

Page 16: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

The pulldown menu that is revealed when you click on the wheel has many options that are worth exploring…

Right now we are interested in the “Copy Datasets” option

Basically, we want to copy the data we have in this imported history to our previously created “RNA-Seq workshop” history

Getting data

Page 17: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Getting data into Galaxy (Method 4)

For your “Source History”, select the imported one and for your “Destination History”, select the RNA-Seq workshop

Select all the datasets that you want to copy to the “RNA-Seq workshop” history

Click on “Copy History Items”

Page 18: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Getting data

Page 19: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

A glimpse at the input data

• FASTA• chr22.fa

• GTF• genes-chr22.gtf

• FASTQ• thrombin_expt.fastq• thrombin_control.fastq

Page 20: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

RUN 1: ALIGNMENT

20RNA-Seq Lab | Jessica Kirkpatrick | 2015

Page 21: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

We are not going to provide any genic structure information. TopHat will find splice

junctions on its own.

Aligning reads using TopHat

21

Page 22: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

• Always read the instructions before running software

Aligning reads using TopHat

22

• In the left tools panel search

for tophat2

• Click on tophat2, this will

result in the central panel

showing you all the options for

tophat2

• Remember you need the

quality values in your fastq to

be phred 33, or Sanger scores

Page 23: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

• Run 1:

• No genic structure information (i.e. no GTF file)

• TopHat2 will find splice junctions on its own

• Run this on experimental & control data.

• Run 2:

• Genic structure information will be used

• Run this on experimental data.

Aligning reads using TopHat2

23

Page 24: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Alignment with Tophat2: Run 1

• In the left tools panel search

for tophat2

• Click on tophat2, this will

result in the central panel

showing you all the options for

tophat2

• Remember you need the

quality values in your fastq to

be phred 33, or Sanger scores

RNA-Seq Lab | Jessica Kirkpatrick | 2015 24

Page 25: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Alignment with Tophat2: Run 1

RNA-Seq Lab | Jessica Kirkpatrick | 2015 25

Page 26: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Alignment with Tophat2: Run 1

RNA-Seq Lab | Jessica Kirkpatrick | 2015 26

Page 27: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Alignment with Tophat2: Run 1

• Click “Execute” once you have made all the

selections.

RNA-Seq Lab | Jessica Kirkpatrick | 2015 27

Page 28: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Alignment with Tophat2: Run 1

Now we want to start a new tophat2 run for another fastq file

in the RNA-Seq workshop history

RNA-Seq Lab | Jessica Kirkpatrick | 2015 28

Page 29: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Alignment with Tophat2: Run 1

Since this is “re run”, all the parameters should be the same;

this makes it easy to replicate runs, and easy to go back and

check run parameters.

Always re-label new files immediately with names that makes

sense to you, by clicking on the pencil and changing attributes

Now we want to start a new tophat2 run for the control fastq

file in the RNA-Seq workshop history

RNA-Seq Lab | Jessica Kirkpatrick | 2015 29

Page 30: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

On Galaxy its important to rename your files to something meaningful

Rename Files

30

Page 31: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

How many reads DID NOT align to the reference genome chr22?

Evaluating alignment: Run 1

31

Page 32: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

RUN 2: INFORMED ALIGNMENT

32

.

RNA-Seq Lab | Jessica Kirkpatrick | 2015

Page 33: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

• Run 1:

• No genic structure information (i.e. no GTF file)

• TopHat2 will find splice junctions on its own

• Run this on experimental and control data

• Run 2:

• Genic structure information will be used

• Run this on experimental data only

Aligning reads using TopHat2

33

Page 34: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Alignment with Tophat2: Run 2

Now we want to start a new informed tophat2 run

RNA-Seq Lab | Jessica Kirkpatrick | 2015 34

Page 35: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Aligning reads using gene information

35

• Click “Execute” once you have changed the

selections shown above.

Page 36: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Rename your files and make sure they are distinct from the last dataset

Rename Files

36

Page 37: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Evaluating alignment: Run 2

37

Page 38: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

sample # fastq name # readsUnmapped Reads

Run 1 Informed run (Run 2)

control thrombin_control.txt 10,953 101 27*

experimental thrombin_expt.txt 12,027 147 39

Comparison of alignments

There are fewer unmapped reads with the informed alignment, or Run 2 (i.e.

when we use the known genes, and known splice sites)!

TopHat’s prediction of splice junctions is not working very well for this dataset.

(This is likely due to the low number of reads in our dataset)38

Conclusions

Page 39: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

FINDING DIFFERENTIALLY EXPRESSED GENES

39

.

RNA-Seq Lab | Jessica Kirkpatrick | 2015

Page 40: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Trapnell et al., Nature Protocols, March 2012

The Cufflinks package has 4 components, the 2 major

ones are listed below -

Cufflinks does reference-based transcriptome

assembly

Cuffdiff does statistical analysis and identifies

differentially expressed transcripts in a simple pairwise

comparison, and a series of pairwise comparisons in a

time-course experiment

Tuxedo suite (Cufflinks)

Page 41: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

• Run Cufflinks to obtain newly assembled gene transcripts

from the aligned RNA-Seq reads.

There is no need to conduct this step for the informed

alignment (Run 2) because the locations of known genes

are known already.

Assembling transcripts using Cufflinks

41

Page 42: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

• Click “Execute” once you have made all the

selections.

Cufflinks: Expt data

Page 43: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Cufflinks: Control data

Now we want to start a new cufflinks run for the control

dataset

RNA-Seq Lab | Jessica Kirkpatrick | 2015 43

Page 44: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Cufflinks: Control data

Since this is “re run”, all the parameters should be the same;

this makes it easy to replicate runs, and easy to go back and

check run parameters.

Now we want to start a new cufflinks run for the control

dataset

RNA-Seq Lab | Jessica Kirkpatrick | 2015 44

Page 45: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Run Cuffmerge in order to merge the

assembled transcripts from control and

experimental samples. The output of this will

be your transcriptome.

There is no need to conduct this step for

the informed alignment

Merging transcripts sets using Cuffmerge

45

Page 46: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

46

• For Run 1 (uninformed) lets find out how many differentially

expressed (DE) genes are present• We need a gene (.gtf) file and both the alignment (.bam) files

(control and experimental)

• We could use Cuffdiff on the informed alignments (run 2) as well,

but we normally recommend using htseqcount and edgeR

instead

Differential gene expression using Cuffdiff

Page 47: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

47

Differential gene expression using Cuffdiff

• Once you have set your specifications,

hit execute

• This results in many output files

• See the “Outputs” description below

the Cuffdiff page for more details

• We are interested in the differential

expressions of genes

• Look at the last column and count the

number of yes’s.

Page 48: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

VISUALIZATION USING IGV

The Integrative Genomics Viewer (IGV) is a tool that supports the visualization

of mapped reads to a reference genome, among other functionalities.

48

.

RNA-Seq Lab | Jessica Kirkpatrick | 2015

Page 49: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Download data

49

• Lets compare alignments and GTFs• Download 6 files to your computer

• thrombin_expt_accepted_hits• thrombin_expt_inform_accepted_hits• Cuffmerge results• genes-chr22.fa• Index files for both alignment files

Page 50: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Start IGV and load data

50

Load Genome

1. Within IGV, click the FILE tab on the menu bar.

2. Click the ‘Load Genome from Server’ option.

3. In the browser window, search for “human”, and select

the hg19 version

Load Other Files

1. Within IGV, click the FILE tab on the menu bar.

2. Click the ‘Load from File’ option.

3. Select the files below (one at a time or use the

ctrl key to make multiple selections).ctrl_accepted_hits.bam

ctrl_genes_accepted_hits.bam

expt_accepted_hits.bam

expt_genes_accepted_hits.bam

first-cuffmerge_merged.gtf

genes-chr22.gtf

Page 51: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Visualization with IGV

51

Your browser window should look similar to the picture below:

Page 52: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

52

Click here and type the following location of a differentially expressed gene:

chr22:19960675-19963235

Move to the left and right of the gene. What do you see?

Visualization with IGV

Page 53: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

» Looks like the new transcriptome (first-cuffmerge_merged.gtf)

compares poorly to the known gene models. This is very likely due to the

very low number of reads in our dataset.

» We can see that there are many more reads for one dataset compared to

the other. Hence, it makes sense that the gene was called as being

differentially expressed.

» Note the intron spanning reads.

53

Visualization with IGV

Page 54: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

ConclusionToday we did the following:

1. Used the Tuxedo Suite to:

a. Aligned RNA-Seq reads using TopHat(splice-aware aligner).

b. Performed reference-based transcriptome assembly with Cufflinks.

c. Obtained a new transcriptome using Cufflinks & Cuffmerge.

d. Used Cuffdiff to obtain a list of differentially expressed genes.

e. Reported a list of significantly expressed genes.

2. Used a genome browser and visualization tool to observe the

aligned data and the new transcriptome.

54

Page 55: Introduction to RNA-Seq and Transcriptome Analysis Hands – on activities (Fun with UNIX!) PowerPoint: Jessica Kirkpatrick and Casey Hanson RNA-Seq Lab

Useful links

55

Online resources for RNA-Seq analysis questions – http://www.biostars.org/ - Biostar (Bioinformatics explained)

http://seqanswers.com/ - SEQanswers (the next generation sequencing community)

Most tools have a dedicated lists

Information about the various parts of the Tuxedo suite is available here -

http://ccb.jhu.edu/software.shtml

Genome Browsers tutorials – http://www.broadinstitute.org/igv/QuickStart/ - IGV tutorials

http://www.openhelix.com/ucsc/ - UCSC browser tutorials

(openhelix is a great place for tutorials, UIUC has a campus-wide subscription)

Contact us at:

[email protected]

[email protected]