dgaston dec-06-2012

Integrated Learning Session

Daniel Gaston, PhD

Dr. Karen Bedard Lab, Department of Pathology

Bioinformatics: Intro to RNA-Seq Analysis

December 6th, 2012

Overview Introduction

Considerations for RNA-Seq Computational Resources/Options

Analysis of RNA-Seq Data Principle of analyzing RNA-Seq General RNA-Seq analysis pipeline “Tuxedo” pipeline Alternative tools

Resources http://www.slideshare.net/DanGaston

Before You Start: Considerations for RNA-Seq Analysis

Next-Generation Sequencing experiments generate a lot of raw data 25-40 GB/sample/replicate for most

transcriptomes/tissue types/cell lines/conditions

Require more computational resources than many labs routinely have available for analyse data At minimum several processing “cores” (8 minimum) Large amount of RAM (16GB+) Large amount of disk storage space for intermediate and

final results files in addition to raw FastQ files Can be a significant amount of time per sample (days to

week)

Computational Options Local (Large workstation or cluster) Remote Computer/Cluster

(ComputeCanada/ACENet) Cloud Services

Amazon Web Services Cloud/Local Bioinformatics ‘Portals”

Galaxy Chipster GenomeSpace CloudBioLinux CloudMan BioCloudCentral (Interface to CloudMan, CloudBioLinux,

etc)

RNA-Seq Analysis Workflow

So I Ran an RNA-Seq Experiment. Now What? Need to go from raw “read” data to gene

expression data We now have:

De-multiplexed fastq files for each individual sample and replicate

We want lists of: Differentially expressed genes/transcripts Potentially novel genes/transcripts Potentially novel splice junctions Potential fusion events

Organize your data, programs, and additional resources (discussed later)

What is the Raw Data A single lane of Illumina HiSeq 2000

sequencing produces ~ 250 – 300 million “reads” of sequencing

Can be paired or single-end sequencing (paired-end preferred)

Various sequencing lengths (number of sequencing cycles) 2x50bp, 2x75bp, 2x100bp, 2x150bp most

common Cost versus amount of usable data

True raw data is actually image data with colour intensities that are then converted into text (A, C, G, T and quality scores) called FastQ

FastQ

FASTA format file with a header line, sequence line, and quality scores for every base in the sequenced read

In Paired-End Sequencing one file for each “end” of sequencing (Primer 1 and Primer 2)

Qualities scores are encoded with a single character representing a number. Most common encoding scheme is called Phred33. Old Illumina software used Phred64 but current generation does not. (Illumina 1.3 – 1.7 is Phred64) Often needs to be set explicitly in alignment programs

@M00814:1:000000000-A2472:1:1101:14526:1866 1:N:0:1TGGAACATGCGTGCGNAGCCGAAAGTGTGTCCCCACTTTCATATGAAGAAAGACTG+?????BBBBB9?+<+#,,6C>CAEEHHHFFHHHHFEHHHHHHHHHGHHHHHHHHHH

General Analysis Pipeline

Visualization / Statisticss

Abundance/Expression

Transcript Reconstruction

Short-Read Alignment

What is Short-Read Alignment?

Section of Reference Chromosome

Paired-End Reads

What’s Special About RNA-Seq

Normally distance between paired-reads and size of insertions both constrained

With RNA-Seq the source is mRNA, not genomic DNA

Mapping to a reference genome, not transcriptome

Need to account for introns, pairs can be much further apart than expected

Transcript Reconstruction: Intron/Exon Junctions

Exon1 Exon 2 Exon 3

Transcript Reconstruction: Alternative Splicing

Exon1 Exon 2 Exon 3

Transcript Reconstruction: Novel Exon/Transcript Identification

Exon1 Exon 2 Exon 3Exon X

Transcript Reconstruction: Fusion Transcripts

Exon1 Exon 2 Exon 3

Gene 2 Exon 4

Transcript Reconstruction: Differential Expression

Sample 1

Sample 2

What else can we look for? Combine with ChiP-Seq to differentiate various

levels of regulation Integrative analyses to identify common

elements (micro-RNA, transcription factors, molecular pathways, protein-DNA interactions)

Combine with whole-exome or whole-genome sequencing Allele-specific expression Allelic imbalance LOH Large genomic rearrangements/abnormalities

Caution Need to differentiate between real data and

artifacts Differentiate between biologically meaningful

data and “noise” Sample selection, experimental design,

biological replication (not technical replication), and robust statistical methods are important

Looking at your data “by eye” is useful, but needs to be backed up by stats

Avoid experimenter bias Try and be holistic in your analyses

Visualizing with IGV

“Tuxedo” Analysis Pipeline

CummeRbund

CufflinksCufflinks Cuffcomp

areCuffmerg

e Cuffdiff

Tophat

Bowtie

What you need before you begin The individual programs Reference genome (hg19/GRCh37)

FASTA file of whole genome, each chromosome is a sequence entry

Bowtie2 Index files for reference genome Index files are compressed representations of the

genome that allow assembly to the reference efficiently and in parallel

Gene/Transcript annotation reference (UCSC, Ensembl, ENCODE, etc) Gives information about the location of genes and

important features such as location of introns, exons, splice junctions, etc

Step 0: Bowtie Bowtie forms the core of TopHat for short-read

alignment Initial mapping of subset of reads (~5 million) to

a reference transcriptome to estimate inner-distance mean/median and standard deviation for tophat

This info can be retrieved from the library prep stage but is actually better to estimate from your final data

Sample command-line:

bowtie –x /path/to/transcriptome_ref.fa –q –phred33 –local –p 8 -1 read1.fastq -2 read2.fastq –S output.sam

Step 1: Tophat Tophat is a short-read mapper capable of

aligning reads to a reference genome and finding exon-exon junctions

Can be provided a list of known junctions, do de novo junction discovery, or both

Also has an option to find potential fusion-gene transcripts


tophat –p 8 –G gene_annotations.gtf –r inner_distance –mate-std-dev std_dev –o Output.dir /path/to/bowtie2indexes/genome read1.fastq read2.fastq

About TopHat Options -o: The path/name of a directory in which to

place all of the TopHat output files -G path to and name of an annotation file so

TopHat can be aware of known junctions Reference Genome: Given as path and “base

name.” If reference genome saved as: /genomes/hg19/genome.fa then the relevant path and basename would be /genomes/hg19/genome

Inner Distance = Fragment size – (2 x read length)

TopHat: Additional options --no-mixed --b2-very-sensitive --fusion-search Running above options on 6 processing cores

on one sample took ~26 hours

Step 2: Cufflinks Cufflinks performs gene and transcript

discovery Many possible options

No novel discovery, use only a reference group of transcripts

de novo mode (shown below, beginner’s default) Mixed Reference-Guided Assembly and de novo

discovery. Options for more robust normalization methods

and error correction Sample command-line:

cufflinks –p 8 –o Cufflinks.out/ accepted_hits.bam

Step 3: Cuffmerge Merges sample assemblies, estimate

abundances, clean up transcriptome Sample command-line:

cuffmerge –g gene_annotations.gtf –s /path/to/genome.fa –p 8 text_list_of_assemblies.txt

Step 4: Cuffdiff Calculates expression levels of transcripts in

samples Estimates differential expression between

samples Calculates significance value for difference in

expression levels between samples Also groups together transcripts that all start

from same start site. Identify genes under transcriptional/post-transcriptional regulation


cuffdiff –o Output.dir/ –b /path/to/genome.fa –p 8 –L Cond1,Cond2 –u merged.gtf cond1.bam cond2.bam

Cuffdiff Output FPKM values for genes, isoforms, CDS, and groups of

genes from same Transcription Start Site for each condition FPKM is the normalized “expression value” used in RNA-Seq

Count files of above As above but on a per replicate basis Differential expression test results for genes, CDS,

primary transcripts, spliced transcripts on a per sample (condition) comparison basis (Each possible X vs Y comparison unless otherwise specified) Includes identifiers, expression levels, expression difference

values, p-values, q-values, and yes/no significance field Differential splicing tests, differential coding output,

differential promoter use

Step 5: CummeRbund (R)

Trapnell et al., 2012

Visualization

Trapnell et al., 2012

Help! Command X failed

Keep calm Don’t blame the computer Check input files and formats Google/SeqAnswers/Biostars

Results looks “weird” Check the raw data Re-check the commands you used

RNA-Seq analysis is an experiment: Maintain good records of what you did, like any

other experiment

Alternative tools Alternative short-read alignment

BWA -> Can not align RNA-Seq data GSNAP STAR -> Requires minimum of 30GB of RAM

Alternative transcript reconstruction STAR Scripture

Alternative Expression/Abundance Estimation DESeq DEXSeq edgeR

Resources

Software Websites TopHat http://tophat.cbcb.umd.edu Cufflinks http://cufflinks.cbcb.umd.edu STAR http://gingeraslab.cshl.edu/STAR/ Scripture

http://www.broadinstitute.org/software/scripture/

Bioconductor http://www.bioconductor.org/ DEXSeq DESeq edgeR

Blah

http://tophat.cbcb.umd.edu/



http://cufflinks.cbcb.umd.edu/



http://gingeraslab.cshl.edu/STAR/

http://gingeraslab.cshl.edu/STAR/






http://www.bioconductor.org/

http://www.bioconductor.org/

Additional Resources Differential gene and transcript expression

analysis of RNA-Seq Experiments with TopHat and Cufflinks (2012) Nature Protocols. 7(3)

www.biostars.org (Q&A site) SeqAnswers Forum GENCODE Gene Annotations

http://www.gencodegenes.org/ ftp://ftp.sanger.ac.uk/pub/gencode

TopHat / Illumina iGenomes References and Annotation Files: http://tophat.cbcb.umd.edu/igenomes.html

http://www.biostars.org/

http://www.gencodegenes.org/



ftp://ftp.sanger.ac.uk/pub/gencode

ftp://ftp.sanger.ac.uk/pub/gencode

http://tophat.cbcb.umd.edu/igenomes.html

http://tophat.cbcb.umd.edu/igenomes.html

Dalhousie University Dr. Karen Bedard Dr. Chris McMaster Dr. Andrew Orr Dr. Conrad Fernandez Dr. Marissa Leblanc Mat Nightingale Bedard Lab IGNITE

Acknowledgements

Dr. Graham Dellaire Montgomery Lab

Stanford Dr. Stephen

Montgomery BHCRI CRTP Skills

Acquisition Program

Experimental Data for Genes of Interest

UCSC Genome Browser

MetabolicMine

NCI Pathway Interaction Database

The Cancer Genome Atlas Identify cancer subtypes, actionable driver

mutations, personalized/genomic/precision medicine

More than $275 million in funding from NIH Multiple research groups around the world 20 cancer types being studied 205 publications from the research network

since late 2008

The Cancer Genome Atlas

UNIX/Linux command-line basics

What is UNIX? UNIX and UNIX-Like are a family of computer

operating systems originally developed at AT&T’s Bell Labs Apple OS X and iOS (UNIX) Linux (UNIX-Like)

Intro The terminal (command-line) isn’t THAT scary.

Maintaining a Linux environment can be challenging, but most of these analyses can also be done in an OS X environment

Installing software can sometimes be cumbersome and confusing, however many standard bioinformatics programs and software libraries are fairly easy to set-up

Working with the programs from the command-line will often give you a better appreciation for what the program does and what it requires

Terms to Know Path: The location of a directory, file, or

command on the computer. Example: /Users/dan (OS X home directory)

The Commands You Need to Know ls: Lists the files in the current directory.

Directories (folders) are just a special type of file themselves

cd: Change directory pwd: View the full path of the directory you

are currently in cat: Displays the contents of a file on the

terminal screen head / tail : Displays the top or bottom

contents of a file to the screen respectively

dgaston dec-06-2012

Technology

seq data principle

rnaseq experiment

rnaseq analysis workflow

raw read data

real data

usable data true raw

exon x exon

common elementsmicrorna