dgaston dec-06-2012

52
Integrated Learning Session Daniel Gaston, PhD Dr. Karen Bedard Lab, Department of Pathology Bioinformatics: Intro to RNA-Seq Analysis December 6 th , 2012

Upload: dan-gaston

Post on 11-May-2015

439 views

Category:

Technology


0 download

DESCRIPTION

Intro primer on Bioinformatics and Gene Expression analysis in RNA-Seq using the Tuxedo pipeline

TRANSCRIPT

Page 1: Dgaston dec-06-2012

Integrated Learning Session

Daniel Gaston, PhD

Dr. Karen Bedard Lab, Department of Pathology

Bioinformatics: Intro to RNA-Seq Analysis

December 6th, 2012

Page 2: Dgaston dec-06-2012

Overview Introduction

Considerations for RNA-Seq Computational Resources/Options

Analysis of RNA-Seq Data Principle of analyzing RNA-Seq General RNA-Seq analysis pipeline “Tuxedo” pipeline Alternative tools

Resources http://www.slideshare.net/DanGaston

Page 3: Dgaston dec-06-2012

Before You Start: Considerations for RNA-Seq Analysis

Next-Generation Sequencing experiments generate a lot of raw data 25-40 GB/sample/replicate for most

transcriptomes/tissue types/cell lines/conditions

Require more computational resources than many labs routinely have available for analyse data At minimum several processing “cores” (8 minimum) Large amount of RAM (16GB+) Large amount of disk storage space for intermediate and

final results files in addition to raw FastQ files Can be a significant amount of time per sample (days to

week)

Page 4: Dgaston dec-06-2012

Computational Options Local (Large workstation or cluster) Remote Computer/Cluster

(ComputeCanada/ACENet) Cloud Services

Amazon Web Services Cloud/Local Bioinformatics ‘Portals”

Galaxy Chipster GenomeSpace CloudBioLinux CloudMan BioCloudCentral (Interface to CloudMan, CloudBioLinux,

etc)

Page 5: Dgaston dec-06-2012

RNA-Seq Analysis Workflow

Page 6: Dgaston dec-06-2012

So I Ran an RNA-Seq Experiment. Now What? Need to go from raw “read” data to gene

expression data We now have:

De-multiplexed fastq files for each individual sample and replicate

We want lists of: Differentially expressed genes/transcripts Potentially novel genes/transcripts Potentially novel splice junctions Potential fusion events

Organize your data, programs, and additional resources (discussed later)

Page 7: Dgaston dec-06-2012

What is the Raw Data A single lane of Illumina HiSeq 2000

sequencing produces ~ 250 – 300 million “reads” of sequencing

Can be paired or single-end sequencing (paired-end preferred)

Various sequencing lengths (number of sequencing cycles) 2x50bp, 2x75bp, 2x100bp, 2x150bp most

common Cost versus amount of usable data

True raw data is actually image data with colour intensities that are then converted into text (A, C, G, T and quality scores) called FastQ

Page 8: Dgaston dec-06-2012

FastQ

FASTA format file with a header line, sequence line, and quality scores for every base in the sequenced read

In Paired-End Sequencing one file for each “end” of sequencing (Primer 1 and Primer 2)

Qualities scores are encoded with a single character representing a number. Most common encoding scheme is called Phred33. Old Illumina software used Phred64 but current generation does not. (Illumina 1.3 – 1.7 is Phred64) Often needs to be set explicitly in alignment programs

@M00814:1:000000000-A2472:1:1101:14526:1866 1:N:0:1TGGAACATGCGTGCGNAGCCGAAAGTGTGTCCCCACTTTCATATGAAGAAAGACTG+?????BBBBB9?+<+#,,6C>CAEEHHHFFHHHHFEHHHHHHHHHGHHHHHHHHHH

Page 9: Dgaston dec-06-2012

General Analysis Pipeline

Visualization / Statisticss

Abundance/Expression

Transcript Reconstruction

Short-Read Alignment

Page 10: Dgaston dec-06-2012

What is Short-Read Alignment?

Section of Reference Chromosome

Paired-End Reads

Page 11: Dgaston dec-06-2012

What’s Special About RNA-Seq

Normally distance between paired-reads and size of insertions both constrained

With RNA-Seq the source is mRNA, not genomic DNA

Mapping to a reference genome, not transcriptome

Need to account for introns, pairs can be much further apart than expected

Page 12: Dgaston dec-06-2012

Transcript Reconstruction: Intron/Exon Junctions

Exon1 Exon 2 Exon 3

Page 13: Dgaston dec-06-2012

Transcript Reconstruction: Alternative Splicing

Exon1 Exon 2 Exon 3

Page 14: Dgaston dec-06-2012

Transcript Reconstruction: Novel Exon/Transcript Identification

Exon1 Exon 2 Exon 3Exon X

Page 15: Dgaston dec-06-2012

Transcript Reconstruction: Fusion Transcripts

Exon1 Exon 2 Exon 3

Gene 2 Exon 4

Page 16: Dgaston dec-06-2012

Transcript Reconstruction: Differential Expression

Sample 1

Sample 2

Page 17: Dgaston dec-06-2012

What else can we look for? Combine with ChiP-Seq to differentiate various

levels of regulation Integrative analyses to identify common

elements (micro-RNA, transcription factors, molecular pathways, protein-DNA interactions)

Combine with whole-exome or whole-genome sequencing Allele-specific expression Allelic imbalance LOH Large genomic rearrangements/abnormalities

Page 18: Dgaston dec-06-2012

Caution Need to differentiate between real data and

artifacts Differentiate between biologically meaningful

data and “noise” Sample selection, experimental design,

biological replication (not technical replication), and robust statistical methods are important

Looking at your data “by eye” is useful, but needs to be backed up by stats

Avoid experimenter bias Try and be holistic in your analyses

Page 19: Dgaston dec-06-2012

Visualizing with IGV

Page 20: Dgaston dec-06-2012

“Tuxedo” Analysis Pipeline

CummeRbund

CufflinksCufflinks Cuffcomp

areCuffmerg

e Cuffdiff

Tophat

Bowtie

Page 21: Dgaston dec-06-2012

What you need before you begin The individual programs Reference genome (hg19/GRCh37)

FASTA file of whole genome, each chromosome is a sequence entry

Bowtie2 Index files for reference genome Index files are compressed representations of the

genome that allow assembly to the reference efficiently and in parallel

Gene/Transcript annotation reference (UCSC, Ensembl, ENCODE, etc) Gives information about the location of genes and

important features such as location of introns, exons, splice junctions, etc

Page 22: Dgaston dec-06-2012

Step 0: Bowtie Bowtie forms the core of TopHat for short-read

alignment Initial mapping of subset of reads (~5 million) to

a reference transcriptome to estimate inner-distance mean/median and standard deviation for tophat

This info can be retrieved from the library prep stage but is actually better to estimate from your final data

Sample command-line:

bowtie –x /path/to/transcriptome_ref.fa –q –phred33 –local –p 8 -1 read1.fastq -2 read2.fastq –S output.sam

Page 23: Dgaston dec-06-2012

Step 1: Tophat Tophat is a short-read mapper capable of

aligning reads to a reference genome and finding exon-exon junctions

Can be provided a list of known junctions, do de novo junction discovery, or both

Also has an option to find potential fusion-gene transcripts

Sample command-line:

tophat –p 8 –G gene_annotations.gtf –r inner_distance –mate-std-dev std_dev –o Output.dir /path/to/bowtie2indexes/genome read1.fastq read2.fastq

Page 24: Dgaston dec-06-2012

About TopHat Options -o: The path/name of a directory in which to

place all of the TopHat output files -G path to and name of an annotation file so

TopHat can be aware of known junctions Reference Genome: Given as path and “base

name.” If reference genome saved as: /genomes/hg19/genome.fa then the relevant path and basename would be /genomes/hg19/genome

Inner Distance = Fragment size – (2 x read length)

Page 25: Dgaston dec-06-2012

TopHat: Additional options --no-mixed --b2-very-sensitive --fusion-search Running above options on 6 processing cores

on one sample took ~26 hours

Page 26: Dgaston dec-06-2012

Step 2: Cufflinks Cufflinks performs gene and transcript

discovery Many possible options

No novel discovery, use only a reference group of transcripts

de novo mode (shown below, beginner’s default) Mixed Reference-Guided Assembly and de novo

discovery. Options for more robust normalization methods

and error correction Sample command-line:

cufflinks –p 8 –o Cufflinks.out/ accepted_hits.bam

Page 27: Dgaston dec-06-2012

Step 3: Cuffmerge Merges sample assemblies, estimate

abundances, clean up transcriptome Sample command-line:

cuffmerge –g gene_annotations.gtf –s /path/to/genome.fa –p 8 text_list_of_assemblies.txt

Page 28: Dgaston dec-06-2012

Step 4: Cuffdiff Calculates expression levels of transcripts in

samples Estimates differential expression between

samples Calculates significance value for difference in

expression levels between samples Also groups together transcripts that all start

from same start site. Identify genes under transcriptional/post-transcriptional regulation

Sample command-line:

cuffdiff –o Output.dir/ –b /path/to/genome.fa –p 8 –L Cond1,Cond2 –u merged.gtf cond1.bam cond2.bam

Page 29: Dgaston dec-06-2012

Cuffdiff Output FPKM values for genes, isoforms, CDS, and groups of

genes from same Transcription Start Site for each condition FPKM is the normalized “expression value” used in RNA-Seq

Count files of above As above but on a per replicate basis Differential expression test results for genes, CDS,

primary transcripts, spliced transcripts on a per sample (condition) comparison basis (Each possible X vs Y comparison unless otherwise specified) Includes identifiers, expression levels, expression difference

values, p-values, q-values, and yes/no significance field Differential splicing tests, differential coding output,

differential promoter use

Page 30: Dgaston dec-06-2012

Step 5: CummeRbund (R)

Trapnell et al., 2012

Page 31: Dgaston dec-06-2012

Visualization

Trapnell et al., 2012

Page 32: Dgaston dec-06-2012

Help! Command X failed

Keep calm Don’t blame the computer Check input files and formats Google/SeqAnswers/Biostars

Results looks “weird” Check the raw data Re-check the commands you used

RNA-Seq analysis is an experiment: Maintain good records of what you did, like any

other experiment

Page 33: Dgaston dec-06-2012

Alternative tools Alternative short-read alignment

BWA -> Can not align RNA-Seq data GSNAP STAR -> Requires minimum of 30GB of RAM

Alternative transcript reconstruction STAR Scripture

Alternative Expression/Abundance Estimation DESeq DEXSeq edgeR

Page 34: Dgaston dec-06-2012

Resources

Page 36: Dgaston dec-06-2012

Additional Resources Differential gene and transcript expression

analysis of RNA-Seq Experiments with TopHat and Cufflinks (2012) Nature Protocols. 7(3)

www.biostars.org (Q&A site) SeqAnswers Forum GENCODE Gene Annotations

http://www.gencodegenes.org/ ftp://ftp.sanger.ac.uk/pub/gencode

TopHat / Illumina iGenomes References and Annotation Files: http://tophat.cbcb.umd.edu/igenomes.html

Page 37: Dgaston dec-06-2012

Dalhousie University Dr. Karen Bedard Dr. Chris McMaster Dr. Andrew Orr Dr. Conrad Fernandez Dr. Marissa Leblanc Mat Nightingale Bedard Lab IGNITE

Acknowledgements

Dr. Graham Dellaire Montgomery Lab

Stanford Dr. Stephen

Montgomery BHCRI CRTP Skills

Acquisition Program

Page 38: Dgaston dec-06-2012

Experimental Data for Genes of Interest

Page 39: Dgaston dec-06-2012

UCSC Genome Browser

Page 40: Dgaston dec-06-2012

UCSC Genome Browser

Page 41: Dgaston dec-06-2012

MetabolicMine

Page 42: Dgaston dec-06-2012

MetabolicMine

Page 43: Dgaston dec-06-2012

NCI Pathway Interaction Database

Page 44: Dgaston dec-06-2012

The Cancer Genome Atlas Identify cancer subtypes, actionable driver

mutations, personalized/genomic/precision medicine

More than $275 million in funding from NIH Multiple research groups around the world 20 cancer types being studied 205 publications from the research network

since late 2008

Page 45: Dgaston dec-06-2012

The Cancer Genome Atlas

Page 46: Dgaston dec-06-2012

The Cancer Genome Atlas

Page 47: Dgaston dec-06-2012

The Cancer Genome Atlas

Page 48: Dgaston dec-06-2012

UNIX/Linux command-line basics

Page 49: Dgaston dec-06-2012

What is UNIX? UNIX and UNIX-Like are a family of computer

operating systems originally developed at AT&T’s Bell Labs Apple OS X and iOS (UNIX) Linux (UNIX-Like)

Page 50: Dgaston dec-06-2012

Intro The terminal (command-line) isn’t THAT scary.

Maintaining a Linux environment can be challenging, but most of these analyses can also be done in an OS X environment

Installing software can sometimes be cumbersome and confusing, however many standard bioinformatics programs and software libraries are fairly easy to set-up

Working with the programs from the command-line will often give you a better appreciation for what the program does and what it requires

Page 51: Dgaston dec-06-2012

Terms to Know Path: The location of a directory, file, or

command on the computer. Example: /Users/dan (OS X home directory)

Page 52: Dgaston dec-06-2012

The Commands You Need to Know ls: Lists the files in the current directory.

Directories (folders) are just a special type of file themselves

cd: Change directory pwd: View the full path of the directory you

are currently in cat: Displays the contents of a file on the

terminal screen head / tail : Displays the top or bottom

contents of a file to the screen respectively