rnaseq analyses -- methods june 7, 2015. module 3 – expression and differential expression lecture...

17
RNAseq analyses -- methods June 7, 2015

Upload: esther-richard

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

RNAseq analyses -- methods

June 7, 2015

Page 2: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

Lecture Objectives

• Terminology for RNA-seq• Steps in RNAseq analysis• Downstream interpretation of expression and

differential estimates– multiple testing, clustering, heatmaps, classification,

pathway analysis, etc.

Page 3: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

Steps in RNAseq workflow

• Obtain raw data from sequencer• Align/assemble reads• Process alignment to quantify reads/gene feature• Conduct differential analysis• Summarize and visualize

– Create gene lists, prioritize candidates for validation, etc.

• Conduct gene enrichment or pathway analysis

Page 4: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

Some nomenclature

Fasta(sequence only)

Fastq(contains quality scores)

>SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

@SEQ_IDGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Sequencing file types:

Page 5: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

Alignment files

• SAM – Sequence Alignment/MAP coordinate file– Tab-delimited ASCII file with all of the information needed for

the alignment of the reads to reference

HWI-ST155_0544:7:2:7658:34048#0 16 Chr1 2082 1 42M *0 0 GTAGAGGTAGGACCAACAAGGACCAAGTTTCCCTGTTCCAAC

ghghhhhgcghh_hhhhhhhhhhhhhhhhghhhhhghhhhhh NM:i:0 NH:i:4 CC:Z:Chr10CP:i:1057787HWI-ST155_0544:7:61:11040:141129#0 0 Chr1 2956 1 42M *

0 0 CTTTTCTGGCGTAACTTGGTTCCCTTTAGTTTGGAACAGATAhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghhchfhhhh NM:i:0 NH:i:3 CC:Z:Chr10CP:i:1056913HWI-ST155_0544:7:8:11214:130793#0 0 Chr1 3645 1 42M *

0 0 CAAATAGTGGTTTGAAACCTATCAATCAAGTCACTTTCAAGTgggggggggggggggggeggdffffggggggggggggdggfc NM:i:0 NH:i:3 CC:Z:Chr10CP:i:1056224

• BAM – Binary version of SAM– Machine readable, smaller more compact version– More commonly used by viewers and analysis programs

Page 6: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

Reference guided RNAseq alignment

• You have a sequenced genome and a file with the gene model annotations

• Genome reference file is usually a fasta file with each chromosome as a separate entry

• GTF = gene transcript file• GFF = gene feature file (GFF3)• The GTF/GFF chromosome name must match the

chromosome name in the genome reference file

• In this case we are using the human hg19 genome build and the Refseq transcript file from 05-07-2015

Page 7: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

RNAseq alignment

Page 8: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

RPKM (FPKM)

• Reads (fragments) per Kilobase Per Million

RPKM =raw number of reads

exon lengthX

1,000,000

Number of reads mapped in the sample

• In RNA-Seq, the relative expression of a transcript is proportional to the number of cDNA fragments that originate from it. However: • The number of fragments is biased towards larger genes• Total number of fragments is related to total library depth

• Use of FPKM/RPKM normalizes for gene size and library depth

Page 9: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

Alternatives to FPKM

• Raw read counts – Instead of calculating FPKM, simply assign

reads/fragments to a defined set of genes/transcripts and determine “raw counts”

• HTSeq (htseq-count)– Python code that converts aligned reads to counts– You give it an alignment file and the associated transcript

file and it will output a list of counts by feature

• FeatureCounts (part of Subread package)– BAM file + GTF file => # reads/feature

• PGS also reports reads/feature– We can try differential expression analysis with the RPKM

and the reads file to explore the different outcomes

Page 10: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

How to quantify reads per feature?

Page 11: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

Counts vs FPKM(RPKM)?

• Count files can be analyzed by a number of different R programs– EdgeR– DESeq2– NOISeq

• More robust statistical methods for differential expression

• Accommodates more sophisticated experimental designs with appropriate statistical tests

Page 12: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

Gene expression differences

• Once you have your data in RPKM or counts format, you can use a variety of tools to compare the different conditions and determine what genes are differentially expressed

• Authors used Cufflinks/CuffDiff

Page 13: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

How does cuffdiff work?• The variability in fragment count for each

gene across replicates is modeled.• Fragment count for each isoform is estimated

in each replicate along with a measure of uncertainty in this estimate arising from ambiguously mapped reads– transcripts with more shared exons and few

uniquely assigned fragments will have greater uncertainty

• The algorithm combines estimates of uncertainty and cross-replicate variability under a negative binomial model of fragment count variability to estimate count variances for each transcript in each library

• These variance estimates are used during statistical testing to report significantly differentially expressed genes and transcripts.

More replicates = more confidence

Page 14: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

Multiple approaches advisable

Page 15: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

Multiple testing correction

• As more attributes are compared, it becomes more likely that the treatment and control groups will appear to differ on at least one attribute by random chance alone.

• Well known from array studies– 10,000s genes/transcripts– 100,000s exons

• With RNA-seq, more of a problem than ever– All the complexity of the transcriptome– Almost infinite number of potential features

• Genes, transcripts, exons, juntions, retained introns, microRNAs, lncRNAs, etc, etc

• Bioconductor multtest– http://www.bioconductor.org/packages/release/bioc/html/multtest.html

Page 16: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

Lessons learned from microarray days

• Hansen et al. “Sequencing Technology Does Not Eliminate Biological Variability.” Nature Biotechnology 29, no. 7 (2011): 572–573.

• Power analysis for RNA-seq experiments– http://euler.bc.edu/marthlab/scotty/scotty.php

• RNA-seq need for biological replicates– http://www.biostars.org/p/1161/

• RNA-seq study design– http://www.biostars.org/p/68885/

Page 17: RNAseq analyses -- methods June 7, 2015. Module 3 – Expression and Differential Expression Lecture Objectives Terminology for RNA-seq Steps in RNAseq

Module 3 – Expression and Differential Expression

Today in computer lab

• Analyze the BAM files downloaded from Google drive• Do QA/Qc• Do mRNA quantification• Do comparisons between groups

• The analysis will take several hours…plan accordingly