differential gene expression analysis using rna-seq · 2017. 8. 16. · complex & specialized...

31
Di erential gene expression analysis using RNA-seq Applied Bioinformatics Core, August 2017 https://abc.med.cornell.edu/ Friederike Dündar with Luce Skrabanek & Ceyda Durmaz

Upload: others

Post on 27-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Differential gene expression analysis using RNA-seq Applied Bioinformatics Core, August 2017

https://abc.med.cornell.edu/

Friederike Dündar with Luce Skrabanek & Ceyda Durmaz

Page 2: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Day 1: Introduction into high-throughput sequencing [many general concepts!]

1.  RNA isolation & library preparation

2.  Illumina’s sequencing by synthesis

3.  raw sequencing reads • download

• quality control

4.  experimental design

Page 3: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

RNA-seq is popular, but still developing

Reuter et al. ( 2015). Mol Cell. Goodwin, McPherson & McCombie (2016). Nat Gen, 17(6), 333–351

“RNA%seq)is)not$a$mature$technology.)It$is$undergoing$rapid$evolution$of)biochemistry)of)sample)preparation;)of)sequencing)platforms;)of)computational)pipelines;)and)of$subsequent$analysis$methods$that$include$statistical$treatments)and)transcript)model)building.)“) ENCODE&consortium&

Page 4: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

“Analysis paralysis” •  basically no generally

accepted standard reference

• myriad tools ! highly complex & specialized “pipelines”

“The (…) flexibility and seemingly infinite set of options (…) have

hindered its path to the clinic. (…) The fixed nature of probe sets with

microarrays or qRT-PCR offer an accelerated path (…) without the lure

of the latest and newest analysis methods.”

Byron et al., 2016

Byron et al. Nat Rev Genetics (2016)

Page 5: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

What to expect from the class

Sample type & quality

Sequencing •  Read length •  PE vs. SR •  Sequencing errors

Experimental design •  Controls •  No. of replicates •  Randomization

Library preparation •  Poly-A enrichment

vs. ribo minus •  Strand information

Bioinformatics •  Aligner •  Normalization •  DE analysis strategy

•  Expression quantification •  Alternative splicing •  De novo assembly needed •  mRNAs, small RNAs •  ….

Biological question

NOT COVERED: •  novel transcript

discovery •  transcriptome

assembly •  alternative splicing

analysis (see the course notes for

references to useful reviews)

Page 6: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

RNA-seq workflow overview

Sequencing

•  Fragmentation•  mRNA enrichment•  Library preparation

Bioinformatics

•  Cluster generation•  Sequencing by synthesis•  Image acquisition

Total RNA extractionRNA

cDNA with adapters

cells

fragments

Page 7: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Quality control of RNA extraction

28S:18S ratio

avoid degraded RNA junk

Page 8: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

RNA-seq library preparation

double-stranded cDNA synthesis followed by the same steps asdone for DNA-seq, i.e. end-repair, ligation of dsDNA adapters, andPCR amplification (Fig. 2). A major drawback of this method isthat strand information is not preserved. For this reason, manydifferent methods have been developed for strand-specific RNA-seq, which fall into two main classes. One class of methods relieson marking one strand by chemical modification. These modifica-tion methods essentially follow the standard protocol with theexception of these marking steps. A well-known example is thedUTP second strand marking method, in which dUTP is incorpo-rated in the second strand, preventing this strand from beingamplified by PCR, thus leading to the exclusive amplification ofthe first strand [22]. The second class relies on attaching differentadapters in a known orientation relative to the 5' and 3' ends ofthe original mRNA. A major representative of this class is theIllumina RNA ligation method, in which adapters are ligated tothe RNA in a sequential manner. Other methods of this class are

the “not so random” (NSR) method, which relies on first andsecond-strand cDNA synthesis from degenerate primers tailedwith adapter sequences [23] or the SMART method, in whichreverse transcription from an adapter-tailed primer is followed bytemplate switching at the 5' end of the RNA template. SMART isbased on the fact that reverse transcriptases can add three non-templated C residues to which an oligonuceotide with threeterminal G's can bind; this oligonucleotide will then serve as atemplate for continued cDNA synthesis [24]. A variant is theSMART-RNA ligation (hybrid) method, in which first strand cDNAsynthesis is preceded by 3' adapter ligation as in the RNA ligationmethod.

RNA-seq protocols are technically more challenging than DNA-seq protocols and are often biased procedures. Common types ofbias include low complexity (many reads with the same startingpoint), uneven coverage across different regions of transcriptionunits, and antisense artifacts in the case of stranded libraries.

3’ adapter ligation

random priming and reverse transcription

second strand synthesis

PCR

end repair, A-addition, adapter ligation

PCR

UUU

UU UU

reverse transcription

PCR

rRNA depletion/mRNA enrichment

fragmentation

5’ adapter ligation

end repair, A-addition, adapter ligation

U U

Fig. 2 – The most common RNA-seq protocols fall in three main classes. (A) Classical Illumina protocol. Random-primed double-stranded cDNA synthesis is followed by adapter ligation and PCR. (B) One class of strand-specific methods relies on marking onestrand by chemical modification. The dUTP second strand marking method follows basically the same procedure as the classicalprotocol except that dUTP is incorporated during second strand cDNA synthesis, preventing this strand from being amplified byPCR. Most current transcriptome library preparation kits follow the dUTP method. (C) The second class of strand-specific methodsrelies on attaching different adapters in a known orientation relative to the 5' and 3'ends of the RNA transcript. The Illuminaligation method is a well-know example of this class and is based on sequential ligation of two different adapters. Most currentsmall RNA library preparation kits follow the RNA ligation method.

E X P E R I M E N T A L C E L L R E S E A R C H 3 2 2 ( 2 0 1 4 ) 1 2 – 2 016

classical Illumina protocol (unstranded)

Van Dijk et al. (2014). Experimental Cell Research, 322(1), 12–20. doi:10.1016/j.yexcr.2014.01.008

sequential ligation of two different adapters

RNA extraction poly(A) enrichment or

ribo-depletion

dUTP stranded library preparation

QC!

Page 9: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

RNA-seq workflow overview

Sequencing

Total RNA extraction

RNA

cDNA with adapters

cells

fragments

flowcell with primers

http://informatics.fas.harvard.edu/test-tutorial-page/

Page 10: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Cluster generation

bridge amplification denaturation cluster generation removal of complementary

strands ! identical fragment copies remain

http://informatics.fas.harvard.edu/test-tutorial-page/

Page 11: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Sequencing by synthesis

5"

Illumina Sequencing Workflow

1.  extend 1st base 2.  read 3.  deblock

repeat for 50 – 100 bp generate base calls

Image from Illumina

labelled dNTP

Page 12: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Typical biases of Illumina sequencing •  sequencing errors •  miscalled bases •  PCR artifacts (library preparation)

•  duplicates (due to low amounts of starting material) •  length bias •  GC bias

Figure from Love et al. (2016). Nat Biotech, 34(12). More details & refs in course notes (esp. Table 6).

RNA-seq-specific

sample-specific problems!

Page 13: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

General sources of biases(not inherently sample-specific)

inclusion of multi-mapped reads exclusion of multi-mapped reads

•  issues with the reference •  CNV •  mappability

•  inappropriate data processing

Page 14: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

RAW SEQUENCING READS Let the data wrangling begin!

Page 15: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Images

Raw reads

Aligned reads

Read count table

Normalized read count table

List of fold changes & statistical values

Downstream analyses on DE genes

FASTQC

Base calling & demultiplexing

Mapping

Counting

DE test & multiple testing correction

Normalizing

Filtering

.tif

.fastq

Bioinformatics workflow of RNA-seq analysis

.sam/.bam

.txt

.Robj

.Robj, .txt

Raw reads

Bustard/RTA/OLB, CASAVA

STAR

featureCounts

DESeq2, edgeR

DESeq2, edgeR, limma

Customized scripts

Page 16: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Where are all the reads? International Nucleotide Sequence Database Collaboration

Sequence Read

Archive GenBank DDBJ

ENA

http://www.ncbi.nlm.nih.gov/genbank/ http://www.ddbj.nig.ac.jp/intro-e.html

https://www.ebi.ac.uk/ena/

The SRA is the main repository for publicly available DNA and RNA sequencing data of which three instances are maintained world-wide.

Page 17: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Let’s download!

• We will work with a data set submitted by Gierlinski et al.

•  they deposited the sequence files with SRA – we will retrieve it via ENA (https://www.ebi.ac.uk/ena/)

•  accession number: ERP004763 ls

mkdir

wget

cut

grep

awk

Course notes @ https://chagall.med.cornell.edu

See Section 2 (Raw Data) for download instructions etc.

Page 18: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

FASTQ file format = FASTA + quality scores

1 read " 4 lines!

2.2 Storing sequencing reads: FASTQ format

Example data Throughout the course, we will be working with sequencing reads from the most compre-hensive RNA-seq dataset to date that contains mRNA from 48 replicates of two S. cerevisiae populations:wildtype and snf2 knock-out mutants (Gierlinski et al., 2015; Schurch et al., 2015). All 96 samples weresequenced on one flowcell (Illumina HiSeq 2000); each sample was distributed over seven lanes, which meansthat there are seven technical replicates per sample. The accession number for the entire data set (consistingof 7 x 2 x 48 raw read files) is ERP004763.

? Use the information from the file ERP004763 sample mapping.tsv to download all FASTQ filesrelated to the biological replicates no. 1 of sample type “SNF2” as well as of sample type “WT”.Try to do it via the command line and make sure to create two folders (e.g., SNF2 rep1 andWT rep1) of which each should contain seven FASTQ files in the end.

2.2 Storing sequencing reads: FASTQ format

Currently, raw reads are most commonly stored as FASTQ files. However, details of the file formats may varywidely depending on the sequencing platform, the lab that released the data, or the data repository. Fora more comprehensive overview of possible file formats of raw sequencing data, see the NCBI’s file formatguide: https://www.ncbi.nlm.nih.gov/books/NBK242622/.

The FASTQ file format was derived from the simple text format for nucleic acid or protein sequences, FASTA.FASTQ bundles the sequence of every single read produced during a sequencing run together with the qualityscores. FASTQ files are uncompressed and quite large because they contain the following information for everysingle sequencing read:

1. @ followed by the read ID and possibly information about the sequencing run

2. sequenced bases

3. + (perhaps followed by the read ID again, or some other description)

4. quality scores for each base of the sequence (ASCII-encoded, see below)

Again: be aware that this is not a strictly defined file format – variations do exist and may cause havoc!

Here’s a real-life example snippet of a FASTQ file downloaded from ENA:⌥ ⌅1 $ zcat ERR459145.fastq.gz | head2 @ERR459145 .1 DHKW5DQ1 :219: D0PT7ACXX :2:1101:1590:2149/13 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC4 +5 @7 <DBADDDBH?DHHI@DH >HHHEGHIIIGGIFFGIBFAAGAFHA ’5?B@D6 @ERR459145 .2 DHKW5DQ1 :219: D0PT7ACXX :2:1101:2652:2237/17 GCAGCATCGGCCTTTTGCTTCTCTTTGAAGGCAATGTCTTCAGGATCTAAG8 +9 @@;BDDEFGHHHHIIIGBHHEHCCHGCGIGGHIGHGIGIIGHIIAHIIIGI

10 @ERR459145 .3 DHKW5DQ1 :219: D0PT7ACXX :2:1101:3245:2163/111 TGCATCTGCATGATCTCAACCATGTCTAAATCCAAATTGTCAGCCTGCGCG⌃ ⇧

! For paired-end (PE) sequencing runs, there will always be two FASTQ files – one for the forwardreads, one for the backward reads.Once you have downloaded the files for a PE run, make sure you understand how the origin ofeach read (forward or reverse read) is encoded in the read name information as some downstreamanalysis tools may require you to combine the two files into one.

? 1. Count the number of reads stored in a FASTQ file.

2. Extract just the quality scores of the first 10 reads of a FASTQ file.

3. Concatenate the two FASTQ files of a PE run.

c� 2015 Applied Bioinformatics Core | Weill Cornell Medical College Page 11 of 66

1 2 3 4

1.  @Read ID and sequencing run information 2.  sequence 3.  + (additional description possible) 4.  quality scores

Page 19: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Base quality score

2.2 Storing sequencing reads: FASTQ format

Example data Throughout the course, we will be working with sequencing reads from the most compre-hensive RNA-seq dataset to date that contains mRNA from 48 replicates of two S. cerevisiae populations:wildtype and snf2 knock-out mutants (Gierlinski et al., 2015; Schurch et al., 2015). All 96 samples weresequenced on one flowcell (Illumina HiSeq 2000); each sample was distributed over seven lanes, which meansthat there are seven technical replicates per sample. The accession number for the entire data set (consistingof 7 x 2 x 48 raw read files) is ERP004763.

? Use the information from the file ERP004763 sample mapping.tsv to download all FASTQ filesrelated to the biological replicates no. 1 of sample type “SNF2” as well as of sample type “WT”.Try to do it via the command line and make sure to create two folders (e.g., SNF2 rep1 andWT rep1) of which each should contain seven FASTQ files in the end.

2.2 Storing sequencing reads: FASTQ format

Currently, raw reads are most commonly stored as FASTQ files. However, details of the file formats may varywidely depending on the sequencing platform, the lab that released the data, or the data repository. Fora more comprehensive overview of possible file formats of raw sequencing data, see the NCBI’s file formatguide: https://www.ncbi.nlm.nih.gov/books/NBK242622/.

The FASTQ file format was derived from the simple text format for nucleic acid or protein sequences, FASTA.FASTQ bundles the sequence of every single read produced during a sequencing run together with the qualityscores. FASTQ files are uncompressed and quite large because they contain the following information for everysingle sequencing read:

1. @ followed by the read ID and possibly information about the sequencing run

2. sequenced bases

3. + (perhaps followed by the read ID again, or some other description)

4. quality scores for each base of the sequence (ASCII-encoded, see below)

Again: be aware that this is not a strictly defined file format – variations do exist and may cause havoc!

Here’s a real-life example snippet of a FASTQ file downloaded from ENA:⌥ ⌅1 $ zcat ERR459145.fastq.gz | head2 @ERR459145 .1 DHKW5DQ1 :219: D0PT7ACXX :2:1101:1590:2149/13 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTTCAGC4 +5 @7 <DBADDDBH?DHHI@DH >HHHEGHIIIGGIFFGIBFAAGAFHA ’5?B@D6 @ERR459145 .2 DHKW5DQ1 :219: D0PT7ACXX :2:1101:2652:2237/17 GCAGCATCGGCCTTTTGCTTCTCTTTGAAGGCAATGTCTTCAGGATCTAAG8 +9 @@;BDDEFGHHHHIIIGBHHEHCCHGCGIGGHIGHGIGIIGHIIAHIIIGI

10 @ERR459145 .3 DHKW5DQ1 :219: D0PT7ACXX :2:1101:3245:2163/111 TGCATCTGCATGATCTCAACCATGTCTAAATCCAAATTGTCAGCCTGCGCG⌃ ⇧

! For paired-end (PE) sequencing runs, there will always be two FASTQ files – one for the forwardreads, one for the backward reads.Once you have downloaded the files for a PE run, make sure you understand how the origin ofeach read (forward or reverse read) is encoded in the read name information as some downstreamanalysis tools may require you to combine the two files into one.

? 1. Count the number of reads stored in a FASTQ file.

2. Extract just the quality scores of the first 10 reads of a FASTQ file.

3. Concatenate the two FASTQ files of a PE run.

c� 2015 Applied Bioinformatics Core | Weill Cornell Medical College Page 11 of 66

base error probability p,

e.g. 10e-4!

Phred score, e.g.: 40

-10 x log10(p)

“FASTQ score”, e.g.: (

turn score into ASCII symbol

http://www.ascii-code.com/

Page 20: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Base quality scores

Indicator.[6] The Illumina manual[7] (page 30) states the following: If a read ends witha segment of mostly low quality (Q15 or below), then all of the quality values in thesegment are replaced with a value of 2 (encoded as the letter B in Illumina'stext-based encoding of quality scores)... This Q2 indicator does not predict a specificerror rate, but rather indicates that a specific final portion of the read should not beused in further analyses. Also, the quality score encoded as "B" letter may occurinternally within reads at least as late as pipeline version 1.6, as shown in thefollowing example:

@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT+HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBBBBBBBBRTT\]][]dddd`ddd^dddadd^BBBBBBBBBBBBBBBBBBBBBBBB

An alternative interpretation of this ASCII encoding has been proposed.[8] Also, in Illuminaruns using PhiX controls, the character 'B' was observed to represent an "unknown qualityscore". The error rate of 'B' reads was roughly 3 phred scores lower the mean observedscore of a given run.

Starting in Illumina 1.8, the quality scores have basically returned to the use of theSanger format (Phred+33).

For raw reads, the range of scores will depend on the technology and the base caller used,but will typically be up to 41 for recent Illumina chemistry. Since the maximum observedquality score was previously only 40, various scripts and tools break when they encounterdata with quality values larger than 40. For processed reads, scores may be even higher.For example, quality values of 45 are observed in reads from Illumina's Long ReadSequencing Service (previously Moleculo).

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS..................................................... ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...................... ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...................... .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ | | | | | | 33 59 64 73 104 126 0........................26...31.......40 -5....0........9.............................40 0........9.............................40 3.....9.............................40 0.2......................26...31........41

S - Sanger Phred+33, raw reads typically (0, 40)X - Solexa Solexa+64, raw reads typically (-5, 40)I - Illumina 1.3+ Phred+64, raw reads typically (0, 40)J - Illumina 1.5+ Phred+64, raw reads typically (3, 40) with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) (Note: See discussion above).L - Illumina 1.8+ Phred+33, raw reads typically (0, 41)

Color space

For SOLiD data, the sequence is in color space, except the first position. The quality valuesare those of the Sanger format. Alignment tools differ in their preferred version of thequality values: some include a quality score (set to 0, i.e. '!') for the leading nucleotide,others do not. The sequence read archive includes this quality score.

FASTQ format - Wikipedia, the free encyclopedia

5 of 8

also see Table 2 in the course notes image from https://en.wikipedia.org/wiki/FASTQ_format

•  each base has a certain error probability (p) •  Phred score = -10 x log10(p)•  Phred scores are ASCII-encoded, e.g., “!” COULD represent Phred score 33

Page 21: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Quality control of raw reads: FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. The main functions of FastQC are: •  Import of data from BAM, SAM or FastQ files (any variant) •  Providing a quick overview to tell you in which areas there may be

problems •  Summary graphs and tables to quickly assess your data •  Export of results to an HTML based permanent report •  Offline operation to allow automated generation of reports without

running the interactive application

$ mat/software/FastQC/fastqc

$ mat/software/anaconda2/bin/multiqc

not specific for RNA-seq data!

Page 22: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

EXPERIMENTAL DESIGN How to avoid spurious signals and drowning in noise

Page 23: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

How deep is deep enough?

Goals that require more, longer, and possibly paired-end reads: •  quantification of lowly expressed genes •  identification of genes with small changes between conditions •  investigation of alternative splicing/isoform quantification •  identification of novel transcripts, chimeric transcripts •  de novo transcriptome assembly

for DGE (logFC~ 2) in mammals: 20 – 50 mio SR, 75 bp

Page 24: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Why do we need replicates?

“Samples are our windows to the population, and their statistics are used to estimate those of the population.” Martin Krzywinski & Naomi Altman

Goal: Identify differences in expression for every gene.

…and “differences” should preferably be due to our experiment, not noise!

doi:10.1038/nmeth.2613

Page 25: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Invest in replicates!

•  recommended: 6 biological replicates per condition for DGE of strongly changing genes (logFC >= 2) [based on insights from the fairly simple yeast transcriptome]

6

Outlier fraction

The poor correlations shown by some replicates are the result of a small proportion of genes with atypical read counts. These outliers can be identified by comparing each gene’s expression in an individual replicate with the trimmed mean across all repli-cates. Specifically, the 𝑛𝑡 largest and smallest values are trimmed from the set of rep-licates for a gene before calculating the mean (�̅�𝑔;𝑛𝑡) and standard deviation (𝑠𝑔;𝑛𝑡). Genes are then identified as outliers if |𝑥𝑔𝑖 − �̅�𝑔;𝑛𝑡| > 𝑛𝑠𝑠𝑔;𝑛𝑡 , where 𝑛𝑠 is a constant. Fig. 2b shows the fraction 𝑓𝑖 of all genes identified as outliers for each replicate 𝑖, for 𝑛𝑡 = 3 and 𝑛𝑠 = 5. As expected, the anomalous replicates with high outlier fraction in Fig. 2b correspond well with the poorly correlating replicates in Fig. 2a. Increasing 𝑛𝑡 and/or decreasing 𝑛𝑠 enhances the outlier fraction in replicates already identified as anomalous; for example, reducing the standard deviation limit to 𝑛𝑠 = 3 boosts the outlier fraction in these replicates by a factor ~2–3.

Gene read depth profiles

The atypical total read counts of outlier genes are, in our case, the result of an atypi-cal, strongly non-uniform, read depth gene profile. An example of the difference of the read depth profiles between clean and “bad” replicates is shown in Fig. 3 for the gene YHR215W. The distribution of reads from the example “bad” replicate (WT rep-licate 21, red line) is much less uniform than the mean read depth from the other “clean” replicates (black line) with distinct peaks in the distributions that are 50 bp long. These suspicious features are found universally in all outlier genes in “bad” rep-licates and can be also seen in some (but not all) non-outlier genes in these replicates. It is likely that the cause of this atypical read distribution is uneven priming during the PCR amplification step of the library preparation, prior to the sequencing. The level of gene non-uniformity in each replicate can be quantified with a reduced chi-squared statistic for each gene in each replicate, defined as

Fig. 3. Read depth profiles of YHR215W (PHO12). The black line indicates the mean read counts from all “clean” WT replicates for a given genomic position in the gene YHR215W (the set of “clean” replicates is defined in Section 3.1). The grey lines show the mean read depth plus/minus one standard deviation. The red line illustrates the read depth profile from a single example “bad” replicate (WT replicate 21). The block diagram at the bottom shows the simple gene structure of YHR215W.

log2

Cou

nts

10.16

10.18

10.20

10.22

10.24

10.26

condition 1 condition 2

Gene X

Gierliński et al. (2015). Bioinformatics, 31(22), 3625–3630. & Schurch et al. (2016) RNA.

Page 26: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Replicates Technical replicates

library prep

sequencing lane

RNA extraction

library prep sequencing lane

sequencing lane sequencing lane

RNA from an independent growth of cells/tissue

RNA extraction

RNA extraction

Biological replicates

sequencing lane library prep

sequencing lane

sequencing lane sequencing lane

also see course notes and Blainey et al. (2014) Nature Methods, 1(9) 879–880.

Page 27: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Gilad & Mizrahi-Man (2015). F1000Research 4:121

“Once$we$accounted$for$the$batch$effect$[i.e.,)mouse)and)human)samples)being)sequenced)on)two)different)machines])(…),)the)comparative)gene)expression)data)no)longer)clustered)by)species,)and)instead,)we)observed)a$clear$tendency$for$clustering$by$tissue.”))

Batch effects can happen everywhere

“Overall,)our)results)indicate)that)there)is)considerable$RNA$expression$diversity$between$humans$and$mice,)well)beyond)what)was)described)previously,)likely)reIlecting)the)fundamental)physiological)differences)between)these)two)organisms.)“)

Lin, Lin, and Snyder (2014). PNAS 111:48

Page 28: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

ENCODE’s* study design was not optimal

* not just ENCODE: see e.g. Leek et al. (2010) Nat Rev Gen 11(10) 733-739 or Jaffe & Irizarry (2014) Genome Biol 15(R31) 1–9

A very good read (including the reviews and comments that discuss many scientific as well as ethical issues: https://f1000research.com/articles/4-121/v1

Tissue was confounded with (at least): •  sequencer •  sex •  age •  tissue handling

human data: deceased organ donors mouse data: 10-week-old littermates

Page 29: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Avoiding bias

Block what you can, randomize what you cannot.

Completely randomized design

Restricted randomized design

WEIGHT �

Blocked & randomized design

What factors are of interest? Which ones might introduce noise? Which nuisance factors do you absolutely need to account for?

Krzywinski & Altman (2014) Nature Methods 11(7)

Page 30: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Auer & Doerge (2010). Genetics, 185(2), 405–16.

Make sure the sequencing core multiplexes all samples!

Typical RNA-seqset-up •  keep the technical nuisance

factors (harvest date, RNA extraction kit, sequencing date…) to a minimum

•  cover only as much of the biological variation as needed (just keep possible restrictions about your conclusions in mind for later)

Page 31: Differential gene expression analysis using RNA-seq · 2017. 8. 16. · complex & specialized “pipelines” “The (…) flexibility and seemingly infinite set of options (…)

Summary Day 1 • RNA-seq analysis is not a completely solved issue – but

DE analysis on a gene level is decently mature and the field seems to gravitate towards some sort of standard

•  no analysis tool can enforce (or replace!) common sense and knowledge about the biology behind the experiment

•  crap in, crap out • more replicates are often better investments than more

reads •  FastQC and multiQC are great tools to detect possible

technical nuisance factors