computational methods for analysis of single cell rna-seq data ion măndoiu computer science &...

Computational Methods for Analysis of Single Cell RNA-Seq Data

Ion MăndoiuComputer Science & Engineering Department

University of Connecticution@engr.uconn.edu

Outline

• Intro to RNA-Seq– Next-generation sequencing technologies– RNA-Seq applications– Analysis challenges for single cell data

• Typical analysis pipeline for single-cell RNA-Seq– Primary analysis: reads QC, mapping, and quantification– Secondary analysis: cells QC, normalization, clustering, and

differential expression– Tertiary analysis: functional annotation

• Conclusions

2nd Gen. Sequencing: Illumina

• ION Torrent uses a high-density array of micro-machined wells to perform this biochemical process in a massively parallel way• Each well holds a different DNA template generated by emulsion PCR. Beneath the wells is an ion-sensitive layer and beneath that a proprietary ION sensor• The sequencer sequentially floods the chip with one nucleotide after another; in each cycle the voltage change recorded at a well is proportional to the number of incorporated bases

2nd Gen. Sequencing: ION Torrent

3rd Gen. Sequencing: PacBio SMRT

3rd Gen. Sequencing: Oxford Nanopore

http://www.technologyreview.com/article/427677/nanopore-sequencing/

Standard (Bulk) RNA-Seq

Reverse transcribe into cDNA & shatter into fragments

Sequence fragment ends

A B C D E

Map reads

Gene expression quantification

Isoform expressionquantification

Transcriptome reconstruction

AAAAAA

AAAAAAAAAAAA

Alternative Splicing

Pal S. et all , Genome Research, June 2011

Transcriptome Reconstruction

Common Approaches

• De novo (genome independent reconstruction)– Trinity, Oases, TransABySS

• de Brujin k-mer graph

• Genome guided– Scripture

• Reports “all” transcripts

– Cufflinks, IsoLasso, SLIDE• Minimize set of transcripts explaining reads

• Annotation guided– RABT

• Simulate reads from annotated transcripts

1 742 3 65t1 :

1 743 65t2 :

1 742 3 5t3 :

t4 :1 743 5

1 742 3 65

Genome-Guided Transcriptome Reconstruction – Multiple Solutions

Which Solution is Most Likely?• TRIP: select smallest set of transcripts with good

statistical fit between fragment length distribution– empirically determined during library preparation– implied by “mapping” read pairs

200 200 200

200 200

Series1

TRIP Results

• 100x coverage, 2x100bp pe reads; annotations for genes

SensPPV

SensPPVFScore

TPSens

Why Single Cell RNA-Seq?

Macaulay and Voet, PLOS Genetics, 2014

Challenges

• Low RNA input + low RT efficiency– Especially problematic for low expression genes

Macaulay and Voet, PLOS Genetics, 2014

Challenges

• Stochastic effects (e.g., transcriptional bursting) hard to distinguish from regulated transcriptional heterogeneity

• PCR amplification bias results in distortion of transcript abundances

SMARTer RNA-Seq Protocol

Islam et al. http://www.nature.com/nmeth/journal/v11/n2/full/nmeth.2772.html

Correcting PCR Bias using UMIs (STRT-C1)

Outline

• Intro to RNA-Seq– RNA-Seq applications– Analysis challenges for single cell data

• Typical analysis pipeline for single-cell RNA-Seq– Primary analysis: reads QC, mapping, and quantification– Secondary analysis: cells QC, normalization, clustering,

and differential expression– Tertiary analysis: functional annotation

• Conclusions

Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression

Functional analysis

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 710

Lane 1 Lane 2

Lane 3

Read position

Tools to analyze and preprocess fastq files• FASTX (http://hannonlab.cshl.edu/fastx_toolkit/)

– Charts quality statistics– Filters sequences based on quality– Trims sequences based on quality– Collapses identical sequences into a single sequence

• PRINSEQ (http://prinseq.sourceforge.net/)– Generates read length and quality statistics– Filters reads based on length, quality, GC content

and other criteria– Trims reads based on length/position or quality

scores

Functional analysis

http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png

Functional analysis

RNA-Seq read mapping strategies:– Ungapped mapping (with mismatches) to genome

• Cannot align reads spanning exon-junctions

– Local alignment (Smith-Waterman) to genome• Very slow

– Spliced alignment to genome• Computationally harder than ungapped alignment, but much

faster than local alignment

– Mapping on transcript libraries• Fastest, but cannot align reads from un-annotated transcripts

– Mapping on exon-exon junction libraries• Cannot align reads overlapping un-annotated exons

– Hybrid approaches

Functional analysis

Comparison of spliced read mapping tools

Functional analysis

Kim et al. http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3317.html

• Cannot use raw read counts (why not?)

Functional analysis

Islam et al. http://www.nature.com/nmeth/journal/v11/n2/full/nmeth.2772.html

• CPM = count per million– Ignores multireads underestimates expression of genes in large families

– Does not normalize for gene length cannot compare CPMs b/w genes

– Comparing CPMs between samples assumes similar transcriptome size

• RPKM/FPKM = reads/fragments per kilobase per million– [Mortazavi et al. 08] Fractionally allocates multireads based on unique read

estimates

– Length for multi-isoform genes?

– Comparing FPKM between samples assumes similar (weighted) transcriptome size

• TPM: transcripts per million– Still relative measure of expression, but comparable between samples

– Most accurate estimation methods use multireads and isoform level estimation

• UMI counts– Absolute measure of expression?

Functional analysis

A B C D E

Functional analysis

Gene ambiguous reads

Isoform ambiguous reads

Expectation-maximization approach (IsoEM, RSEM)

Functional analysis

Series1

Fa (j)

EM Algorithm

1. Start with random transcript frequencies

2. Fractionally allocate reads to transcripts

0.50.5

3. Compute expected #reads for each transcript

Functional analysis

4. Update transcript frequencies using maximum likelihood estimates

EM AlgorithmReads QC Read mapping Quantification Cells QC Normalization Clustering Differential

expressionFunctional analysis

4. Update transcript frequencies using maximum likelihood estimates

5. Repeat steps 2-4 until convergence

EM AlgorithmReads QC Read mapping Quantification Cells QC Normalization Clustering Differential

expressionFunctional analysis

Detected genes/cell -- main population

Functional analysis

Detected genes/cell -- minor population

Detected genes/cell -- bi-modal distribution

Functional analysis

Batch effects can be larger than biological effects, but can be corrected by normalization procedures

CPM & TPM datasets pre-quantile normalization

CPM & TPM datasets post-quantile normalization

Functional analysis

Quantile normalization (Irizarry et al 2002) • Shifts CPM/FPKM/TPM values for each cell to match a reference

distribution (e.g., distribution of means)- Highest value gets matched to highest value in reference- 2nd highest gets mapped to 2nd highest value in reference- And so on

Distribution of TPMs

Reference distribution

Principal Component Analysis

• Linear transformation of the data:

– 1st component = direction of max. variance– 2nd component = orthogonal on 1st, max. residual variance

• Used for dimensionality reduction (ignore high components)– Visualization for exploratory analysis– Feature selection

Functional analysis

What makes a good clustering?• Homogeneity: Elements within a cluster are close to

each other• Separation: Elements in different clusters are further

apart from each other

Functional analysis

Bad clustering Good clustering

Functional analysis

Algorithm ParametersK-means K = Number of clusters

Fuzzy c-means Clustering (FCM)

K = number of clustersd = Degree of fuzziness

Hierarchical Clustering (HCS)

Metric = euclidean, seuclidean, cityblock, minkowski, chebychev, cosine, correlation, spearmanMethod = average, centroid, complete, median, single

EM Clustering K = Number of clustersS = Number of initial seedsI = Number of iteration

SNN-Cliq n = Size of the nearest neighbor listr = Density threshold of quasi-cliques m = Threshold on the overlapping rate for merging.

Many clustering algorithms!

K-Means Clustering

• Goal: find K clusters minimizing the mean squared distance from data points to corresponding cluster centroids

Functional analysis

K-Means Clustering

0 1 2 3 4 5

expression in condition 1

0 1 2 3 4 5

Functional analysis

K-Means Clustering

0 1 2 3 4 5

Functional analysis

K-Means Clustering

0 1 2 3 4 5

Functional analysis

K-Means Clustering

Functional analysis

Accuracy measuresPurity

U: set of ground truth classes; V: set of the computed clusters; N:total # of objects in dataset

Adjusted Rand Index (AR)

Rand Index (RI) RI= (TP+TN)/(TP+FP+FN+TN)

F1 Score F1 Score= 2×TP/(2×TP+FP+FN)

Mirkin’s index (MI) It counts the number of disagreements in data pairs between two clustering. It is the ratio of the number of disagreeing pairs to the total number of pairs. Lower value of Mirkin’s index indicates better clustering.

Hubert’s index (HI) HI = RI – MI

Corr Maximum weighted Pearson correlation between average expression value of each class at ground truth and computed cluster

Functional analysis

Accuracy comparison (Pollen et al. 2014, MiSeq)

Functional analysis

Accuracy comparison (Pollen et al. 2014, HiSeq)

Functional analysis

Accuracy comparison (Zeisel et al. 2015)

Functional analysis

Tests for differential gene expression must take both fold change and statistical significance into account

FC = 2 FC = 2 FC = 1.5

• Many reliable DE methods for data with replicates – edgeR [Robinson et al., 2010]– DESeq [Anders et al., 2010]

• When no/few replicates available bootstrapping provides a robust alternative

Functional analysis

Sensitivity results on Illumina MCF-7 data with varying number of replicates and minimum fold change 1.5

Functional analysis

Spindle 0.00001

Apoptosis 0.00025

ENRICHMENTTEST

Enrichment Table

Experimental Data

A priori knowledge +existing experimental data

Gene expression table

Gene-setDatabases

Interpretation& Hypotheses

Functional analysis

http://david.abcc.ncifcrf.gov/

Functional analysis

http://www.genemania.org/

Outline

• Intro to RNA-Seq– RNA-Seq applications– Analysis challenges for single cell data

• Typical analysis pipeline for single-cell RNA-Seq– Primary analysis: reads QC, mapping, and quantification– Secondary analysis: cells QC, normalization, clustering,

and differential expression– Tertiary analysis: functional annotation

• Conclusions

Conclusions • The range of single-cell applications continues to

expand, fueled by advances in microfluidics technology and library prep protocols• ATAC-Seq, GT-Seq, Methyl-Seq, …

• Primary analysis is compute intensive• Requires server/cluster/cloud + linux + scripting• Galaxy framework (https://usegalaxy.org/) provides web-

based interface to many tools

• Most secondary/tertiary analyses can be done on PC/Mac using

• R environment (some programming)• Many can be done using web-based tools and user-friendly

apps (we’ll use JMP)

Conclusions• Development of single-cell specific analysis methods

critical for fully realizing the potential of the technology• Allele specific expression• Biomarker selection• Cell type assignment• Lineage reconstruction• Characterization of heterogeneity

• Joint analysis of bulk and single cell data still needed to get unbiased cell type frequencies• Can also identify and characterize cell types missed by

current capture protocols

Single cells or AND computational deconvolution

Acknowledgements

Sahar Al SeesiMarius NicolaeElham Sherafat

Craig Nelson

Adrian CaciulaSerghei Mangul

Yvette Temate TiagueuAlex Zelikovsky

Edward HemphillJames Lindsay

computational methods for analysis of single cell rna-seq data ion măndoiu computer science &...

standard bulk rnaseqreverse

cells qc

illumina44 ion torrent

pcr bias

proprietary ion sensor

ionsensitive layer

low expression genes

plos genetics

Documents

transcriptome assembly and quantification from ion torrent...

metal ion dependence of cooperative collapse transitions...

ion ampliseq rna library kit man0007450 rev...

turbo-decoding of rna secondary structure€¦ · rna...

geology 229 engineering and environmental...

turbo-decoding of rna secondary...

1 notes: chapter 13 - rna & protein synthesis vocabulary:...

verification of an ion ampliseq™ rna fusion lung cancer...

the of vol. 262, issue june pp. the inc printed in rna n...

exploration of metal ion binding sites in rna folds...

rna metabolism - wordpress.com 05, 2017 · rna metabolism ...

rna ion kit manual - cell and plant

rna-seq plant data analysis - coboru. gorczak... · rna-seq...

ion total rna-seq kit v2 - thermo fisher scientific · pdf...

patrolling in a stochastic environmente-mail:...

rna folding during transcription by escherichia coli rna...

fast ion beam inactivation of viruses, where radiation...

rna-regulation: rna interference

estimation of alternative splicing isoform frequencies from...

transcriptome assembly and quantification from ion torrent...