computational methods for analysis of single cell rna-seq data ion măndoiu computer science &...
Post on 02-Jan-2016
217 Views
Preview:
TRANSCRIPT
Computational Methods for Analysis of Single Cell RNA-Seq Data
Ion MăndoiuComputer Science & Engineering Department
University of Connecticution@engr.uconn.edu
Outline
• Intro to RNA-Seq– Next-generation sequencing technologies– RNA-Seq applications– Analysis challenges for single cell data
• Typical analysis pipeline for single-cell RNA-Seq– Primary analysis: reads QC, mapping, and quantification– Secondary analysis: cells QC, normalization, clustering, and
differential expression– Tertiary analysis: functional annotation
• Conclusions
2nd Gen. Sequencing: Illumina
2nd Gen. Sequencing: Illumina
• ION Torrent uses a high-density array of micro-machined wells to perform this biochemical process in a massively parallel way• Each well holds a different DNA template generated by emulsion PCR. Beneath the wells is an ion-sensitive layer and beneath that a proprietary ION sensor• The sequencer sequentially floods the chip with one nucleotide after another; in each cycle the voltage change recorded at a well is proportional to the number of incorporated bases
2nd Gen. Sequencing: ION Torrent
6
3rd Gen. Sequencing: PacBio SMRT
3rd Gen. Sequencing: PacBio SMRT
3rd Gen. Sequencing: Oxford Nanopore
http://www.technologyreview.com/article/427677/nanopore-sequencing/
Standard (Bulk) RNA-Seq
Reverse transcribe into cDNA & shatter into fragments
Sequence fragment ends
A B C D E
Map reads
Gene expression quantification
Isoform expressionquantification
A B C
A C
D E
Transcriptome reconstruction
AAAAAA
AAAAAAAAAAAA
Alternative Splicing
Pal S. et all , Genome Research, June 2011
Transcriptome Reconstruction
Common Approaches
• De novo (genome independent reconstruction)– Trinity, Oases, TransABySS
• de Brujin k-mer graph
• Genome guided– Scripture
• Reports “all” transcripts
– Cufflinks, IsoLasso, SLIDE• Minimize set of transcripts explaining reads
• Annotation guided– RABT
• Simulate reads from annotated transcripts
1 742 3 65t1 :
1 743 65t2 :
1 742 3 5t3 :
t4 :1 743 5
1 742 3 65
Genome-Guided Transcriptome Reconstruction – Multiple Solutions
Which Solution is Most Likely?• TRIP: select smallest set of transcripts with good
statistical fit between fragment length distribution– empirically determined during library preparation– implied by “mapping” read pairs
1 3
1 2 3
500
300
200 200 200
200 200
Series1
Series1
TRIP Results
• 100x coverage, 2x100bp pe reads; annotations for genes
FPTP
TPPPV
SensPPV
SensPPVFScore
2
FNTP
TPSens
Why Single Cell RNA-Seq?
Macaulay and Voet, PLOS Genetics, 2014
Challenges
• Low RNA input + low RT efficiency– Especially problematic for low expression genes
Macaulay and Voet, PLOS Genetics, 2014
Challenges
• Stochastic effects (e.g., transcriptional bursting) hard to distinguish from regulated transcriptional heterogeneity
• PCR amplification bias results in distortion of transcript abundances
SMARTer RNA-Seq Protocol
Islam et al. http://www.nature.com/nmeth/journal/v11/n2/full/nmeth.2772.html
Correcting PCR Bias using UMIs (STRT-C1)
Outline
• Intro to RNA-Seq– RNA-Seq applications– Analysis challenges for single cell data
• Typical analysis pipeline for single-cell RNA-Seq– Primary analysis: reads QC, mapping, and quantification– Secondary analysis: cells QC, normalization, clustering,
and differential expression– Tertiary analysis: functional annotation
• Conclusions
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 710
0.5
1
1.5
2
2.5
Lane 1 Lane 2
Lane 3
Read position
Perc
enta
ge o
f rea
ds w
ith m
ism
atch
es
Tools to analyze and preprocess fastq files• FASTX (http://hannonlab.cshl.edu/fastx_toolkit/)
– Charts quality statistics– Filters sequences based on quality– Trims sequences based on quality– Collapses identical sequences into a single sequence
• PRINSEQ (http://prinseq.sourceforge.net/)– Generates read length and quality statistics– Filters reads based on length, quality, GC content
and other criteria– Trims reads based on length/position or quality
scores
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
RNA-Seq read mapping strategies:– Ungapped mapping (with mismatches) to genome
• Cannot align reads spanning exon-junctions
– Local alignment (Smith-Waterman) to genome• Very slow
– Spliced alignment to genome• Computationally harder than ungapped alignment, but much
faster than local alignment
– Mapping on transcript libraries• Fastest, but cannot align reads from un-annotated transcripts
– Mapping on exon-exon junction libraries• Cannot align reads overlapping un-annotated exons
– Hybrid approaches
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Comparison of spliced read mapping tools
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Kim et al. http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3317.html
• Cannot use raw read counts (why not?)
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Islam et al. http://www.nature.com/nmeth/journal/v11/n2/full/nmeth.2772.html
• CPM = count per million– Ignores multireads underestimates expression of genes in large families
– Does not normalize for gene length cannot compare CPMs b/w genes
– Comparing CPMs between samples assumes similar transcriptome size
• RPKM/FPKM = reads/fragments per kilobase per million– [Mortazavi et al. 08] Fractionally allocates multireads based on unique read
estimates
– Length for multi-isoform genes?
– Comparing FPKM between samples assumes similar (weighted) transcriptome size
• TPM: transcripts per million– Still relative measure of expression, but comparable between samples
– Most accurate estimation methods use multireads and isoform level estimation
• UMI counts– Absolute measure of expression?
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
A B C D E
A C
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Gene ambiguous reads
Isoform ambiguous reads
Expectation-maximization approach (IsoEM, RSEM)
irw ,
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
A B C
A C
i
j
Series1
Fa(i)
Series1
Fa (j)
EM Algorithm
1. Start with random transcript frequencies
0.2
0.2
0.2
0.2
0.2
2. Fractionally allocate reads to transcripts
1
1
1
0.50.5
0.50.5
0.5
0.5
3. Compute expected #reads for each transcript
0.5
2.5
0.5
1
1.5
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
1. Start with random transcript frequencies
2. Fractionally allocate reads to transcripts
3. Compute expected #reads for each transcript
0.5
2.5
0.5
1
1.5
4. Update transcript frequencies using maximum likelihood estimates
0.5/6
2.5/6
0.5/6
1/6
1.5/6
EM AlgorithmReads QC Read mapping Quantification Cells QC Normalization Clustering Differential
expressionFunctional analysis
1. Start with random transcript frequencies
2. Fractionally allocate reads to transcripts
3. Compute expected #reads for each transcript
4. Update transcript frequencies using maximum likelihood estimates
0.5/6
2.5/6
0.5/6
1/6
1.5/6
5. Repeat steps 2-4 until convergence
EM AlgorithmReads QC Read mapping Quantification Cells QC Normalization Clustering Differential
expressionFunctional analysis
Detected genes/cell -- main population
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Detected genes/cell -- minor population
Detected genes/cell -- bi-modal distribution
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Batch effects can be larger than biological effects, but can be corrected by normalization procedures
CPM & TPM datasets pre-quantile normalization
CPM & TPM datasets post-quantile normalization
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Quantile normalization (Irizarry et al 2002) • Shifts CPM/FPKM/TPM values for each cell to match a reference
distribution (e.g., distribution of means)- Highest value gets matched to highest value in reference- 2nd highest gets mapped to 2nd highest value in reference- And so on
Distribution of TPMs
Reference distribution
Principal Component Analysis
• Linear transformation of the data:
– 1st component = direction of max. variance– 2nd component = orthogonal on 1st, max. residual variance
• Used for dimensionality reduction (ignore high components)– Visualization for exploratory analysis– Feature selection
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
What makes a good clustering?• Homogeneity: Elements within a cluster are close to
each other• Separation: Elements in different clusters are further
apart from each other
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Bad clustering Good clustering
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Algorithm ParametersK-means K = Number of clusters
Fuzzy c-means Clustering (FCM)
K = number of clustersd = Degree of fuzziness
Hierarchical Clustering (HCS)
Metric = euclidean, seuclidean, cityblock, minkowski, chebychev, cosine, correlation, spearmanMethod = average, centroid, complete, median, single
EM Clustering K = Number of clustersS = Number of initial seedsI = Number of iteration
SNN-Cliq n = Size of the nearest neighbor listr = Density threshold of quasi-cliques m = Threshold on the overlapping rate for merging.
Many clustering algorithms!
K-Means Clustering
• Goal: find K clusters minimizing the mean squared distance from data points to corresponding cluster centroids
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
K-Means Clustering
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
expr
essi
on in
con
ditio
n 2
k1
k2
k3
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
expr
essi
on in
con
ditio
n 2
k1
k2
k3
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
K-Means Clustering
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
expr
essi
on in
con
ditio
n 2
k1
k2k3
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
K-Means Clustering
0
1
2
3
4
5
0 1 2 3 4 5
expression in condition 1
expr
essi
on in
con
ditio
n 2
k1
k2 k3
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
K-Means Clustering
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Accuracy measuresPurity
U: set of ground truth classes; V: set of the computed clusters; N:total # of objects in dataset
Adjusted Rand Index (AR)
Rand Index (RI) RI= (TP+TN)/(TP+FP+FN+TN)
F1 Score F1 Score= 2×TP/(2×TP+FP+FN)
Mirkin’s index (MI) It counts the number of disagreements in data pairs between two clustering. It is the ratio of the number of disagreeing pairs to the total number of pairs. Lower value of Mirkin’s index indicates better clustering.
Hubert’s index (HI) HI = RI – MI
Corr Maximum weighted Pearson correlation between average expression value of each class at ground truth and computed cluster
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Accuracy comparison (Pollen et al. 2014, MiSeq)
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Accuracy comparison (Pollen et al. 2014, HiSeq)
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Accuracy comparison (Zeisel et al. 2015)
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Tests for differential gene expression must take both fold change and statistical significance into account
*
DE
FC = 2 FC = 2 FC = 1.5
*
• Many reliable DE methods for data with replicates – edgeR [Robinson et al., 2010]– DESeq [Anders et al., 2010]
• When no/few replicates available bootstrapping provides a robust alternative
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Sensitivity results on Illumina MCF-7 data with varying number of replicates and minimum fold change 1.5
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
Spindle 0.00001
Apoptosis 0.00025
ENRICHMENTTEST
Enrichment Table
Experimental Data
A priori knowledge +existing experimental data
Gene expression table
Gene-setDatabases
Interpretation& Hypotheses
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
http://david.abcc.ncifcrf.gov/
Reads QC Read mapping Quantification Cells QC Normalization Clustering Differential expression
Functional analysis
http://www.genemania.org/
Outline
• Intro to RNA-Seq– RNA-Seq applications– Analysis challenges for single cell data
• Typical analysis pipeline for single-cell RNA-Seq– Primary analysis: reads QC, mapping, and quantification– Secondary analysis: cells QC, normalization, clustering,
and differential expression– Tertiary analysis: functional annotation
• Conclusions
Conclusions • The range of single-cell applications continues to
expand, fueled by advances in microfluidics technology and library prep protocols• ATAC-Seq, GT-Seq, Methyl-Seq, …
• Primary analysis is compute intensive• Requires server/cluster/cloud + linux + scripting• Galaxy framework (https://usegalaxy.org/) provides web-
based interface to many tools
• Most secondary/tertiary analyses can be done on PC/Mac using
• R environment (some programming)• Many can be done using web-based tools and user-friendly
apps (we’ll use JMP)
Conclusions• Development of single-cell specific analysis methods
critical for fully realizing the potential of the technology• Allele specific expression• Biomarker selection• Cell type assignment• Lineage reconstruction• Characterization of heterogeneity
• Joint analysis of bulk and single cell data still needed to get unbiased cell type frequencies• Can also identify and characterize cell types missed by
current capture protocols
Single cells or AND computational deconvolution
Acknowledgements
Sahar Al SeesiMarius NicolaeElham Sherafat
Craig Nelson
Adrian CaciulaSerghei Mangul
Yvette Temate TiagueuAlex Zelikovsky
Edward HemphillJames Lindsay
top related