geneexpressiondataanalysis - university of...
Post on 24-Aug-2020
0 Views
Preview:
TRANSCRIPT
Gene Expression Data Analysis
Qin Ma, Ph.D.December 10, 2017
1
Bioinformatics
• This interdisciplinary science … is about providing computational support to studies on linking the behavior of cells, organisms and populations to the information encoded in the genomes.
– Temple Smith, Current Topics in Computational Molecular Biology (2002)
2
Bioinformatics
GenomicsTranscriptomicsMetabolomicsMetagenomicsEpigenomicsProteomics
Interactomics…
Omics data
Systems biology
Characteristics of Biological Big Data
Big Small Data
v.s.
Small Big Data
3
• 36.8 million transactions per day on Amazon
Next Generation Sequencing Data
• Biomedical Data (behavioral outcomes in observational study)
The Hierarchical Structure of Computational Techniques
4
Models
Algorithms
Programs Tools Software
Central Dogma
� DNA à RNA à Protein
Intro to gene expression (central dogma). (n.d.). Retrieved November 05, 2017, from https://www.khanacademy.org/science/biology/gene-expression-central-dogma/central-dogma-transcription/a/intro-to-gene-expression-central-dogma
5/46
6
Information derivable from gene expression data
Inference: genes with similar expression patterns might be functionally related, e.g., working in the same pathway or co-regulated
co-expression -> co-regulation
Inference: genes x, y are highly expressed under conditions W while genes a, b are not expressed
genome sequence
Inference: gene X is significantly more highly expressed in diseased cell than in normal cell; hence gene X could potentially serve s a marker of the disease – differentially expressed genes
genome sequence
Control
Treatment
Gene Expression Measurement
7
Read quality check (FastQC)
RNA-seq read mapping (BWA, Bowtie)
RNA-seq Assemblywith reference genome
(Cufflinks)
RNA-seq Assemblywithout reference genome (Trinity: De-novo assembly)
Microarray (GEO)
RNA-seq (SRA)
$ $
RNA-seq
� Process
� Purpose� Analysis of Big Genomic Data� Gene Expression Estimation
� Variations� Differential Gene Expression
Analysis� Functional Enrichment Analysis� Network Analysis
Forde, B. M., & O’Toole, P. W. (2013). Next-generation sequencing technologies and their impact on microbial genomics. Briefings in functional genomics, 12(5), 440-453.
8/46
Non-trivial RNA-seq Analysis Pipeline
9
Quality Check
Read Mapping
GeneRead Count
DifferentialExpression Analysis
Functional Enrichment Analysis
RNA-seq Reads
Data Trimming
De-novo (Bi)-Clustering
Network Analysis & Modeling
(De-novo) Assembly
Operon Prediction
Non-trivial RNA-seq Analysis Tools
10
FastQC
HISAT
HtSeq
EdgeR/DeSeq DAVID/GO
RNA-seq Reads
Btrim
MCL/QUBIC NCA/GtrieScanner
CufflinksTrinity
DOORSeqTU
Existing RNA-seq Pipeline Tools
2009
2010
2011
2012
2013
2014
2015
2016
2017
HISAT2BridgerHtSeq
RSEMCufflinksCutadapt
NovoalignBowtie 2
BWABowti
eTopHa
tGNUM
ap
TopHat2STARTrinity
DESeq2
GSNAPedgeRFastQCFastX
kallisto
sleuth
11/46
ViDGER
� Tool to assist in interpreting and analyzing count matrices
� PCA, MDS, Clustering
� DGEA� Visualizations
� Basic R package � Shiny
implementation
12/46
ViDGERCompatibility� Count & condition
matrix
� Popular DGE tools by citation count
� Cuffdiff*� edgeR� DESeq2� DEGseq� limma� sleuth*
1202164%
520028%
1607 8%
DGE & Visualization Visualization Only None
13/46
Shiny Input� Count Matrix
� Generates basic figures from matrix
� Initial Analyses� PCA
� MDS
14/46
Differential Gene Expression� Select DGE tool to
analyze data
� Interactive results table
� DGE results visualizations for improved interpretation
� Interactivity between table & figures
15/46
Pitfall I: Popularity ≠High Performance
MapSplice2CRAC GSNAPNovoalignTopHat2
27
Human(97.8%)(86.1%)(98.9%)(90.3%)(12.5%)
Pitfall II: Gene expression estimation
28
Quality Check
Read Mapping
GeneRead Count
DifferentialExpression Analysis
Functional Enrichment Analysis
RNA-seq Reads
Data Trimming
De-novo (Bi)-Clustering
Network Analysis & Modeling
(De-novo) Assembly
Operon Prediction
Mapping uncertainty!
Pitfall II: Gene expression estimation
RNA-seq reads mapping uncertainty
29
Mapping Uncertainty Occurrences
� Plants� Highly duplicative nature of genome
� Animals� Alternative splicing
� Metagenomics� Sequencing of entire microbial communities simultaneously� Identical genes across different species� Similar, mutated or evolved genes� Currently other issues compounding mapping uncertainty
19/46
Pitfall II: How Serious?
30
Diploid plants Polyploid plants
Species Arabidopsisthaliana Vitis vinifera Solanum
lycopersicumSolanum
tuberosumTriticum aestivum
Unique-mapped 77%~89% 55%~82% 49%~87% 55%~69% 62%~69%
Multi-mapped 8%~17% 10%~25% 6%~34% 18%~26% 18%~25%
Un-mapped 2%~5% 8%~23% 5%~44% 12%~19% 9%~18%
Similar things happen in Human (transcript) and Metagenome
Diploid plants Polyploid plants Animal
TotalSpecies Arabidopsis
thaliana Vitis vinifera SolanumLycopersicum
PanicumVirgatum
TriticumAestivum
HumanGenome
HumanTranscriptome
Mus musculusGenome
Mus musculusTranscriptome
Datasets 10 10 10 10 13 11 11 10 10 95
Size(G) 153.7 152.3 151.8 385.7 348.1 249.9 249.9 129.9 129.9 1951
Unique-Mapped 69%~89% 55%~82% 52%~88% 47%~66% 61%~69% 55%-65% 10%~15% 40%~70% 11%~27% 55%
Multi-Mapped 8%~17% 9%~25% 5%~34% 17%~33% 17%~25% 21%-28% 23%-31% 10%~38% 9%~42% 22%
Un-mapped 2%~17% 8%~23% 4%~16% 13%~25% 9%~18% 12%-21% 55%-65% 3%~31% 43%~67% 23%
(Multi-mapped)/(Total mapped) 8%-18% 10%-31% 6%-39% 22%-39% 21%-28% 25%-33% 61%-72% 13%~48% 29%~77% 29%
Mapping Uncertainty in Real Data
21/46
Mapping Uncertainty in Plant Data
22/46
Mapping Uncertainty in Animal Data
23/46
Pitfall II: How to Proceed?
31
a) Ignore them: only consider unique mapping– 30%-70% of reads are discarded from further analysis in plants
b) Random mapping: If multiple equally best matches, choose one at random– TopHat
c) Report all: try to keep more information– Cufflinks: distribute these multiple mapping reads uniformly or
based on the expression level of unique mapping reads.
Pitfall II: How to Proceed?
32
It is an OPEN and challenge problem!
Quantifying Mapping Uncertainty
� Gene Expression Quality Check (GeneQC)� Computational program collecting relevant information from
datasets� Interprets information in meaningful way to provide quantification
of mapping uncertainty
� Two levels of observations� Genomic level: Sequence Similarity between two genomic locations� Transcriptomic level: Proportion of shared ambiguous reads
26/46
GeneQC
97
0.5
A B
C
C D
27/46
D-score� Allows for comparable
metric of mapping uncertainty
� Combines three statistics� Maximum proportion
of shared ambiguous reads
� Maximum base-pair similarity
� Number of gene pair interactions
� Normalized between 0 and 1 for each dataset
𝒊
0.07
0.18
0.65
0.95
0.84
28/46
Variables: 𝐷.
� 𝐷.: Sequence Similarity * Match Length� max
2{𝑠𝑠5,2 ∗ 𝑙5,2}
� 𝑠𝑠5,2 = 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠𝑖𝑚𝑖𝑙𝑖𝑟𝑡𝑦𝑜𝑓𝑔𝑒𝑛𝑒𝑖𝑎𝑛𝑑𝑔𝑒𝑛𝑒𝑦� 𝑙5,2 = 𝑚𝑎𝑡𝑐ℎ𝑙𝑒𝑛𝑔𝑡ℎ
� Additional Constraints for 𝐷.� e-value < 10KL
� SS*Match Length > 100� Mismatch < 5� Gap < 5
𝑔𝑒𝑛𝑒𝑦.: 𝑠𝑠5,. = 65%; 𝑙5,. = 100
𝑔𝑒𝑛𝑒𝑦P: 𝑠𝑠5,P = 85%; 𝑙5,. = 200
𝑔𝑒𝑛𝑒𝑖
𝑔𝑒𝑛𝑒𝑦R: 𝑠𝑠5,R = 85%; 𝑙5,R = 200
𝑔𝑒𝑛𝑒𝑦S: 𝑠𝑠5,S = 85%; 𝑙5,S = 350
29/46
Variables: 𝐷P
� 𝐷P: Max MMR percentage�
UV∩XUV
� 𝐺5 = 𝑟𝑒𝑎𝑑𝑠𝑎𝑙𝑖𝑔𝑛𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖� 𝑋 = argmax
]|𝐺5 ∩ 𝑌|
𝐺5𝑋
𝑌P
𝑌.
𝐺5∩ 𝑋
30/46
Variables: 𝐷S
� 𝐷S: Degree weight� log.b 𝑆5 ∪ 𝑀5 + 1 � 𝑆5 = {𝑔𝑒𝑛𝑜𝑚𝑖𝑐𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑤ℎ𝑒𝑟𝑒𝐷. > 0}� 𝑀5 = {𝑔𝑒𝑛𝑜𝑚𝑖𝑐𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑤ℎ𝑒𝑟𝑒𝐷P > 0}
� Separated into two populations� 𝐷P = 0� 𝐷P ≠ 0
31/46
Variables by Species
32/46
D-score Development
� 𝐷., 𝐷P, 𝐷S combined into one distinct value
� Regression-based approach to optimize effect of each parameter
𝐷 = 𝛼.𝐷. + 𝛼P𝐷P + 𝛼S𝐷S + 𝛼R𝐷.𝐷P + 𝛼j𝐷.𝐷S + 𝛼L𝐷P𝐷S + 𝛼k𝐷.𝐷P𝐷S
𝑆𝐷 = 𝐷S(𝛼.𝐷. + 𝛼P𝐷P)
� 𝐷∗ used as dependent variable to represent mapping uncertainty� 𝐺5 = 𝑟𝑒𝑎𝑑𝑠𝑚𝑎𝑝𝑝𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖 (All matches)� 𝑈5 = 𝑟𝑒𝑎𝑑𝑠𝑢𝑛𝑖𝑞𝑢𝑒𝑙𝑦𝑚𝑎𝑝𝑝𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖 (Unique mapping)� Real alignment falls somewhere between
� |𝑈5| ≤ |𝑅5| ≤ |𝐺5|
� 𝐷∗ = UV K rVUV
= 1 − rVUV= 1 − UV t uV
P UV
�.P≤ 𝐷∗ ≤ 1
� 𝐷∗ regressed upon (𝐷., 𝐷P, 𝐷S) to determine optimized coefficients for each dataset
� Interpretations for each set of coefficients can be used to understand biological mechanisms behind species-specific mapping uncertainty
33/46
D-scores
34/46
Simplified D-score
35/46
Simplified D-score Distributions� Density plots appear to
show mixture distributions
� Individual distributions can help indicate categorizations for mapping uncertainty
36/46
Level of Mapping Uncertainty from D-scores
� Mixture model distributions fit to set of D-scores� Indicates level of mapping uncertainty for each annotated gene� Normal & Gamma distribution fitting� Variable number of distributions
� Mixture Model Fitting using Expectation-Maximization Algorithm
� 𝑃 𝑋 𝜃 = ∑ 𝛽z𝑌z 𝑋 𝜃z�z
� 𝑋 = 𝑥., 𝑥P, … , 𝑥~ represent the set of D-scores� 𝛽z represent the weight for the 𝑘�� component with ∑ 𝛽z�
z = 1� 𝑌z(𝑋|𝜃z) represent the distribution of the 𝑘�� component
� 𝜃z is the set of parameters for the 𝑘�� component
37/46
Mixture Model Fitting: Initialization
� Assume 𝑌z(𝑋, 𝜃z) = 𝑁(𝑋; 𝜇z, 𝜎zP)
� Initial parameterization� K-means clustering to separate into k components� 𝜃z, 𝛽zcalculated for each component using MLE based on 𝑁z
� 𝑀𝐿𝐸(𝜇z) =∑ ��,����
��
� 𝑀𝐿𝐸 𝜎zP =∑ ��,�K��
����
��
� 𝛽z =���
, with 𝑁z = 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠𝑖𝑛𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑘 & ∑ 𝑁z�z = 𝑁
𝑘 = 4
38/46
Mixture Model Fitting: Expectation & Maximization
� Posterior Probability of containment within each component for each D-score is calculated
𝑃 𝑥� ∈ 𝑘5 𝑥� =𝑃 𝑥� 𝑥� ∈ 𝑘5 𝑃 𝑘5
𝑃 𝑥�=𝑁 𝑥� 𝜇z, 𝜎z
𝑁z𝑁
∑ 𝛽z𝑁 𝑥� 𝜇z, 𝜎z�z
=𝛽z𝑁 𝑥� 𝜇z𝜎z
∑ 𝛽z𝑁 𝑥� 𝜇z𝜎z�z
� Parameters for each component calculated after Expectation Step
𝜇z =∑ 𝑃 𝑥� ∈ 𝑘5 𝑥� 𝑥����.
∑ 𝑃 𝑥� ∈ 𝑘5 𝑥����.
𝜎zP = ∑ 𝑃 𝑥� ∈ 𝑘5 𝑥� 𝑥� − 𝜇z
P���.
∑ 𝑃 𝑥� ∈ 𝑘5 𝑥����.
𝛽z =∑ 𝑃 𝑥� ∈ 𝑘5 𝑥����.
𝑁
39/46
Mixture Model Fitting: Optimization
� Expectation and Maximization steps repeated until no significant improvement achieved after each iteration
� log likelihood fails to substantially increase
� Implementation in R with 𝑘 ∈ {1, … , 9}� Best model fitting determined by lowest Bayesian Information
Criterion (BIC)
𝑘 = 4
40/46
Mixture Model Fitting
𝑘 = 4The four distributions provide criteria for separating genes into 4 categorizations based on mapping uncertainty level
41/46
Addressing Mapping Uncertainty
� Co-expression Modules (CEMs)� Genes typically co-expressed at certain rates with other genes
forming co-expression modules� Can use expression levels for known co-expressed genes (CEGs) to
predict likely expression levels for the gene locations� This information can be in turn used to determine which location is
most likely for any particular ambiguous read
� Can use existing information to gain insight into the likelihood of the correct location for alignment
� If no prior CEMs are available, biclustering of data can provide dataset-specific CEMs.
42/46
Pitfall III: T-test for differentially expression analysis
Wilcoxon (nonparametric) test has better performance than T-test
(parametric)
Bioinformatics. 2002 Nov;18(11):1454-61.Cited by 308
P-value < 0.0134
Pitfall IV: co-expression correlation
chip1 chip2 chip3 chip4 chip5 chip6 chip7 chip8 Chip9 chip10
Gene1 7.6 6.0 10.8 8.3 9.1 8.7 7.4 6.4 10.2 6.5
Gene2 8.1 7.2 7.0 8.4 8.9 8.8 6.5 10.4 6.9 7.5
Pearson Spearman
• Pearson benchmarks linear relationship• Spearman’s rank correlation benchmarks monotonic relationship
Pearson or Spearman?
35
45
Pitfall V: Co-expression in LARGE data set
Genes are not necessarily co-expressed under all experimental conditions,when we have a large data set!
Gen
esConditions
One dimensional clustering (genes or conditions)
Bi-clustering (genes & conditions)more data!!
Computer Lab Requirement
• Recent version of following software– R– RStudio– MiKTeX (or TeXLive)
• Install the following R packages on yourpersonal computer– EdgeR– QUBIC– sand
46
Final Report Presentation
• 12 teams, 3 person/team
• For each team, 15 mins team presentation– 12 mins presentation– 3 mins question-and-answer
• One score per team
47
top related