geneexpressiondataanalysis - university of...

Gene Expression Data Analysis

Qin Ma, Ph.D.December 10, 2017

Bioinformatics

• This interdisciplinary science … is about providing computational support to studies on linking the behavior of cells, organisms and populations to the information encoded in the genomes.

– Temple Smith, Current Topics in Computational Molecular Biology (2002)

Bioinformatics

GenomicsTranscriptomicsMetabolomicsMetagenomicsEpigenomicsProteomics

Interactomics…

Omics data

Systems biology

Characteristics of Biological Big Data

Big Small Data

Small Big Data

• 36.8 million transactions per day on Amazon

Next Generation Sequencing Data

• Biomedical Data (behavioral outcomes in observational study)

The Hierarchical Structure of Computational Techniques

Models

Algorithms

Programs Tools Software

Central Dogma

� DNA à RNA à Protein

Intro to gene expression (central dogma). (n.d.). Retrieved November 05, 2017, from https://www.khanacademy.org/science/biology/gene-expression-central-dogma/central-dogma-transcription/a/intro-to-gene-expression-central-dogma

Information derivable from gene expression data

Inference: genes with similar expression patterns might be functionally related, e.g., working in the same pathway or co-regulated

co-expression -> co-regulation

Inference: genes x, y are highly expressed under conditions W while genes a, b are not expressed

genome sequence

Inference: gene X is significantly more highly expressed in diseased cell than in normal cell; hence gene X could potentially serve s a marker of the disease – differentially expressed genes

genome sequence

Control

Treatment

Gene Expression Measurement

Read quality check (FastQC)

RNA-seq read mapping (BWA, Bowtie)

RNA-seq Assemblywith reference genome

(Cufflinks)

RNA-seq Assemblywithout reference genome (Trinity: De-novo assembly)

Microarray (GEO)

RNA-seq (SRA)

RNA-seq

� Process

� Purpose� Analysis of Big Genomic Data� Gene Expression Estimation

� Variations� Differential Gene Expression

Analysis� Functional Enrichment Analysis� Network Analysis

Forde, B. M., & O’Toole, P. W. (2013). Next-generation sequencing technologies and their impact on microbial genomics. Briefings in functional genomics, 12(5), 440-453.

Non-trivial RNA-seq Analysis Pipeline

Quality Check

Read Mapping

GeneRead Count

DifferentialExpression Analysis

Functional Enrichment Analysis

RNA-seq Reads

Data Trimming

De-novo (Bi)-Clustering

Network Analysis & Modeling

(De-novo) Assembly

Operon Prediction

Non-trivial RNA-seq Analysis Tools

FastQC

EdgeR/DeSeq DAVID/GO

RNA-seq Reads

MCL/QUBIC NCA/GtrieScanner

CufflinksTrinity

DOORSeqTU

Existing RNA-seq Pipeline Tools

HISAT2BridgerHtSeq

RSEMCufflinksCutadapt

NovoalignBowtie 2

BWABowti

eTopHa

TopHat2STARTrinity

DESeq2

GSNAPedgeRFastQCFastX

kallisto

sleuth

ViDGER

� Tool to assist in interpreting and analyzing count matrices

� PCA, MDS, Clustering

� DGEA� Visualizations

� Basic R package � Shiny

implementation

ViDGERCompatibility� Count & condition

matrix

� Popular DGE tools by citation count

� Cuffdiff*� edgeR� DESeq2� DEGseq� limma� sleuth*

1202164%

520028%

1607 8%

DGE & Visualization Visualization Only None

Shiny Input� Count Matrix

� Generates basic figures from matrix

� Initial Analyses� PCA

� MDS

Differential Gene Expression� Select DGE tool to

analyze data

� Interactive results table

� DGE results visualizations for improved interpretation

� Interactivity between table & figures

Pitfall I: Popularity ≠High Performance

MapSplice2CRAC GSNAPNovoalignTopHat2

Human(97.8%)(86.1%)(98.9%)(90.3%)(12.5%)

Pitfall II: Gene expression estimation

Quality Check

Read Mapping

GeneRead Count

DifferentialExpression Analysis

Functional Enrichment Analysis

RNA-seq Reads

Data Trimming

De-novo (Bi)-Clustering

Network Analysis & Modeling

(De-novo) Assembly

Operon Prediction

Mapping uncertainty!

Pitfall II: Gene expression estimation

RNA-seq reads mapping uncertainty

Mapping Uncertainty Occurrences

� Plants� Highly duplicative nature of genome

� Animals� Alternative splicing

� Metagenomics� Sequencing of entire microbial communities simultaneously� Identical genes across different species� Similar, mutated or evolved genes� Currently other issues compounding mapping uncertainty

Pitfall II: How Serious?

Diploid plants Polyploid plants

Species Arabidopsisthaliana Vitis vinifera Solanum

lycopersicumSolanum

tuberosumTriticum aestivum

Unique-mapped 77%~89% 55%~82% 49%~87% 55%~69% 62%~69%

Multi-mapped 8%~17% 10%~25% 6%~34% 18%~26% 18%~25%

Un-mapped 2%~5% 8%~23% 5%~44% 12%~19% 9%~18%

Similar things happen in Human (transcript) and Metagenome

Diploid plants Polyploid plants Animal

TotalSpecies Arabidopsis

thaliana Vitis vinifera SolanumLycopersicum

PanicumVirgatum

TriticumAestivum

HumanGenome

HumanTranscriptome

Mus musculusGenome

Mus musculusTranscriptome

Datasets 10 10 10 10 13 11 11 10 10 95

Size(G) 153.7 152.3 151.8 385.7 348.1 249.9 249.9 129.9 129.9 1951

Unique-Mapped 69%~89% 55%~82% 52%~88% 47%~66% 61%~69% 55%-65% 10%~15% 40%~70% 11%~27% 55%

Multi-Mapped 8%~17% 9%~25% 5%~34% 17%~33% 17%~25% 21%-28% 23%-31% 10%~38% 9%~42% 22%

Un-mapped 2%~17% 8%~23% 4%~16% 13%~25% 9%~18% 12%-21% 55%-65% 3%~31% 43%~67% 23%

(Multi-mapped)/(Total mapped) 8%-18% 10%-31% 6%-39% 22%-39% 21%-28% 25%-33% 61%-72% 13%~48% 29%~77% 29%

Mapping Uncertainty in Real Data

Mapping Uncertainty in Plant Data

Mapping Uncertainty in Animal Data

Pitfall II: How to Proceed?

a) Ignore them: only consider unique mapping– 30%-70% of reads are discarded from further analysis in plants

b) Random mapping: If multiple equally best matches, choose one at random– TopHat

c) Report all: try to keep more information– Cufflinks: distribute these multiple mapping reads uniformly or

based on the expression level of unique mapping reads.

Pitfall II: How to Proceed?

It is an OPEN and challenge problem!

Quantifying Mapping Uncertainty

� Gene Expression Quality Check (GeneQC)� Computational program collecting relevant information from

datasets� Interprets information in meaningful way to provide quantification

of mapping uncertainty

� Two levels of observations� Genomic level: Sequence Similarity between two genomic locations� Transcriptomic level: Proportion of shared ambiguous reads

GeneQC

D-score� Allows for comparable

metric of mapping uncertainty

� Combines three statistics� Maximum proportion

of shared ambiguous reads

� Maximum base-pair similarity

� Number of gene pair interactions

� Normalized between 0 and 1 for each dataset

Variables: 𝐷.

� 𝐷.: Sequence Similarity * Match Length� max

2{𝑠𝑠5,2 ∗ 𝑙5,2}

� 𝑠𝑠5,2 = 𝑠𝑒𝑞𝑢𝑒𝑛𝑐𝑒𝑠𝑖𝑚𝑖𝑙𝑖𝑟𝑡𝑦𝑜𝑓𝑔𝑒𝑛𝑒𝑖𝑎𝑛𝑑𝑔𝑒𝑛𝑒𝑦� 𝑙5,2 = 𝑚𝑎𝑡𝑐ℎ𝑙𝑒𝑛𝑔𝑡ℎ

� Additional Constraints for 𝐷.� e-value < 10KL

� SS*Match Length > 100� Mismatch < 5� Gap < 5

𝑔𝑒𝑛𝑒𝑦.: 𝑠𝑠5,. = 65%; 𝑙5,. = 100

𝑔𝑒𝑛𝑒𝑦P: 𝑠𝑠5,P = 85%; 𝑙5,. = 200

𝑔𝑒𝑛𝑒𝑖

𝑔𝑒𝑛𝑒𝑦R: 𝑠𝑠5,R = 85%; 𝑙5,R = 200

𝑔𝑒𝑛𝑒𝑦S: 𝑠𝑠5,S = 85%; 𝑙5,S = 350

Variables: 𝐷P

� 𝐷P: Max MMR percentage�

UV∩XUV

� 𝐺5 = 𝑟𝑒𝑎𝑑𝑠𝑎𝑙𝑖𝑔𝑛𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖� 𝑋 = argmax

]|𝐺5 ∩ 𝑌|

𝐺5𝑋

𝐺5∩ 𝑋

Variables: 𝐷S

� 𝐷S: Degree weight� log.b 𝑆5 ∪ 𝑀5 + 1 � 𝑆5 = {𝑔𝑒𝑛𝑜𝑚𝑖𝑐𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑤ℎ𝑒𝑟𝑒𝐷. > 0}� 𝑀5 = {𝑔𝑒𝑛𝑜𝑚𝑖𝑐𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑤ℎ𝑒𝑟𝑒𝐷P > 0}

� Separated into two populations� 𝐷P = 0� 𝐷P ≠ 0

Variables by Species

D-score Development

� 𝐷., 𝐷P, 𝐷S combined into one distinct value

� Regression-based approach to optimize effect of each parameter

𝐷 = 𝛼.𝐷. + 𝛼P𝐷P + 𝛼S𝐷S + 𝛼R𝐷.𝐷P + 𝛼j𝐷.𝐷S + 𝛼L𝐷P𝐷S + 𝛼k𝐷.𝐷P𝐷S

𝑆𝐷 = 𝐷S(𝛼.𝐷. + 𝛼P𝐷P)

� 𝐷∗ used as dependent variable to represent mapping uncertainty� 𝐺5 = 𝑟𝑒𝑎𝑑𝑠𝑚𝑎𝑝𝑝𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖 (All matches)� 𝑈5 = 𝑟𝑒𝑎𝑑𝑠𝑢𝑛𝑖𝑞𝑢𝑒𝑙𝑦𝑚𝑎𝑝𝑝𝑒𝑑𝑡𝑜𝑔𝑒𝑛𝑒𝑖 (Unique mapping)� Real alignment falls somewhere between

� |𝑈5| ≤ |𝑅5| ≤ |𝐺5|

� 𝐷∗ = UV K rVUV

= 1 − rVUV= 1 − UV t uV

�.P≤ 𝐷∗ ≤ 1

� 𝐷∗ regressed upon (𝐷., 𝐷P, 𝐷S) to determine optimized coefficients for each dataset

� Interpretations for each set of coefficients can be used to understand biological mechanisms behind species-specific mapping uncertainty

D-scores

Simplified D-score

Simplified D-score Distributions� Density plots appear to

show mixture distributions

� Individual distributions can help indicate categorizations for mapping uncertainty

Level of Mapping Uncertainty from D-scores

� Mixture model distributions fit to set of D-scores� Indicates level of mapping uncertainty for each annotated gene� Normal & Gamma distribution fitting� Variable number of distributions

� Mixture Model Fitting using Expectation-Maximization Algorithm

� 𝑃 𝑋 𝜃 = ∑ 𝛽z𝑌z 𝑋 𝜃z�z

� 𝑋 = 𝑥., 𝑥P, … , 𝑥~ represent the set of D-scores� 𝛽z represent the weight for the 𝑘�� component with ∑ 𝛽z�

z = 1� 𝑌z(𝑋|𝜃z) represent the distribution of the 𝑘�� component

� 𝜃z is the set of parameters for the 𝑘�� component

Mixture Model Fitting: Initialization

� Assume 𝑌z(𝑋, 𝜃z) = 𝑁(𝑋; 𝜇z, 𝜎zP)

� Initial parameterization� K-means clustering to separate into k components� 𝜃z, 𝛽zcalculated for each component using MLE based on 𝑁z

� 𝑀𝐿𝐸(𝜇z) =∑ ��,��

��

� 𝑀𝐿𝐸 𝜎zP =∑ ��,�K��

��

� 𝛽z =��

, with 𝑁z = 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡𝑠𝑖𝑛𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑘 & ∑ 𝑁z�z = 𝑁

𝑘 = 4

Mixture Model Fitting: Expectation & Maximization

� Posterior Probability of containment within each component for each D-score is calculated

𝑃 𝑥� ∈ 𝑘5 𝑥� =𝑃 𝑥� 𝑥� ∈ 𝑘5 𝑃 𝑘5

𝑃 𝑥�=𝑁 𝑥� 𝜇z, 𝜎z

𝑁z𝑁

∑ 𝛽z𝑁 𝑥� 𝜇z, 𝜎z�z

=𝛽z𝑁 𝑥� 𝜇z𝜎z

∑ 𝛽z𝑁 𝑥� 𝜇z𝜎z�z

� Parameters for each component calculated after Expectation Step

𝜇z =∑ 𝑃 𝑥� ∈ 𝑘5 𝑥� 𝑥��.

∑ 𝑃 𝑥� ∈ 𝑘5 𝑥��.

𝜎zP = ∑ 𝑃 𝑥� ∈ 𝑘5 𝑥� 𝑥� − 𝜇z

P��.

∑ 𝑃 𝑥� ∈ 𝑘5 𝑥��.

𝛽z =∑ 𝑃 𝑥� ∈ 𝑘5 𝑥��.

Mixture Model Fitting: Optimization

� Expectation and Maximization steps repeated until no significant improvement achieved after each iteration

� log likelihood fails to substantially increase

� Implementation in R with 𝑘 ∈ {1, … , 9}� Best model fitting determined by lowest Bayesian Information

Criterion (BIC)

𝑘 = 4

Mixture Model Fitting

𝑘 = 4The four distributions provide criteria for separating genes into 4 categorizations based on mapping uncertainty level

Addressing Mapping Uncertainty

� Co-expression Modules (CEMs)� Genes typically co-expressed at certain rates with other genes

forming co-expression modules� Can use expression levels for known co-expressed genes (CEGs) to

predict likely expression levels for the gene locations� This information can be in turn used to determine which location is

most likely for any particular ambiguous read

� Can use existing information to gain insight into the likelihood of the correct location for alignment

� If no prior CEMs are available, biclustering of data can provide dataset-specific CEMs.

Pitfall III: T-test for differentially expression analysis

Wilcoxon (nonparametric) test has better performance than T-test

(parametric)

Bioinformatics. 2002 Nov;18(11):1454-61.Cited by 308

P-value < 0.0134

Pitfall IV: co-expression correlation

chip1 chip2 chip3 chip4 chip5 chip6 chip7 chip8 Chip9 chip10

Gene1 7.6 6.0 10.8 8.3 9.1 8.7 7.4 6.4 10.2 6.5

Gene2 8.1 7.2 7.0 8.4 8.9 8.8 6.5 10.4 6.9 7.5

Pearson Spearman

• Pearson benchmarks linear relationship• Spearman’s rank correlation benchmarks monotonic relationship

Pearson or Spearman?

Pitfall V: Co-expression in LARGE data set

Genes are not necessarily co-expressed under all experimental conditions,when we have a large data set!

esConditions

One dimensional clustering (genes or conditions)

Bi-clustering (genes & conditions)more data!!

Computer Lab Requirement

• Recent version of following software– R– RStudio– MiKTeX (or TeXLive)

• Install the following R packages on yourpersonal computer– EdgeR– QUBIC– sand

Final Report Presentation

• 12 teams, 3 person/team

• For each team, 15 mins team presentation– 12 mins presentation– 3 mins question-and-answer

• One score per team

geneexpressiondataanalysis - university of...

Documents

christine vogel nih public access edward m. marcotte...

jlu lusaka final

eplace:’electrostacs’based’placement...

jlu gpn booklet 2ed edition

curved mirrors, ray diagrams, and simulations...curved...

functionalization electronic supplementary information...

jlu gpn booklet

metastasis: cancer cell s escape from oxidative...

jlu newsletter jan 2014

spherical mirrors alfano i: year 4. concave mirrors concave...

catalogue reflect+ 2015 - exclusive mirrors by deknudt...

1 recursion: the mirrors (walls & mirrors - chapter 2)

curved mirrors. two types of curved mirrors 1. concave...

curved mirrors: locating images in concave & convex mirrors

srinivasan, s. 2008. mirrors: metal mirrors from india. in...

metastasis: cancer cell s escape from oxidative...

dare to dream -aiesec jlu in moc

done by: tan yan wei 1o2 (21). objectives history of...

alexander stankovski/mirrors within mirrors

aiesec-jlu-moc explore china