enabling reproducible gene expression analysis using biological pathways limsoon wong 7 april 2011...
TRANSCRIPT
Enabling Reproducible Gene Expression Analysis Using
Biological Pathways
Limsoon Wong
7 April 2011
(Joint work with Donny Soh, Difeng Dong, Yike Guo)
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
2
Plan
• An issue in gene expression analysis
• Comparing pathway sources: Comprehensiveness, Consistency, Compatibility
• Matching pathways in different sources
• Finding more consistent disease subnetworks
An Issue in Gene Expression Analysis
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
4
First, the good news..
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
5
• The subtypes look similar
• Conventional diagnosis– Immunophenotyping– Cytogenetics– Molecular diagnosticsÞ Unavailable in
developing countries
Childhood Acute Lymphoblastic Leukemia
• Major subtypes: T-ALL, E2A-PBX, TEL-AML, BCR-ABL, MLL genome rearrangements, Hyperdiploid>50
• Diff subtypes respond differently to same Tx
• Over-intensive Tx – Development of
secondary cancers– Reduction of IQ
• Under-intensiveTx – Relapse
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
6
Fold Change
T-test
x – Microarray value after drugy – Microarray value before drugi – Gene
x – Log2 value of treatmenty – Log2 value of controls – Standard errori – Gene
Individual Gene Testing
Golub et al, Science, 1999
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
7
Yeoh et al, Cancer Cell 2002
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
8
Conventional Tx:• intermediate intensity to allÞ 10% suffers relapseÞ 50% suffers side effectsÞ costs US$150m/yr
Our optimized Tx:• high intensity to 10%• intermediate intensity to 40%• low intensity to 50%• costs US$100m/yr
Copyright © 2004 by Jinyan Li and Limsoon Wong
•High cure rate of 80%• Less relapse
• Less side effects• Save US$51.6m/yr
Impact
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
9
Now, the bad news..
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
10
Percentage of Overlapping Genes
• Low % of overlapping genes from diff expt in general
– Prostate cancer• Lapointe et al, 2004• Singh et al, 2002
– Lung cancer• Garber et al, 2001• Bhattacharjee et al,
2001– DMD
• Haslett et al, 2002• Pescatori et al, 2007
Datasets DEG POG
ProstateCancer
Top 10 0.30
Top 50 0.14
Top100 0.15
LungCancer
Top 10 0.00
Top 50 0.20
Top100 0.31
DMDTop 10 0.20
Top 50 0.42
Top100 0.54
Zhang et al, Bioinformatics, 2009
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
11
Gene Regulatory Circuits
• Each disease subtype has underlying cause
• There is a unifying biological theme for genes that are truly associated with a disease subtype
• Uncertainty in selected genes can be reduced by considering biological processes of the genes
• The unifying biological theme is basis for inferring the underlying cause of disease subtype
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
12
Towards More Meaningful Genes
• ORA – Khatri et al– Genomics, 2002
• FCS– Pavlidis & Noble– PSB 2002
• GSEA– Subramanian et al– PNAS, 2005
• Pathway Express– Draghici et al – Genome Res, 2007
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
13
All of these newer methods rely on gene group or pathway information.
But how good are the available sources of pathway information?
Comparing Pathway Sources:Comprehensiveness, Consistency, &
Compatibility
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
15
Data Sources
• KEGG– Curated by a single lab– Long famous history– Used by many people
• Wikipathways– Community effort– new curation model
• Ingenuity– Commercial effort– Used by many biopharma’s
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
16
Low Comprehensiveness of Pathway Sources
0
50
100
150
200
250
300
350
400
450
500
Unified KEGG Ingenuity Wiki
# of Pathways
0
10000
20000
30000
40000
50000
60000
70000
Unified KEGG Ingenuity Wiki
# of Genes Pairs
0
5000
10000
15000
20000
25000
Unified KEGG Ingenuity Wiki
# of Genes
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
17
Gene Pair Overlap
Gene Overlap
Wiki vs KEGG Wiki vs Ingenuity KEGG vs Ingenuity
Wiki vs KEGG Wiki vs Ingenuity KEGG vs Ingenuity
Low Consistencyof Pathway Sources
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
18
Example: Apoptosis Pathway
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
19
Would Unifying Pathway Sources Help?
• Incompatibility Issues!• Data extraction method variations
• Format variations
• Data differences
• Gene/GeneID name differences
• Pathway name differences
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
20
The preceding analyses hide an intricate issue…
The same pathways in the different sources are often given different names.
So how do we even know two pathways are the same and should be compared / merged?
Intricacy of Pathway Matching
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
22
Possible Ways to Match Pathways
• Match based on name– Pathways w/ similar name should be the same
pathway– But annotations are very noisyÞ Likely to mismatch pathways?Þ Likely to match too many pathways?
• Are the followings good alternative approaches?– Match based on overlap of genes– Match based on overlap of gene pairs
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
23
Matching Pathways by Name
• LCS procedure– Given pathway X in
db A– Sort pathways in db
B by “longest common substring” with X
– Manually scan the ranked list to choose closest nomen-clatural match
• Issue: Accuracy– When LCS says two
pathways are the same one, are they really the same?
• Issue: Completeness– When LCS says two
pathways are different, are they really different?
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
24
LCS vs Gene-Agreement Matching
• Accuracy– 94% of LCS matches
are in top 3 gene agreement matches
– 6% of LCS matches not in top 3 of gene agreement matches; but their gene-pair agreement levels are higher
• Completeness– Let Pi be a pathway in
db A that LCS cannot find match in db B
– Let Qi be pathway in db B with highest gene agreement to Pi
– Gene-pair agreement of Pi-Qi is much lower than pathway pairs matched by LCS
LCS is better than gene-agreement based matching!
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
25
LCS vs Gene-Agreement Matching
• LCS consistently has higher gene-pair agreementÞ LCS is better than gene-agreement based matching!
gene overlap percentage
Gene-pair overlap percentage
LCS match
Gene-agreement match
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
26
LCS vs Gene-Pair Agreement Matching
8
2416
LCS
Gene-PairOverlap
The 8 pathway pairs singled out by LCS
The 24 pathway pairs singled out by maximal gene-pair overlap
Note: We consider only pathway pairs that have at least 20 reaction overlap.
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
27
LCS vs Gene-Pair Agreement Matching
• Gene-pair agreement match will miss when– Pathway P in db A has few overlap with pathway P in
db B due to incompleteness of db, even if pathway name matches perfectly!
– Example: wnt signaling pathway, VEGF signaling pathway, MAPK signaling pathway, etc. in KEGG don’t have largest gene-pair overlap w/ corresponding pathways in Wikipathways & Ingenuity
Þ Bad for getting a more complete unified pathway P
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
28
LCS vs Gene-Pair Agreement Matching
• Pathways having large gene-pair overlap are not necessarily the same pathways
• Examples– “Synaptic Long Term Potentiation” in Ingenuity vs
“calcium signalling” in KEGG – “PPAR-alpha/RXR-alpha Signaling” in Ingenuity vs
“TGF-beta signaling pathway” in KEGG
Þ Difficult to set correct gene-pair overlap threshold to balance against false positive matches
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
29
% o
verla
p w
ith L
CS
Top n% matching pathways based on gene / gene pair overlap
Gene vs Gene Pair Agreement Matching
• Pathways w/ higher gene/gene pair overlap have higher overlap w/ LCS
• Gene pair matching is better than gene matching• But both are not as good as LCS
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
30
PathwayAPI= KEGG
+ Wikipathways + Ingenuity
• Having found a good way to match up pathways in different datasources, we proceeded to build a big unified pathway db….
Donny Soh, Difeng Dong, Yike Guo, Limsoon Wong. Consistency, Comprehensiveness, and Compatibility of Pathway Databases. BMC Bioinformatics, 11:449, September 2010.
More Consistent Disease Subnetworks
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
32
But these methods still
don’t return the precise parts of a pathway that
are significant…
• ORA – Khatri et al– Genomics, 2002
• FCS– Pavlidis & Noble– PSB 2002
• GSEA– Subramanian et al– PNAS, 2005
• Pathway Express– Draghici et al – Genome Res, 2007
Test whole gene group at a time
Test a node and its immediate neighbours
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
33
The SNet Method
• Group samples into type D and D• Extract & score subnetworks for type D
– Get list of genes highly expressed in most D samples• These genes need not be differentially expressed!
– Put these genes into pathways– Locate connected components (ie., candidate
subnetworks) from these pathway graphs– Score subnetworks on D samples and on D samples
• For each subnetwork, compute t-statistics on the two sets of scores
• Determine significant subnetworks by permutations
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
34
SNet: Extract Subnetworks
Genes highly expressed in many type-D samples
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
35
SNet: Score Subnetworks
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
36
SNet: Significant Subnetworks
• Randomize patient samples many times
• Get t-score for subnetworks from the randomizations
• Use these t-scores to establish null distribution
• Filter for significant subnetworks from real samples
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
37
Let’s see whether SNet gives us subnetworks that are
(i) more consistent between datasets of the same types of disease samples
(ii) larger and more meaningful
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
38
Recall Examples from “Bad News”
• Low % of overlapping genes from diff expt in general
– Prostate cancer• Lapointe et al, 2004• Singh et al, 2002
– Lung cancer• Garber et al, 2001• Bhattacharjee et al,
2001– DMD
• Haslett et al, 2002• Pescatori et al, 2007
Datasets DEG POG
ProstateCancer
Top 10 0.30
Top 50 0.14
Top100 0.15
LungCancer
Top 10 0.00
Top 50 0.20
Top100 0.31
DMDTop 10 0.20
Top 50 0.42
Top100 0.54
Zhang et al, Bioinformatics, 2009
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
39
Better Subnetwork Overlap
• For each disease, take significant subnetworks from one dataset and see if it is also significant in the other dataset
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
40
Better Gene Overlaps
• For each disease, take significant subnetworks extracted independently from both datasets and see how much their genes overlap
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
41
Larger Subnetworks
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
42
Genes A, B, C are high in phenotype D
A is high in phenotype ~D but B and C are not
A
B
C
Conventional techniques: Gene B and Gene C are selected. Possible incorrect postulation of mutations in gene B and C
Key Insight # 1
• SNet does not require all the genes in subnet to be diff expressed
• It only requires the subnet as a whole to be diff expressed
• Able to capture entire relationship, postulating a mutation in gene A
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
43
A branch within pathway consisting of genes A, B, C, D and E are high in phenotype D
Genes C, D and E not high in phenotype ~D
30 other genes not diff expressed
A
B
C
Conventional techniques: Entire subnetwork is likely to be missed
D
E
30 other genes
Key Insight # 2
• SNet: Able to capture the entire subnetwork branch within the pathway
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
44
Genes A, B and C are present in two separate pathways
A, B and C are high in phenotype D, but not high in phenotype ~D
Conventional techniques:
Both pathways are scored equally. So both got selected, resulting in pathway 2 being a false positive
A
B
C
A
B
C
Pathway 1 Pathway 2
Key Insight # 3
• SNet: Able to select only pathway 1, which has the relevant relationship
Remarks
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
46
What have we learned?
• Significant lack of concordance betw db’s– Level of consistency for genes is 0% to 88%– Level of consistency for genes pairs is 0%-61%– Most db contains less than half of the pathways in
other db’s
• Matching pathways by name is better than matching by gene overlap or gene-pair overlap
• SNet method yields more consistent and larger disease subnetworks
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
47
Acknowledgements
• A*STAR AIP scholarship• A*STAR SERC PSF grant
Difeng DongDonny Soh Yike Guo
Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong
48
References
• Eng-Juh Yeoh, Mary E. Ross, Sheila A. Shurtleff, W. Kent William, Divyen Patel, Rami Mahfouz, Fred G. Behm, Susana C. Raimondi, Mary V. Reilling, Anami Patel, Cheng Cheng, Dario Campana, Dawn Wilkins, Xiaodong Zhou, Jinyan Li, Huiqing Liu, Chin-Hon Pui, William E. Evans, Clayton Naeve, Limsoon Wong, James R. Downing. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1:133--143, March 2002.
• Donny Soh, Difeng Dong, Yike Guo, Limsoon Wong. Enabling More Sophisticated Gene Expression Analysis for Understanding Diseases and Optimizing Treatments. ACM SIGKDD Explorations, 9(1):3--14, June 2007.
• Donny Soh, Difeng Dong, Yike Guo, Limsoon Wong. Consistency, Comprehensiveness, and Compatibility of Pathway Databases. BMC Bioinformatics, 11:449, September 2010.
• Donny Soh, Understanding Pathways, PhD thesis, December 2010, Imperial College London