![Page 1: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/1.jpg)
For written notes on this lecture, please read chapter 14 of The Practical Bioinformatician.
CS2220: Introduction to Computational BiologyUnit 3: Gene Expression Analysis
Wong Limsoon
![Page 2: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/2.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
2
Plan
• Microarray background
• Gene expression profile classification
• Gene expression profile clustering
• Normalization
• Extreme sample selection
• Gene regulatory network inference
![Page 3: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/3.jpg)
Background on microarrays
![Page 4: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/4.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
4
What is a microarray?
• Contain large numbers of DNA molecules spotted on glass slides, nylon membranes, or silicon wafers
• Detect what genes are being expressed or found in a cell of a tissue sample
• Measure expression of thousands of genes simultaneously
![Page 5: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/5.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
5
Affymetrix GeneChip®
![Page 6: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/6.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
6
quartz is washed to ensure uniform hydroxylation across its surface and to attach linker molecules
exposed linkers become deprotected and are available for nucleotide coupling
Making Affymetrix GeneChip®
Exercise: What is the other commonly used type of microarray? How is that one differentfrom Affymetrix’s?
![Page 7: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/7.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
7
Gene expression
measurement by AffymetrixGeneChip®
Click to watch an interesting movie explaining the working of microarray
![Page 8: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/8.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
8
Sample Affymetrix GeneChip®data file (U95A)
![Page 9: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/9.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
9
Some advice on processingAffymetrix GeneChip® data
• Ignore AFFX genes– These genes are control genes
• Ignore genes with “Abs Call” equal to “A” or “M”– Measurement quality is suspect
• Upperbound 40000, lowerbound 100– Saturation of laser scanner
• Deal with missing values
Exercise: Suggest 2 waysto deal with missing value
![Page 10: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/10.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
10
Type of gene expression datasets
Class Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 .....
Sample1 Cancer 0.12 -1.3 1.7 1.0 -3.2 0.78 -0.12
Sample2 Cancer 1.3
.
~Cancer
SampleN ~Cancer
1000 - 100,000 columns
100-500 rows
Gene-Conditions or Gene-Sample (numeric or discretized)
Gene-Sample-Time Gene-Time
time
expression level
![Page 11: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/11.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
11
Type of gene expression datasets
1000 - 100,000 columns
100-500 rows
Gene-Conditions or Gene-Sample (numeric or discretized)
Class Gene1 Gene2 Gene3 Gene4 Gene5 Gene6 Gene7 .....
Sample1 Cancer 1 0 1 1 1 0 0
Sample2 Cancer 1
.
~Cancer
SampleN ~Cancer
Gene-Sample-Time Gene-Time
time
expression level
![Page 12: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/12.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
12
Application: Disease subtype diagnosis
???
malignmalignmalignmalignbenignbenignbenignbenign
??????
genessa
mpl
es
![Page 13: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/13.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
13
Application: Treatment prognosis
???
NRNRNRNRRRRR
??????
genessa
mpl
es
![Page 14: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/14.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
14
Type of gene expression datasets
1000 - 100,000 columns
100-500 rows
Gene-Conditions or Gene-Sample (numeric or discretized)
Gene1 Gene2 Gene3 Gene 4 Gene5 Gene6 Gene7
Cond1 0.12 -1.3 1.7 1.0 -3.2 0.78 -0.12
Cond2 1.3
.
CondN
Gene-Sample-Time Gene-Time
time
expression level
![Page 15: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/15.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
15
Application: Drug-action detection
• Which group of genes does the drug affect? Why?
NormalNormalNormalNormalDrugDrugDrugDrug
genesco
nditi
ons
Exercise #1
![Page 16: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/16.jpg)
Gene expression profile classification
Childhood acute lymphoblastic leukemiasubtype diagnosis
![Page 17: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/17.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
17
• The subtypes look similar
• Conventional diagnosis– Immunophenotyping– Cytogenetics– Molecular diagnostics
• Unavailable in most ASEAN countries
Childhood ALL
• Major subtypes: T-ALL, E2A-PBX, TEL-AML, BCR-ABL, MLL genome rearrangements, Hyperdiploid>50
• Diff subtypes respond differently to same Tx
• Over-intensive Tx– Development of
secondary cancers– Reduction of IQ
• Under-intensiveTx– Relapse
![Page 18: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/18.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
18
Mission
• Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists
• Generally available only in major advanced hospitals
⇒ Can we have a single-test easy-to-use platform instead?
![Page 19: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/19.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
19
Single-test platform ofmicroarray & machine learning
![Page 20: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/20.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
20
Overall strategy
• For each subtype, select genes to develop classification model for diagnosing that subtype
• For each subtype, select genes to develop prediction model for prognosis of that subtype
Diagnosis of subtype
Subtype-dependentprognosis
Risk-stratifiedtreatmentintensity
![Page 21: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/21.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
21
Subtype diagnosis by PCL
• Gene expression data collection
• Gene selection by χ2
• Classifier training by emerging pattern
• Classifier tuning (optional for some machine learning methods)
• Apply classifier for diagnosis of future cases by PCL
![Page 22: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/22.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
22
Childhood ALL subtype diagnosis workflow
A tree-structureddiagnostic workflow was recommended byour doctor collaborator
![Page 23: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/23.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
23
Training and testing sets
![Page 24: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/24.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
24
Signal selection basic idea
• Choose a signal w/ low intra-class distance
• Choose a signal w/ high inter-class distance
![Page 25: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/25.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
25
Signal selection by χ2
![Page 26: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/26.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
26
Emerging patterns
• An emerging pattern is a set of conditions– usually involving several features– that most members of a class satisfy – but none or few of the other class satisfy
• A jumping emerging pattern is an emerging pattern that – some members of a class satisfy– but no members of the other class satisfy
• We use only jumping emerging patterns
![Page 27: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/27.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
27
Examples
Reference number 9: the expression of gene 37720_at > 215Reference number 36: the expression of gene 38028_at ≤ 12
Patterns Frequency (P) Frequency(N){9, 36} 38 instances 0{9, 23} 38 0{4, 9} 38 0{9, 14} 38 0{6, 9} 38 0{7, 21} 0 36{7, 11} 0 35{7, 43} 0 35{7, 39} 0 34{24, 29} 0 34
Easy interpretation
![Page 28: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/28.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
28
PCL: Prediction by Collective Likelihood
![Page 29: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/29.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
29
PCL learning
Top-Ranked EPs inPositive class
Top-Ranked EPs inNegative class
EP1P (90%)
EP2P (86%)
.
.EPn
P (68%)
EP1N (100%)
EP2N (95%)
.
.EPn
N (80%)
The idea of summarizing multiple top-ranked EPs is intendedto avoid some rare tie cases
![Page 30: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/30.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
30
PCL testing
ScoreP = EP1P’ / EP1
P + … + EPkP’ / EPk
P
Most freq EP of pos classin the test sample
Most freq EP of pos class
Similarly, ScoreN = EP1
N’ / EP1N + … + EPk
N’ / EPkN
If ScoreP > ScoreN, then positive class, Otherwise negative class
![Page 31: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/31.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
31
Accuracy of PCL (vs. other classifiers)
The classifiers are all applied to the 20 genes selected by χ2 at each level of the tree
![Page 32: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/32.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
32
Understandability of PCL
• E.g., for T-ALL vs. OTHERS, one ideally discriminatory gene 38319_at was found, inducing these 2 EPs
• These give us the diagnostic rule
![Page 33: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/33.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
33
Multidimensional scaling plot for subtype diagnosis
Obtained by performing PCA on the 20 genes chosen for each level
![Page 34: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/34.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
34
Childhood ALL cure rates
• Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists
⇒Not available in less advanced ASEAN countries
![Page 35: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/35.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
35
Childhood ALL treatment cost
• Treatment for childhood ALL over 2 yrs– Intermediate intensity: US$60k– Low intensity: US$36k– High intensity: US$72k
• Treatment for relapse: US$150k
• Cost for side-effects: Unquantified
![Page 36: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/36.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
36
Current situation (2000 new cases / yr in ASEAN)
• Intermediate intensity conventionally applied in less advanced ASEAN countries
• Over intensive for 50% of patients, thus more side effects
• Under intensive for 10% of patients, thus more relapse
• US$120m (US$60k * 2000) for intermediate intensity tx
• US$30m (US$150k * 2000 * 10%) for relapse tx
• Total US$150m/yr plus un-quantified costs for dealing with side effects
![Page 37: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/37.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
37
Using our platform
• Low intensity applied to 50% of patients
• Intermediate intensity to 40% of patients
• High intensity to 10% of patients
⇒ Reduced side effects⇒ Reduced relapse⇒ 75-80% cure rates
• US$36m (US$36k * 2000 * 50%) for low intensity
• US$48m (US$60k * 2000 * 40%) for intermediate intensity
• US$14.4m (US$72k * 2000 * 10%) for high intensity
• Total US$98.4m/yr⇒ Save US$51.6m/yr
![Page 38: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/38.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
38
A nice ending…
• Asian Innovation Gold Award 2003
![Page 39: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/39.jpg)
Gene expression profile clustering
Novel disease subtype discovery
![Page 40: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/40.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
40
Is there a new subtype?
• Hierarchical clustering of gene expression profiles reveals a novel subtype of childhood ALL
Exercise: Name and describe one bi-clustering method
![Page 41: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/41.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
41
Hierarchical clustering
• Assign each item to its own cluster– If there are N items initially, we get N clusters,
each containing just one item
• Find the “most similar” pair of clusters, merge them into a single cluster, so we now have one less cluster
• Repeat previous step until all items are clustered into a single cluster of size N
More about this in a moment
![Page 42: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/42.jpg)
Gene expression profile clustering
Diagnosis via guilt-by-association
![Page 43: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/43.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
43
Some patient samples
• Does Mr. A have cancer?
malign
malign
malign
malignbenign
benignbenign
benign
genessa
mpl
es
???Mr. A:
![Page 44: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/44.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
44
Let’s rearrange the rows…
• Does Mr. A have cancer?
genessa
mpl
es
malignmalignmalignmalignbenignbenignbenignbenign
???Mr. A:
![Page 45: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/45.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
45
and the columns too…
malignmalignmalignmalignbenignbenignbenignbenign
genessa
mpl
es
???Mr. A:
• Does Mr. A have cancer?
![Page 46: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/46.jpg)
Introduction to simple clustering methods
![Page 47: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/47.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
47
What is cluster analysis?
• Finding groups of objects such that objects in a group are similar to one another and different from objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
![Page 48: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/48.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
48
Notion of a cluster can be ambiguous
How many clusters?
Four ClustersTwo Clusters
Six Clusters
![Page 49: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/49.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
49
We can also have
![Page 50: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/50.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
50
K-means clustering
• Partitional clustering approach • Each cluster is associated with a centroid• Each point is assigned to the cluster with the
closest centroid• # of clusters, K, must be specified
Assignment
Update
![Page 51: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/51.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
51
K-means clustering illustration
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
![Page 52: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/52.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
52
K-means clustering illustration
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
![Page 53: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/53.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
53
Importance of choosing
initial centroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
![Page 54: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/54.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
54
Hierarchical clustering
• Two main types of hierarchical clustering– Agglomerative:
• Start with the points as individual clusters• At each step, merge the closest pair of clusters until
only one cluster (or k clusters) left– Divisive:
• Start with one, all-inclusive cluster • At each step, split a cluster until each cluster
contains a point (or there are k clusters)
• Traditional hierarchical algorithms use a similarity or distance matrix– Merge or split one cluster at a time
![Page 55: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/55.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
55
Agglomerative hierarchical clustering
• More popular hierarchical clustering technique
• Basic algorithm Compute the proximity matrixLet each data point be a clusterRepeat
Merge the two closest clustersUpdate the proximity matrix
Until only a single cluster remains
• Key is computation of proximity of two clusters– Different approaches to defining the distance /
similarity between clusters
Merge
Update
![Page 56: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/56.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
56
Visualization of agglomerative
hierarchical clustering
p4 p1
p3
p2
Traditional Hierarchical Clustering
p4p1 p2 p3
Traditional Dendrogram
![Page 57: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/57.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
57
Single, complete, & average Linkage
Single linkage defines distancebetw two clusters as min distancebetw them
Complete linkage defines distancebetw two clusters as max distance betwthem
Exercise: Give definition of “average linkage”
Image source: UCL Microcore Website Exercise #2
![Page 58: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/58.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
58
Simulation: Starting situation
...p1 p2 p3 p4 p9 p10 p11 p12
• Start with clusters of individual points and a proximity matrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.Proximity Matrix
![Page 59: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/59.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
59
Intermediate situation
...p1 p2 p3 p4 p9 p10 p11 p12
• After some merging steps, we have some clusters
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
![Page 60: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/60.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
60
Intermediate situation
...p1 p2 p3 p4 p9 p10 p11 p12
• We want to merge the two closest clusters (C2 and C5) and update the proximity matrix.
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
![Page 61: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/61.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
61
After merging
...p1 p2 p3 p4 p9 p10 p11 p12
• The question is “How do we update the proximity matrix?”
C1
C4
C2 U C5
C3 ? ? ? ?
?
?
?
C2 U C5C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
![Page 62: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/62.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
62
How to define inter-cluster similarity
• Min• Max• Group average• Distance between centroids
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
Proximity Matrix
![Page 63: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/63.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
63
How to define inter-cluster similarity
–
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
• Min• Max• Group average• Distance between centroids
![Page 64: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/64.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
64
How to define inter-cluster similarity
–
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
• Min• Max• Group average• Distance between centroids
![Page 65: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/65.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
65
How to define inter-cluster similarity
–
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
• Min• Max• Group average• Distance between centroids
![Page 66: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/66.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
66
How to define inter-cluster similarity
–
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
× ×
• Min• Max• Group average• Distance between centroids
![Page 67: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/67.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
67
Cluster similarity: Min / single linkage
• Similarity of two clusters is based on the two most similar (closest) points in the different clusters– Determined by one pair of points, i.e., by one link
in the proximity graph
3 6 2 5 4 10
0.05
0.1
0.15
0.2
![Page 68: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/68.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
68
Hierarchical clustering: Min
Single-linkage clustering Single-linkage dendrogram
1
23
4
5
6
12
3
4
5
3 6 2 5 4 10
0.05
0.1
0.15
0.2
![Page 69: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/69.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
69
Food for thought
• What are the key strengths of single-linkage clustering?
• What are the key weaknesses of single-linkage clustering?
Exercise #3
![Page 70: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/70.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
72
Cluster similarity: Max / complete linkage
• Similarity of two clusters is based on the two least similar (most distant) points in the different clusters– Determined by all pairs of points in the two clusters
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
![Page 71: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/71.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
73
Hierarchical clustering: Max
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
2
3
4
5
61
2 5
3
4
We still want to merge two most similar clusters each time. But we define the distance between clusters based on MAX
![Page 72: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/72.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
74
Food for thought
• What are the key strengths of complete-linkage clustering?
• What are the key weaknesses of complete-linkage clustering?
Exercise #4
![Page 73: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/73.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
77
Cluster similarity: Group average
• Proximity of two clusters is the average of pairwise proximity between points in the two clusters
||Cluster||Cluster
)p,pproximity(
)Cluster,Clusterproximity(ji
ClusterpClusterp
ji
jijjii
∗=
∑∈∈
![Page 74: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/74.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
78
Hierarchical clustering: Group average
Group Average Clustering Group Average Dendrogram
1
2
3
4
5
6
1
2
5
34
![Page 75: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/75.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
79
Hierarchical clustering:
Group average
• Compromise between single and complete linkage
• Strengths– Less susceptible to
noise and outliers
• Limitations– Biased towards
globular clusters
![Page 76: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/76.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
80
Hierarchical clustering: Comparison
Group average
Min Max
1
23
4
5
61
2
5
34
1
23
4
5
61
2 5
3
41
23
4
5
61
2
3
4
5
![Page 77: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/77.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
81
Food for thought
• What are the space and time complexity of hierarchical clustering?
Exercise #5
![Page 78: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/78.jpg)
Normalization
![Page 79: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/79.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
84
Sometimes, a gene expression study may involve batches of data collected over a long period of time…
0
10
20
30
40
50
60
70
Jan-
04
Apr-0
4
Jul-0
4
Oct
-04
Jan-
05
Apr-0
5
Jul-0
5
Oct
-05
Jan-
06
Apr-0
6
Jul-0
6
Oct
-06
Jan-
07
Apr-0
7
Jul-0
7
Oct
-07
Jan-
08
Apr-0
8
Jul-0
8
Oct
-08
Jan-
09
Apr-0
9
Jul-0
9
Oct
-09
Jan-
10
Time Span of Gene Expression Profiles
Image credit: Dong Difeng
![Page 80: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/80.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
85
In such a case, batch effect may be severe… to the extent that you can predict the batch that each sample comes!
⇒Need normalization to correct for batch effect
Image credit: Dong Difeng
![Page 81: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/81.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
86
Normalization approaches
• Aim of normalization: Reduce variance w/o increasing bias
• Scaling method– Intensities are scaled
so that each array has same ave value
– E.g., Affymetrix’s
• Xform data so that distribution of probe intensities is same on all arrays– E.g., Z = (x −µ) / σ
• Quantile normalization
![Page 82: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/82.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
87
Quantile normalization
• Given n arrays of length p, form X of size p × n where each array is a column
• Sort each column of X to give Xsort
• Take means across rows of Xsort and assign this mean to each elem in the row to get X’sort
• Get Xnormalized by arranging each column of X’sort to have same ordering as X
• Implemented in some microarray s/w, e.g., EXPANDER
![Page 83: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/83.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
88
After quantile normalization
![Page 84: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/84.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
89
Food for thought
• Given a cancer vs normal dataset
• Should you apply quantile normalization to the dataset as a whole or should you apply quantile normalization to the cancer and the normal part separately? Why?
Exercise #6
![Page 85: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/85.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
90
Food for thought
• Given a cancer vs normal dataset
• Should you apply Z-normalization in a patient-wise or gene-wise manner? Why?
Exercise #7
![Page 86: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/86.jpg)
Selection of patient samples and genes for disease prognosis
![Page 87: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/87.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
92
Gene expression profile + clinical data
⇒ outcome prediction
• Univariate & multivariate Cox survival analysis (Beer et al 2002, Rosenwald et al 2002)
• Fuzzy neural network (Ando et al 2002)
• Partial least squares regression (Park et al 2002)
• Weighted voting algorithm (Shipp et al 2002)
• Gene index and “reference gene” (LeBlanc et al 2003)
• ……
![Page 88: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/88.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
93
Our approach
“extreme”sampleselection
ERCOF
Liu et al. “Use of extreme patient samples for outcome prediction from gene expression data. Bioinformatics, 21(16):3377--3384, 2005
![Page 89: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/89.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
94
Short-term Survivors v.s. Long-term Survivors
T: sampleF(T): follow-up time
E(T): status (1:unfavorable; 0: favorable)c1 and c2: thresholds of survival time
Short-term survivorswho died within a short
period
F(T) < c1 and E(T) = 1
⇓
Long-term survivorswho were alive after a
long follow-up time
F(T) > c2
⇓
Extreme sample selection
![Page 90: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/90.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
95
ERCOFEntropy-
Based Rank Sum Test & Correlation
Filtering
Remove genes with expression values w/o cut point found (can’t be discretized)
Calculate Wilcoxon rank sum w(x) for gene x. Remove gene x if w(x)∈[clower, cupper]
Group features by Pearson Correlation For each group, retain the top 50% wrt class entropy
![Page 91: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/91.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
96
Linear Kernel SVM regression functionbixTKyaTG i
ii +=∑ ))(,()(
T: test sample, x(i): support vector,yi: class label (1: short-term survivors; -1: long-term survivors)
Transformation function (posterior probability)
)(11)( TGe
TS −+= ))1,0()(( ∈TS
S(T): risk score of sample T
Risk score construction
![Page 92: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/92.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
97
Diffuse large B-cell lymphoma
• DLBC lymphoma is the most common type of lymphoma in adults
• Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients
⇒ DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy
• Intl Prognostic Index (IPI) – age, “Eastern Cooperative
Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ...
• Not very good for stratifying DLBC lymphoma patients for therapeutic trials
⇒ Use gene-expression profiles to predict outcome of chemotherapy?
![Page 93: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/93.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
98
Rosenwald et al., NEJM 2002
• 240 data samples– 160 in preliminary group– 80 in validation group– each sample described by 7399 microarray
features
• Rosenwald et al.’s approach– identify gene: Cox proportional-hazards model– cluster identified genes into four gene signatures– calculate for each sample an outcome-predictor
score– divide patients into quartiles according to score
![Page 94: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/94.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
99
Knowledge discovery from gene expression of “extreme” samples
“extreme”sampleselection:< 1 yr vs > 8 yrs
knowledgediscovery from gene expression
240 samples
80 samples26 long-
term survivors
47 short-term survivors
7399genes
84genes
T is long-term if S(T) < 0.3T is short-term if S(T) > 0.7
![Page 95: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/95.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
100
732547+1(*)Informative
1607288OriginalDLBCL
AliveDead
TotalStatusData setApplication
Number of samples in original data and selected informative training set.(*): Number of samples whose corresponding patient was dead at the end of follow-up time, but selected as a long-term survivor.
Discussions: Sample selection
![Page 96: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/96.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
101
84(1.7%)Phase II
132(2.7%)Phase I
4937(*)Original
DLBCLGene selection
Number of genes left after feature filtering for each phase.(*): number of genes after removing those genes who were absent in more than 10% of the experiments.
Discussions: Gene identification
![Page 97: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/97.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
102
p-value of log-rank test: < 0.0001Risk score thresholds: 0.7, 0.3
Kaplan-Meier plot for 80 test cases
![Page 98: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/98.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
103
(A) IPI low, p-value = 0.0063
(B) IPI intermediate,p-value = 0.0003
Improvement over IPI
![Page 99: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/99.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
104
(A) W/o sample selection (p =0.38) (B) With sample selection (p=0.009)
No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted
Merit of “extreme” samples
![Page 100: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/100.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
105
About the inventor: Huiqing Liu
• Huiqing Liu– PhD, NUS, 2004– Currently Senior
Scientist at Centocor– Asian Innovation
Gold Award 2003– New Jersey Cancer
Research Award for Scientific Excellence 2008
– Gallo Prize 2008
![Page 101: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/101.jpg)
Beyond disease diagnosis & prognosis
![Page 102: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/102.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
107
Beyond classification of gene expression profiles
• After identifying the candidate genes by feature selection, do we know which ones are causal genes, which ones are surrogates, and which are noise?
Diagnostic ALL BM samples (n=327)
3σ-3σ -2σ -1σ 0 1σ 2σσ = std deviation from mean
Gen
es fo
r cla
ss
dist
inct
ion
(n=2
71)
TEL-AML1BCR-ABL
Hyperdiploid >50E2A-PBX1
MLL T-ALL Novel
![Page 103: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/103.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
108
Gene regulatory circuits
• Genes are “connected” in “circuit” or network
• Expression of a gene in a network depends on expression of some other genes in the network
• Can we “reconstruct” the gene network from gene expression and other data?
Source: Miltenyi Biotec
![Page 104: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/104.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
109
Key questions
For each gene in the network:
• Which genes affect it?
• How they affect it?– Positively?– Negatively?– More complicated ways?
![Page 105: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/105.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
110
Some techniques
• Bayesian Networks– Friedman et al., JCB 7:601--620, 2000
• Boolean Networks– Akutsu et al., PSB 2000, pages 293--304
• Differential equations– Chen et al., PSB 1999, pages 29--40
• Classification-based method– Soinov et al., “Towards reconstruction of gene
network from expression data by supervised learning”, Genome Biology 4:R6.1--9, 2003
![Page 106: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/106.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
111
A classification-based techniqueSoinov et al., Genome Biology 4:R6.1-9, 2003
• Given a gene expression matrix X– each row is a gene– each column is a sample– each element xij is expression of gene i in sample j
• Find the average value ai of each gene i
• Denote sij as state of gene i in sample j,– sij = up if xij > ai
– sij = down if xij ≤ ai
![Page 107: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/107.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
112
A classification-based techniqueSoinov et al., Genome Biology 4:R6.1-9, Jan 2003
• To see whether the state of gene g is determined by the state of other genes
– See whether ⟨sij | i ≠ g⟩can predict sgj
– If can predict with high accuracy, then “yes”
– Any classifier can be used, such as C4.5, PCL, SVM, etc.
• To see how the state of gene g is determined by the state of other genes
– Apply C4.5 (or PCL or other “rule-based” classifiers) to predict sgjfrom ⟨sij | i ≠ g⟩
– Extract the decision tree or rules used
![Page 108: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/108.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
113
Advantages of this method
• Can identify genes affecting a target gene• Don’t need discretization thresholds?• Each data sample is treated as an example• Explicit rules can be extracted from the classifier
(assuming C4.5 or PCL)• Generalizable to time series
• Discuss the point “Don’t need discretization thresholds”. Is it true?
Exercise #8
![Page 109: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/109.jpg)
Concluding remarks
![Page 110: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/110.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
115
Bcr-Abl
• Targeted drug dev – Know what
molecular effect you want to achieve
• E.g., inhibit a mutated form of a protein
– Engineer a compound that directly binds and causes the desired effect
• Gleevec (imatinib)– 1st success for real drug– Targets Bcr-Abl fusion
protein (ie, Philadelphia chromosome, Ph)
– NCI summary of clinical trial of imatinib for ALL at
http://www.cancer.gov/clinicaltrials/results/ALLimatinib1109/print
![Page 111: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/111.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
116
What have we learned?
• Technologies– Microarray– PCL, ERCOF
• Microarray applications– Disease diagnosis by supervised learning– Subtype discovery by unsupervised learning– Disease diagnosis via guilt-by-association– Gene network reconstruction
• Important tactic– Extreme sample selection
![Page 112: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/112.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
117
Useful packages
• EXPANDER (EXPression Analyser & DisplayER)– http://acgt.cs.tau.ac.il/expander
• BRB-Array Tools– http://linus.nci.nih.gov/BRB-ArrayTools.html
• NetProt– http://rpubs.com/gohwils/204259– https://github.com/gohwils/NetProt/releases/
![Page 113: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/113.jpg)
Any question?
![Page 114: CS2220: Introduction to Computational Biology Unit …wongls/courses/cs2220/2017/...•Assign each item to its own cluster – If there are N items initially, we get N clusters, each](https://reader034.vdocuments.us/reader034/viewer/2022042204/5ea54b2811b80430a8334dc8/html5/thumbnails/114.jpg)
CS2220, AY17/18 Copyright 2015 © Wong Limsoon
119
References
• E.-J. Yeoh et al., “Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling”, Cancer Cell, 1:133--143, 2002
• H. Liu, J. Li, L. Wong. Use of Extreme Patient Samples for Outcome Prediction from Gene Expression Data. Bioinformatics, 21(16):3377--3384, 2005.
• L.D. Miller et al., “Optimal gene expression analysis by microarrays”, Cancer Cell 2:353--361, 2002
• J. Li, L. Wong, “Techniques for Analysis of Gene Expression”, The Practical Bioinformatician, Chapter 14, pages 319—346, WSPC, 2004
• B. Bolstad et al. “A comparison of normalization methods for high density oligonucleotide array data based on variance and bias”. Bioinformatics, 19:185–193. 2003