cbcl/ai mit what is bioinformatics ? application of computing technology to providing statistical...
TRANSCRIPT
![Page 1: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/1.jpg)
CBCl/AI MIT
What is bioinformatics ?
Application of computing technology to providing statistical anddatabase solutions to problems in molecular biology.
Defining and addressing problems in molecular biology using methodologies from statistics and computer science.
The genome project, genome wide analysis/screening of disease,genetic regulatory networks, analysis of expression data.
Pre 1995
Post 1995
![Page 2: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/2.jpg)
CBCl/AI MIT
Central dogma of biology
DNA RNA pre-mRNA
mRNA Protein
Central dogma
![Page 3: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/3.jpg)
CBCl/AI MIT
CGAACAAACCTCGAACCTGCTDNA:
mRNA: GCU UGU UUA CGA
Polypeptide: Ala Cys Leu Arg
Translation
Transcription
Basic molecular biology
![Page 4: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/4.jpg)
CBCl/AI MIT
Transcription
End modification
Splicing
Transport
Translation
Less basic molecular biology
![Page 5: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/5.jpg)
CBCl/AI MIT
Sequence information
Quantitative information
microarray
rt-pcr
protein arrays
yeast-2 hybrid
Chemical screens
DNA
RNA
preRNA
mRNA
Protein
transcription
splicing
translation, stability
transport, localization
Biological information
![Page 6: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/6.jpg)
CBCl/AI MIT
•Splice sites and branch points in eukaryotic pre-mRNA•Gene finding in prokaryotes and eukaryotes•Promoter recognition (transcription and termination)•Protein structure prediction•Protein function prediction•Protein family classification
Sequence problems
![Page 7: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/7.jpg)
CBCl/AI MIT
•Predict tissue morphology •Predict treatment/drug effect•Infer metabolic pathway•Infer disease pathways•Infer developmental pathway
Gene expression problems
![Page 8: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/8.jpg)
CBCl/AI MIT
Basic idea:
The state of the cell is determined by proteins. A gene codes for a protein which is assembled via mRNA.Measuring amount particular mRNA gives measure ofamount of corresponding protein.Copies of mRNA is expression of a gene.
Microarray technology allows us to measure the expressionof thousands of genes at once. (Northern blot).
Measure the expression of thousands of genesunder different experimental conditions and ask what isdifferent and why.
Microarray technology
![Page 9: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/9.jpg)
CBCl/AI MIT
Cy3 Cy5
ReferenceTest Sample
cDNA Clone(LIBRARY)
PCR Product
PE
Test Sample
OligonucleotideSynthesis
Biological Sample
RNA
ARRAY
ARRAY
Ramaswamy and Golub, JCO
Microarray technology
![Page 10: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/10.jpg)
CBCl/AI MIT
Lockhart and Winzler 2000
Oligonucleotide cDNA
Microarray technology
![Page 11: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/11.jpg)
CBCl/AI MIT
Yeast experiment
Microarray experiment
![Page 12: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/12.jpg)
CBCl/AI MIT
When the science is not well understood, resort to statistics:
Ultimate goal: discover the genetic pathways of cancers
Infer cancer genetics by analyzing microarray data from tumors
Curse of dimensionality: Far too few examples for so many dimensions to predict accurately
Immediate goal: models that discriminate tumor types or treatment outcomes and determine genes used in model
Basic difficulty: few examples 20-100, high-dimensionality 7,000-16,000 genes measured for each sample, ill-posed problem
Analytic challenge
![Page 13: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/13.jpg)
CBCl/AI MIT
38 examples of Myeloid and Lymphoblastic leukemias Affymetrix human 6800, (7128 genes including control genes)
34 examples to test classifier
Results: 33/34 correct
d perpendicular distancefrom hyperplane
Test data
d
Cancer Classification
![Page 14: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/14.jpg)
CBCl/AI MIT
Coregulation: the expression of two genes must be correlated for a protein to be made, so we need to look at pairwise correlations as well as individual expression
Size of feature space: if there are 7,000 genes, feature space is about 24 million features, so the fact that feature space is never computed is important
2
22
)1()()(),(
1,,,,,)(
,
jijiji
TrkshshTrkTrksh
Trksh
xxxxxxK
eeeeeex
eex
φφ
φ
Two gene example: two genes measuring Sonic Hedgehog and TrkC
Coregulation and kernels
![Page 15: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/15.jpg)
CBCl/AI MIT
Nonlinear SVM helps when the most informative genes are removed,Informative as ranked using Signal to Noise (Golub et al).
Genes removed errors1st order 2nd order 3rd order
polynomials
0 1 1 110 2 1 120 3 2 130 3 3 240 3 3 250 3 2 2100 3 3 2200 3 3 3 1500 7 7 8
Gene coregulation
![Page 16: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/16.jpg)
CBCl/AI MIT
Golub et al classified 29 test points correctly, rejected 5 of which 2 were errors using 50 genes
Need to introduce concept of rejects to SVM
g1
g2
Normal
Cancer
Reject
Rejecting samples
![Page 17: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/17.jpg)
CBCl/AI MIT
Rejecting samples
![Page 18: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/18.jpg)
CBCl/AI MIT
Estimating a CDF
![Page 19: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/19.jpg)
CBCl/AI MIT
The regularized solution
![Page 20: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/20.jpg)
CBCl/AI MIT
1/d
P(c=1 | d)
.95
95% confidence or p = .05 d = .107
Rejections for SVMs
![Page 21: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/21.jpg)
CBCl/AI MIT
Results: 31 correct, 3 rejected of which 1 is an error
Test data
d
Results with rejections
![Page 22: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/22.jpg)
CBCl/AI MIT
SVMs as stated use all genes/features
Molecular biologists/oncologists seem to be convinced that only a small subset of genes are responsible for particular biological properties, so they want the genes most important in
discriminating
Practical reasons, a clinical device with thousands of genes is not financially practical
Possible performance improvement
Wrapper method for gene/feature selection
Gene selection
![Page 23: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/23.jpg)
CBCl/AI MIT
AML vs ALL: 40 genes 34/34 correct, 0 rejects. 5 genes 31/31 correct, 3 rejects of which 1 is an error.
B vs T cells for AML: 10 genes 33/33 correct, 0 rejects.
d
Test data
d
Test data
Results with gene selection
![Page 24: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/24.jpg)
CBCl/AI MIT
Recursive feature elimination (RFE): based upon perturbationanalysis, eliminate genes that perturb the margin the least
Optimize leave-one out (LOO): based upon optimization of leave-one out error of a SVM, leave-one out error is
unbiased
Two feature selection algorithms
![Page 25: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/25.jpg)
CBCl/AI MIT
(2) step goto and set gene reduced on SVM Retrain 4.
10%)
smallest (for magnitude absolute small withelements vector
those to ingcorrespond enesfeatures/g input Discard 3.
value absoluteby vector of elements order Rank
vector for problem SVM the Solve
w
w
.2
.1
Recursive feature elimination
![Page 26: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/26.jpg)
CBCl/AI MIT
Use leave-one-out (LOO) bounds for SVMs as a criterion to select features by searching over all possible subsets of n features for the ones that minimizes the bound.
When such a search is impossible because of combinatorial explosion, scale each feature by a real value variable and compute this scaling via gradient descent on the leave-one-out bound. One can then keep the features corresponding to the largest scaling variables.
The rescaling can be done in the input space or in a “Principal Components” space.
Optimizing the LOO
![Page 27: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/27.jpg)
CBCl/AI MIT
Rescale features to minimize the LOO bound R2/M2
x2
x1
R2/M2 >1
M
R
x2
R2/M2 =1
M = R
Pictorial demonstration
![Page 28: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/28.jpg)
CBCl/AI MIT
Radius margin bound: simple to compute, continuous very loose but often tracks LOO well
Jaakkola Haussler bound: somewhat tighter, simple to compute, discontinuous so need to smooth,
valid only for SVMs with no b term
Span bound: tight complicated to compute, discontinuous so need to smooth
Three LOO bounds
![Page 29: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/29.jpg)
CBCl/AI MIT
tion.multiplica elementby element denote
, where
vu
σσyσxKyxK
bxxKαyxf
nσ
iσSVi
ii
),,(),(
.),()(
We add a scaling parameter to the SVM, which scales genes, genes corresponding to small j are removed.
The SVM function has the form:
Classification function with scaling
![Page 30: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/30.jpg)
CBCl/AI MIT
. ,)
:formquadratic following the maximizingby computed are s' The
ij 0),((),(,
1212 αxxKααyyασαW
α
jiCjiσjiji
ii
.
data the around variance For
and ,
maximize data the around radius the computing For
1
jijiσ
iiiσ
ii
jijiσji
iiiσi
xxKxxKσV
ββxxKββxxKβσβR
,
12
,212
),(),()(
.10),(),(),(
2
).(),(),(),()( 221221 σVσαWσβRσαWTDL Remeber
SVM and other functionals
![Page 31: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/31.jpg)
CBCl/AI MIT
2. step to return and in elements small to ingcorrespond dimensions Discard 5.
3. step goto reached not is of minima local 4.If
step. gradient a with to respect with error of estimate the Minimize 3.
algorithm SVM standard the 2.Solve
Initialize
and compute to used are steps following The
σ
T
σT
σαWσα
σ
σα
α).,(maxarg)(
.1,...,1.1
Algorithm
![Page 32: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/32.jpg)
CBCl/AI MIT
Computing gradients
.01
1KK~
where
,K~K
~K~
span therespect toith gradient w The
,
,
radius therespect toith gradient w The
.,W
margin therespect toith gradient w The
SV
1-SV
SV1-SV
42
1,1
2
1,
2
T
fi
f
i
ji f
jiji
i i
iii
f
ji f
jijiji
f
SS
xxKxxKR
xxKyy
![Page 33: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/33.jpg)
CBCl/AI MIT
Linear problem with 6 relevant dimensions of 202
Nonlinear problem with 2 relevant dimensions of 52
number of samples number of samples
erro
r ra
te
erro
r ra
te
Toy data
![Page 34: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/34.jpg)
CBCl/AI MIT
Dataset Total Samples
Class 0
Class 1
Leukemia Morphology (train)
38 27 ALL
11 AML
Leukemia Morpholgy (test)
34 20 ALL
14 AML
Leukemia Lineage (ALL)
23 15 B-Cell
8 T-Cell
Lymphoma Outcome (AML)
15 8 Low risk
7 High risk
Dataset Total Samples
Class 0
Class 1
Lymphoma Morphology
77 19 FSC
58 DLCL
Lymphoma Outcome
58 22 Low risk
36 High risk
Brain Morphology
41 14 Glioma
27 MD
Brain Outcome
50 38 Low risk
12 High risk
Hierarchy of difficulty:1. Histological differences: normal vs. malignant, skin vs. brain2. Morphologies: different leukemia types, ALL vs. AML3. Lineage B-Cell vs. T-Cell, folicular vs. large B-cell lymphoma4. Outcome: treatment outcome, elapse, or drug sensitivity.
Molecular classification of cancer
![Page 35: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/35.jpg)
CBCl/AI MIT
Dataset Algorithm Total Samples
Total errors
Class 1 errors
Class 0 errors
Number Genes
SVM 35 0/35 0/21 0/14 40
WV 35 2/35 1/21 1/14 50
Leukemia Morphology (trest) AML vs ALL
k-NN 35 3/35 1/21 2/14 10
SVM 23 0/23 0/15 0/8 10
WV 23 0/23 0/15 0/8 9
Leukemia Lineage (ALL) B vs T
k-NN 23 0/23 0/15 0/8 10
SVM 77 4/77 2/32 2/35 200
WV 77 6/77 1/32 5/35 30
Lymphoma FS vs DLCL
k-NN 77 3/77 1/32 2/35 250
SVM
41 1/41 1/27 0/14 100
WV
41 1/41 1/27 0/14 3
Brain MD vs Glioma
k-NN
41 0/41 0/27 0/14 5
Morphology classification
![Page 36: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/36.jpg)
CBCl/AI MIT
Dataset Algorithm Total Samples
Total errors
Class 1 errors
Class 0 errors
Number Genes
SVM 58 13/58 3/32 10/26 100
WV 58 15/58 5/32 10/26 12
Lymphoma LBC treatment outcome
k-NN 58 15/58 8/32 7/26 15
SVM 50 7/50 6/12 1/38 50
WV 50 13/50 6/12 7/38 6
Brain MD treatment outcome
k-NN 50 10/50 6/12 4/38 5
Outcome classification
![Page 37: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/37.jpg)
CBCl/AI MIT
Error rates ignore temporal information such as when a patient dies. Survivalanalysis takes temporal information into account. The Kaplan-Meier survivalplots and statistics for the above predictions show significance.
0 20 40 60 80 100 120
0.0
0.2
0.4
0.6
0.8
1.0
p-val = 0.0015
0 50 100 150
0.0
0.2
0.4
0.6
0.8
1.0
p-val = 0.00039
Lymphoma Medulloblastoma
Outcome classification
![Page 38: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/38.jpg)
CBCl/AI MIT
Breast Prostate Lung Colorectal
Lymphoma
Bladder
Melenoma Uterus Leuke
mia Renal Pancreas Ovary Mesothel
ioma Brain
Abrev B P L CR Ly Bl M U Le R PA Ov MS C
Total 11 10 11 13 22 11 10 10 30 11 11 11 11 20
Train 8 8 8 8 16 8 8 8 24 8 8 8 8 16
Test 3 2 3 5 6 3 2 3 6 3 3 3 3 4
Note that most of these tumors came from secondary sources and were notat the tissue of origin.
Multi tumor classification
![Page 39: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/39.jpg)
CBCl/AI MIT
CNS, Lymphoma, Leukemia tumors separate
Adenocarcinomas do not separate
Clustering is not accurate
![Page 40: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/40.jpg)
CBCl/AI MIT
+
+
+
+
R+1+1
Y-1-1
G+1-1
B-1+1
ClassG+RB+R
Combination approaches: All pairsOne versus all (OVA)
Multi tumor classification
![Page 41: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/41.jpg)
CBCl/AI MIT
GeneExpression
Dataset
FinalMulticlass
Call(Highest OVA
PredictionStrength)
Breast OVAClassifier
. . .
. . .
Prostate OVAClassifier
CNS OVAClassifier
TEST SAMPLE
BREAST TUMORS
ALL OTHER TUMORS
Hyperplane
Confidence
Breast (High Confidence)
-2
0
+2
Figure 2
Supervised methodology
![Page 42: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/42.jpg)
CBCl/AI MIT
0
0.2
0.4
0.6
0.8
1
-1 0 1 2 3 4
Accuracy Fraction of Calls
0
0.2
0.4
0.6
0.8
1
-1 0 1 2 3 4-1
0
1
2
3
4
5
Low HighLow High
Correct Errors Correct Errors
Lo
w
H
igh
Confidence Confidence
Co
nfi
den
ce
Train/ Test 1
cross -val.
Train/cross -val. Test 1
00.1
0.20.3
0.40.5
0.6
0.7
0.8
0.91
First Top 2 Top 3
Prediction Calls
Train/cross -val. Test 1
0
0.2
0.4
0.6
0.8
1
-1 0 1 2 3 4
Accuracy Fraction of Calls
0
0.2
0.4
0.6
0.8
1
-1 0 1 2 3 4-1
0
1
2
3
4
5
Low HighLow High
Correct Errors Correct Errors
Lo
w
H
igh
Confidence Confidence
Co
nfi
den
ce
Train/ Test 1
cross -val.
Train/cross -val. Test 1
0
0.1
0.20.3
0.40.5
0.6
0.7
0.8
0.91
First Top 2 Top 3
Prediction Calls
Train/cross -val. Test 1
Dataset Sample Type ValidationMethod
Sample Number
TotalAccuracy
Confidence High LowFraction Accuracy Fraction Accuracy
Train Well Differentiated Cross-val. 144 78% 80% 90% 20% 28%
Test 1 Well Differentiated Train/Test 54 78% 78% 83% 22% 58%
Well differentiated tumors
![Page 43: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/43.jpg)
CBCl/AI MIT
Feature selection hurts performance
![Page 44: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/44.jpg)
CBCl/AI MIT
0
0.2
0.4
0.6
0.8
1
-1 0 1 2 3 4
Accuracy Fraction of Calls
-1
0
1
2
3
4
5
Low High
Confidence
Lo
w
H
igh
Co
nfi
den
ce
Correct Errors
00.10.20.30.40.50.60.70.80.9
1
First Top 2 Top 3
Prediction Calls
Dataset Sample Type ValidationMethod
Sample Number
TotalAccuracy
Confidence High LowFraction Accuracy Fraction Accuracy
Test Poorly Differentiated Train/test 20 30% 50% 50% 50% 10%
Poorly differentiated tumors
![Page 45: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/45.jpg)
CBCl/AI MIT
Predicting sample sizes
For a given number of existing samples, how significant is the performance
of a classifier? Does the classifier work at all? Are the results better than
what one would expect by chance?
Assuming we know the answer to the previous questions for a number of sample sizes. What would be the performance of the classifier When trained with additional samples? Will the accuracy of the classifier improve significantly? Is the effort to collect additional samples worthwhile?
![Page 46: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/46.jpg)
CBCl/AI MIT
Predicting sample sizes
classifierclassifierclassifierclassifierclassifierclassifierclassifierclassifier
classifierclassifierclassifierclassifier
Train
Test
Train
Test Train
Test
Input
Dataset
l samples n1 n2 l
….
….
….
Train/test realizations
i =1,2,…T1
Subsampling procedure
Significance test (comparison with random predictors)
1ne 2ne le Average error rates
Sample size
Error rate
Not significant Significant
Learning curve
banne )(
This model can be used to estimate sample size requirements
classifierclassifierclassifierclassifierclassifierclassifierclassifierclassifier
classifierclassifierclassifierclassifierclassifierclassifierclassifierclassifier
classifierclassifierclassifierclassifierclassifierclassifierclassifierclassifier
Train
Test
Train
Test Train
Test
Input
Dataset
l samples n1 n2 l
….
….
….
Train/test realizations
i =1,2,…T1
Subsampling procedure
Significance test (comparison with random predictors)
1ne 2ne le Average error rates
Sample size
Error rate
Not significant Significant
Learning curve
banne )(
This model can be used to estimate sample size requirements
![Page 47: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/47.jpg)
CBCl/AI MIT
Statistical significance
a) b)
The statistical significance in the tumor vs. non-tumor classification for a) 15 samples and b) 30 samples.
![Page 48: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/48.jpg)
CBCl/AI MIT
Learning curves
Tumor vs. normal
![Page 49: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/49.jpg)
CBCl/AI MIT
Learning curves
Tumor vs. normal
a) b)
![Page 50: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/50.jpg)
CBCl/AI MIT
Learning curves
AML vs. ALL
![Page 51: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/51.jpg)
CBCl/AI MIT
Learning curves
Colon cancer
![Page 52: CBCl/AI MIT What is bioinformatics ? Application of computing technology to providing statistical and database solutions to problems in molecular biology](https://reader036.vdocuments.us/reader036/viewer/2022062301/5697bfa11a28abf838c95d61/html5/thumbnails/52.jpg)
CBCl/AI MIT
Learning curves
Brain treatment outcome