data balancing for phenotype classification based on snps
Upload: asociacion-argentina-de-bioinformatica-y-biologia-computacional
Post on 04-Jul-2015
539 views
TRANSCRIPT
1 Facultad de Ingeniería - Universidad Nacional de Mar del Plata1 Facultad de Ingeniería - Universidad Nacional de Mar del Plata
2 - Agencia Nacional de Promoción Científica y Tecnológica – FONCyT - 2 - Agencia Nacional de Promoción Científica y Tecnológica – FONCyT - PICT 2006PICT 2006
1er Congreso Argentino de Bioinformática1er Congreso Argentino de Bioinformática
Data Balancing for PhenotypeData Balancing for Phenotype
Classification Based on SNPsClassification Based on SNPs
Marcel BrunMarcel Brun1,21,2, Virginia Ballarín, Virginia Ballarín11
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Single Nucleotide Polymorphism (SNP)Single Nucleotide Polymorphism (SNP)
• A A single nucleotide polymorphismsingle nucleotide polymorphism is a position in the genome is a position in the genome where two different alleles have been found to be present in the where two different alleles have been found to be present in the population, both at a greater than 1% frequencypopulation, both at a greater than 1% frequency
• This is responsible for most of the This is responsible for most of the genetic variationgenetic variation between between individuals. individuals.
• SNPs may occur in SNPs may occur in non-coding regionsnon-coding regions (SNPs) as well as in (SNPs) as well as in coding regionscoding regions (cSNPs). (cSNPs).
• Each polymorphism, or variant, occurs at a Each polymorphism, or variant, occurs at a frequencyfrequency of of greater than 1% in the populationgreater than 1% in the population
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
SNPsSNPs
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
SNPs CategorizationSNPs Categorization
Categorization:Categorization:
A) Functional SNPS that affect phenotype directly via gen alterations.A) Functional SNPS that affect phenotype directly via gen alterations.
B) Functional SNPs that produces higher predisposition to phenotypical changes B) Functional SNPs that produces higher predisposition to phenotypical changes (via transcriptional changes or external factors).(via transcriptional changes or external factors).
D)D) Silent SNPs, no functionals.Silent SNPs, no functionals.
Can we predict the Phenotype based on a combination of SNPs?Can we predict the Phenotype based on a combination of SNPs?
Can we find SNPs associated to specific phenotype changes?Can we find SNPs associated to specific phenotype changes?
SNPs
•Coding Regions
•Non Coding Regions
• Changes in Protein
• No Changes in Protein
• Changes in Transcription Factors
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Classification based on SNP dataClassification based on SNP data
SNP Data Combinatorial search
Errors Tables for combinations of n SNPs as predictors of control/disease
35 3 1 1 2 3 0 0 0.423 0.430 0.423 0.248 1 2 4 0 0 0.340 0.430 0.340 0.284 1 2 5 0 0 0.351 0.430 0.351 0.279 1 2 6 0 0 0.205 0.430 0.205 0.342 1 2 7 0 0 0.323 0.430 0.323 0.291 1 3 4 0 0 0.344 0.430 0.344 0.282 1 3 5 0 0 0.328 0.430 0.328 0.289 1 3 6 0 0 0.333 0.430 0.333 0.287 1 3 7 0 0 0.314 0.430 0.314 0.295 1 4 5 0 0 0.267 0.430 0.267 0.315 1 4 6 0 0 0.130 0.430 0.130 0.374 1 4 7 0 0 0.267 0.430 0.267 0.315
Processed Results in html pages and Excel datasheets
0 0.5 10
0.2
0.4
0.6
0.8
1
RAS1
RA
S2
Calls – SNiPer HD
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Example: Bovine race classification based on SNPsExample: Bovine race classification based on SNPs
• Race classification based on small sets of SNPs
• European vs. Indicine breeds.
• Data from Bovine Hapmap Consortium
• Focus on region of interest in the genome (Chromosomes 4 and 9)
rs29024708BTA-160695BTA-140710Ψ29
BTA-71641rs29019831BTA-117838Ψ4
SNP 3SNP 2SNP 1
0.018640.00339Error
0.9890.999Sensibilidad
0.9740.995Especificidad
0.9940.999NPV
0.9570.991PPV
0.0110.00098FNR
0.0220.0046FPR
38.1438.46TN Medio
19.7620.34TP Medio
0.220.02FN Medio
0.880.18FP Medio
Ψ4Ψ29
• Mariela A. Gonzalez, Marcel Brun, Pablo M. Corva, Virginia Ballarin, “Análisis de señales genómicas para la clasificación de razas bovinas”, CAI 2009, 1er Congreso Argentino de Agroinformática, in the 38 JAIIO, Mar del Plata, Argentina, 24-28 de Agosto 2009.
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Discrete Ful Logic for SNP-based classificationDiscrete Ful Logic for SNP-based classification
• SNP calls can be considered discrete variables with three possible values AA, AB and BB
• SNPs can be used together to predict the status of disease against control. This status can be considered a Boolean variable: “1” for case and “0” for control.
• Training consist in learning the logic to determine the outcome as a function of the observed SNPs.
• The problem is constrained to few observed SNPs to avoid over-fitting in training.
• Given a number of SNPs, a decision table defines a “full-logic” discrete classifier.
Control
Unknown
Unknown
Control
Control
Case
Case
Case
Control
Outcome
ABAA
BBAA
AAAB
BBAB
ABAB
AABB
BBBB
ABBB
AAAA
SNP2SNP1
Example of decision table
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Estimation of error rate for 2 SNPs from data – Discrete Full logic
Control
AB
AA
S7
Case
AB
AA
S7
Case
AA
BB
S6
Case
AA
BB
S5
Case
AA
AA
S4
ControlCaseControlPhenotype
AAAAABSNP2
ABAAAASNP1
S3S2S1Train
2
1
0
1
0
0
2
0
0
# Cases
0
2
0
0
0
0
0
0
0
# Control
ControlControlABAA
Case???BBAA
CaseCaseAAAB
Case???BBAB
Case???ABAB
Case
Case
???
???
Decision
CaseAABB
CaseBBBB
CaseABBB
CaseAAAA
GeneralizationSNP2SNP1
Statistical Inference of the optimal function
using multi-resolution to “generalize”
Case
AA
AA
S11
Case
AB
BB
S12
Control
BB
AB
S10
ControlCaseControlPhenotype
AAAAABSNP2
AAAAAASNP1
S13S9S8TEST
2 6mis takes on ⇒ 66% accuracy
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Example of Truth table for a real caseExample of Truth table for a real case
396 185 Total Case2 ( 0.3 % ) 0 ( 0 % ) AA AA AA Case0 ( 0 % ) 0 ( 0 % ) AB AA AA Case0 ( 0 % ) 0 ( 0 % ) BB AA AA Case28 ( 3.5 % ) 0 ( 0 % ) AA AB AA Case11 ( 1.4 % ) 0 ( 0 % ) AB AB AA Case0 ( 0 % ) 0 ( 0 % ) BB AB AA Case22 ( 2.8 % ) 3 ( 0.8 % ) AA BB AA Case11 ( 1.4 % ) 0 ( 0 % ) AB BB AA Case0 ( 0 % ) 0 ( 0 % ) BB BB AA
Control5 ( 0.6 % ) 8 ( 2.2 % ) AA AA AB Case5 ( 0.6 % ) 2 ( 0.5 % ) AB AA AB
Control1 ( 0.1 % ) 1 ( 0.3 % ) BB AA AB Case60 ( 7.6 % ) 14 ( 3.8 % ) AA AB AB Case35 ( 4.4 % ) 5 ( 1.4 % ) AB AB AB Case3 ( 0.4 % ) 1 ( 0.3 % ) BB AB AB Case49 ( 6.2 % ) 11 ( 3 % ) AA BB AB Case31 ( 3.9 % ) 6 ( 1.6 % ) AB BB AB
Control3 ( 0.4 % ) 2 ( 0.5 % ) BB BB AB Control8 ( 1 % ) 25 ( 6.8 % ) AA AA BB
Case9 ( 1.1 % ) 4 ( 1.1 % ) AB AA BB Control1 ( 0.1 % ) 1 ( 0.3 % ) BB AA BB Control41 ( 5.2 % ) 49 ( 13.2 % ) AA AB BB Control15 ( 1.9 % ) 11 ( 3 % ) AB AB BB
Case2 ( 0.3 % ) 0 ( 0 % ) BB AB BB Control31 ( 3.9 % ) 34 ( 9.2 % ) AA BB BB
Case 21 ( 2.7 % ) 5 ( 1.4 % ) AB BB BB Control2 ( 0.3 % ) 3 ( 0.8 % ) BB BB BB
Predicted # Case ( Freq. Corrected ) # Control ( Freq. Corrected ) SNP 41SNP 6SNP 1
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Feature Selection – Sets of 3 SNPsFeature Selection – Sets of 3 SNPs
Selected SNPs
Selected SNPs
Step N
Step N+1
Error
Error
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Now The IssuesNow The Issues
Why Balancing?Why Balancing?
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Why BalancingWhy Balancing
Previous Example shows 185 controls vs. 396 Previous Example shows 185 controls vs. 396 casescases
These numbers are based on the how the data These numbers are based on the how the data sampling was donesampling was done
It does not represent population proportionsIt does not represent population proportions
But it affects the classifier design because of But it affects the classifier design because of differeces in the prior probabilities.differeces in the prior probabilities.
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
class 1
class 2
Why BalancingWhy Balancing
class 1
class 2
48% 2.5%
6.6% 15%
0
2.9
• Using the best error threshold may shield non desirable FPR and FNR values
• Changes in threshold provide “better” combined FPR and FNR at the cost of increased error rate.
Error = 6.6%
Error = 14.2%
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Why BalancingWhy Balancing
Usual learning algorithm do assume:
a) The goal is to minimize the error rate b) Training data and application data have the same distribution
Error rate may not be the best goal for many problems
But usually the proportion of the two classes in the training samples do not reflect their population probabilities (Priors).
Even if the samples represent the population proportion, the best classifier given error rates may be very bad regarding FPR and FNR.
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Why is this bad?Why is this bad?
Designed classifier is sub-optimal (Red Line) Estimated error may not predict true error (Blue Line)
• Fixed Population
• Training data with different proportions of Positive/Negative Samples
• Optimal classifiers for extreme cases have zero error (empirical error)
• The optimal classifier is obtained when the proportion (60%-40%) are similar to the population’s one.
• The classifier trained in 50%-50% does not perform so bad
• In both cases the empirical estimate is a good estimator
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
BalancingBalancing
Several techniques to balance continuous data artificially:Several techniques to balance continuous data artificially:
Replicate samples from smaller set (upsampling)Replicate samples from smaller set (upsampling)
Remove Samples from larger set (downsampling)Remove Samples from larger set (downsampling)
Threshold adjustment (LDA / Trees / etc)Threshold adjustment (LDA / Trees / etc)
Taking new Samples (resampling)Taking new Samples (resampling)
These techniques produce changes in the empirical joint distribution These techniques produce changes in the empirical joint distribution avoiding changes in the conditional distributionsavoiding changes in the conditional distributions
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Balancing on discrete dataBalancing on discrete data
1985AAAA
512
0
45
90
90
50
60
10
95
# 0
120
19
8
3
15
19
16
12
9
# 1
ABAA
BBAA
AAAB
BBAB
ABAB
AABB
BBBB
ABBB
SNP2SNP1
00.02950.1318AAAA
0.1860
0.0295
0.0124
0.0047
0.0233
0.0295
0.0248
0.0186
0.0140
# 1
0.814
0.0000
0.0698
0.1395
0.1395
0.0775
0.0930
0.0155
0.1473
# 0
0
0
0
0
0
1
0
0
Classif
ABAA
BBAA
AAAB
BBAB
ABAB
AABB
BBBB
ABBB
SNP2SNP1
Unbalanced Data Classifier design
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Balancing on discrete dataBalancing on discrete data
00.07920.0810AAAA
0.5000
0.0792
0.0333
0.0125
0.0625
0.0792
0.0667
0.0500
0.0375
# 1
0.5000
0.0000
0.0429
0.0857
0.0857
0.0476
0.0571
0.0095
0.0905
# 0
1
0
0
0
1
1
1
0
ΨB
ABAA
BBAA
AAAB
BBAB
ABAB
AABB
BBBB
ABBB
SNP2SNP1
X*0.5/0.814 X*0.5/0.1860
• Balancing without resampling.• Changes in Marginal Distributions.• Conditional Distributions unchanged.• Changes in the plug-in classifier and its error rates
ERROR = 25 %FPR = 11 %FNR = 86 %
ERROR = 34 %FPR = 23 %FNR = 45%
00.02950.1318AAAA
0.1860
0.0295
0.0124
0.0047
0.0233
0.0295
0.0248
0.0186
0.0140
# 1
0.814
0.0000
0.0698
0.1395
0.1395
0.0775
0.0930
0.0155
0.1473
# 0
0
0
0
0
0
1
0
0
Ψ
ABAA
BBAA
AAAB
BBAB
ABAB
AABB
BBBB
ABBB
SNP2SNP1
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Balancing on discrete dataBalancing on discrete data
00.07920.0810AAAA
0.5000
0.0792
0.0333
0.0125
0.0625
0.0792
0.0667
0.0500
0.0375
# 1
0.5000
0.0000
0.0429
0.0857
0.0857
0.0476
0.0571
0.0095
0.0905
# 0
0
0
0
0
0
1
0
0
Ψ
ABAA
BBAA
AAAB
BBAB
ABAB
AABB
BBBB
ABBB
SNP2SNP1
ERROR ΨΨBB = 26 % (against 25% of Ψ)) ERROR ΨΨ = 49 % (against 34% of ΨB)
00.02950.1318AAAA
0.1860
0.0295
0.0124
0.0047
0.0233
0.0295
0.0248
0.0186
0.0140
# 1
0.814
0.0000
0.0698
0.1395
0.1395
0.0775
0.0930
0.0155
0.1473
# 0
1
0
0
0
1
1
1
0
ΨB
ABAA
BBAA
AAAB
BBAB
ABAB
AABB
BBBB
ABBB
SNP2SNP1
•What happens if we exchange the classifiers (assumption of different prior distribution than used for design)?•The Balanced classifier Ψ is more robust against wrong prior distribution !!!!
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
BalancingBalancing samples allows for classifier design that is samples allows for classifier design that is independentindependent of of proportionproportion of samples between cases and control. of samples between cases and control.
• Same example as before (Same joint point-label distribution)
• Before training the classifier, samples are “balanced” artificially. We obtain always the same classifier.
• Predicted error from training samples still not predicting correctly the true error (but bounded by FPR and FNR)
• Even with correct proportions (60% class 0), the optimal operator does not reach optimal error (because of balancing)
Balancing - SimulationsBalancing - Simulations
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Another AdvantagesAnother Advantages
When balancing the data, sampling from same population When balancing the data, sampling from same population generates the generates the same classifiersame classifier regardless the number of samples regardless the number of samples produced for each sample !produced for each sample !
Because, if the conditional probabilities are the same, after Because, if the conditional probabilities are the same, after balancing we always reach the balancing we always reach the same joint distributionsame joint distribution!!
Therefore the Error rate, FPR and FNR of the designed classifier Therefore the Error rate, FPR and FNR of the designed classifier does does NOT dependNOT depend on the proportion of samples used!! on the proportion of samples used!!
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Another AdvantagesAnother Advantages
Balancing tends to minimize jointly FPR and FNRBalancing tends to minimize jointly FPR and FNR
Because it minimizes Error=(FPR+FNR)/2Because it minimizes Error=(FPR+FNR)/2
The designed classifier has smaller range for estimated error The designed classifier has smaller range for estimated error
Non Balanced Design Balanced Design
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
Another AdvantagesAnother Advantages
Balancing Data is equivalent to change the threshold as to obtain Balancing Data is equivalent to change the threshold as to obtain the closest values between FPR and FNR.the closest values between FPR and FNR.
It is done without need to search for best thresholdIt is done without need to search for best threshold
Threshold Change
Balancing
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
1-1
D1M451D1M504D1M113D1M355D2M148D3M265D4M251D5M257D5M83D5M357D5M91D5M338D5M168D6M284D6M39D6M254D6M339D6M294D7M105D8M94D9M106D10M233D11M327D13M99D13M106D13M147D15M239D18M186
Samp
le 1
Samp
le 2
Sample
3Sa
mple 4
Samp
le 5
Samp
le 6
Samp
le 7
Sample
8Sa
mple 9
Samp
le 1
0Sa
mple
11
Samp
le 1
2Samp
le 1
3Sa
mple
14
Samp
le 1
5Sa
mple
16
Samp
le 1
7Samp
le 1
8Sa
mple
19
Samp
le 2
0Sa
mple
21
Samp
le 2
2Samp
le 2
3Sa
mple
24
Samp
le 2
5Sa
mple
26
Samp
le 2
7Samp
le 2
8Sa
mple
29
Samp
le 3
0Sa
mple
31
Samp
le 3
2Samp
le 3
3Sa
mple
34
Samp
le 3
5Sa
mple 3
6Sa
mple
37
Samp
le 3
8Samp
le 3
9Sa
mple
40
Samp
le 4
1Sa
mple
42
Samp
le 4
3Samp
le 4
4Sa
mple
45
Samp
le 4
6Sa
mple
47
Samp
le 4
8Samp
le 4
9Sa
mple
50
Samp
le 5
1Sa
mple
52
Samp
le 5
3Samp
le 5
4Sa
mple
55
Samp
le 5
6Sa
mple
57
Samp
le 5
8Samp
le 5
9Sa
mple
60
Samp
le 6
1Sa
mple 6
2Sa
mple
63
Samp
le 6
4Samp
le 6
5Sa
mple
66
Samp
le 6
7Sa
mple
68
Samp
le 6
9Samp
le 7
0Sa
mple
71
Samp
le 7
2Sa
mple
73
Samp
le 7
4Samp
le 7
5Sa
mple
76
Samp
le 7
7Sa
mple
78
Samp
le 7
9Samp
le 8
0Sa
mple
81
Samp
le 8
2Sa
mple
83
Samp
le 8
4Samp
le 8
5Sa
mple
86
Samp
le 8
7Sa
mple 8
8Sa
mple
89
Samp
le 9
0Samp
le 9
1Sa
mple
92
Samp
le 9
3Sa
mple
94
Samp
le 9
5Samp
le 9
6Sa
mple
97
Samp
le 9
8Sa
mple
99
Samp
le 1
00Samp
le 1
01Sa
mple
102
Samp
le 1
03Sa
mple
104
Samp
le 1
05Samp
le 1
06Sa
mple
107
Samp
le 1
08Sa
mple
109
Samp
le 1
10Samp
le 1
11Sa
mple
112
Samp
le 1
13Sa
mple 1
14Sa
mple
115
Samp
le 1
16Samp
le 1
17Sa
mple
118
Samp
le 1
19Sa
mple
120
SamplesNot Surv.Surv.
Example of ApplicationExample of Application
Victor L. Boyartchuk, Karl W. Broman, Rebecca E. Mosher, Sarah E.F. D’Orazio, Michael N. Starnbach & William F. Dietrich, “Multigenic control of Listeria monocytogenes susceptibility in mice”, 2001 Nature Publishing Group, Brief Communications, 2001
Selected SNPs
Survival vs. No Survival Analysis
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
ResultsResults
Balanced vs. Unbalanced classifier designBalanced vs. Unbalanced classifier design
Small increase in FPR but large reduction Small increase in FPR but large reduction in FNRin FNR
We expect this classifier to be more robust We expect this classifier to be more robust against varying prior proportions.against varying prior proportions.
9.7%51.4%30.3%Balanced Design
19.8%50.9%29.7%Classic Design
FNRFPRError Rate
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
ConclusionsConclusions
Data Balancing is a necessary step to ensure Data Balancing is a necessary step to ensure proper classifier design under unknown priors proper classifier design under unknown priors P(case) and P(control).P(case) and P(control).
The designed classifier is more robust against The designed classifier is more robust against the unknown priors.the unknown priors.
Data Balancing for discrete data can be (and Data Balancing for discrete data can be (and should be) easily applied via the estimated joint should be) easily applied via the estimated joint distribution, avoiding resampling.distribution, avoiding resampling.
Facu
ltad
de
Ing
enie
ría
Facu
ltad
de
Ing
enie
ría
Un
iver
sid
ad N
acio
nal
de
Mar
del
Pla
taU
niv
ersi
dad
Nac
ion
al d
e M
ar d
el P
lata
AcknowledgmentsAcknowledgments
UNMdPVirginia Ballarin
Mariela Azul Gonzalez
INTA (Balcarce)Pablo Corva
FI-UNERInti Anabela Pagnuco
AgenciaFONCyT – PICT 2313
TGenEdward DoughertyDietrich Stephan