data balancing for phenotype classification based on snps

1 Facultad de Ingeniería - Universidad Nacional de Mar del Plata1 Facultad de Ingeniería - Universidad Nacional de Mar del Plata

2 - Agencia Nacional de Promoción Científica y Tecnológica – FONCyT - 2 - Agencia Nacional de Promoción Científica y Tecnológica – FONCyT - PICT 2006PICT 2006

1er Congreso Argentino de Bioinformática1er Congreso Argentino de Bioinformática

Data Balancing for PhenotypeData Balancing for Phenotype

Classification Based on SNPsClassification Based on SNPs

Marcel BrunMarcel Brun1,21,2, Virginia Ballarín, Virginia Ballarín11

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

Single Nucleotide Polymorphism (SNP)Single Nucleotide Polymorphism (SNP)

• A A single nucleotide polymorphismsingle nucleotide polymorphism is a position in the genome is a position in the genome where two different alleles have been found to be present in the where two different alleles have been found to be present in the population, both at a greater than 1% frequencypopulation, both at a greater than 1% frequency

• This is responsible for most of the This is responsible for most of the genetic variationgenetic variation between between individuals. individuals.

• SNPs may occur in SNPs may occur in non-coding regionsnon-coding regions (SNPs) as well as in (SNPs) as well as in coding regionscoding regions (cSNPs). (cSNPs).

• Each polymorphism, or variant, occurs at a Each polymorphism, or variant, occurs at a frequencyfrequency of of greater than 1% in the populationgreater than 1% in the population

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

SNPsSNPs

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

SNPs CategorizationSNPs Categorization

Categorization:Categorization:

A) Functional SNPS that affect phenotype directly via gen alterations.A) Functional SNPS that affect phenotype directly via gen alterations.

B) Functional SNPs that produces higher predisposition to phenotypical changes B) Functional SNPs that produces higher predisposition to phenotypical changes (via transcriptional changes or external factors).(via transcriptional changes or external factors).

D)D) Silent SNPs, no functionals.Silent SNPs, no functionals.

Can we predict the Phenotype based on a combination of SNPs?Can we predict the Phenotype based on a combination of SNPs?

Can we find SNPs associated to specific phenotype changes?Can we find SNPs associated to specific phenotype changes?

SNPs

•Coding Regions

•Non Coding Regions

• Changes in Protein

• No Changes in Protein

• Changes in Transcription Factors

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

Classification based on SNP dataClassification based on SNP data

SNP Data Combinatorial search

Errors Tables for combinations of n SNPs as predictors of control/disease

35 3 1 1 2 3 0 0 0.423 0.430 0.423 0.248 1 2 4 0 0 0.340 0.430 0.340 0.284 1 2 5 0 0 0.351 0.430 0.351 0.279 1 2 6 0 0 0.205 0.430 0.205 0.342 1 2 7 0 0 0.323 0.430 0.323 0.291 1 3 4 0 0 0.344 0.430 0.344 0.282 1 3 5 0 0 0.328 0.430 0.328 0.289 1 3 6 0 0 0.333 0.430 0.333 0.287 1 3 7 0 0 0.314 0.430 0.314 0.295 1 4 5 0 0 0.267 0.430 0.267 0.315 1 4 6 0 0 0.130 0.430 0.130 0.374 1 4 7 0 0 0.267 0.430 0.267 0.315

Processed Results in html pages and Excel datasheets

0 0.5 10

0.2

0.4

0.6

0.8

1

RAS1

RA

S2

Calls – SNiPer HD

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

Example: Bovine race classification based on SNPsExample: Bovine race classification based on SNPs

• Race classification based on small sets of SNPs

• European vs. Indicine breeds.

• Data from Bovine Hapmap Consortium

• Focus on region of interest in the genome (Chromosomes 4 and 9)

rs29024708BTA-160695BTA-140710Ψ29

BTA-71641rs29019831BTA-117838Ψ4

SNP 3SNP 2SNP 1

0.018640.00339Error

0.9890.999Sensibilidad

0.9740.995Especificidad

0.9940.999NPV

0.9570.991PPV

0.0110.00098FNR

0.0220.0046FPR

38.1438.46TN Medio

19.7620.34TP Medio

0.220.02FN Medio

0.880.18FP Medio

Ψ4Ψ29

• Mariela A. Gonzalez, Marcel Brun, Pablo M. Corva, Virginia Ballarin, “Análisis de señales genómicas para la clasificación de razas bovinas”, CAI 2009, 1er Congreso Argentino de Agroinformática, in the 38 JAIIO, Mar del Plata, Argentina, 24-28 de Agosto 2009.

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

Discrete Ful Logic for SNP-based classificationDiscrete Ful Logic for SNP-based classification

• SNP calls can be considered discrete variables with three possible values AA, AB and BB

• SNPs can be used together to predict the status of disease against control. This status can be considered a Boolean variable: “1” for case and “0” for control.

• Training consist in learning the logic to determine the outcome as a function of the observed SNPs.

• The problem is constrained to few observed SNPs to avoid over-fitting in training.

• Given a number of SNPs, a decision table defines a “full-logic” discrete classifier.

Control

Unknown

Unknown

Control

Control

Case

Case

Case

Control

Outcome

ABAA

BBAA

AAAB

BBAB

ABAB

AABB

BBBB

ABBB

AAAA

SNP2SNP1

Example of decision table

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

Estimation of error rate for 2 SNPs from data – Discrete Full logic

Control

AB

AA

S7

Case

AB

AA

S7

Case

AA

BB

S6

Case

AA

BB

S5

Case

AA

AA

S4

ControlCaseControlPhenotype

AAAAABSNP2

ABAAAASNP1

S3S2S1Train

2

1

0

1

0

0

2

0

0

# Cases

0

2

0

0

0

0

0

0

0

# Control

ControlControlABAA

Case???BBAA

CaseCaseAAAB

Case???BBAB

Case???ABAB

Case

Case

???

???

Decision

CaseAABB

CaseBBBB

CaseABBB

CaseAAAA

GeneralizationSNP2SNP1

Statistical Inference of the optimal function

using multi-resolution to “generalize”

Case

AA

AA

S11

Case

AB

BB

S12

Control

BB

AB

S10

ControlCaseControlPhenotype

AAAAABSNP2

AAAAAASNP1

S13S9S8TEST

2 6mis takes on ⇒ 66% accuracy

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

Example of Truth table for a real caseExample of Truth table for a real case

396 185 Total Case2 ( 0.3 % ) 0 ( 0 % ) AA AA AA Case0 ( 0 % ) 0 ( 0 % ) AB AA AA Case0 ( 0 % ) 0 ( 0 % ) BB AA AA Case28 ( 3.5 % ) 0 ( 0 % ) AA AB AA Case11 ( 1.4 % ) 0 ( 0 % ) AB AB AA Case0 ( 0 % ) 0 ( 0 % ) BB AB AA Case22 ( 2.8 % ) 3 ( 0.8 % ) AA BB AA Case11 ( 1.4 % ) 0 ( 0 % ) AB BB AA Case0 ( 0 % ) 0 ( 0 % ) BB BB AA

Control5 ( 0.6 % ) 8 ( 2.2 % ) AA AA AB Case5 ( 0.6 % ) 2 ( 0.5 % ) AB AA AB

Control1 ( 0.1 % ) 1 ( 0.3 % ) BB AA AB Case60 ( 7.6 % ) 14 ( 3.8 % ) AA AB AB Case35 ( 4.4 % ) 5 ( 1.4 % ) AB AB AB Case3 ( 0.4 % ) 1 ( 0.3 % ) BB AB AB Case49 ( 6.2 % ) 11 ( 3 % ) AA BB AB Case31 ( 3.9 % ) 6 ( 1.6 % ) AB BB AB

Control3 ( 0.4 % ) 2 ( 0.5 % ) BB BB AB Control8 ( 1 % ) 25 ( 6.8 % ) AA AA BB

Case9 ( 1.1 % ) 4 ( 1.1 % ) AB AA BB Control1 ( 0.1 % ) 1 ( 0.3 % ) BB AA BB Control41 ( 5.2 % ) 49 ( 13.2 % ) AA AB BB Control15 ( 1.9 % ) 11 ( 3 % ) AB AB BB

Case2 ( 0.3 % ) 0 ( 0 % ) BB AB BB Control31 ( 3.9 % ) 34 ( 9.2 % ) AA BB BB

Case 21 ( 2.7 % ) 5 ( 1.4 % ) AB BB BB Control2 ( 0.3 % ) 3 ( 0.8 % ) BB BB BB

Predicted # Case ( Freq. Corrected ) # Control ( Freq. Corrected ) SNP 41SNP 6SNP 1

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

Feature Selection – Sets of 3 SNPsFeature Selection – Sets of 3 SNPs

Selected SNPs

Selected SNPs

Step N

Step N+1

Error

Error

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

Now The IssuesNow The Issues

Why Balancing?Why Balancing?

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

Why BalancingWhy Balancing

Previous Example shows 185 controls vs. 396 Previous Example shows 185 controls vs. 396 casescases

These numbers are based on the how the data These numbers are based on the how the data sampling was donesampling was done

It does not represent population proportionsIt does not represent population proportions

But it affects the classifier design because of But it affects the classifier design because of differeces in the prior probabilities.differeces in the prior probabilities.

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

class 1

class 2


class 1

class 2

48% 2.5%

6.6% 15%

0

2.9

• Using the best error threshold may shield non desirable FPR and FNR values

• Changes in threshold provide “better” combined FPR and FNR at the cost of increased error rate.

Error = 6.6%

Error = 14.2%

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata


Usual learning algorithm do assume:

a) The goal is to minimize the error rate b) Training data and application data have the same distribution

Error rate may not be the best goal for many problems

But usually the proportion of the two classes in the training samples do not reflect their population probabilities (Priors).

Even if the samples represent the population proportion, the best classifier given error rates may be very bad regarding FPR and FNR.

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

Why is this bad?Why is this bad?

Designed classifier is sub-optimal (Red Line) Estimated error may not predict true error (Blue Line)

• Fixed Population

• Training data with different proportions of Positive/Negative Samples

• Optimal classifiers for extreme cases have zero error (empirical error)

• The optimal classifier is obtained when the proportion (60%-40%) are similar to the population’s one.

• The classifier trained in 50%-50% does not perform so bad

• In both cases the empirical estimate is a good estimator

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

BalancingBalancing

Several techniques to balance continuous data artificially:Several techniques to balance continuous data artificially:

Replicate samples from smaller set (upsampling)Replicate samples from smaller set (upsampling)

Remove Samples from larger set (downsampling)Remove Samples from larger set (downsampling)

Threshold adjustment (LDA / Trees / etc)Threshold adjustment (LDA / Trees / etc)

Taking new Samples (resampling)Taking new Samples (resampling)

These techniques produce changes in the empirical joint distribution These techniques produce changes in the empirical joint distribution avoiding changes in the conditional distributionsavoiding changes in the conditional distributions

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

Balancing on discrete dataBalancing on discrete data

1985AAAA

512

0

45

90

90

50

60

10

95

# 0

120

19

8

3

15

19

16

12

9

# 1

ABAA

BBAA

AAAB

BBAB

ABAB

AABB

BBBB

ABBB

SNP2SNP1

00.02950.1318AAAA

0.1860

0.0295

0.0124

0.0047

0.0233

0.0295

0.0248

0.0186

0.0140

# 1

0.814

0.0000

0.0698

0.1395

0.1395

0.0775

0.0930

0.0155

0.1473

# 0

0

0

0

0

0

1

0

0

Classif

ABAA

BBAA

AAAB

BBAB

ABAB

AABB

BBBB

ABBB

SNP2SNP1

Unbalanced Data Classifier design

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata


00.07920.0810AAAA

0.5000

0.0792

0.0333

0.0125

0.0625

0.0792

0.0667

0.0500

0.0375

# 1

0.5000

0.0000

0.0429

0.0857

0.0857

0.0476

0.0571

0.0095

0.0905

# 0

1

0

0

0

1

1

1

0

ΨB

ABAA

BBAA

AAAB

BBAB

ABAB

AABB

BBBB

ABBB

SNP2SNP1

X*0.5/0.814 X*0.5/0.1860

• Balancing without resampling.• Changes in Marginal Distributions.• Conditional Distributions unchanged.• Changes in the plug-in classifier and its error rates

ERROR = 25 %FPR = 11 %FNR = 86 %

ERROR = 34 %FPR = 23 %FNR = 45%

00.02950.1318AAAA

0.1860

0.0295

0.0124

0.0047

0.0233

0.0295

0.0248

0.0186

0.0140

# 1

0.814

0.0000

0.0698

0.1395

0.1395

0.0775

0.0930

0.0155

0.1473

# 0

0

0

0

0

0

1

0

0

Ψ

ABAA

BBAA

AAAB

BBAB

ABAB

AABB

BBBB

ABBB

SNP2SNP1

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata


00.07920.0810AAAA

0.5000

0.0792

0.0333

0.0125

0.0625

0.0792

0.0667

0.0500

0.0375

# 1

0.5000

0.0000

0.0429

0.0857

0.0857

0.0476

0.0571

0.0095

0.0905

# 0

0

0

0

0

0

1

0

0

Ψ

ABAA

BBAA

AAAB

BBAB

ABAB

AABB

BBBB

ABBB

SNP2SNP1

ERROR ΨΨBB = 26 % (against 25% of Ψ)) ERROR ΨΨ = 49 % (against 34% of ΨB)

00.02950.1318AAAA

0.1860

0.0295

0.0124

0.0047

0.0233

0.0295

0.0248

0.0186

0.0140

# 1

0.814

0.0000

0.0698

0.1395

0.1395

0.0775

0.0930

0.0155

0.1473

# 0

1

0

0

0

1

1

1

0

ΨB

ABAA

BBAA

AAAB

BBAB

ABAB

AABB

BBBB

ABBB

SNP2SNP1

•What happens if we exchange the classifiers (assumption of different prior distribution than used for design)?•The Balanced classifier Ψ is more robust against wrong prior distribution !!!!

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

BalancingBalancing samples allows for classifier design that is samples allows for classifier design that is independentindependent of of proportionproportion of samples between cases and control. of samples between cases and control.

• Same example as before (Same joint point-label distribution)

• Before training the classifier, samples are “balanced” artificially. We obtain always the same classifier.

• Predicted error from training samples still not predicting correctly the true error (but bounded by FPR and FNR)

• Even with correct proportions (60% class 0), the optimal operator does not reach optimal error (because of balancing)

Balancing - SimulationsBalancing - Simulations

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

Another AdvantagesAnother Advantages

When balancing the data, sampling from same population When balancing the data, sampling from same population generates the generates the same classifiersame classifier regardless the number of samples regardless the number of samples produced for each sample !produced for each sample !

Because, if the conditional probabilities are the same, after Because, if the conditional probabilities are the same, after balancing we always reach the balancing we always reach the same joint distributionsame joint distribution!!

Therefore the Error rate, FPR and FNR of the designed classifier Therefore the Error rate, FPR and FNR of the designed classifier does does NOT dependNOT depend on the proportion of samples used!! on the proportion of samples used!!

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata


Balancing tends to minimize jointly FPR and FNRBalancing tends to minimize jointly FPR and FNR

Because it minimizes Error=(FPR+FNR)/2Because it minimizes Error=(FPR+FNR)/2

The designed classifier has smaller range for estimated error The designed classifier has smaller range for estimated error

Non Balanced Design Balanced Design

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata


Balancing Data is equivalent to change the threshold as to obtain Balancing Data is equivalent to change the threshold as to obtain the closest values between FPR and FNR.the closest values between FPR and FNR.

It is done without need to search for best thresholdIt is done without need to search for best threshold

Threshold Change

Balancing

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

1-1

D1M451D1M504D1M113D1M355D2M148D3M265D4M251D5M257D5M83D5M357D5M91D5M338D5M168D6M284D6M39D6M254D6M339D6M294D7M105D8M94D9M106D10M233D11M327D13M99D13M106D13M147D15M239D18M186

Samp

le 1

Samp

le 2

Sample

3Sa

mple 4

Samp

le 5

Samp

le 6

Samp

le 7

Sample

8Sa

mple 9

Samp

le 1

0Sa

mple

11

Samp

le 1

2Samp

le 1

3Sa

mple

14

Samp

le 1

5Sa

mple

16

Samp

le 1

7Samp

le 1

8Sa

mple

19

Samp

le 2

0Sa

mple

21

Samp

le 2

2Samp

le 2

3Sa

mple

24

Samp

le 2

5Sa

mple

26

Samp

le 2

7Samp

le 2

8Sa

mple

29

Samp

le 3

0Sa

mple

31

Samp

le 3

2Samp

le 3

3Sa

mple

34

Samp

le 3

5Sa

mple 3

6Sa

mple

37

Samp

le 3

8Samp

le 3

9Sa

mple

40

Samp

le 4

1Sa

mple

42

Samp

le 4

3Samp

le 4

4Sa

mple

45

Samp

le 4

6Sa

mple

47

Samp

le 4

8Samp

le 4

9Sa

mple

50

Samp

le 5

1Sa

mple

52

Samp

le 5

3Samp

le 5

4Sa

mple

55

Samp

le 5

6Sa

mple

57

Samp

le 5

8Samp

le 5

9Sa

mple

60

Samp

le 6

1Sa

mple 6

2Sa

mple

63

Samp

le 6

4Samp

le 6

5Sa

mple

66

Samp

le 6

7Sa

mple

68

Samp

le 6

9Samp

le 7

0Sa

mple

71

Samp

le 7

2Sa

mple

73

Samp

le 7

4Samp

le 7

5Sa

mple

76

Samp

le 7

7Sa

mple

78

Samp

le 7

9Samp

le 8

0Sa

mple

81

Samp

le 8

2Sa

mple

83

Samp

le 8

4Samp

le 8

5Sa

mple

86

Samp

le 8

7Sa

mple 8

8Sa

mple

89

Samp

le 9

0Samp

le 9

1Sa

mple

92

Samp

le 9

3Sa

mple

94

Samp

le 9

5Samp

le 9

6Sa

mple

97

Samp

le 9

8Sa

mple

99

Samp

le 1

00Samp

le 1

01Sa

mple

102

Samp

le 1

03Sa

mple

104

Samp

le 1

05Samp

le 1

06Sa

mple

107

Samp

le 1

08Sa

mple

109

Samp

le 1

10Samp

le 1

11Sa

mple

112

Samp

le 1

13Sa

mple 1

14Sa

mple

115

Samp

le 1

16Samp

le 1

17Sa

mple

118

Samp

le 1

19Sa

mple

120

SamplesNot Surv.Surv.

Example of ApplicationExample of Application

Victor L. Boyartchuk, Karl W. Broman, Rebecca E. Mosher, Sarah E.F. D’Orazio, Michael N. Starnbach & William F. Dietrich, “Multigenic control of Listeria monocytogenes susceptibility in mice”, 2001 Nature Publishing Group, Brief Communications, 2001

Selected SNPs

Survival vs. No Survival Analysis

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

ResultsResults

Balanced vs. Unbalanced classifier designBalanced vs. Unbalanced classifier design

Small increase in FPR but large reduction Small increase in FPR but large reduction in FNRin FNR

We expect this classifier to be more robust We expect this classifier to be more robust against varying prior proportions.against varying prior proportions.

9.7%51.4%30.3%Balanced Design

19.8%50.9%29.7%Classic Design

FNRFPRError Rate

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

ConclusionsConclusions

Data Balancing is a necessary step to ensure Data Balancing is a necessary step to ensure proper classifier design under unknown priors proper classifier design under unknown priors P(case) and P(control).P(case) and P(control).

The designed classifier is more robust against The designed classifier is more robust against the unknown priors.the unknown priors.

Data Balancing for discrete data can be (and Data Balancing for discrete data can be (and should be) easily applied via the estimated joint should be) easily applied via the estimated joint distribution, avoiding resampling.distribution, avoiding resampling.

Facu

ltad

de

Ing

enie

ría

Facu

ltad

de

Ing

enie

ría

Un

iver

sid

ad N

acio

nal

de

Mar

del

Pla

taU

niv

ersi

dad

Nac

ion

al d

e M

ar d

el P

lata

AcknowledgmentsAcknowledgments

UNMdPVirginia Ballarin

Mariela Azul Gonzalez

INTA (Balcarce)Pablo Corva

FI-UNERInti Anabela Pagnuco

AgenciaFONCyT – PICT 2313

TGenEdward DoughertyDietrich Stephan

data balancing for phenotype classification based on snps

Health & Medicine