metrics matter, examples from binary and multilabel classi...
TRANSCRIPT
![Page 1: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/1.jpg)
Metrics Matter,Examples from Binary and Multilabel Classi�cation
Sanmi Koyejo
University of Illinois at Urbana-Champaign
![Page 2: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/2.jpg)
Joint work with
B. Yan @UT Austin K. Zhong @UT Austin P. Ravikumar @CMU
N. Natarajan @MSR India I. Dhillon @UT Austin
![Page 3: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/3.jpg)
Learning with complex metrics
Goal: Train a DNN to optimize F-measure.
F1 =2TP
2TP + FN + FP
Direct optimization
Mixed combinatorial optimization
Convex lower bound
Logloss + thresholding
![Page 4: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/4.jpg)
Learning with complex metrics
Goal: Train a DNN to optimize F-measure.
F1 =2TP
2TP + FN + FP
Direct optimization
Mixed combinatorial optimization
Convex lower bound
Logloss + thresholding
![Page 5: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/5.jpg)
Learning with complex metrics
Goal: Train a DNN to optimize F-measure.
F1 =2TP
2TP + FN + FP
Direct optimization
F-measure is not an average. Naïve SGD is not validThe sample F-measure is non-di�erentiable
Mixed combinatorial optimization
Convex lower bound
Logloss + thresholding
![Page 6: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/6.jpg)
Learning with complex metrics
Goal: Train a DNN to optimize F-measure.
F1 =2TP
2TP + FN + FP
Direct optimization
Mixed combinatorial optimization
e.g. cutting plane method (Joachims, 2005)may require exponential complexitymost statistical properties unknown
Convex lower bound
Logloss + thresholding
![Page 7: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/7.jpg)
Learning with complex metrics
Goal: Train a DNN to optimize F-measure.
F1 =2TP
2TP + FN + FP
Direct optimization
Mixed combinatorial optimization
Convex lower bound
di�cult to constructmost statistical properties unknown
Logloss + thresholding
![Page 8: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/8.jpg)
Learning with complex metrics
Goal: Train a DNN to optimize F-measure.
F1 =2TP
2TP + FN + FP
Direct optimization
Mixed combinatorial optimization
Convex lower bound
Logloss + thresholding
simple, most common approach in practicehas good statistical properties!
![Page 9: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/9.jpg)
Learning with complex metrics
Goal: Train a DNN to optimize F-measure.
F1 =2TP
2TP + FN + FP
Direct optimization
Mixed combinatorial optimization
Convex lower bound
Logloss + thresholding
Why does thresholding work?
![Page 10: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/10.jpg)
The confusion matrix summarizes binary classi�er mistakes
Y ∈ {0, 1} denotes labels, X ∈ X denotes instances, letX,Y ∼ PThe classi�er θ : X 7→ {0, 1}
![Page 11: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/11.jpg)
Metrics tradeo� which kinds of mistakes are (most)acceptable
Case Study
A medical test determines that a patient has a 30%chance of having a fatal disease. Should the doctortreat the patient?
choosing not to treat a sick patient (test is falsenegative) could lead to serious issues.
choosing to treat a healthy patient (test is falsepositive) increases risk of side e�ects.
![Page 12: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/12.jpg)
Metrics tradeo� which kinds of mistakes are (most)acceptable
Case Study
A medical test determines that a patient has a 30%chance of having a fatal disease. Should the doctortreat the patient?
choosing not to treat a sick patient (test is falsenegative) could lead to serious issues.
choosing to treat a healthy patient (test is falsepositive) increases risk of side e�ects.
![Page 13: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/13.jpg)
Performance metrics
We express tradeo�s via a metric Φ : [0, 1]4 7→ R
Examples
Accuracy (fraction of mistakes) = TP + TN
Error Rate = 1-Accuracy = FP + FN
For medical diagnosis example, consider the weighted error =w1FP + w2FN, where w2 � w1
and many more . . .
Recall =TP
TP + FN, Fβ =
(1 + β2)TP
(1 + β2)TP + β2FN + FP,
Precision =TP
TP + FP, Jaccard =
TP
TP + FN + FP.
![Page 14: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/14.jpg)
Performance metrics
We express tradeo�s via a metric Φ : [0, 1]4 7→ R
Examples
Accuracy (fraction of mistakes) = TP + TN
Error Rate = 1-Accuracy = FP + FN
For medical diagnosis example, consider the weighted error =w1FP + w2FN, where w2 � w1
and many more . . .
Recall =TP
TP + FN, Fβ =
(1 + β2)TP
(1 + β2)TP + β2FN + FP,
Precision =TP
TP + FP, Jaccard =
TP
TP + FN + FP.
![Page 15: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/15.jpg)
Performance metrics
We express tradeo�s via a metric Φ : [0, 1]4 7→ R
Examples
Accuracy (fraction of mistakes) = TP + TN
Error Rate = 1-Accuracy = FP + FN
For medical diagnosis example, consider the weighted error =w1FP + w2FN, where w2 � w1
and many more . . .
Recall =TP
TP + FN, Fβ =
(1 + β2)TP
(1 + β2)TP + β2FN + FP,
Precision =TP
TP + FP, Jaccard =
TP
TP + FN + FP.
![Page 16: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/16.jpg)
Performance metrics
We express tradeo�s via a metric Φ : [0, 1]4 7→ R
Examples
Accuracy (fraction of mistakes) = TP + TN
Error Rate = 1-Accuracy = FP + FN
For medical diagnosis example, consider the weighted error =w1FP + w2FN, where w2 � w1
and many more . . .
Recall =TP
TP + FN, Fβ =
(1 + β2)TP
(1 + β2)TP + β2FN + FP,
Precision =TP
TP + FP, Jaccard =
TP
TP + FN + FP.
![Page 17: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/17.jpg)
No need for metrics if you can learn a "perfect" classi�er!
When is a perfect classi�er learnable?
the true mapping between input and labels is deterministic i.e.there is no noise
function class is su�ciently �exible (realizability) and optimalis computable
we have su�cient data
In practice:
real-world uncertainty e.g. hidden variables, measurement error
true function is unknown, optimization may be intractable
data are limited
Thus, in most realistic scenarios, all classi�ers will make mistakes!
![Page 18: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/18.jpg)
No need for metrics if you can learn a "perfect" classi�er!
When is a perfect classi�er learnable?
the true mapping between input and labels is deterministic i.e.there is no noise
function class is su�ciently �exible (realizability) and optimalis computable
we have su�cient data
In practice:
real-world uncertainty e.g. hidden variables, measurement error
true function is unknown, optimization may be intractable
data are limited
Thus, in most realistic scenarios, all classi�ers will make mistakes!
![Page 19: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/19.jpg)
No need for metrics if you can learn a "perfect" classi�er!
When is a perfect classi�er learnable?
the true mapping between input and labels is deterministic i.e.there is no noise
function class is su�ciently �exible (realizability) and optimalis computable
we have su�cient data
In practice:
real-world uncertainty e.g. hidden variables, measurement error
true function is unknown, optimization may be intractable
data are limited
Thus, in most realistic scenarios, all classi�ers will make mistakes!
![Page 20: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/20.jpg)
No need for metrics if you can learn a "perfect" classi�er!
When is a perfect classi�er learnable?
the true mapping between input and labels is deterministic i.e.there is no noise
function class is su�ciently �exible (realizability) and optimalis computable
we have su�cient data
In practice:
real-world uncertainty e.g. hidden variables, measurement error
true function is unknown, optimization may be intractable
data are limited
Thus, in most realistic scenarios, all classi�ers will make mistakes!
![Page 21: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/21.jpg)
No need for metrics if you can learn a "perfect" classi�er!
When is a perfect classi�er learnable?
the true mapping between input and labels is deterministic i.e.there is no noise
function class is su�ciently �exible (realizability) and optimalis computable
we have su�cient data
In practice:
real-world uncertainty e.g. hidden variables, measurement error
true function is unknown, optimization may be intractable
data are limited
Thus, in most realistic scenarios, all classi�ers will make mistakes!
![Page 22: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/22.jpg)
Utility & Regret
population performance is measured via utility
U(θ, P ) = Φ(TP,FP,FN,TN)
we seek a classi�er that maximizes this utility within somefunction class F
The Bayes optimal classi�er, when it exists, is given by:
θ∗ = argmaxθ∈Θ
U(θ, P ), where Θ = {f : X 7→ {0, 1}}
The regret of the classi�er θ is given by:
R(θ, P ) = U(θ∗, P )− U(θ, P )
![Page 23: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/23.jpg)
Utility & Regret
population performance is measured via utility
U(θ, P ) = Φ(TP,FP,FN,TN)
we seek a classi�er that maximizes this utility within somefunction class F
The Bayes optimal classi�er, when it exists, is given by:
θ∗ = argmaxθ∈Θ
U(θ, P ), where Θ = {f : X 7→ {0, 1}}
The regret of the classi�er θ is given by:
R(θ, P ) = U(θ∗, P )− U(θ, P )
![Page 24: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/24.jpg)
Utility & Regret
population performance is measured via utility
U(θ, P ) = Φ(TP,FP,FN,TN)
we seek a classi�er that maximizes this utility within somefunction class F
The Bayes optimal classi�er, when it exists, is given by:
θ∗ = argmaxθ∈Θ
U(θ, P ), where Θ = {f : X 7→ {0, 1}}
The regret of the classi�er θ is given by:
R(θ, P ) = U(θ∗, P )− U(θ, P )
![Page 25: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/25.jpg)
Towards analysis of the classi�cation procedure
In practice P (X,Y ) is unknown, instead we observeDn = {(Xi, Yi) ∼ P}ni=1
The classi�cation procedure estimates a classi�er θn∣∣Dn
Example
Empirical risk minimization via SVM:
θn = sign
argminf∈Hk
∑{xi,yi}∈Dn
max (0, 1− yif(xi))
![Page 26: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/26.jpg)
Towards analysis of the classi�cation procedure
In practice P (X,Y ) is unknown, instead we observeDn = {(Xi, Yi) ∼ P}ni=1
The classi�cation procedure estimates a classi�er θn∣∣Dn
Example
Empirical risk minimization via SVM:
θn = sign
argminf∈Hk
∑{xi,yi}∈Dn
max (0, 1− yif(xi))
![Page 27: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/27.jpg)
Consistency
Consider the sequence of classi�ers {θn(x), n→∞}
A classi�cation procedure is consistent when R(θn, P )n→∞−−−→ 0 i.e.
the procedure is eventually Bayes optimal
Consistency is a desirable property:
implies stability of the classi�cation procedure, related togeneralization performance
![Page 28: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/28.jpg)
Consistency
Consider the sequence of classi�ers {θn(x), n→∞}
A classi�cation procedure is consistent when R(θn, P )n→∞−−−→ 0 i.e.
the procedure is eventually Bayes optimal
Consistency is a desirable property:
implies stability of the classi�cation procedure, related togeneralization performance
![Page 29: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/29.jpg)
Optimal Binary classi�cation with
Decomposable Metrics
![Page 30: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/30.jpg)
Consider the empirical accuracy:
ACC(θ,Dn) =1
n
∑(xi,yi)∈Dn
1[yi=θ(xi)]
Observe that the classi�cation problem
minθ∈F
ACC(θ,Dn)
is a combinatorial optimization problem
optimal classi�cation is computationally hard for non-trivial Fand Dn
![Page 31: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/31.jpg)
Consider the empirical accuracy:
ACC(θ,Dn) =1
n
∑(xi,yi)∈Dn
1[yi=θ(xi)]
Observe that the classi�cation problem
minθ∈F
ACC(θ,Dn)
is a combinatorial optimization problem
optimal classi�cation is computationally hard for non-trivial Fand Dn
![Page 32: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/32.jpg)
Bayes Optimal Classi�er
Population Accuracy
EX,Y∼P[1[Y=θ(X)]
]Easy to show that θ∗(x) = sign
(P (Y = 1|x)− 1
2
)
Weighted Accuracy
EX,Y∼P[
(1− ρ)1[Y=θ(X)=1] + ρ1[Y=θ(X)=0]
]Scott (2012) showed that θ∗(x) = sign (P (Y = 1|x)− ρ)
![Page 33: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/33.jpg)
Bayes Optimal Classi�er
Population Accuracy
EX,Y∼P[1[Y=θ(X)]
]Easy to show that θ∗(x) = sign
(P (Y = 1|x)− 1
2
)Weighted Accuracy
EX,Y∼P[
(1− ρ)1[Y=θ(X)=1] + ρ1[Y=θ(X)=0]
]Scott (2012) showed that θ∗(x) = sign (P (Y = 1|x)− ρ)
![Page 34: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/34.jpg)
Where do surrogates come from?
Observe that there is no need to estimate P , instead optimize anysurrogate loss function L(θ,Dn) where:
θn = sign
(argmin
fL(f,Dn)
)n→∞−−−→ θ∗(x)
These are known as classi�cation
calibrated surrogate losses (Bartlett et al.,2003; Scott, 2012)
research can focus on how to choose L,Fwhich improve e�ciency, samplecomplexity, robustness . . .
surrogates are often chosen to be convexe.g. hinge loss, logistic loss
![Page 35: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/35.jpg)
Where do surrogates come from?
Observe that there is no need to estimate P , instead optimize anysurrogate loss function L(θ,Dn) where:
θn = sign
(argmin
fL(f,Dn)
)n→∞−−−→ θ∗(x)
These are known as classi�cation
calibrated surrogate losses (Bartlett et al.,2003; Scott, 2012)
research can focus on how to choose L,Fwhich improve e�ciency, samplecomplexity, robustness . . .
surrogates are often chosen to be convexe.g. hinge loss, logistic loss
![Page 36: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/36.jpg)
Where do surrogates come from?
Observe that there is no need to estimate P , instead optimize anysurrogate loss function L(θ,Dn) where:
θn = sign
(argmin
fL(f,Dn)
)n→∞−−−→ θ∗(x)
These are known as classi�cation
calibrated surrogate losses (Bartlett et al.,2003; Scott, 2012)
research can focus on how to choose L,Fwhich improve e�ciency, samplecomplexity, robustness . . .
surrogates are often chosen to be convexe.g. hinge loss, logistic loss
![Page 37: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/37.jpg)
Non-decomposability
A common theme so far is decomposability i.e. linearity wrt.confusion matrix
E[
Φ(C)]
=⟨A,E
[C]⟩
= Φ(E[
C])
However, Fβ , Jaccard, AUC and other common utilityfunctions are non-decomposable i.e. non-linear wrt. C
Thus imples that the averaging trick is no longer valid
E[
Φ(C)]6= Φ(E
[C])
Primary source of di�culty for analysis, optimization, . . .
![Page 38: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/38.jpg)
Non-decomposability
A common theme so far is decomposability i.e. linearity wrt.confusion matrix
E[
Φ(C)]
=⟨A,E
[C]⟩
= Φ(E[
C])
However, Fβ , Jaccard, AUC and other common utilityfunctions are non-decomposable i.e. non-linear wrt. C
Thus imples that the averaging trick is no longer valid
E[
Φ(C)]6= Φ(E
[C])
Primary source of di�culty for analysis, optimization, . . .
![Page 39: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/39.jpg)
Non-decomposability
A common theme so far is decomposability i.e. linearity wrt.confusion matrix
E[
Φ(C)]
=⟨A,E
[C]⟩
= Φ(E[
C])
However, Fβ , Jaccard, AUC and other common utilityfunctions are non-decomposable i.e. non-linear wrt. C
Thus imples that the averaging trick is no longer valid
E[
Φ(C)]6= Φ(E
[C])
Primary source of di�culty for analysis, optimization, . . .
![Page 40: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/40.jpg)
Optimal Binary classi�cation with
Non-decomposable Metrics
![Page 41: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/41.jpg)
The unreasonable e�ectiveness of thresholding
Theorem (Koyejo et al., 2014; Yan et al., 2016)
Let ηx = P (Y = 1|X = x) and let U be di�erentiable wrt. theconfusion matrix, then ∃ a δ∗ such that:
θ∗(x) = sign (ηx − δ∗)
is a Bayes optimal classi�er almost everywhere.
result does not require concavity of U , or other "nice"properties
1
1Condition: P (ηx = δ∗) = 0, easily satis�ed e.g. when P (X) is continuous.
![Page 42: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/42.jpg)
The unreasonable e�ectiveness of thresholding
Theorem (Koyejo et al., 2014; Yan et al., 2016)
Let ηx = P (Y = 1|X = x) and let U be di�erentiable wrt. theconfusion matrix, then ∃ a δ∗ such that:
θ∗(x) = sign (ηx − δ∗)
is a Bayes optimal classi�er almost everywhere.
result does not require concavity of U , or other "nice"properties
1
1Condition: P (ηx = δ∗) = 0, easily satis�ed e.g. when P (X) is continuous.
![Page 43: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/43.jpg)
Proof Sketch
Let F = {f | f : X 7→ [0, 1]} and Θ = {f | f : X 7→ {0, 1}}
Consider the relaxed problem:
θ∗F = argmaxθ∈F
U(θ,P)
Show that the optimal �relaxed� classi�er is θ∗F = sign(ηx− δ∗)Observe that Θ ⊂ F . Thus U(θ∗F ,P) ≥ U(θ∗Θ,P).
As a result, θ∗F ∈ Θ implies that θ∗F ≡ θ∗Θ.
![Page 44: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/44.jpg)
Proof Sketch
Let F = {f | f : X 7→ [0, 1]} and Θ = {f | f : X 7→ {0, 1}}
Consider the relaxed problem:
θ∗F = argmaxθ∈F
U(θ,P)
Show that the optimal �relaxed� classi�er is θ∗F = sign(ηx− δ∗)
Observe that Θ ⊂ F . Thus U(θ∗F ,P) ≥ U(θ∗Θ,P).
As a result, θ∗F ∈ Θ implies that θ∗F ≡ θ∗Θ.
![Page 45: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/45.jpg)
Proof Sketch
Let F = {f | f : X 7→ [0, 1]} and Θ = {f | f : X 7→ {0, 1}}
Consider the relaxed problem:
θ∗F = argmaxθ∈F
U(θ,P)
Show that the optimal �relaxed� classi�er is θ∗F = sign(ηx− δ∗)Observe that Θ ⊂ F . Thus U(θ∗F ,P) ≥ U(θ∗Θ,P).
As a result, θ∗F ∈ Θ implies that θ∗F ≡ θ∗Θ.
![Page 46: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/46.jpg)
Some recovered and new results
Fβ (Ye et al., 2012), Monotonic metrics (Narasimhan et al., 2014)
![Page 47: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/47.jpg)
Simulated examples
F1 Jaccard
1 2 3 4 5 6 7 8 9 10x
0.0
0.2
0.4
0.6
0.8
1.0
2TP2TP+FP+FN
η(x)
δ ∗ =0.34
θ ∗
1 2 3 4 5 6 7 8 9 10x
0.0
0.2
0.4
0.6
0.8
1.0
TPTP+FP+FN
η(x)
δ ∗ =0.39
θ ∗
Finite sample space X , so we can exhaustively search for θ∗
![Page 48: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/48.jpg)
Algorithm 1 (Koyejo et al., 2014)
Step 1: Conditional probability estimation
Estimate ηx via. proper loss (Reid and Williamson, 2010), then
θδ(x) = sign(ηx − δ)
Step 2: Threshold search
maxδU(θδ,Dn)
One dimensional, e�ciently computable using exhaustive search(Sergeyev, 1998).
θδ is consistent
![Page 49: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/49.jpg)
Algorithm 1 (Koyejo et al., 2014)
Step 1: Conditional probability estimation
Estimate ηx via. proper loss (Reid and Williamson, 2010), then
θδ(x) = sign(ηx − δ)
Step 2: Threshold search
maxδU(θδ,Dn)
One dimensional, e�ciently computable using exhaustive search(Sergeyev, 1998).
θδ is consistent
![Page 50: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/50.jpg)
Algorithm 2 (Koyejo et al., 2014)
Step 1: Weighted classi�er estimation)
For classi�cation-calibrated loss (Scott, 2012)
fδ = argminf∈F
∑xi,yi∈Dn
`δ(f(xi), yi)
consistently estimates θδ(x) = sign(fδ(x))
Step 2: Threshold search
maxδU(θδ,Dn)
θδ is consistent
![Page 51: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/51.jpg)
Algorithm 3 (Yan et al., 2016)
Under additional assumptions, U(θδ, P ) is di�erentiable and strictlylocally quasi-concave wrt. δ
Online Algorithm
Iteratively update
1 ηx via. proper loss (Reid and Williamson, 2010)
2 δt using normalized gradient ascent
![Page 52: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/52.jpg)
Algorithm 3 (Yan et al., 2016)
Under additional assumptions, U(θδ, P ) is di�erentiable and strictlylocally quasi-concave wrt. δ
Online Algorithm
Iteratively update
1 ηx via. proper loss (Reid and Williamson, 2010)
2 δt using normalized gradient ascent
![Page 53: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/53.jpg)
Online algorithm sample complexity
Let η estimation error at step t given by rt =∫|ηt − η|dµ, with
appropriately chosen step size, R(θδt ,P) ≤ C∑ti=1 rit
Example: Online logistic regression
Parameter converges at rate O( 1√n
) by averaged stochastic
gradient algorithm (Bach, 2014). Thus, online algorithm achievesO( 1√
n) regret.
![Page 54: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/54.jpg)
Empirical Evaluation
![Page 55: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/55.jpg)
Datasets
datasets default news20 rcv1 epsilon kdda kddb
# features 25 1,355,191 47,236 2,000 20,216,830 29,890,095# test 9,000 4,996 677,399 100,000 510,302 748,401# train 21,000 15,000 20,242 400,000 8,407,752 19,264,097%pos 22% 67% 52% 50% 85% 86%
η estimation: logistic regression and boosting tree
Baselines: threshold search (Koyejo et al., 2014), SVMperf andSTAMP/SPADE (Narasimhan et al., 2015)
![Page 56: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/56.jpg)
Batch algorithm
Data set/Metric LR+Plug-in LR+Batch XGB+Plug-in XGB+Batch
news20-Q-Mean 0.948 (3.77s) 0.948 (0.001s) 0.874 (3.87s) 0.875 (0.003s)news20-H-Mean 0.950 (3.70s) 0.950 (0.003s) 0.859 (3.61s) 0.860 (0.003s)news20-F1 0.949 (3.49s) 0.948 (0.01s) 0.872 (5.07s) 0.874 (0.01s)default-Q-Mean 0.664 (14.3s) 0.667 (0.19s) 0.688 (13.7s) 0.701 (0.22s)default-H-Mean 0.665 (12.1s) 0.668 (0.17s) 0.693 (12.4s) 0.708 (0.18s)default-F1 0.503 (14.2s) 0.497 (0.19s) 0.538 (16.2s) 0.538 (0.15s)
![Page 57: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/57.jpg)
Online Complex Metric Optimization (OCMO)
Metric Algorithm RCV1 Epsilon KDD-A KDD-B
F1 OCMO 0.952 (0.01s) 0.804 (4.87s) 0.934 (2.43s) 0.941 (5.01s)sTAMP 0.923 (14.44s) 0.585 (133.23s) - -
SVMperf 0.953 (1.72s) 0.872 (20.39s) - -H-Mean OCMO 0.964 (0.02s) 0.891 (4.85s) 0.764 (2.5s) 0.733 (5.16s)
sPADE 0.580 (15.74s) 0.578 (135.26s) - -
SVMperf 0.953 (1.72s) 0.872 (20.39s) - -Q-Mean OCMO 0.964 (0.01s) 0.889 (4.87s) 0.551 (2.11s) 0.506 (4.27s)
sPADE 0.688 (15.83s) 0.632 (136.46s) - -
SVMperf 0.950 (1.72s) 0.872 (20.39s) - -
`�' means the corresponding algorithm does not terminate within 100x that of OCMO.
![Page 58: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/58.jpg)
Performance vs run time for various online algorithms
(a) F1 measure on rcv1 (b) H-Mean on rcv1 (c) Q-Mean on rcv1
![Page 59: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/59.jpg)
Optimal Multilabel classi�cation with
Non-decomposable Averaged Metrics
![Page 60: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/60.jpg)
Multilabel Classi�cation
Multiclass: only one classassociated with eachexample
Multilabel: multiple classesassociated with eachexample
![Page 61: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/61.jpg)
Applications
![Page 62: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/62.jpg)
The Multilabel Classi�cation Problem
Inputs: X ∈ X , Labels: Y ∈ Y = [0, 1]M (with M labels)
Classi�er θ : X 7→ Y
Example: Hamming Loss
U(θ) = EX,Y∼P
[M∑m=1
1[Ym=θm(X)]
]=
M∑m=1
P(Ym = θm(X))
Optimal Prediction for Hamming Loss
θ∗m(x) = sign
(P(Ym = 1|x)− 1
2
)Well known convex surrogates e.g. hinge loss (Bartlett et al., 2006)
![Page 63: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/63.jpg)
Multilabel Confusion
Recall the binary confusionmatrix
Similar idea for multilabel classi�cation, now across both labels mand examples n.
Cm,n =
TPm,n = 1[θm(x(n))=1,y
(n)m =1
], FPm,n = 1[θm(x(n))=1,y
(n)m =0
]FNm,n = 1[
θm(x(n))=0,y(n)m =1
], TNm,n = 1[θm(x(n))=0,y
(n)m =0
]
1We focus on linear-fractional metrics e.g. Accuracy, Fβ , Precision, Recall,Jaccard
![Page 64: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/64.jpg)
Multilabel Confusion
Recall the binary confusionmatrix
Similar idea for multilabel classi�cation, now across both labels mand examples n.
Cm,n =
TPm,n = 1[θm(x(n))=1,y
(n)m =1
], FPm,n = 1[θm(x(n))=1,y
(n)m =0
]FNm,n = 1[
θm(x(n))=0,y(n)m =1
], TNm,n = 1[θm(x(n))=0,y
(n)m =0
]
1We focus on linear-fractional metrics e.g. Accuracy, Fβ , Precision, Recall,Jaccard
![Page 65: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/65.jpg)
Label Averaging
Most popular multilabel metrics are averaged metricsSome notation: Let ηm(x) = P(Ym = 1|x)
Macro-Averaging
Average over examples for each label
Cm =1
N
N∑n=1
Cm,n, Ψmacro :=1
M
M∑m=1
Ψ(Cm).
Bayes optimal classi�er:
θ∗m(x) = sign(ηm(x)− δ∗m) ∀m ∈ [M ]
![Page 66: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/66.jpg)
Label Averaging
Most popular multilabel metrics are averaged metricsSome notation: Let ηm(x) = P(Ym = 1|x)
Macro-Averaging
Average over examples for each label
Cm =1
N
N∑n=1
Cm,n,
Ψmacro :=1
M
M∑m=1
Ψ(Cm).
Bayes optimal classi�er:
θ∗m(x) = sign(ηm(x)− δ∗m) ∀m ∈ [M ]
![Page 67: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/67.jpg)
Label Averaging
Most popular multilabel metrics are averaged metricsSome notation: Let ηm(x) = P(Ym = 1|x)
Macro-Averaging
Average over examples for each label
Cm =1
N
N∑n=1
Cm,n, Ψmacro :=1
M
M∑m=1
Ψ(Cm).
Bayes optimal classi�er:
θ∗m(x) = sign(ηm(x)− δ∗m) ∀m ∈ [M ]
![Page 68: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/68.jpg)
Label Averaging
Most popular multilabel metrics are averaged metricsSome notation: Let ηm(x) = P(Ym = 1|x)
Macro-Averaging
Average over examples for each label
Cm =1
N
N∑n=1
Cm,n, Ψmacro :=1
M
M∑m=1
Ψ(Cm).
Bayes optimal classi�er:
θ∗m(x) = sign(ηm(x)− δ∗m) ∀m ∈ [M ]
![Page 69: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/69.jpg)
Instance Average
Average over labels for each example
Cn =1
M
M∑m=1
Cm,n,
Ψinstance :=1
N
N∑n=1
Ψ(Cn).
Bayes optimal classi�er:
θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]
Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation
Note: Marginals may still be deterministically coupled acrosslabels e.g. low rank, shared DNN representation
Shared threshold across labels
![Page 70: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/70.jpg)
Instance Average
Average over labels for each example
Cn =1
M
M∑m=1
Cm,n, Ψinstance :=1
N
N∑n=1
Ψ(Cn).
Bayes optimal classi�er:
θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]
Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation
Note: Marginals may still be deterministically coupled acrosslabels e.g. low rank, shared DNN representation
Shared threshold across labels
![Page 71: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/71.jpg)
Instance Average
Average over labels for each example
Cn =1
M
M∑m=1
Cm,n, Ψinstance :=1
N
N∑n=1
Ψ(Cn).
Bayes optimal classi�er:
θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]
Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation
Note: Marginals may still be deterministically coupled acrosslabels e.g. low rank, shared DNN representation
Shared threshold across labels
![Page 72: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/72.jpg)
Micro Average
Average over both examples and labels
C =1
NM
N∑n=1
M∑m=1
Cm,n, Ψinstance := Ψ(C).
Bayes optimal classi�er:
θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]
Bayes optimal is identical to instance averaging
Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation
Shared threshold across labels
![Page 73: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/73.jpg)
Micro Average
Average over both examples and labels
C =1
NM
N∑n=1
M∑m=1
Cm,n,
Ψinstance := Ψ(C).
Bayes optimal classi�er:
θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]
Bayes optimal is identical to instance averaging
Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation
Shared threshold across labels
![Page 74: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/74.jpg)
Micro Average
Average over both examples and labels
C =1
NM
N∑n=1
M∑m=1
Cm,n, Ψinstance := Ψ(C).
Bayes optimal classi�er:
θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]
Bayes optimal is identical to instance averaging
Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation
Shared threshold across labels
![Page 75: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/75.jpg)
Micro Average
Average over both examples and labels
C =1
NM
N∑n=1
M∑m=1
Cm,n, Ψinstance := Ψ(C).
Bayes optimal classi�er:
θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]
Bayes optimal is identical to instance averaging
Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation
Shared threshold across labels
![Page 76: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/76.jpg)
Micro Average
Average over both examples and labels
C =1
NM
N∑n=1
M∑m=1
Cm,n, Ψinstance := Ψ(C).
Bayes optimal classi�er:
θ∗m(x) = sign(ηm(x)− δ∗) ∀m ∈ [M ]
Bayes optimal is identical to instance averaging
Only require marginals ηm(x) i.e. label correlations have weaka�ect on optimal classi�cation
Shared threshold across labels
![Page 77: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/77.jpg)
Simulated Micro-averaged F1
1 2 3 4 5 6x
0.0
0.2
0.4
0.6
0.8
1.0δ ∗ =0.40
η0 (x)
θ ∗0
1 2 3 4 5 6x
0.0
0.2
0.4
0.6
0.8
1.0
2TP2TP+FP+FN
δ ∗ =0.40
η1 (x)
θ ∗1
![Page 78: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/78.jpg)
Empirical Evaluation
![Page 79: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/79.jpg)
Dataset BR Plugin Macro-Thres BR Plugin Macro-ThresF1 Jaccard
Scene 0.6559 0.6847 0.6631 0.4878 0.5151 0.5010
Birds 0.4040 0.4088 0.2871 0.2495 0.2648 0.1942
Emotions 0.5815 0.6554 0.6419 0.3982 0.4908 0.4790
Cal500 0.3647 0.4891 0.4160 0.2229 0.3225 0.2608
Table: Comparison of plugin-estimator methods on multilabel F1 andJaccard metrics. Reported values correspond to micro-averaged metric(F1 and Jaccard) computed on test data (with standard deviation, over10 random validation sets for tuning thresholds). Plugin is consistent formicro-averaged metrics, and performs the best consistently acrossdatasets.
![Page 80: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/80.jpg)
Dataset BR Plugin Macro-Thres BR Plugin Macro-ThresF1 Jaccard
Scene 0.5695 0.6422 0.6303 0.5466 0.5976 0.5902
Birds 0.1209 0.1390 0.1390 0.1058 0.1239 0.1195
Emotions 0.4787 0.6241 0.6156 0.4078 0.5340 0.5173
Cal500 0.3632 0.4855 0.4135 0.2268 0.3252 0.2623
Table: Comparison of plugin-estimator methods on multilabel F1 andJaccard metrics. Reported values correspond to instance-averaged metric(F1 and Jaccard) computed on test data (with standard deviation, over10 random validation sets for tuning thresholds). Plugin is consistent forinstance-averaged metrics, and performs the best consistently acrossdatasets.
![Page 81: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/81.jpg)
Dataset BR Plugin Macro-Thres BR Plugin Macro-ThresF1 Jaccard
Scene 0.6601 0.6941 0.6737 0.5046 0.5373 0.5260
Birds 0.3366 0.3448 0.2971 0.2178 0.2341 0.2051
Emotions 0.5440 0.6450 0.6440 0.3982 0.4912 0.4900
Cal500 0.1293 0.2687 0.3226 0.0880 0.1834 0.2146
Table: Comparison of plugin-estimator methods on multilabel F1 andJaccard metrics. Reported values correspond to the macro-averaged
metric computed on test data (with standard deviation, over 10 randomvalidation sets for tuning thresholds). Macro-Thres is consistent formacro-averaged metrics, and is competitive in three out of four datasets.Though not consistent for macro-averaged metrics, Plugin achieves thebest performance in three out of four datasets.
![Page 82: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/82.jpg)
Correlated Binary Decisions
Same procedure applies to more general correlated binarydecisions using averaged metrics
Mul$%scale*Models* Perturb*single*genes*OR*single*phenotypes**
Measure*outcomes*Demonstrate*causality*
Figure 4: Multi-scale models will be used to identify hub genes of gene modules that are associated with a particular neural or behavioral trait of interest. We will perturb the expression of single genes (or phenes)and then measure the outcomes (i.e. changes in the neural network structure, gene module stucture, etc) to determine causal relationships.
Example application:
point estimates of brainnetworks from posteriordistributions
![Page 83: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/83.jpg)
Conclusion
![Page 84: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/84.jpg)
Conclusion and open questions
Optimal classi�ers for a large family of metrics have a simplethreshold form sign(P (Y = 1|X)− δ)Proposed scalable algorithms for consistent estimation
Open Questions:
Can we elucidate utility functions from feedback?
Can we characterize the entire family of utility metrics withthresholded optimal decision functions?
What of more general structured prediction?
![Page 85: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/85.jpg)
Conclusion and open questions
Optimal classi�ers for a large family of metrics have a simplethreshold form sign(P (Y = 1|X)− δ)Proposed scalable algorithms for consistent estimation
Open Questions:
Can we elucidate utility functions from feedback?
Can we characterize the entire family of utility metrics withthresholded optimal decision functions?
What of more general structured prediction?
![Page 86: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/86.jpg)
Conclusion and open questions
Optimal classi�ers for a large family of metrics have a simplethreshold form sign(P (Y = 1|X)− δ)Proposed scalable algorithms for consistent estimation
Open Questions:
Can we elucidate utility functions from feedback?
Can we characterize the entire family of utility metrics withthresholded optimal decision functions?
What of more general structured prediction?
![Page 87: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/87.jpg)
Conclusion and open questions
Optimal classi�ers for a large family of metrics have a simplethreshold form sign(P (Y = 1|X)− δ)Proposed scalable algorithms for consistent estimation
Open Questions:
Can we elucidate utility functions from feedback?
Can we characterize the entire family of utility metrics withthresholded optimal decision functions?
What of more general structured prediction?
![Page 88: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/88.jpg)
Conclusion and open questions
Optimal classi�ers for a large family of metrics have a simplethreshold form sign(P (Y = 1|X)− δ)Proposed scalable algorithms for consistent estimation
Open Questions:
Can we elucidate utility functions from feedback?
Can we characterize the entire family of utility metrics withthresholded optimal decision functions?
What of more general structured prediction?
![Page 90: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/90.jpg)
References
![Page 91: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/91.jpg)
References IFrancis R Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic
regression. Journal of Machine Learning Research, 15(1):595�627, 2014.
Peter L Bartlett, Michael I Jordan, and Jon D McAuli�e. Large margin classi�ers: Convex loss, lownoise, and convergence rates. In NIPS, pages 1173�1180, 2003.
Peter L Bartlett, Michael I Jordan, and Jon D McAuli�e. Convexity, classi�cation, and risk bounds.Journal of the American Statistical Association, 101(473):138�156, 2006.
Elad Hazan, K�r Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convexoptimization. In Advances in Neural Information Processing Systems, pages 1585�1593, 2015.
Thorsten Joachims. A support vector method for multivariate performance measures. In Proceedings ofthe 22nd international conference on Machine learning, pages 377�384. ACM, 2005.
Oluwasanmi O Koyejo, Nagarajan Natarajan, Pradeep K Ravikumar, and Inderjit S Dhillon. Consistentbinary classi�cation with generalized performance metrics. In Advances in Neural InformationProcessing Systems, pages 2744�2752, 2014.
Harikrishna Narasimhan, Rohit Vaish, and Shivani Agarwal. On the statistical consistency of plug-inclassi�ers for non-decomposable performance measures. In Advances in Neural InformationProcessing Systems, pages 1493�1501, 2014.
Harikrishna Narasimhan, Purushottam Kar, and Prateek Jain. Optimizing non-decomposableperformance measures: A tale of two classes. In 32nd International Conference on Machine Learning(ICML), 2015.
Mark D Reid and Robert C Williamson. Composite binary losses. The Journal of Machine LearningResearch, 9999:2387�2422, 2010.
Clayton Scott. Calibrated asymmetric surrogate losses. Electronic J. of Stat., 6:958�992, 2012.
Yaroslav D Sergeyev. Global one-dimensional optimization using smooth auxiliary functions.Mathematical Programming, 81(1):127�146, 1998.
Bowei Yan, Kai Zhong, Oluwasanmi Koyejo, and Pradeep Ravikumar. Online classi�cation with complexmetrics. In arXiv:1610.07116v1, 2016.
Nan Ye, Kian Ming A Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing f-measures: a tale of twoapproaches. In Proceedings of the International Conference on Machine Learning, 2012.
![Page 92: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/92.jpg)
Backup Slides
![Page 93: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/93.jpg)
Two Step Normalized Gradient Descent for optimalthreshold search
1: Input: Training sample {Xi, Yi}ni=1, utility measure U ,conditional probability estimator η, stepsize α.
2: Randomly split the training sample into two subsets
{X(1)i , Y
(1)i }
n1i=1 and {X(2)
i , Y(2)i }
n2i=1;
3: Estimate η on {X(1)i , Y
(1)i }
n1i=1.
4: Initialize δ = 0.5;5: while not converged do
6: Evaluate TP,TN on {X(2)i , Y
(2)i }
n2i=1 with
f(x) = sign(η − δ).7: Calculate ∇U ;8: δ ← δ − α ∇U
‖∇U‖ .
9: end while
10: Output: f(x) = sign(η − δ).
![Page 94: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/94.jpg)
Online Complex Metric Optimization (OCMO)Require: online CPE with update g, metric U , stepsize α;1: Initialize η0, δ0 = 0.5;2: while data stream has points do3: Receive data point (xt, yt)4: ηt = g(ηt−1);
5: δ(0)t = δt, TP
(0)t = TPt−1,TN
(0)t = TNt−1;
6: for i = 1, · · · , Tt do7: if ηt(xt) > δ
(i−1)t then
8: TP(i)t ←
TPt−1·(t−1)+(1+yt)/2t , TN
(i)t ← TNt−1 · t−1
t ;
9: else TP(i)t ← TPt−1 · t−1
t , TN(i)t ←
TNt−1·t+(1−yt)/2t+1 ;
10: end if
11: δ(i)t = δ
(i−1)t −α ∇G(TPt,TNt)
‖∇G(TPt,TNt)‖ , TPt = TP(i)t ,TNt = TN
(i)t ;
12: end for
13: δt+1 = δ(Tt)t ;
14: t = t+ 1;15: end while
16: Output (ηt, δt).
![Page 95: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/95.jpg)
Scaling up Classi�cation with ComplexMetrics
![Page 96: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/96.jpg)
Additional properties of U
Informal theorem (Yan et al., 2016)
Suppose U is fractional-linear or monotonic, under weak conditionsa
on P :
U(θδ, P ) is di�erentiable wrt δ
U(θδ, P ) is Lipschitz wrt δ
U(θδ, P ) is strictly locally quasi-concave wrt δ
aηx is di�erentiable wrt x, and its characteristic function is absolutelyintegrable
![Page 97: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/97.jpg)
Algorithms
Normalized Gradient Descent (Hazan et al., 2015)
Fix ε > 0, let f be strictly locally quasi-concave, and x∗ ∈ argmin f(x).
NGD algorithm with number of iterations T ≥ κ2‖x1 − x∗‖2/ε2 and step
size η = ε/κ achieves f(xT )− f(x∗) ≤ ε.
Batch Algorithm
1 Estimate ηx via. proper loss (Reid and Williamson, 2010)
2 Solve maxδ U(θδ,Dn) using normalized gradient ascent
Online Algorithm
Interleave ηt update and δt update
![Page 98: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/98.jpg)
Sample Complexity
Batch Algorithm
With appropriately chosen step size, R(θδ,P) ≤ C∫|η − η|dµ
Comparison to threshold search
complexity of NGD is O(nt) = O(n/ε2), where t is thenumber of iterations and ε is the precision of the solution
when log n ≥ 1/ε2, the batch algorithm has favorablecomputational complexity vs. threshold search
Online Algorithm
Let η estimation error at step t given by rt =∫|ηt − η|dµ, with
appropriately chosen step size, R(θδt ,P) ≤ C∑ti=1 rit
![Page 99: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/99.jpg)
Sample Complexity
Batch Algorithm
With appropriately chosen step size, R(θδ,P) ≤ C∫|η − η|dµ
Comparison to threshold search
complexity of NGD is O(nt) = O(n/ε2), where t is thenumber of iterations and ε is the precision of the solution
when log n ≥ 1/ε2, the batch algorithm has favorablecomputational complexity vs. threshold search
Online Algorithm
Let η estimation error at step t given by rt =∫|ηt − η|dµ, with
appropriately chosen step size, R(θδt ,P) ≤ C∑ti=1 rit
![Page 100: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/100.jpg)
Sample Complexity
Batch Algorithm
With appropriately chosen step size, R(θδ,P) ≤ C∫|η − η|dµ
Comparison to threshold search
complexity of NGD is O(nt) = O(n/ε2), where t is thenumber of iterations and ε is the precision of the solution
when log n ≥ 1/ε2, the batch algorithm has favorablecomputational complexity vs. threshold search
Online Algorithm
Let η estimation error at step t given by rt =∫|ηt − η|dµ, with
appropriately chosen step size, R(θδt ,P) ≤ C∑ti=1 rit
![Page 101: Metrics Matter, Examples from Binary and Multilabel Classi ...sanmi.cs.illinois.edu/documents/koyejo-metrics.pdf · Metrics tradeo which kinds of mistakes are (most) acceptable Case](https://reader033.vdocuments.us/reader033/viewer/2022042401/5f0fece17e708231d446921a/html5/thumbnails/101.jpg)
Sample Complexity
Batch Algorithm
With appropriately chosen step size, R(θδ,P) ≤ C∫|η − η|dµ
Comparison to threshold search
complexity of NGD is O(nt) = O(n/ε2), where t is thenumber of iterations and ε is the precision of the solution
when log n ≥ 1/ε2, the batch algorithm has favorablecomputational complexity vs. threshold search
Online Algorithm
Let η estimation error at step t given by rt =∫|ηt − η|dµ, with
appropriately chosen step size, R(θδt ,P) ≤ C∑ti=1 rit