Efficient Model Selection for Regularized Classification by Exploiting Unlabeled Data

Georgios Balikas, Ioannis Partalas, Eric Gaussier, Rohit Babbar, Massih-Reza Amini
University Grenoble, Alpes; Viseo R&D; Max-Plank Institute for Intelligent Systems
Intelligent Data Analysis 2015, Saint-Étienne

1 Introduction

2 Quantification

3 The proposed approach

4 Experiment Framework

5 Conclusion

Model selection for text classification



d1 ∈ Rd

dN ∈ Rd



Select hθ ∈ H.

θ: hyper-parametersR(θ) ∈ R


θ ?

The task

Efficiently select the hyper-parameter value which minimizes thegeneralization error (using the empirical error as a proxy).

Traditional Model Selection Methods

Valid. Train Train Train Train

Train Valid. Train Train Train

Train Train Train Train Valid.

Figure : 5-fold Cross Validation

Train Valid.

Figure : Hold-out

Extensions of the above such as Leave-one-out, etc.

M. Mohri et al.Foundations of Machine Learning, MIT press 2012

The issues

In large scale problems:

Resource intensive: ∼ 106 − 108 free parameters. Optimizedk-CV can take up to several days.

Power law distribution ofexamples. Only a fewinstances for smallclasses, splitting themresults in loss ofinformation.

Labeled Documents/class

R. Babbar, I. Partalas, E. Gaussier, M-R. AminiRe-ranking approach to classification in large-scale power-law distributedcategory systems, SIGIR 2014

Our contribution

We propose a bound that motivates efficient model selection.

Leverages unlabeled data for model selection

Performs on par (if not better) with traditional methods

Is k times faster than k-cross validation.

In many classification scenarios, the real goal is determining theprevalence of each class in the test, a task called quantification.

Given a dataset:

? How many people liked the new iPhone?

? How many instances belong to yi class?

A. Esuli and F. SebastianiOptimizing text quantifiers for multivariate loss functions, arXiv preprintarXiv:1502.05491

Quantification using general purpose learners

Classify and Count

Aggregative method

Classify each instancefirst

Count instances/class

Probabilistic Classify and Count

Non-aggregative method

Get scores/probabilities for eachinstance

Sum over probabilities/class

G. FormanCounting positives accurately despite inaccurate classification, ECML 2005

Our setting

Mono-label, multi-class classification

Observations x ∈ X ⊆ Rd , labels y ∈ Y, |Y | > 2

(x, y) i.i.d. according to a fixed, unknown D over X × YStrain = {(x(i), y (i))}N

i=1, S = {(x(i))}Mi=N+1

Regularized classification: w = arg min Remp(w) + λReg(w)

hθ ∈ H, e.g., for SVMs the θ = λ from a set λvalues

py , pC(S)y : prior on Strain, estimated using quantification on S

Accuracy bound


Let S = {(x(j))}Mj=1 be a set generated i.i.d. with respect to DX , py the true prior

probability for category y ∈ Y andNy

N, py its empirical estimate obtained on Strain.

We consider here a classifier C trained on Strain and we assume that the quantificationmethod used is accurate in the sense that:

∃ε, ε� min{py , py , pC(S)y }, ∀y ∈ Y : |pC(S)

y −M


|S || ≤ ε

Let BC(S)A , be defined as:


min{py × |S|, pC(S)y × |S|}

|S| , BC(S)A

Then for any δ ∈]0, 1], with probability at least (1− δ):

AC(S) ≤ BC(S)A + |Y|(

√log |Y|+ log 1


2N+ ε)

Estimated prob. of y on |S |prior prob. of y

BC(S)A ,


min{ py × |S |, pC(S)y × |S |}

|S |

In a power-law distributed category systems this is an upperbound:

– py will be used for large classes due to false positives, and

– pC(S)y will be used for small classes due to false negatives.

Model selection using the bound

Training Data

Estimate class priorsQuantification on unseen data

Model selection using the bound

Training Data

Estimate class priorsQuantification on unseen data

for λ in λvalues doTrain on Strain

Estimate pC(S)y on S

end for

Model selection using the bound

Training Data

Estimate class priorsQuantification on unseen data

Calculate the Bound

Select hyper-parameter value

Dataset #Training #Quantification #Test #Features # Parameters

dmoz250 1,542 2,401 1,023 55,610 13,9Mdmoz500 2,137 3,042 1,356 77,274 38,6Mdmoz1000 6,806 10,785 4,510 138,879 138,8Mdmoz1500 9,039 14,002 5,958 170,828 256,2Mdmoz2500 12,832 19,188 8,342 212,073 530,1M

– Similar experimental settings on wikipedia data

– SVMs and Log. Regression, λ ∈ {10−4, . . . , 104}– 5-CV, Held out (70%-30%), BoundUN, BoundTest

Results (1/2)

10−4 10−3 10−2 10−1 1 10 102 103

λ values













H out




Figure : MaF measure optimization for wiki1500 for SVM.

Results (2/2)

BoundUn BoundTest Hold-out 5-CV

Dataset Acc MaF Acc MaF Acc MaF Acc MaF

dmoz250 .8260 .6242 .8270 .6243 .8260 (±.0000) .6242 (±.0000) .8260 .6242dmoz500 .7227 .5584 .7227 .5584 .7221 (±.0005) .5558 (±.0022) .7220 .5562dmoz1000 .7302 .4883 .7302 .4892 .7301 (±.0001) .4835 (±.0155) .7299 .4883dmoz1500 .7132 .4715 .7132 .4715 .6958 (±.0457) .4065 (±.0998) .7132 .4715dmoz2500 .6352 .4301 .6350 .4306 .6350 (±.0001) .3949 (±.0686) .6352 .4301

wiki1500 for SVM on 4 cores: BoundUn (302 sec), 5-CV (1310 sec).

? Performs equally well or better than traditional modelselection methods for model selection.

? Is k times faster than k-CV.

? It requires unlabeled data from the same distribution as thetraining data.

Thank you

This work is partially supported by the CIFRE N 28/2015 and bythe LabEx PERSYVAL Lab ANR-11-LABX-0025.

