![Page 1: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/1.jpg)
Margin-Sparsity Trade-offfor
the Set Covering Machine
ECML 2005ECML 2005
François Laviolette (Université Laval)
Mario Marchand (Université Laval)
Mohak Shah (Université d’Ottawa)
![Page 2: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/2.jpg)
PLAN Margin-Sparsity trade-off for Sample Compressed Classifiers
The “classical” Set Covering Machine (Classical-SCM) Definition Tight Risk Bound and model selection The learning algorithm
The modified Set Covering Machine (SCM2) Definition A non trivial Margin-Sparsity trade-off expressed by the risk bound The learning algorithm Empirical results
Conclusions
![Page 3: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/3.jpg)
The Sample compression Framework In the sample compression setting, each classifier is
identified by 2 different sources of information: The compression set: an (ordered) subset of the training set A message string of additional information needed to identify a
classifier
To be more precise: In the sample compression setting, there exists a
“reconstruction” function R that gives a classifier
h = R(, Si) when given a compression set Si and a message string .
![Page 4: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/4.jpg)
The Sample compression Framework (2) The examples are supposed i.i.d.
The risk (or generalization error) of a classifier h (noted R(h)) is the probability that h misclassified a new example.
The empirical risk (noted RS(h)) on a training set S is the frequency of errors of h on S.
![Page 5: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/5.jpg)
Examples of sample-compressed classifiers
Set Covering Machines (SCM) [Marchand and Shaw-Taylor JMLR 2002]
Decision List Machines (DLM) [Marchand and Sokolova JMLR 2005]
Support Vector Machines (SVM)
…
![Page 6: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/6.jpg)
Margin-Sparsity trade-off There is a widespread belief that in the sample compression
setting learning algorithms should somehow try to find a non-trivial margin-sparsity trade-off.
SVM are looking for Margin. But some efforts as been done in order to find a sparser SVM
(Bennett (1999), and Bi et al. (2003)). This seems a difficult task.
SCM are looking for Sparsity. To force a classifier which is a conjunction of “geometric” Boolean
features to have no training example within a distance of its decision surface seems a much easier task.
Moreover, we will see that in our setting, both sparsity and margin can be considered as different forms of data-compression
![Page 7: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/7.jpg)
The “Classical” Set Covering Machine(Marchand and Shawe-Taylor 2002)
Construct the “smallest possible” conjunction of (Boolean-valued) features
Each feature h is a ball identified by two training examples (the center (xc, yc) and the border point (xb, yb) ) and defined for any input example x as:
(Dually, one can consider to construct “smallest possible” disjunction of features, but we will only consider the conjunction case in this talk)
![Page 8: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/8.jpg)
+
-
An Example of a “Classical”-SCM
-
-
-
-
+
+
+
+
+
-
++
+
-
-
-
- -
-
-
--
--
![Page 9: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/9.jpg)
+
-
An Example of a “Classical”-SCM
-
-
-
-
+
+
+
+
+
-
++
+
-
-
-
- -
-
-
--
--
![Page 10: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/10.jpg)
+
-
An Example of a “Classical”-SCM
-
-
-
-
+
+
+
+
+
-
++
+
-
-
-
- -
-
-
--
--
![Page 11: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/11.jpg)
+
-
An Example of a “Classical”-SCM
-
-
-
-
+
+
+
+
-
++
+
-
-
-
+
- -
-
-
--
--
![Page 12: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/12.jpg)
+
-
An Example of a “Classical”-SCM
-
-
-
-
+
+
+
+
+
-
++
+
-
-
-
- -
-
-
--
--
![Page 13: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/13.jpg)
+
-
An Example of a “Classical”-SCM
-
-
-
-
+
+
+
+
+
-
++
+
-
-
-
- -
-
-
--
--
![Page 14: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/14.jpg)
+
-
An Example of a “Classical”-SCM
-
-
-
-
+
+
+
+
+
-
++
+
-
-
-
- -
-
-
--
--
![Page 15: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/15.jpg)
+
-
But SCM is looking for sparsity !!
-
-
-
-
+
+
+
+
+
-
++
+
-
-
-
- -
-
-
--
--
![Page 16: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/16.jpg)
+
But SCM is looking for sparsity !!
-
-
-
-
+
+
+
+
+
-
++
+
-
-
-
- -
-
-
--
--
![Page 17: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/17.jpg)
A risk bound
![Page 18: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/18.jpg)
For “classical” SCMs If we choose the following Prior:
Then Corollary 1 becomes:
Which almost expresses a symmetry between k and d. Because
PM (Zi)(), is small for the “classical” SCMs compared to and
Idem for ln(d+1)
![Page 19: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/19.jpg)
Model selection by the bound
Empirical results showed that looking for a SCM that minimise this risk bound is a slightly better model selection’s strategy than the cross-validation approach
The reasons are not totally clear This bound is tight There is a symmetry between d and k ???
![Page 20: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/20.jpg)
A Learning algorithm for the “Classical” SCM
Ideally we would like a to find a SCM that minimizes the risk bound
Unfortunately, this is NP-Hard (at least)
We will therefore use a greedy heuristic based on the following observation
![Page 21: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/21.jpg)
+
Adding one ball at the time, a classification error on an example “+” can not be fixed by adding other balls
-
-
-
-
+
+
+
+
+
-
++
+
-
-
-
- -
-
But, for an example “-” it is possible.
+
-
-
--
--
![Page 22: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/22.jpg)
A Learning algorithm for the “Classical” SCM(Marchand and Shawe-Taylor 2002)
Define a list p1,p2,…,pl , and for each such p (called the learning parameter) DO STEP 1
STEP 1:Suppose i balls (Bp,0, Bp,1, … Bp,i-1) already have been construct by the algorithm UNTIL every “-” is assign correctly by the SCM (Bp,1, Bp,2,
… Bp,i-1) DO Choose a new ball Bp,i that maximizes qi - p ¢ ri where
qi is the number of “-” correctly assign by Bp,i but not correctly assign by the SCM (Bp,0, Bp,1, … Bp,i-1)
ri is the number of “+” not correctly assign by Bp,i but correctly assign by the SCM (Bp,0, Bp,1, … Bp,i-1)
![Page 23: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/23.jpg)
A Learning algorithm for the “Classical” SCM(continued)
Among the following SCMs,
OUTPUT the one that have the best risk bound
Note: the algorithm can be adapt to a cross-validation approach
![Page 24: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/24.jpg)
SCM2, SCM with radii coded by message strings In the “classical” SCM: centers and radii are defined
by examples of training set
Another alternative: to code each radius value by a message string (but still use examples of the training set to define the centers)
Objective: to construct the “smallest possible” conjunction of balls each of which having the “smallest possible” number of bits in the message string that define its radius
![Page 25: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/25.jpg)
What kind of radius can be described with l bits?
Let us choose a scale
Then with l = 0 bit, we can define:
R
+ R/2
![Page 26: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/26.jpg)
3R/4
What kind of radius can be described with l bits?
Let us choose a scale
Then with l = 1 bit, we can define:
R
+ R/4
![Page 27: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/27.jpg)
3R/8 5R/8 7R/8
What kind of radius can be described with l bits?
Let us choose a scale
Then with l = 2 bits, we can define:
R
+ R/8
![Page 28: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/28.jpg)
More precisely Under some parameter R (that will be our scale), the radius of
any ball of a SCM2 will be code by a pair (l, s) such that
0 < 2s-1 < 2l+1
the code (l,s) means that the radius of the ball is
Thus the possible radius value for l=2 are
Note that l is the number of bits of the radius
R/8, 3R/8, 5R/8 and 7R/8
![Page 29: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/29.jpg)
--
-
--
-
-
-
-
--
-
Observe that if we have a large margin, among all the “interesting” balls, there will be one whose radius (l,s) of small number of bits
+
+
+
+
+
++
+
![Page 30: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/30.jpg)
For SCM2 If we choose the following Priors:
Then Corollary 1 becomes:
Which expresses a non trivial margin-sparsity trade-off !!!
The learning algorithm is similar to the classical one, except it need two extra learning parameter: R and the maximum of bits allowed by message strings (noted l*)
![Page 31: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/31.jpg)
Empirical results SVMs and SCMs on UCI data sets:
We observe: For SCMs, model selection by the bound is almost always
better than by cross-validation SCM2 is almost always better than SCM1 SCM2 tends to produce more balls than SCM1.
Hence SCM2 sacrifices sparsity to obtain a larger margin
![Page 32: Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f295503460f94c430ea/html5/thumbnails/32.jpg)
ConclusionWe have proposed:
A new representation for the SCM that use two distinct sources of information: A compression set to represent the centers of the balls A message string to encode the radius value of each ball
A general data-compression risk bound that depend explicitly on these two information sources which exhibits a non trivial trade-off between sparsity (the inverse of the
compression set size) and the margin (the inverse of the message length) Seems to be an effective guide for choosing the proper margin-sparsity trade-
off of a classifier