state-of-the-art clustering techniques: support vector methods and minimum bregman information...
Post on 11-May-2015
3.353 Views
Preview:
DESCRIPTION
TRANSCRIPT
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
Support Vector Methods and Minimum Bregman Information Principle
State-of-the-art Clustering Techniques
by
VINCENZO RUSSO
SUPERVISORCO-SUPERVISOR
Anna CORAZZAprof. prof.
Ezio CATANZARITI
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Unsupervised learning: groups a set of objects in subsets called clustersThe objects are represented as points in a subspace of d is the number of point components, also called attributes or featuresSeveral application domains: information retrieval, bioinformatics, cheminformatics, image retrieval, astrophsics, market segmentation, etc.
What is the clustering?
Introduction
CLUSTERING
Rd
Non
-str
uctu
red
data
3-cl
uste
rs s
truc
ture
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Support Vector Clustering (SVC)Bregman Co-clustering
Two state-of-the-art approaches
Goals
Goals Application domain
Robustness w.r.t. Missing-valued Data
Robustness w.r.t. Sparse Data
Robustness w.r.t. High “dimensionality”
Robustness w.r.t. Noise/Outliers
Astrophysics
Textual documents
Textual documentstt
Synthetic data
Other desirable propertiesNonlinear separable problems handlingAutomatic detection of the number of clustersApplication domain independent
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Nonlinear Mapping from a data space to a higher dimensional feature space In we find the Minimum Enclosing Ball (MEB), i.e. the sphere enclosing all feature-space images and having the minimum radiusMapping back on : the sphere splits in contours
The contours constist of Support Vectors (SV), and describe the clusters
Support Vector Clustering: the idea
Support Vector Clustering
6.1. INTRODUCTION TO SUPPORT VECTOR CLUSTERING
!X (data space) F (feature space)
Figure 6.1: A graphical example of a data mapping from input space to feature space using theMEB formulation.
clusters.Unfortunately, finding the MEB is not enough. The process of finding such asphere is only able to detect the cluster boundaries which are modeled mappingback the sphere to the data space. This first phase was called Cluster Description byBen-Hur et al. (2001). A second stage for determining the membership of pointsto the clusters is needed. The authors called this step Cluster Labeling, even if itsimply does a cluster assignment.5In the following subsections we provide an overview of the Support Vector Clus-tering algorithm as originally proposed by Ben-Hur et al. (2001).
6.1.1 Cluster descriptionThe SVM formulation for the smallest enclosing sphere was firstly presented toexplain the Vapnik-Chervonenkis (VC) dimension (Vapnik, 1995). Later it was usedfor estimating the support of a high-dimensional distribution (Schölkopf et al.,1999). Finally, the MEB was used for the Support Vector Domain Description(SVDD), an SVM formulation for the one class classification (Tax, 2001; Tax andDuin, 1999a,b, 2004).6 The SVDD is the basic step of the SVC and allows describ-ing the boundaries of clusters.Let X = {!x1, !x2, · · · , !xn} be a dataset of n points, with X ! Rd, the data space. Weuse a nonlinear transformation " : X " F from the input space X to some highdimensional feature space F , wherein we look for the smallest enclosing sphereof radius R. This is formalized as follows
5The name cluster labeling probably descends from the originally proposed algorithm which isbased on finding the connected components of a graph: the algorithms for finding the connectedcomponents usually assign the “components labels” to the vertices.
6An alternative SVM formulation for the same task, called One Class SVM, can be found inSchölkopf et al. (2000b) (see Appendix A).
91
6.1. INTRODUCTION TO SUPPORT VECTOR CLUSTERING
!X (data space) F (feature space)
Figure 6.1: A graphical example of a data mapping from input space to feature space using theMEB formulation.
clusters.Unfortunately, finding the MEB is not enough. The process of finding such asphere is only able to detect the cluster boundaries which are modeled mappingback the sphere to the data space. This first phase was called Cluster Description byBen-Hur et al. (2001). A second stage for determining the membership of pointsto the clusters is needed. The authors called this step Cluster Labeling, even if itsimply does a cluster assignment.5In the following subsections we provide an overview of the Support Vector Clus-tering algorithm as originally proposed by Ben-Hur et al. (2001).
6.1.1 Cluster descriptionThe SVM formulation for the smallest enclosing sphere was firstly presented toexplain the Vapnik-Chervonenkis (VC) dimension (Vapnik, 1995). Later it was usedfor estimating the support of a high-dimensional distribution (Schölkopf et al.,1999). Finally, the MEB was used for the Support Vector Domain Description(SVDD), an SVM formulation for the one class classification (Tax, 2001; Tax andDuin, 1999a,b, 2004).6 The SVDD is the basic step of the SVC and allows describ-ing the boundaries of clusters.Let X = {!x1, !x2, · · · , !xn} be a dataset of n points, with X ! Rd, the data space. Weuse a nonlinear transformation " : X " F from the input space X to some highdimensional feature space F , wherein we look for the smallest enclosing sphereof radius R. This is formalized as follows
5The name cluster labeling probably descends from the originally proposed algorithm which isbased on finding the connected components of a graph: the algorithms for finding the connectedcomponents usually assign the “components labels” to the vertices.
6An alternative SVM formulation for the same task, called One Class SVM, can be found inSchölkopf et al. (2000b) (see Appendix A).
91
6.1. INTRODUCTION TO SUPPORT VECTOR CLUSTERING
!X (data space) F (feature space)
Figure 6.1: A graphical example of a data mapping from input space to feature space using theMEB formulation.
clusters.Unfortunately, finding the MEB is not enough. The process of finding such asphere is only able to detect the cluster boundaries which are modeled mappingback the sphere to the data space. This first phase was called Cluster Description byBen-Hur et al. (2001). A second stage for determining the membership of pointsto the clusters is needed. The authors called this step Cluster Labeling, even if itsimply does a cluster assignment.5In the following subsections we provide an overview of the Support Vector Clus-tering algorithm as originally proposed by Ben-Hur et al. (2001).
6.1.1 Cluster descriptionThe SVM formulation for the smallest enclosing sphere was firstly presented toexplain the Vapnik-Chervonenkis (VC) dimension (Vapnik, 1995). Later it was usedfor estimating the support of a high-dimensional distribution (Schölkopf et al.,1999). Finally, the MEB was used for the Support Vector Domain Description(SVDD), an SVM formulation for the one class classification (Tax, 2001; Tax andDuin, 1999a,b, 2004).6 The SVDD is the basic step of the SVC and allows describ-ing the boundaries of clusters.Let X = {!x1, !x2, · · · , !xn} be a dataset of n points, with X ! Rd, the data space. Weuse a nonlinear transformation " : X " F from the input space X to some highdimensional feature space F , wherein we look for the smallest enclosing sphereof radius R. This is formalized as follows
5The name cluster labeling probably descends from the originally proposed algorithm which isbased on finding the connected components of a graph: the algorithms for finding the connectedcomponents usually assign the “components labels” to the vertices.
6An alternative SVM formulation for the same task, called One Class SVM, can be found inSchölkopf et al. (2000b) (see Appendix A).
91
6.1. INTRODUCTION TO SUPPORT VECTOR CLUSTERING
!X (data space) F (feature space)
Figure 6.1: A graphical example of a data mapping from input space to feature space using theMEB formulation.
clusters.Unfortunately, finding the MEB is not enough. The process of finding such asphere is only able to detect the cluster boundaries which are modeled mappingback the sphere to the data space. This first phase was called Cluster Description byBen-Hur et al. (2001). A second stage for determining the membership of pointsto the clusters is needed. The authors called this step Cluster Labeling, even if itsimply does a cluster assignment.5In the following subsections we provide an overview of the Support Vector Clus-tering algorithm as originally proposed by Ben-Hur et al. (2001).
6.1.1 Cluster descriptionThe SVM formulation for the smallest enclosing sphere was firstly presented toexplain the Vapnik-Chervonenkis (VC) dimension (Vapnik, 1995). Later it was usedfor estimating the support of a high-dimensional distribution (Schölkopf et al.,1999). Finally, the MEB was used for the Support Vector Domain Description(SVDD), an SVM formulation for the one class classification (Tax, 2001; Tax andDuin, 1999a,b, 2004).6 The SVDD is the basic step of the SVC and allows describ-ing the boundaries of clusters.Let X = {!x1, !x2, · · · , !xn} be a dataset of n points, with X ! Rd, the data space. Weuse a nonlinear transformation " : X " F from the input space X to some highdimensional feature space F , wherein we look for the smallest enclosing sphereof radius R. This is formalized as follows
5The name cluster labeling probably descends from the originally proposed algorithm which isbased on finding the connected components of a graph: the algorithms for finding the connectedcomponents usually assign the “components labels” to the vertices.
6An alternative SVM formulation for the same task, called One Class SVM, can be found inSchölkopf et al. (2000b) (see Appendix A).
91
6.1. INTRODUCTION TO SUPPORT VECTOR CLUSTERING
!X (data space) F (feature space)
Figure 6.1: A graphical example of a data mapping from input space to feature space using theMEB formulation.
clusters.Unfortunately, finding the MEB is not enough. The process of finding such asphere is only able to detect the cluster boundaries which are modeled mappingback the sphere to the data space. This first phase was called Cluster Description byBen-Hur et al. (2001). A second stage for determining the membership of pointsto the clusters is needed. The authors called this step Cluster Labeling, though itsimply does a cluster assignment.5In the following subsections we provide an overview of the Support Vector Clus-tering algorithm as originally proposed by Ben-Hur et al. (2001).
6.1.1 Cluster descriptionThe SVM formulation for the smallest enclosing sphere was firstly presented toexplain the Vapnik-Chervonenkis (VC) dimension (Vapnik, 1995). Later it was usedfor estimating the support of a high-dimensional distribution (Schölkopf et al.,1999). Finally, the MEB was used for the Support Vector Domain Description(SVDD), an SVM formulation for the one class classification (Tax, 2001; Tax andDuin, 1999a,b, 2004).6 The SVDD is the basic step of the SVC and allows describ-ing the boundaries of clusters.Let X = {!x1, !x2, · · · , !xn} be a dataset of n points, with X ! Rd, the data space. Weuse a nonlinear transformation " : X " F from the input space X to some highdimensional feature space F , wherein we look for the smallest enclosing sphereof radius R. This is formalized as follows
5The name cluster labeling probably descends from the originally proposed algorithm which isbased on finding the connected components of a graph: the algorithms for finding the connectedcomponents usually assign the “components labels” to the vertices.
6An alternative SVM formulation for the same task, called One Class SVM, can be found inSchölkopf et al. (2000b) (see Appendix A).
91
!!1: F ! X
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Finding the Minimum Enclosing Ball (MEB)Nonlinear Support Vector Domain Description (SVDD)QP problem; computational complexity
Phase I: Cluster description
CHAPTER 6. SUPPORT VECTOR CLUSTERING
minR,!a
R2 (6.1)
subject to!!("xk)" "a!2 # R2, k = 1, 2, · · · , n
where "a is the center of the sphere. Soft constraints are incorporated by addingslack variables #k
minR,!a,"
R2 + Cn!
k=1
#k (6.2)
subject to!!("xk)" "a!2 # R2 + #k, k = 1, 2, · · · , n
#k $ 0, k = 1, 2, · · · , n
To solve this problem we introduce the Lagrangian
L(R,"a, #; $, µ) = R2 "!
k
(R2 + #k " !!("xk)" "a!2)$k "!
k
#kµk + C!
k
#k (6.3)
with Lagrangian multipliers $k $ 0 and µk $ 0 for k = 1, 2, · · · , n. The posi-tive real constant C provides a way to control outliers percentage, allowing thesphere in feature space to not enclose all points of the target class. As in the caseof supervised SVMs (see subsection 2.6.3), this could lead to some erroneous clas-sifications, but it allows finding a solution to the problem.The solution is characterized by the saddle point of the Lagrangian
max#,µ
minR,!a,"
L(R,"a, #; $, µ) (6.4)
Setting to zero the partial derivatives of L with respect to R,"a and #k, respectively,leads to
!
k
$k = 1 (6.5)
"a =!
k
$k!("xk) (6.6)
$k = C " µk (6.7)
We still need to determine R. The KKT complementarity conditions7 (Boyd andVandenberghe, 2004, sec. 5.5.3) result in
7It is a generalization of the method of the Lagrange multipliers. Also in this case, if the primalproblem is a convex problem, the strong duality holds.
92
CHAPTER 6. SUPPORT VECTOR CLUSTERING
Definition 6.1 (Squared Feature Space Distance) . Let !x be a data point. We definethe distance of its image in feature space F , "(!x), from the center of the sphere, !a, asfollows
d2R(!x) = !"(!x)" !a!2 (6.13)
In view of Equation 6.6 and the definition of the kernel we have the kernelizedversion of the above distance
d2R(!x) = K(!x, !x)" 2
n!
k=1
#kK(!xk, !x) +n!
k=1
n!
l=1
#k#lK(!xk, !xl) (6.14)
Since the solution vector !# is sparse, i.e. only the Lagrangian multipliers associ-ated to the support vectors are non-zero, we can rewrite the above equation asfollows
d2R(!x) = K(!x, !x)" 2
nsv!
k=1
#kK(!xk, !x) +nsv!
k=1
nsv!
l=1
#k#lK(!xk, !xl) (6.15)
where nsv is the number of support vectors.It is trivial to define the radius R of the feature space sphere as the distance of anysupport vector from the center of the sphere
R = dR(!xk),#!xk $ SV (6.16)
where SV = {!xk $ X : 0 < #k < C} and dR(·) ="
d2R(·).
The contours that enclose the points in data space are defined by the level set8 ofthe function dR(·)
LdR(R) % {x | dR(!x) = R} (6.17)
Such contours are interpreted as forming cluster boundaries. In view of Equa-tion 6.16, we can say that SVs lie on cluster boundaries, BSVs are outside, and allother points lie inside the clusters.
6.1.1.1 The kernel choice
All works9 about the Support Vector Clustering assume the Gaussian kernel (seesubsubsection 2.6.1.1)
K(!x, !y) = "q!!x" !y!2 (q=1/2!2)= "!!x" !y!2
2$2(6.18)
as the only kernel available. There are several motivations for the exclusive em-ployment of this kernel. The main reason is that Tax (2001) has investigated onlytwo kernels for the SVDD, the polynomial and the Gaussian ones. The result
8More on level sets in Gray (1997).9Actually, in one of the preliminary papers about SVC a Laplacian kernel was also employed
(Ben-Hur et al., 2000a).
94
CHAPTER 6. SUPPORT VECTOR CLUSTERING
Definition 6.1 (Squared Feature Space Distance) . Let !x be a data point. We definethe distance of its image in feature space F , "(!x), from the center of the sphere, !a, asfollows
d2R(!x) = !"(!x)" !a!2 (6.13)
In view of Equation 6.6 and the definition of the kernel we have the kernelizedversion of the above distance
d2R(!x) = K(!x, !x)" 2
n!
k=1
#kK(!xk, !x) +n!
k=1
n!
l=1
#k#lK(!xk, !xl) (6.14)
Since the solution vector !# is sparse, i.e. only the Lagrangian multipliers associ-ated to the support vectors are non-zero, we can rewrite the above equation asfollows
d2R(!x) = K(!x, !x)" 2
nsv!
k=1
#kK(!xk, !x) +nsv!
k=1
nsv!
l=1
#k#lK(!xk, !xl) (6.15)
where nsv is the number of support vectors.It is trivial to define the radius R of the feature space sphere as the distance of anysupport vector from the center of the sphere
R = dR(!xk),#!xk $ SV (6.16)
where SV = {!xk $ X : 0 < #k < C} and dR(·) ="
d2R(·).
The contours that enclose the points in data space are defined by the level set8 ofthe function dR(·)
LdR(R) % {x | dR(!x) = R} (6.17)
Such contours are interpreted as forming cluster boundaries. In view of Equa-tion 6.16, we can say that SVs lie on cluster boundaries, BSVs are outside, and allother points lie inside the clusters.
6.1.1.1 The kernel choice
All works9 about the Support Vector Clustering assume the Gaussian kernel (seesubsubsection 2.6.1.1)
K(!x, !y) = "q!!x" !y!2 (q=1/2!2)= "!!x" !y!2
2$2(6.18)
as the only kernel available. There are several motivations for the exclusive em-ployment of this kernel. The main reason is that Tax (2001) has investigated onlytwo kernels for the SVDD, the polynomial and the Gaussian ones. The result
8More on level sets in Gray (1997).9Actually, in one of the preliminary papers about SVC a Laplacian kernel was also employed
(Ben-Hur et al., 2000a).
94
The kernel need to be normalized. Gaussian is the most used one
2.6. SUPPORT VECTOR MACHINES
Class -1
Class +1
Class -1
Class +1
!X (data space) F (feature space)
Figure 2.3: A nonlinear separable problem in the data space X that becomes linear separable inthe feature space F .
In polynomial kernels, the parameter k is the degree. In the exponential-basedkernels (and others), the parameter q is called kernel width. The kernel widthhas different mathematical meaning depending on the kernel: in the case of theGaussian kernel, it is a function of the variance
q =1
2!2
Kernel width13 is a general term to indicate the scale at which data is probed. Be-cause of the aforementioned mathematical differences, a proper kernel width fora particular kernel function can be not proper also for another kernel in contextof the same problem.
2.6.2 Maximal margin classifierThe margin concept is a first important step towards understanding the formula-tion of Support Vector Machines (SVMs). As we can see in Figure 2.1, there existseveral separating hyperplanes that separate the data in two classes and someclassifiers (such as the perceptron) just find one of them as solution. On the otherhand, we can also define a unique separating hyperplane according to some cri-terion. The SVMs in particular define the criterion to be looking for a decisionsurface that is maximally far away from any data points. This distance from thedecision surface to the closest data point determines the margin of the classifier.An SVM is designed to maximize the margin around the separating hyperplane.This necessarily means that the decision function for an SVM is fully specifiedby a usually small subset of the data which defines the position of the separa-tor. These points are referred to as the support vectors. The data points other thansupport vectors play no part in determining the decision surface.
13As we will see in the sequel, the kernel width is one of the SVM training hyper-parameters.
19
CHAPTER 2. MACHINE LEARNING ESSENTIALS
Theorem 2.4 Let X ! R be a non-empy set such that X = {!x1, !x2, · · · , !xn}. Let" : X " X # R be a positive semidefinite kernel. The following condition holds
$!x % X , "(!x, !x) & 0
Theorem 2.5 Let X ! R be a non-empy set such that X = {!x1, !x2, · · · , !xn}. For anypositive definite kernel " : X " X # R the following inequality holds
| "(!xi, !xj) |2' "(!xi, !xi)"(!xj, !xj) (2.5)
where i, j = 1, 2, · · · , n and i (= j.
Now we introduce the result that allows to use Mercer kernels to make innerproducts.
Theorem 2.6 Let K be a symmetric function such that for all !xi, !xj % X = {!x1, !x2, · · · ,, !xn} ! R
K(!xi, !xj) = "(!xi) · "(!xj) (2.6)
where ! : X # F and F is an inner product higher dimensional space, called FeatureSpace. The function K can be represented in terms of Equation 2.6 if and only if theGram matrix G = (K(!xi, !xj))i,j=1,2,··· ,n is positive semidefinite, i.e. K(·) is a Mercerkernel.
The kernel function K(·) defines an explicit mapping if " is known, otherwise themapping is said to be implicit. In the majority of cases, the function " is unknown,so we can implicitly perform an inner product in the feature space F by means ofa kernel K.Using nonlinear kernel transformations, we have a chance to transform a non sep-arable problem in data space to a separable one in feature space (see Figure 2.3).
2.6.1.1 Valid Mercer kernels in Rn subspaces
There are several functions which are known to satisfy Mercer’s condition forX ! Rn. Some of them are
• Linear kernel: K(!x, !y) = !x!y
• Polynomial kernel: K(!x, !y) = (!x!y + r)k, r & 0, k % N• Gaussian kernel: K(!x, !y) = e!q"!x!!y"2 , q > 0
• Exponential kernel: K(!x, !y) = e!q"!x!!y", q > 0
• Laplacian kernel: K(!x, !y) = e!q|!x!!y|, q > 0
where ) · ) is the L2 metric (also known as Euclidean distance) and | · | is the L1metric (also known as Manhattan distance).In addition, we can build other kernel functions: given two Mercer kernels K1
and K2, we can construct new Mercer kernels by properly combining K1 and K2
(Cristianini and Shawe-Taylor, 2000, sec. 3.3.2).
18
Support Vector Clustering
6.1. INTRODUCTION TO SUPPORT VECTOR CLUSTERING
Algorithm 5 The general execution scheme for the SVC1: procedure SVC(X )2: q ! initial kernel width3: C ! initial soft constraint4: while stopping criterion is not met do5: ! ! clusterDescription(X , q, C) " Find the MEB via SVDD6: results ! clusterLabeling(X , !)7: choose new q and/or C8: end while9: return results
10: end procedure
6.1.5 ComplexityWe recall that the SVC is composed of two steps: the cluster description and thecluster labeling. To calculate the overall time and space complexity we have toanalyze each step separately.
6.1.5.1 Cluster description complexity
The complexity of the cluster description is the complexity of the QP problem(see Equation 6.3) we have to solve for finding the MEB. Such a problem hasO(n3) worst-case running time complexity and O(n2) space complexity. Anyway,the QP problem can be solved through efficient approximation algorithms likeSequential Minimal Optmization (SMO) (Platt, 1998) and many other decomposi-tion methods. These methods can practically scale down the worst-case runningtime complexity to (approximately) O(n2), whereas the space complexity can bereduced to O(1) (Ben-Hur et al., 2001, sec. 5).12
6.1.5.2 Cluster labeling complexity
The cluster labeling is composed of two sub-steps: (i) the construction of the ad-jacency matrix A (see Equation 6.19) and (ii) the computation of the connectedcomponents of the undirected graph induced by the matrix A. The size of such amatrix is n" n, where n = n# nbsv.In the first sub-step we have to compute the sphere radius R, i.e. the distancedR(#s) where #s is anyone of the support vectors, and the distance dR(#y) for eachpoint #y sampled along the path connecting two points. The number m of pointssampled along a path is the same for all paths we have to check. Finally, due tothe dimensions of the adjacency matrix A, we have to check n2 paths.Let us recall Equation 6.15
d2R(#x) = K(#x, #x)# 2
nsv!
k=1
!kK(#xk, #x) +nsv!
k=1
nsv!
l=1
!k!lK(#xk, #xl)
12In the sequel we discuss the problem of the SVM training complexity more in details.
101
Parametersq = kernel widthC = soft margin
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Each component is a clusterOriginal Phase II is a bottleneck (caso peggiore ) Alternatives
Cone Cluster Labeling: best performance/accuracy rateGradient Descent
The Phase I only describes the clusters’ boundariesPhase II: finding the connected components of the graph induced by the matrix A
Phase II: Cluster labeling
Support Vector Clustering
6.1. INTRODUCTION TO SUPPORT VECTOR CLUSTERING
was that the Gaussian kernel produces tighter descriptions (and so tighter con-tours in the SVC), whereas the polynomial kernel stretches the data in the highdimensional feature space, causing data to become very hard to describe with ahypersphere. In the subsubsection 6.1.3.2 we mention another important reasonwhy the Gaussian kernel is the most common choice in this application.In all correlated works, the employment of the Gaussian kernel has been takenfor granted. Anyway, in the sequel we will also show that the SVC can work alsowith other types of kernels.
6.1.1.2 SVDD improvement
Tax and Juszczak (2003) have proposed a preprocessing method in order to en-hance the performance of the SVDD. The SVDD (and, generally, any one-classclassification formulation) is quite sensible to data distributed in subspaces whichharm the performance. The preprocessing method is called kernel whitenening andit consists of a way to scale the data such that the variances of the data are equalin all directions.
6.1.2 Cluster labelingThe cluster description algorithm does not differentiate between points that be-long to different clusters. To build a decision method for cluster membership weuse a geometric approach involving dR(!x), based on the following observation:given a pair of data points that belong to different clusters, any path that con-nects them must exit from the sphere in the feature space. Therefore, such a pathcontains a segment of points !y such that dR(!y) > R. This leads to the definition ofthe adjacency matrix A between all pairs of points whose images lie in or on thesphere in feature space.Let Sij be the line segment connecting !xi and !xj , such that Sij = {!xi+1, !xi+2, · · · ,!xj!2, !xj!1} for all i, j = 1, 2, · · · , n, then
Aij =
!1 if !!y " Sij, dR(!y) # R
0 otherwise.(6.19)
Clusters are now defined as the connected components of the graph induced bythe matrix A. Checking the line segment is implemented by sampling a numberm of points between the starting point and the ending point. The exactness of Aij
depends on the number m.Clearly, the BSVs are unclassified by this procedure since their feature space im-ages lie outside the enclosing sphere. One may decide either to leave them un-classified or to assign them to the cluster that they are closest to. Generally, thelatter is the most appropriate choice.
6.1.3 Working with Bounded Support VectorsThe shape of enclosing contours in data space is governed by two hyper-parameters(or simply parameters): the kernel width q that determines the scale at which data
95
Sij = {!xi+1, !xi+2, · · · , !xj!2, !xj!1}
6.1. INTRODUCTION TO SUPPORT VECTOR CLUSTERING
Algorithm 5 The general execution scheme for the SVC1: procedure SVC(X )2: q ! initial kernel width3: C ! initial soft constraint4: while stopping criterion is not met do5: ! ! clusterDescription(X , q, C) " Find the MEB via SVDD6: results ! clusterLabeling(X , !)7: choose new q and/or C8: end while9: return results
10: end procedure
6.1.5 ComplexityWe recall that the SVC is composed of two steps: the cluster description and thecluster labeling. To calculate the overall time and space complexity we have toanalyze each step separately.
6.1.5.1 Cluster description complexity
The complexity of the cluster description is the complexity of the QP problem(see Equation 6.3) we have to solve for finding the MEB. Such a problem hasO(n3) worst-case running time complexity and O(n2) space complexity. Anyway,the QP problem can be solved through efficient approximation algorithms likeSequential Minimal Optmization (SMO) (Platt, 1998) and many other decomposi-tion methods. These methods can practically scale down the worst-case runningtime complexity to (approximately) O(n2), whereas the space complexity can bereduced to O(1) (Ben-Hur et al., 2001, sec. 5).12
6.1.5.2 Cluster labeling complexity
The cluster labeling is composed of two sub-steps: (i) the construction of the ad-jacency matrix A (see Equation 6.19) and (ii) the computation of the connectedcomponents of the undirected graph induced by the matrix A. The size of such amatrix is n" n, where n = n# nbsv.In the first sub-step we have to compute the sphere radius R, i.e. the distancedR(#s) where #s is anyone of the support vectors, and the distance dR(#y) for eachpoint #y sampled along the path connecting two points. The number m of pointssampled along a path is the same for all paths we have to check. Finally, due tothe dimensions of the adjacency matrix A, we have to check n2 paths.Let us recall Equation 6.15
d2R(#x) = K(#x, #x)# 2
nsv!
k=1
!kK(#xk, #x) +nsv!
k=1
nsv!
l=1
!k!lK(#xk, #xl)
12In the sequel we discuss the problem of the SVM training complexity more in details.
101
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Parameters explorationThe greater the kernel width q, the greater the number of support vectors (and so of clusters)C rules the number of outliers and allows to deal with strongly overlapping clusters
Brute force approach unfeasibleApproaches proposed in literature
Secant-like algorithm for q explorationNo theoretical-rooted method for C exploration
Data analysis is performed at different levels of detailPseudo-hierarchical: strict hierarchy not guaranteed when ‘C < 1’, due to the Bounded Support Vectors
Pseudo-hierarchical execution
Support Vector Clustering
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Soft Margin C parameter selectionHeuristics: successfully applied in 90% of casesOnly 10 tests out of 100 needed further tuning
10 datasets had a high percentage of missing valuesNew robust stop criterion
Based upon Relative evaluation criteria (C-index, Dunn Index, ad hoc)
Kernel width (q) selectionSVC integrationSoftening strategy heuristicsFor all normalized kernels
More kernelsExponential ( ), Laplace ( )
Proposed improvements
Support Vector Clustering
K(!x, !y) = e!q|!x!!y|K(!x, !y) = e!q"!x!!y"
O(Qn3) O(n2
sv)
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Improvements - Stop criterionDetected clusters Actual clusters Validity index
1 3 1,00E-063 3 0,134 3 0,05
Brea
stIr
is
1 2 1,00E-052 2 0,804 2 0,27
The bigger the Validity index the better the clustering foundThe stop criterion halt the process when the index value start to decrease
The idea: the SVC outputs quality-increasing clusterings before reaching the optimal clustering. After that, it provides quality-decreasing partitionings.
Support Vector Clustering
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Improvements - Kernel width selection
Support Vector Clustering
Algorithm Accuracy Macroaveraging # iter # potential “q”
SVC+ softening
K-means
88,00% 87,69% 2 994,00% 93,99% 1 1385,33% 85,11% not applicable
SVC+ softening
K-means
87,07% 87,55% 3 793,26% 93,91% 2 650,00% 51,78% not applicableW
ine
Iris
SVC+ softening
K-means
91,85% 11,00% 3 1196,71% 2,82% 3 1360,23% 32,00% not applicable
Benign Contamination
B. C
ance
r
SVC+ softening
K-means
88,80% 100,00% 8 1888,00% 100,00% 4 1568,40% 63,84% not applicableSy
n02
SVC+ softening
K-means
87,30% 100,00% 17 3687,30% 100,00% 6 3139,47% 39,90% not applicableSy
n03
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Improvements - non-Gaussian kernels
Support Vector Clustering
Algorithm Accuracy Macroaveraging # iter # potential “q”SVC + softening+ Exp Kernel
K-means
94,00% 93,99% 1 1397,33% 97,33% 1 1585,33% 85,11% not applicable
Iris
SVC + softening+ Exp Kernel
K-means
Failed - only one class out of 3 separated94,00% 93,99% 1 1185,33% 85,11% not applicableC
LA3
Exponential Kernel: improves the cluster separation in several cases
Laplace Kernel: improves/allows the cluster separation with normalized data
Algorithm Accuracy # iter # potential “q”SVC + softening
+ Laplace Kernel
K-means
Failed - no class separated99,94% 1 1783,00% not applicableQ
uad
SVC + softening
+ Laplace Kernel
K-means
73,15% 3 1991,04% 1 1650,24% not applicableSG
03
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
with convex, real, differentiable
5.1. TOWARDS THE BREGMAN DIVERGENCES
This idea can be generalized to more than two points. We refer to a point of theform !1"x1, !1"x2, · · · , !1"xp where !1 +!2 + · · ·+!p = 1, as an affine combination of thepoints "x1, "x2, · · · , "xp. Using the induction from the definition of affine set, it canbe shown that an affine set contains every affine combination of its points.The set of all affine combinations of points in some set C ! Rn is called the affinehull of C, and denoted aff(C)
aff(C) = {!1"x1 + !1"x2 + · · · + !1"xp : "x1, "x2, · · · , "xp " C, !1 + !2 + · · · + !p = 1}.
The affine hull is the smallest affine set that contains C, in the following sense: ifS is any affine set with C ! S, then aff(C) ! S.Finally, we define the relative interior of the set C, denoted ri(C), as its interior2
relative to aff(C)
ri(C) = {"x " C : B("x, r) # aff(C) ! C for some r > 0}, (5.2)
where B("x, r) is the ball of radius r and center x (Boyd and Vandenberghe, 2004,chap. 2).
5.1.2 Bregman divergencesLet us define Bregman divergences (Bregman, 1967), which form a large class ofwell-behaved loss functions with a number of desirable properties.
Definition 5.1 (Bregman divergence) Let # be a real-valued convex function of Leg-endre type3 defined on the convex set S $ dom(#) ! Rd. The Bregman divergenced! : S % ri(S) & R+ is defined as
d!("x1, "x2) = #("x1)' #("x2)' ("x1 ' "x2,)#("x2)* (5.3)
where )# is the gradient of #, (·* is the dot product, and ri(S) is the relative interior ofS.
Example 5.1 (Squared Euclidean Distance) Squared Euclidean distance is perhapsthe simplest and most widely used Bregman divergence. The underlying function #("x) =("x, "x* is strictly convex, differentiable in Rd and
d!("x1, "x2) = ("x1, "x1* ' ("x2, "x2* ' ("x1 ' "x2,)#("x2)* =
= ("x1, "x1* ' ("x2, "x2* ' ("x1 ' "x2, 2"x2* =
= ("x1 ' "x2, "x1 ' "x2* = +"x1 ' "x2+2
(5.4)
2The interior of a set C consists of all points of C that are intuitively not on the “edge” of C(Boyd and Vandenberghe, 2004, app. A).
3A proper, closed, convex function ! is said to be of Legendre type if: (i) int(dom(!))is non-empty, (ii) ! is strictly convex and differentiable on int(dom(!)), and (iii) ,zb "bd(dom(!)), lim
z!dom(!)"zb+)!(z)+ & -, where dom(!) is the domain of the ! application,
int(dom(!)) is the interior of the domain of ! and bd(dom(!)) is the boundary of the domain of! (Banerjee et al., 2005c).
57
Co-clustering: simultaneous clustering of both rows and columns of a data matrix Bregman framework
Generalizes K-means strategyLarge class of divergences: Bregman divergencesMinima Bregman Information (MBI) principleMeta-algorithm
Bregman Co-clustering (BCC)
Minimum Bregman Information Principle
5.1. TOWARDS THE BREGMAN DIVERGENCES
This idea can be generalized to more than two points. We refer to a point of theform !1"x1, !1"x2, · · · , !1"xp where !1 +!2 + · · ·+!p = 1, as an affine combination of thepoints "x1, "x2, · · · , "xp. Using the induction from the definition of affine set, it canbe shown that an affine set contains every affine combination of its points.The set of all affine combinations of points in some set C ! Rn is called the affinehull of C, and denoted aff(C)
aff(C) = {!1"x1 + !1"x2 + · · · + !1"xp : "x1, "x2, · · · , "xp " C, !1 + !2 + · · · + !p = 1}.
The affine hull is the smallest affine set that contains C, in the following sense: ifS is any affine set with C ! S, then aff(C) ! S.Finally, we define the relative interior of the set C, denoted ri(C), as its interior2
relative to aff(C)
ri(C) = {"x " C : B("x, r) # aff(C) ! C for some r > 0}, (5.2)
where B("x, r) is the ball of radius r and center x (Boyd and Vandenberghe, 2004,chap. 2).
5.1.2 Bregman divergencesLet us define Bregman divergences (Bregman, 1967), which form a large class ofwell-behaved loss functions with a number of desirable properties.
Definition 5.1 (Bregman divergence) Let # be a real-valued convex function of Leg-endre type3 defined on the convex set S $ dom(#) ! Rd. The Bregman divergenced! : S % ri(S) & R+ is defined as
d!("x1, "x2) = #("x1)' #("x2)' ("x1 ' "x2,)#("x2)* (5.3)
where )# is the gradient of #, (·* is the dot product, and ri(S) is the relative interior ofS.
Example 5.1 (Squared Euclidean Distance) Squared Euclidean distance is perhapsthe simplest and most widely used Bregman divergence. The underlying function #("x) =("x, "x* is strictly convex, differentiable in Rd and
d!("x1, "x2) = ("x1, "x1* ' ("x2, "x2* ' ("x1 ' "x2,)#("x2)* =
= ("x1, "x1* ' ("x2, "x2* ' ("x1 ' "x2, 2"x2* =
= ("x1 ' "x2, "x1 ' "x2* = +"x1 ' "x2+2
(5.4)
2The interior of a set C consists of all points of C that are intuitively not on the “edge” of C(Boyd and Vandenberghe, 2004, app. A).
3A proper, closed, convex function ! is said to be of Legendre type if: (i) int(dom(!))is non-empty, (ii) ! is strictly convex and differentiable on int(dom(!)), and (iii) ,zb "bd(dom(!)), lim
z!dom(!)"zb+)!(z)+ & -, where dom(!) is the domain of the ! application,
int(dom(!)) is the interior of the domain of ! and bd(dom(!)) is the boundary of the domain of! (Banerjee et al., 2005c).
57
5.1. TOWARDS THE BREGMAN DIVERGENCES
E! [d"(X,!s)] (5.6)
that depends on the choice of the representative !s and can be optimized by pick-ing up the right representative. We call this optimal distortion-rate function theBregman Information (BI) of the random variable X for the Bregman divergence d"
and denote it by I"(X), i.e.
I"(X) = min#s!ri(S)
E! [d"(X,!s)] = min#s!ri(S)
n!
i=1
"id"(!xi,!s) (5.7)
The optimal vector !s which achieves the minimal distortion will be called theBregman representative or, simply the representative of X . The following proposi-tion4 states that such a representative always exists, is uniquely determined anddoes not depend on the choice of Bregman divergence. In fact, the minimizer is justthe expectation of the random variable X .
Proposition 5.1 Let X be a random variable that takes values in X = {!xi}ni=1 ! S "
Rd following a positive probability distribution measure " such that E! [X] # ri(S).5Given a Bregman divergence d" : S $ ri(S)% [0,&), the problem
min#s!ri(S)
E! [d"(X,!s)] (5.8)
has a unique solution given by !s" = !µ = E! [X].
Using the proposition above, we can now give a more direct definition of theBregman Information (BI).
Definition 5.2 (Bregman Information) Let X be a random variable that take valuesin X = {!xi}n
i=1 ! S " Rd following a positive probability distribution measure ".Let µ = E! [X] =
"ni=1 "i!xi # ri(S) and let d" : S $ ri(S) % [0,&) be a Bregman
divergence. Then the Bregman Information of X in terms of d" is defined as
I"(X) = E! [d"(X, !µ)] =n!
i=1
"id"(!xi, !µ) (5.9)
Example 5.3 (Variance) Let X = {!xi}ni=1 be a set in Rd, and consider the uniform
measure over X , i.e. "i = 1/n. The Bregman Information of X with squared Euclideandistance as Bregman divergence is actually the variance
I"(X) =n!
i=1
"id"(!xi, !µ) =1
n
n!
i=1
'!xi ( !µ'2 (5.10)
where
!µ =n!
i=1
"i!xi =n!
i=1
1
n!xi
4A proof is available in Banerjee et al. (2005c).5The assumption that E! [X] # ri(S) is not restrictive since a violation can occur only when
co(X ) ! bd(S), i.e. the entire convex hull of X is on the boundary of S.
59
Bregman divergence Bregman Information
Divergence Information MBI Algorithm
Euclidean Variance Least Squares K-means
Relative Entropy Mutual Information Maximum Entropy unnamed
Itakura-Saito unnamed unnamed Lindo-Buzo-Gray
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Sparse data and missing-valued data
Other experiments
Dataset SVC BCC K-means # attr. affected % obj. affected
MV5000 (25D)MV10000 (25D)AMV5000 (15D)AMV10000 (15D)
99,02% 94,00% 71,08% 10 27,0%96,10% 95,60% 75,12% 10 29,0%91,76% 79,46% 74,90% 6 30,0%90,31% 83,51% 68,20% 6 30,0%
Star/Galaxy data with missing values
Textual document data: sparsity and high “dimensionality”
Dataset SVC BCC K-means
CLASSIC3 (3303D)
SCI3 (9456D)
PORE (13821D)
99,80% 100,00% 49,80%
failed 89,39% 39,15%
failed 82,68% 45,91%
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Outliers
Dataset SVC Best BCC K-means # objects # outliers
SynDECA 02
SynDECA 03
100,00% 94,18% 68,04% 1000 112
100,00% 49,00% 39,47% 10000 1.270
Other experiments
9.8. MISSING VALUES IN ASTROPHYSICS: THE DATA
Figure 9.2: Since the dataset is 3-dimensional, this plot exactly represents Syndeca 03 data. Thefeatures are unnamed, so in the plot we have written “col1”, “col2”, and “col3” for the first, thesecond, and the third feature respectively. The outliers are represented by empty circles.
object is represented by 25 features:27 brightness (Point Spread Function flux),size (Petrosian radius, in arcsec), texture (small-scale roughness of object), andtwo shape features, all observed at 365 nm. The two shape features (10 actualattributes) are the ones that can have missing values.The guidelines used to extract the objects from the SDSS allowed us to get datathat are fairly separable based on the features we used, so we were able to fo-cus on and accurately test the missing values robustness of our two clusteringalgorithms.In the second session data were extracted in a similar way, but the actual fea-tures are 15 because we avoided features with missing values; then we createddifferent missing-valued versions of the various datasets by means of a simplehomemade program that removes values on specific features. The items affectedby this procedure are randomly chosen and their number is specified as an inputparameter.Finally, to evaluate our clustering results we considered an object to be a Star if the“type_u, type_g, type_r, type_i, type_z” attributes in the SDSS catalogue assumethe value 6. An object is a Galaxy if the aforesaid attributes assume the value 3.28
27Five attributes, each over the u, g, r, i, z bands.28The “type” attributes assume values in function of the Stellarity index returned by the SEx-
tractor. SExtractor (Source-Extractor) is a program that builds a catalogue of objects from anastronomical image. It is particularly oriented towards reduction of large scale galaxy-surveydata, but it also performs well on moderately crowded star fields. For more information seehttp://sextractor.sourceforge.net/.
189
SynDECA 02 SynDECA 03
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Conclusions
Conclusions and future works
Support Vector Clustering achieves the goals
Bregman Co-clustering achieves same goals, but the following still holdthe problem of estimating the number of clustersoutliers handling problem
Goals Application domainRobustness w.r.t. Missing-valued Data
Robustness w.r.t. Sparse DataRobustness w.r.t. High “dimensionality”
Robustness w.r.t. Noise/OutliersOther properties
Automatic discovering of the number of clustersApplication domain independent
Nonlinear separable problems handlingArbitrary-shaped clusters handling
AstrophysicsTextual documentsTextual documents
Synthetic dataApplication domain
Whole experimental stage
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
SVC was made applicable in practiceComplexity reduction for the kernel width selectionSoft margin C parameter estimationNew effective stop criterion
non-Gaussian KernelsThe kernel width selection was shown to be applicable to all normalized kernelsExponential and Laplacian kernel successfully used
Improved accuracySoftening strategy for the kernel width selection
Contribution
Conclusions and future works
UNIVERSITÀ degli STUDI di NAPOLI FEDERICO II
VINCENZO RUSSO STATE-OF-THE-ART CLUSTERING TECHNIQUES: SUPPORT VECTOR METHODS AND MINIMUM BREGMAN INFORMATION PRINCIPLE
Minimum Enclosing Bregman Ball (MEBB)Generalization of the Minimum Enclosing Ball (MEB) problem and the Bâdoiu-Clarkson (BC) algorithm with Bregman divergences
Core Vector Machines (CVM)SVM reformulated as MEB problemThey make use of the BC algorithm
MEBB + CVM = Bregman Vector MachinesNew implications for vector machines New implications for SVC
Adapting cluster labeling algorithms to the Bregman divergences
Future works
Conclusion and future works
10.3. FUTURE WORK
Itakura-Saito L22 Kullbach-Leibler
Fig. 2. Examples of Bregman Balls, for d = 2. Blue dots are the centers of the balls.
Here, !F is the gradient operator of F . A Bregman divergence has the followingproperties: it is convex in x’, always non negative, and zero i! x = x’. WheneverF (x) =
!di=1 x2
i = !x!22, the corresponding divergence is the squared Euclidean
distance (L22): DF (x’,x) = !x " x’!2
2, with which is associated the commondefinition of a ball in an Euclidean metric space:
Bc,r = {x # X : !x" c!22 $ r} , (2)
with c # S the center of the ball, and r % 0 its (squared) radius. Eq. (2)suggests a natural generalization to the definition of balls for arbitrary Bregmandivergences. However, since a Bregman divergence is usually not symmetric, anyc # S and any r % 0 define actually two dual Bregman balls :
Bc,r = {x # X : DF (c,x) $ r} , (3)
B!c,r = {x # X : DF (x, c) $ r} . (4)
Remark that DF (c,x) is always convex in c while DF (x, c) is not always, butthe boundary !Bc,r is not always convex (it depends on x, given c), while !B!
c,r
is always convex. In this paper, we are mainly interested in Bc,r because ofthe convexity of DF in c. The conclusion of the paper extends some results tobuild B!
c,r as well. Figure 2 presents some examples of Bregman balls for threepopular Bregman divergences (see Table 1 for the analytic expressions of thedivergences). Let S & X be a set of m points that were sampled from X . Asmallest enclosing Bregman ball (SEBB) for S is a Bregman ball Bc!,r! with r"
the minimal real such that S & Bc!,r! . With a slight abuse of language, we willrefer to r" as the radius of the ball. Our objective is to approximate as best aspossible the SEBB of S, which amounts to minimizing the radius of the enclosingball we build. As a simple matter of fact indeed, the SEBB is unique.
Lemma 1. The smallest enclosing Bregman ball Bc!,r! of S is unique.
Figure 10.1: Examples of Bregman balls. The two ones on the left are balls obtained by means ofthe Itakura-Saito distance. The middle one is a classic Euclidean ball. The other two are obtainedby employing the Kullback-Leibler distance (Nock and Nielsen, 2005, fig. 2).
tion of the data, therefore we can take it into account for much research in theSVC and generally in the SVM. In fact, we wish to recall that the classical BC al-gorithm is the optimization algorithm exploited by the already mentioned CVMs.The CVMs reformulate the SVMs as a MEB problem and we already expressedour will of testing such machines for the cluster description stage of the SVC (seesection 6.12). Since the BC algorithm has been generalized to Bregman diver-gences, the research about vector machines (and therefore about the SVC) couldhave very interesting implications. We definitely intend to explore this way.
10.3.2 Improve and extend the SVC softwareFor the sake of accuracy and in order to perform more robust comparisons withother clustering algorithms, an improved and extended software for the SupportVector Clustering (SVC) is needed. More stability and reliability is necessary.Moreover, it is important to implement all the key contributions to this promisingtechnique proposed all around the world. In fact, all the tests have been currentlyperformed by exploiting only some of the characteristic and/or special contribu-tion at time.
223
10.3. FUTURE WORK
Itakura-Saito L22 Kullbach-Leibler
Fig. 2. Examples of Bregman Balls, for d = 2. Blue dots are the centers of the balls.
Here, !F is the gradient operator of F . A Bregman divergence has the followingproperties: it is convex in x’, always non negative, and zero i! x = x’. WheneverF (x) =
!di=1 x2
i = !x!22, the corresponding divergence is the squared Euclidean
distance (L22): DF (x’,x) = !x " x’!2
2, with which is associated the commondefinition of a ball in an Euclidean metric space:
Bc,r = {x # X : !x" c!22 $ r} , (2)
with c # S the center of the ball, and r % 0 its (squared) radius. Eq. (2)suggests a natural generalization to the definition of balls for arbitrary Bregmandivergences. However, since a Bregman divergence is usually not symmetric, anyc # S and any r % 0 define actually two dual Bregman balls :
Bc,r = {x # X : DF (c,x) $ r} , (3)
B!c,r = {x # X : DF (x, c) $ r} . (4)
Remark that DF (c,x) is always convex in c while DF (x, c) is not always, butthe boundary !Bc,r is not always convex (it depends on x, given c), while !B!
c,r
is always convex. In this paper, we are mainly interested in Bc,r because ofthe convexity of DF in c. The conclusion of the paper extends some results tobuild B!
c,r as well. Figure 2 presents some examples of Bregman balls for threepopular Bregman divergences (see Table 1 for the analytic expressions of thedivergences). Let S & X be a set of m points that were sampled from X . Asmallest enclosing Bregman ball (SEBB) for S is a Bregman ball Bc!,r! with r"
the minimal real such that S & Bc!,r! . With a slight abuse of language, we willrefer to r" as the radius of the ball. Our objective is to approximate as best aspossible the SEBB of S, which amounts to minimizing the radius of the enclosingball we build. As a simple matter of fact indeed, the SEBB is unique.
Lemma 1. The smallest enclosing Bregman ball Bc!,r! of S is unique.
Figure 10.1: Examples of Bregman balls. The two ones on the left are balls obtained by means ofthe Itakura-Saito distance. The middle one is a classic Euclidean ball. The other two are obtainedby employing the Kullback-Leibler distance (Nock and Nielsen, 2005, fig. 2).
tion of the data, therefore we can take it into account for much research in theSVC and generally in the SVM. In fact, we wish to recall that the classical BC al-gorithm is the optimization algorithm exploited by the already mentioned CVMs.The CVMs reformulate the SVMs as a MEB problem and we already expressedour will of testing such machines for the cluster description stage of the SVC (seesection 6.12). Since the BC algorithm has been generalized to Bregman diver-gences, the research about vector machines (and therefore about the SVC) couldhave very interesting implications. We definitely intend to explore this way.
10.3.2 Improve and extend the SVC softwareFor the sake of accuracy and in order to perform more robust comparisons withother clustering algorithms, an improved and extended software for the SupportVector Clustering (SVC) is needed. More stability and reliability is necessary.Moreover, it is important to implement all the key contributions to this promisingtechnique proposed all around the world. In fact, all the tests have been currentlyperformed by exploiting only some of the characteristic and/or special contribu-tion at time.
223
10.3. FUTURE WORK
Itakura-Saito L22 Kullbach-Leibler
Fig. 2. Examples of Bregman Balls, for d = 2. Blue dots are the centers of the balls.
Here, !F is the gradient operator of F . A Bregman divergence has the followingproperties: it is convex in x’, always non negative, and zero i! x = x’. WheneverF (x) =
!di=1 x2
i = !x!22, the corresponding divergence is the squared Euclidean
distance (L22): DF (x’,x) = !x " x’!2
2, with which is associated the commondefinition of a ball in an Euclidean metric space:
Bc,r = {x # X : !x" c!22 $ r} , (2)
with c # S the center of the ball, and r % 0 its (squared) radius. Eq. (2)suggests a natural generalization to the definition of balls for arbitrary Bregmandivergences. However, since a Bregman divergence is usually not symmetric, anyc # S and any r % 0 define actually two dual Bregman balls :
Bc,r = {x # X : DF (c,x) $ r} , (3)
B!c,r = {x # X : DF (x, c) $ r} . (4)
Remark that DF (c,x) is always convex in c while DF (x, c) is not always, butthe boundary !Bc,r is not always convex (it depends on x, given c), while !B!
c,r
is always convex. In this paper, we are mainly interested in Bc,r because ofthe convexity of DF in c. The conclusion of the paper extends some results tobuild B!
c,r as well. Figure 2 presents some examples of Bregman balls for threepopular Bregman divergences (see Table 1 for the analytic expressions of thedivergences). Let S & X be a set of m points that were sampled from X . Asmallest enclosing Bregman ball (SEBB) for S is a Bregman ball Bc!,r! with r"
the minimal real such that S & Bc!,r! . With a slight abuse of language, we willrefer to r" as the radius of the ball. Our objective is to approximate as best aspossible the SEBB of S, which amounts to minimizing the radius of the enclosingball we build. As a simple matter of fact indeed, the SEBB is unique.
Lemma 1. The smallest enclosing Bregman ball Bc!,r! of S is unique.
Figure 10.1: Examples of Bregman balls. The two ones on the left are balls obtained by means ofthe Itakura-Saito distance. The middle one is a classic Euclidean ball. The other two are obtainedby employing the Kullback-Leibler distance (Nock and Nielsen, 2005, fig. 2).
tion of the data, therefore we can take it into account for much research in theSVC and generally in the SVM. In fact, we wish to recall that the classical BC al-gorithm is the optimization algorithm exploited by the already mentioned CVMs.The CVMs reformulate the SVMs as a MEB problem and we already expressedour will of testing such machines for the cluster description stage of the SVC (seesection 6.12). Since the BC algorithm has been generalized to Bregman diver-gences, the research about vector machines (and therefore about the SVC) couldhave very interesting implications. We definitely intend to explore this way.
10.3.2 Improve and extend the SVC softwareFor the sake of accuracy and in order to perform more robust comparisons withother clustering algorithms, an improved and extended software for the SupportVector Clustering (SVC) is needed. More stability and reliability is necessary.Moreover, it is important to implement all the key contributions to this promisingtechnique proposed all around the world. In fact, all the tests have been currentlyperformed by exploiting only some of the characteristic and/or special contribu-tion at time.
223
Itakura-Saito
Squared Euclidean
Kullback-Leibler
The End
Breg
man
Bal
ls
top related