random forest for metric learning with pairwise position dependence

Random Forests for Metric Learning with Pairwise Position Dependence

Random Forest for Metric Learning with Pairwise Position DependenceCaiming Xiong, David Johnson, Ran Xu, Jason J. CorsoDepartment of Computer Science and EngineeringSUNY at Buffalo

{cxiong, davidjoh, rxu2, jcorso}@buffalo.edu

Presented at ACM SIG KDD 2012 in Beijing, China

http://www.cse.buffalo.edu/~cxiongGood afternoon, everyone. I am Caiming Xiong from SUNY at Buffalo. I will present the RFML, this is a joint work with David J, Ran X, and my advisor Jason J. C

1Distance FunctionDistance function is widely used in machine learning and data mining problems:Classification: Gaussian Kernel in SVM.Clustering: Spectral Clustering, Kmeans.Is the predefined metric reliable, such as Euclidean distance?

Distance Metric Learning is needed.......

abcEuc(a,b)==Euc(a,c),This is not what we want!http://www.cse.buffalo.edu/~cxiongAs known, distance function is very important and used widely in many machine learning , data mining problems, such as classification , clustering. For example, gaussian kernel for svm, it adopt the euclidean distance in the kernel. In clustering, commonly, the first step is to define the distance function between any pair of data points.

But in most of cases, we would use some predefined or general distance function like Euc or Kl distance for statistical data, but are they really reliable for any problem? According to NO FREE LUNCH theorem, the answer is no. Here is an example. Its a 2d spiral dateset, different color represents different class. Given three points a, b, c, and a ,c are same class. But if use Euc dist, ab and ac are same distance which is not we want, Therefore we need a specific distance function for this dataset. So thats why distance metric learning needed.2OutlineIntroductionOur MethodsExperimentsConclusionhttp://www.cse.buffalo.edu/~cxiongThis is outline. Well go through them one by one. First is introduction of distance learning.3Distance Metric Learninglearn a single Mahalanobis metric, with representative methods.RCA, DCA, LMNN, ITML, PSD Boost.Problem: a uniform Mahalanobis distance for ALL instances.Learn multiple metric, with representative methods.FSM, FSSM (Frome et al. 2006,2007). ISD (Zhan et al. 2009).Bregman Distance (Wu et al. 2009).High Complexity for time and spaceHigh Complexity for testinghttp://www.cse.buffalo.edu/~cxiongThere are two kind of metic learning. First one is single global metric learning. There are RCA(), DCA(), LMNN(), ITML(), PSD boosting.., They can be learned very efficiently and the learned distance metric outperforms predefined/general distance in applications. But the global distance metric only learns a single, identical metric for all instance, not satisfy the demand for many heterogeneous dataset in real world.

So recently, some people begin to change their attention to the multiple distance metric learning which is position based metric learning, there are FSM(), FSSM(), ISD(), Bregman Dist(). they got better performance than single global metric learning method, but their training/testing complexity is much much more than single global metric learning methods.

In their paper, FSM to estimate a specific distance per labeled training instance, ISD Extend to propagate metric to unlabeled. a bregman distance learns a parametric metric method effic trining, complex in testing. they present better performance than single metric learning method. 4Our workCan we obtain a single distance function that is able to achieve both the efficiency of the global methods and specificity of the multi-metric methods?Yes, we learn Implicitly Pairwise Position Dependence Metric via Random Foresthttp://www.cse.buffalo.edu/~cxiongSo, there is a quesiton, Is it possible to learning a single distance that integrate the efficiency of single metric learning and the specificity of multi-metric or the specificity means the position-based property

Yes, in this paper, we propose a random forest metric learning method that is implicitly pairwise position dependent5OutlineIntroductionOur MethodsExperimentsConclusionhttp://www.cse.buffalo.edu/~cxiongSo I start to introduce our method.6Distance Metric Learning RevisitGiven the instance set , and two pairwise constraint set: must-link set S, and cannot-link D.

Distance function space, PSD matrix space for Mahalanobis Distance Function

Then redefine , where is classification model, is a feature mapping function for the pair. The problem becomes classification problem with function space constraint.

Must-Link{S(A,B)}Cannot-Link{D(A,B)}http://www.cse.buffalo.edu/~cxiongFirst Lets revisit supervised distance metric learning problem. first must link means. Cannnot link means..

So The goal of distance learning is to find distance metric that satisfies the pairwise constraints. so Then objective function can be formulated to minimize loss function based on pairwise constraints and the learned distance metric function should belong to some preset distance function space.

For this objective formulation, we could make a simple revision, we can rewrite this two variable dist function by a single variable classfication function F, and the single variable is the output of some mapping function \phi for pair of points. In this way the distance metric learning problem could be transferred to a classification problem with the function space constraint. Next using single mahalanobis distance metric learning problem , we will provide an example to demonstrate this transformation.

7Example by Single Mahalanobis Metric LearningStandard Mahalanobis-based methods learn a distance function of the form .

Using hinge loss:

Therefore, metric learning problem is transferred to classification problem with PSD matrix constraint. The mapping function is defined as:

(1)http://www.cse.buffalo.edu/~cxiongIn mahalanobis distance learning, the goal is to learn a distance function that is formed as., and W is psd matrix.According to the previous slide, we try to rewrite the mahalanobis Dist function by a linear function F() with mapping that is W^T \phi(), using simple algebra, we can obtain F() explicitly by setting the \phi as ()()^T, so we can see F is a linear classifer/regressor in this case, then choosing a loss function such as hinge loss, the metric learning problem has been transferred to classfication problem.

Based on this example, from other side, we also can consider mahalanobis metric learning as a classfication problem with defining a specified \phi function like equa 1. And also because equ 1 only use information of relative position between pair points in feature space/ difference between two data points, that is why that single maha distance metric learning is uniform for all instance, or global metric. So, we think if we can define a phi() function that include position information, can it learn a position dependent distance metric?8Mapping function for position dependent metric learningNow we define a simple and more general mapping function:

It allows our method to adapt to heterogeneous distributions of data.But we proved that this mapping function cannot be used into the single Mahalanobis Metric Learning methods!relative location/differencethe mean of the two point vectors

http://www.cse.buffalo.edu/~cxiongFirst what influence the distance of two points? We think the distance of two points is not only based on the difference of two points, but also has relation to the position of two points in feature space.

So for this assumption and the \phi function we discuss in last slides, we define a novel simple \phi: the mapping function. It consists of two part: u and v, U is . Same as maha, V is , watch the figure, we can find this phi function project the pair point into two space. And since one of them v space includes the position information , it allows our metric to more adapt to hetero data than the previous mapping function

So can we simply use this mapping func to replace the one in the last sildes to learn a new maha metric/distnce func. Answer is no, ..Due to the time reason, I put the poof at the end of this slides, if you interest, we can discuss later. If we canot use this mapping function, what can we do for it9Tree-Structured Metric with Implicit Position DependenceInstead Mahalanobis metric, we design the tree structured metric:

This Tree Metric can be learned by Decision Tree. Neg: learning a single tree via greedy method is easy to get trapped into overfitting problem.http://www.cse.buffalo.edu/~cxiongIn order To use our new mapping function, to make use of position information, we propose a tree-structured metric.

In this tree, each node partitions pairs constarints/ u space and v space (since mapping function project pairt to u, v space) into small cells based on their relative position and absolute/mean position information. You can see the red node partition the u space, the blue node partition the v space. . For Any pair of points, it will go through this tree by some path based on the relative information and position information of pairs until reach leaf node. And output the distance, this distance is based on how many pair constraints arrive at this leaf node. In our paper, this tree metric is considered implicit position dependent, implicit is because we know we use position information, but we dont know where and how much for each test pair , also you cannt write an explicit function

For learning this tree metric, We can use decision tree algorithm to learn this tree metric . But greedy method easily result in overfitting problem. We choose a better way..10Random Forest for Metric LearningTo make the algorithm more general, we adopt the random trees technique to learn our tree metric:

Using Breimans Random Forest method, We name our distance as Random Forest Distance (RFD)!http://www.cse.buffalo.edu/~cxiongSo we adopt random tree idea to learn our tree metric which is more general. We use Breimans Random forest idea, first get the output of mapping function, then train the tree metric with random way that each choose random pairs, then for each node choose random direction / features, so we can consider each random tree as a weak metric, the final distance function is the summation of all weak distance metric. We name our distance funcation as random forest distance.11RFD comments pseudosemimetricSince the nonlinear property of tree structure and new mapping function, our distance function does not satisfy the triangle inequality.

Computational complexityBuild trees in parallel, it is O(n log n).learning processes of nodes from the same depth also could be paralleled.

http://www.cse.buffalo.edu/~cxiongSince the mapping function and nonlinear property of random tree, the learned metric cannot satisfy the triangle inequality which is common in multiple metric learning. So our distance function is pseudosemimetric

About computation complexity, since each weak tree metric can be learned independently, so they can generate in parallel, the complexity is equal to the complexity of single tree construction which is O(n log n), and because in single tree, learning nodes from same depth also are independent, so they can be paralleled too, so our learning is very efficient. We will show some complexity result comparing with some other multiple metric learnig method in the experiment 12OutlineIntroductionOur MethodsExperimentsConclusionhttp://www.cse.buffalo.edu/~cxiongDatasets and MethodsDatasets10 UCI datasets.Corel image database.MethodsOurs RFD.Euclidean, Mahalanobis (inverse of covariance matrix).RCA, DCA, ITML, LMNN.FSM, FSSM, ISD.Single Metric learning methodsMultiple Metric learning methodshttp://www.cse.buffalo.edu/~cxiongDataset, compare way: kNN classification and image retrieval ( just mention it.)

Methods: our, predefined, single global, multiple metric14Classification Performance

http://www.cse.buffalo.edu/~cxiongWe run 10 datasets, and set different k for kNN, for each k, for each dataset, we rank the different method, then average the ranks for each method, the RFD(-p)is without position information, we can find +p is much better thatn p, and our method is better thatn other single global metric. The second best is LMNN which is a good single global metric. 15Classification PerformanceComparison with global Mahalanobis metric learning methods

Homogeneous, low dimensional.Heterogeneous, high dimensional.http://www.cse.buffalo.edu/~cxiongMost case our method is best or second best, the performance of other method will change very much according to dataset. Such as DCA. Second our method performs better in relative complex dataset than simple dataset. like in iris, DCA, ITML get some better, even ours also have 96%, , but Like sonar., 16Classification PerformanceComparison with position-specific multi- metric methods:

http://www.cse.buffalo.edu/~cxiongPrevious we demonstrate our method outperforms the single global metric at most of time because of our implicit position dependence.

Here we compare with other position speciific multiple metric learng method, in the upper table, we can easily find out our method RFD is comparable to other multi Metric method, we outperforms the FSM FSSm and ISD L1. ISD L2 performs best resuls, but also similar to ours. In the lower talble, it shows the time comsuming for the ISD and our RFD method, (because ISD also has good performance in the upper table). In the lower table, it demonstrate our efficiency for learning, at least 16 timer faster. So we said our method combine the efficiency and the specity of mult17OutlineIntroductionOur MethodsExperimentsConclusionhttp://www.cse.buffalo.edu/~cxiongConclusionMain contribution:Overcoming the limitation of single global metric, we incorporates conventional relative position information as well as absolute position of point pairs into the learned metric, and hence implicitly adapts the position-based metric through the feature space.

Our code is publicly released (http://www.cse.buffalo.edu/~cxiong/).Thank you!Acknowledgements: This work was partially supported by the National Science Foundation CAREER grant(IIS-0845282), the Army Research Office (W911NF-11-1-0090), DARPA CSSG (HR0011-09-1-0022 and D11AP00245).http://www.cse.buffalo.edu/~cxiongMain contribution

Acknowledgemetns

And we Thank kdd reviewer for their good suggestions for our paper.19Multiple Metric Learning[Frome et al. NIPS 06, ICCV 07]: estimate a distance metric per instance or exemplar.[Zhan et al. ICML 09]: propagating metrics learned on training exemplars to learn a metric matrix for each unlabeled pointNeg: high time and space complexity due to the need to learn and store O(N) d by d metric matrices.[Wu et al. NIPS 09]: learns a Bregman distance function for semi-supervised clustering, but it needs to take O(N) time to calculate the distance of single pair of points which is impractical for large scale data sets.http://www.cse.buffalo.edu/~cxiongFrome in 2006 2007 proposed two multiple distance metric that is to estimate a specific distance per instance, so comparing any test point with training point, there is a specific distance function for calculating the distance of the pair, but if the pair of points are both unseen, I mean both of them are from test set, what can they do? So in 2009, Zhan propose extended version that also learn a metric matrix for each unlabeled point by propagating metrics learned on training examplars. All these three method has a same big negative thing, that is they need to learn and store O(N) d by metric matrices, if d is large or N is large scale is impossible for learning, even for storing them.

In NIPS 2009, Wu propose a bregman distance function for semi-clustering which is efficient for storing and learning. But it needs to cost O(N) time to calculate dist of single pair of points. This is impractical for large scale situation.20Retrieval PerformanceWe do not include LMNN for the retrieval problem because it requires a label for each training point, which we do not have.

http://www.cse.buffalo.edu/~cxiongIn the classfication experiments, we have found out our method is more robust as k increase which is good charastic for retrieval, we test our method in corel image dataset, there is 10 class, we can easily get that RFD outperforms other method.21Why not Mahalanobis?Thus, as , This is a nonsensical result, and clearly undesirable in a metric.

Why not Mahalanobis?For any two points and such that , where is a scalar,

Why not Mahalanobis?For any two points and such that , where is a scalar, Why not Mahalanobis?For any two points and such that , where is a scalar, Why not Mahalanobis?For any two points and such that , where is a scalar, For any two points and such that , where is a scalar, Thus, as , This is a nonsensical result, and clearly undesirable in a metric.

Thus,as ,

This is a nonsensical result, and clearly undesirable in a metric.

http://www.cse.buffalo.edu/~cxiong22Varying forest size

http://www.cse.buffalo.edu/~cxiong

random forest for metric learning with pairwise position dependence

Documents

global distance metric

learned distance metric

distance metric learninglearn

euclidean distance

predefinedgeneral distance

based metric learning

multiple metric

specific distance function