adaptive graph weighting for multi-view dimensionality ...see.xidian.edu.cn/faculty/chdeng/welcome...

11
Signal Processing 165 (2019) 186–196 Contents lists available at ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate/sigpro Adaptive graph weighting for multi-view dimensionality reduction Xinyi Xu a , Yanhua Yang c , Cheng Deng a,, Feiping Nie b a School of Electronic Engineering, Xidian University, Xi’an 710071, China b OPTIMAL, Northwestern Polytechnical University, Xi’an 710072, China c School of Computer Science and Technology, Xidian University, Xi’an 710071, China a r t i c l e i n f o Article history: Received 8 January 2019 Revised 8 June 2019 Accepted 16 June 2019 Available online 24 June 2019 Keywords: Multi-view learning Adaptive graph weighting Dimensionality reduction Semi-supervised learning Unsupervised learning a b s t r a c t Multi-view learning has become a flourishing topic in recent years since it can discover various informa- tive structures with respect to disparate statistical properties. However, multi-view data fusion remains challenging when exploring a proper way to find shared while complementary information. In this pa- per, we present an adaptive graph weighting scheme to conduct semi-supervised multi-view dimensional reduction. Particularly, we construct a Laplacian graph for each view, and thus the final graph is ap- proximately regarded as a centroid of these single view graphs with different weights. Based on the learned graph, a simple yet effective linear regression function is employed to project data into a low- dimensional space. In addition, our proposed scheme can be well extended to an unsupervised version within a unified framework. Extensive experiments on varying benchmark datasets illustrate that our proposed scheme is superior to several state-of-the-art semi-supervised/unsupervised multi-view dimen- sionality reduction methods. Last but not least, we demonstrate that our proposed scheme provides a unified view to explain and understand a family of traditional schemes. © 2019 Elsevier B.V. All rights reserved. 1. Introduction Multi-view learning is of vital significance as the abundant in- formation it takes advantage of [1] Multiple views information can be typically collected by various sensors, such as depth informa- tion, infrared data, and RGB data, which are employed in object de- tection, classification and scene understanding to boost the gener- alization performance [2–4]. Alternatively, it also can be described by distinct feature subsets, such as HOG, SIFT [5] and GIST [6]. These different features characterize partly independent features information from various perspectives [7,8]. Owing to comprehen- sive representation capability, multi-view learning has gained in- creasing popularity [9,10]. Complying with consensus and complementary philosophy, multi-view learning makes effort to combine heterogeneous infor- mation and thus obtains an integrated type, and a collection of approaches have been proposed in the past decades. A naive one is to concatenate all multiple views and apply single-view learn- ing algorithms directly, which neglects complementary nature and specific statistical properties among different views. To alleviate this issue, Hotelling [11] proposes canonical correlation analysis (CCA) to leverage two views of the same underlying semantic ob- Corresponding author. E-mail address: [email protected] (C. Deng). ject to extract a common representation. Co-training [12] is one of the earliest methods for multi-view learning, based on which co- EM [13], co-testing [14], and robust co-training [15] are proposed. They use multiple redundant views to learn from the data by train- ing a set of classifiers defined in each view, with the assumption that the multi-view features are conditionally independent. How- ever, in most real-world applications, the independence assump- tion is invalid, such that these methods can not effectively work [16]. From the kernel perspective, multiple kernel functions can be used to map the data of multiple views into a unified space, which can easily and effectively combine the information contained by varying views. As a result, multiple kernel learning approaches [17–19] have gained a lot of attention. Along with powerful representation capability indwelled in multi-view data, the curse of dimensionality still exists. It causes huge computation and memory consumption in real applications. Therein, reducing dimension while retaining important information contained by original high dimensional data becomes necessary [20,21]. Depending on the availability of labeled training exam- ples, dimensionality reduction techniques can be divided into three categories: supervised, semi-supervised and unsupervised meth- ods [22,23]. When conducting dimensional reduction, we desire to enhance the discriminative capability of low-dimensional data. To achieve this goal, a collection of supervised algorithms have been proposed, such as linear discriminate analysis (LDA) [24], lo- https://doi.org/10.1016/j.sigpro.2019.06.026 0165-1684/© 2019 Elsevier B.V. All rights reserved.

Upload: others

Post on 27-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adaptive graph weighting for multi-view dimensionality ...see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's Homepa… · mensional reduction can be well-incorporated with graph-based

Signal Processing 165 (2019) 186–196

Contents lists available at ScienceDirect

Signal Processing

journal homepage: www.elsevier.com/locate/sigpro

Adaptive graph weighting for multi-view dimensionality reduction

Xinyi Xu

a , Yanhua Yang

c , Cheng Deng

a , ∗, Feiping Nie

b

a School of Electronic Engineering, Xidian University, Xi’an 710071, China b OPTIMAL, Northwestern Polytechnical University, Xi’an 710072, China c School of Computer Science and Technology, Xidian University, Xi’an 710071, China

a r t i c l e i n f o

Article history:

Received 8 January 2019

Revised 8 June 2019

Accepted 16 June 2019

Available online 24 June 2019

Keywords:

Multi-view learning

Adaptive graph weighting

Dimensionality reduction

Semi-supervised learning

Unsupervised learning

a b s t r a c t

Multi-view learning has become a flourishing topic in recent years since it can discover various informa-

tive structures with respect to disparate statistical properties. However, multi-view data fusion remains

challenging when exploring a proper way to find shared while complementary information. In this pa-

per, we present an adaptive graph weighting scheme to conduct semi-supervised multi-view dimensional

reduction. Particularly, we construct a Laplacian graph for each view, and thus the final graph is ap-

proximately regarded as a centroid of these single view graphs with different weights. Based on the

learned graph, a simple yet effective linear regression function is employed to project data into a low-

dimensional space. In addition, our proposed scheme can be well extended to an unsupervised version

within a unified framework. Extensive experiments on varying benchmark datasets illustrate that our

proposed scheme is superior to several state-of-the-art semi-supervised/unsupervised multi-view dimen-

sionality reduction methods. Last but not least, we demonstrate that our proposed scheme provides a

unified view to explain and understand a family of traditional schemes.

© 2019 Elsevier B.V. All rights reserved.

j

t

E

T

i

t

e

t

[

u

c

v

[

m

h

T

c

[

p

c

o

1. Introduction

Multi-view learning is of vital significance as the abundant in-

formation it takes advantage of [1] Multiple views information can

be typically collected by various sensors, such as depth informa-

tion, infrared data, and RGB data, which are employed in object de-

tection, classification and scene understanding to boost the gener-

alization performance [2–4] . Alternatively, it also can be described

by distinct feature subsets, such as HOG, SIFT [5] and GIST [6] .

These different f eatures characterize partly independent features

information from various perspectives [7,8] . Owing to comprehen-

sive representation capability, multi-view learning has gained in-

creasing popularity [9,10] .

Complying with consensus and complementary philosophy,

multi-view learning makes effort to combine heterogeneous infor-

mation and thus obtains an integrated type, and a collection of

approaches have been proposed in the past decades. A naive one

is to concatenate all multiple views and apply single-view learn-

ing algorithms directly, which neglects complementary nature and

specific statistical properties among different views. To alleviate

this issue, Hotelling [11] proposes canonical correlation analysis

(CCA) to leverage two views of the same underlying semantic ob-

∗ Corresponding author.

E-mail address: [email protected] (C. Deng).

t

T

b

https://doi.org/10.1016/j.sigpro.2019.06.026

0165-1684/© 2019 Elsevier B.V. All rights reserved.

ect to extract a common representation. Co-training [12] is one of

he earliest methods for multi-view learning, based on which co-

M [13] , co-testing [14] , and robust co-training [15] are proposed.

hey use multiple redundant views to learn from the data by train-

ng a set of classifiers defined in each view, with the assumption

hat the multi-view features are conditionally independent. How-

ver, in most real-world applications, the independence assump-

ion is invalid, such that these methods can not effectively work

16] . From the kernel perspective, multiple kernel functions can be

sed to map the data of multiple views into a unified space, which

an easily and effectively combine the information contained by

arying views. As a result, multiple kernel learning approaches

17–19] have gained a lot of attention.

Along with powerful representation capability indwelled in

ulti-view data, the curse of dimensionality still exists. It causes

uge computation and memory consumption in real applications.

herein, reducing dimension while retaining important information

ontained by original high dimensional data becomes necessary

20,21] . Depending on the availability of labeled training exam-

les, dimensionality reduction techniques can be divided into three

ategories: supervised, semi-supervised and unsupervised meth-

ds [22,23] . When conducting dimensional reduction, we desire

o enhance the discriminative capability of low-dimensional data.

o achieve this goal, a collection of supervised algorithms have

een proposed, such as linear discriminate analysis (LDA) [24] , lo-

Page 2: Adaptive graph weighting for multi-view dimensionality ...see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's Homepa… · mensional reduction can be well-incorporated with graph-based

X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196 187

Fig. 1. Comparison among supervised learning, unsupervised learning and semi-supervised learning.

c

y

W

t

d

m

h

j

s

b

l

a

d

s

c

s

s

i

i

e

g

i

i

a

l

o

g

g

g

s

d

e

a

t

m

s

m

p

r

o

2

p

2

m

a

t

Y

w

Table 1

Notations.

Notations Descriptions

N Number of total data samples

n Number of labeled data samples

m Number of views

k Number of categories

f v Dimensionality of the v th view

f Dimensionality across all m views

X ∈ R f × N The training data set

W ∈ R f × k The projection matrix

b ∈ R k × 1 The bias vector

y ∈ R N The label vector of labeled data

Y ∈ R N × k The coded binary label Matrix

F ∈ R N × k The prediction label matrix

A v ∈ R N × N The manually established affinity matrix of v th view

S ∈ R N × N The optimal affinity matrix across all m views

L S ∈ R N × k The Laplacian matrix

D ∈ R N × N The diagonal matrix

I ∈ R N × N The identity matrix

G The Laplacian graph

0 The vector or matrix of which elements are all 0

1 The vector or matrix of which elements are all 1

r

e

w

L

D

2

f

d

p

d

w

t

t

w

V

o

F

l

F

t

h

s

t

t

al fisher discriminant analysis (LFDA) [25] , marginal fisher anal-

sis (MFA) [26,27] , and maximum margin criterion (MMC) [28] .

ithout any label information, unsupervised dimensional reduc-

ion algorithms aim to maintain the original structure of high-

imensional data at the most extent [29] . Some representative

ethods are principal components analysis (PCA) [30] , neighbor-

ood preserving embedding (NPE) [31] , locality preserving pro-

ections (LPP), and sparsity preserving projections (SPP) [32] . As

hown in Fig. 1 , semi-supervised algorithms are the compromise

etween supervised and unsupervised approaches, which focus on

earning intrinsic structure revealed by unlabeled data and a small

mount of labeled data [33] . Semi-supervised dimensionality re-

uction (SSDR) [34] exploits both cannot-link and must-link con-

traints together with unlabeled data, while semi-supervised lo-

al fisher discriminant analysis (SLFDA) [35] bridges PCA and LFDA,

uch that global structure retained by the former in unsupervised

cenarios can complement the latter.

In this paper, we focus on multi-view dimensionality reduction

ssue and propose a novel Laplacian graph framework to cope with

t under semi-supervised as well as unsupervised scenarios. The

ssential motivations are two-fold: (1) adaptive weighted multi-

raph integration can learn persistent manifold and complement

ncomplete information through assigning proper significance to

ndividual graph; (2) graph-based learning can be formulated as

transductive semi-supervised method so that the label of un-

abeled data can be deduced according to the pairwise similarity

f available labeled ones. In detail, we present a novel weighting

raph technique tailored for multi-view input, where the target

raph is considered as a centroid of the built graph for every sin-

le view in terms of relative significance. A family of weights is as-

igned to each view and optimized jointly, which release from pre-

efining hyper-parameters experientially. Furthermore, our scheme

mploys a linear regression function for dimensional reduction

nd constrains discrepancy between low-dimensional representa-

ion and prediction vector in the graph. To this end, the di-

ensional reduction can be well-incorporated with graph-based

emi-supervised multi-view learning progress. Extensive experi-

ents on varying benchmark datasets demonstrate that our pro-

osed scheme is superior to several multi-view dimensionality

eduction methods including semi-supervised and unsupervised

nes.

. Related work

In this section, we will introduce the notations throughout the

aper, and then briefly review several representative methods.

.1. Notations

As shown in Table 1 , the generally-involved notations are sum-

arized. Specifically, we write matrices as bold uppercase letters

nd vectors as bold lowercase letters. Given a matrix W = [ W i j ] ,

he i th row and j th column are respectively denoted as W

i and W j .

is a binary label matrix with Y i j = 1 if y i = j and Y i j = 0 other-

ise. When data set X and affinity matrix S are given, an undi-

ected weight graph G is formulated as G = { X , S } , whose vertex is

ach data point X i ∈ X and two vertices ( X i , X j ) are connected by a

eighted edge S ij ∈ S . The Laplacian matrix L S can be computed by

S = D − S , where D is a diagonal matrix with diagonal elements

ii =

j S i j .

.2. Dimensionality reduction

Yan et al. [26] developed a unified graph embedding framework

or dimensionality reduction task, which unifies a collection of tra-

itional algorithms, such as PCA, LDA, LE, and Eigenmap. In this pa-

er, the statistical or geometric properties of a dimensionality re-

uction scheme are integrated into direct graph embedding along

ith three kinds of transformations of the high-dimensional vec-

or, which are linear projection, nonlinear kernel projection, and

ensorization respectively.

The optimization problem of direct graph is designed as

= argmin

F , F T VF = I T r(F

T L S F ) , (1)

here V is another Laplacian matrix that satisfies the constrains:

1 = 0 and 1 T

V = 0 T , purposing to avoid a trivial solution to the

bjective function. The direct graph embedding is capable to learn

that lies on a low-dimension manifold of the training data. The

ow-dimensional vector F can be computed by linear mapping:

= X

T W , kernel graph embedding in which the kernel Gram ma-

rix K is available: F = b

T K , or tensorization which is designed for

igher-order structured data: F i = X i ×1 W

1 ×2 W

2 · · · ×n W

n . We

uggest interested readers refer (MFA) [26] for details. However,

he main drawback of direct graph embedding exits that it fails

o map out-of-sample data points.

Page 3: Adaptive graph weighting for multi-view dimensionality ...see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's Homepa… · mensional reduction can be well-incorporated with graph-based

188 X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196

w

3

t

o

f

X

c

t

F

0

F

w

H

2.3. LGC and GFHF

LGC [36] and GFHF [37] focus on learning a prediction label ma-

trix F ∈ R N × k on a graph. These two algorithms constrain on the

manifold smoothness (i.e. F should be smooth on the entire graph

in terms of both labeled and unlabeled data) and the label fitness

(i.e. F should be close to the labels of the labeled points). The ob-

jective functions E L and E G are

E L (F ) =

1

2

k ∑

i, j=1

∥∥∥∥∥ F i √

D ii

− F j √

D j j

∥∥∥∥∥2

2

· S i j + λ ·N ∑

i =1

∥∥F i − Y

i ∥∥2

2 ,

E G (F ) =

1

2

k ∑

i, j=1

∥∥F i − Y

j ∥∥2

2 · S i j + λ∞

· 1

2

N ∑

i =1

∥∥F i − Y

i ∥∥2

2 , (2)

where λ is the balance coefficient of two terms, and λ∞

is a very

large number to constrain

∑ N i =1

∥∥F i − Y

i ∥∥2

2 = 0 .

2.4. LapRLS/L

LapRLS/L [38] is a manifold regularization-based dimensional-

ity reduction approach associated with regression. The regression

function is defined as h (X i ) = W

T X i + b , where W ∈ R f × k is the

mapping matrix and b ∈ R k × 1 is the bias term. LapRLS/L minimizes

the ridge regression errors and preserves the manifold smoothness

at the same time, where the loss function is defined as

E La (W , b ) = λn ‖ W ‖

2 F + λs T r(W

T XL S X

T W )

+

1

m

m ∑

i =1

‖ W

T X i + b − Y

T

i ‖

2 2 , (3)

where λn and λs are the balance coefficients among the reg-

ularization term, the manifold smoothness, and the regression

error.

3. Methodology

We construct a collection of graphs for all views and the fi-

nal integrated graph is approximately regard as a centroid of them.

Following the manner of formulating Gaussian similarity from Eu-

clidean distance [39] , we initialize an affinity matrix A for every

single view graph:

A

v i, j =

{exp(−‖ X

v i − X

v j ‖

2 2 /t) , ‖ X

v i − X

v j ‖

2 2 ≤ ε;

0 , otherwise. (4)

One paramount element to the success of graph-based multi-

view learning is to learn an optimally integrated affinity matrix

S . We achieve this goal through employing constraint between

weighted sum of multiple single view graphs and integrated graph∑

v d v ‖ S − A

v ‖ 2 F . The underlying philosophy behind it is to find out

a group of parameter d v to weight every single view’s affinity ma-

trix A v , and maximize the matching degree of the integrated affin-

ity matrix S and view affinity matrices A . We employ T r(F T

L S F ) to

smooth the predictions label vector F , and T r(F − Y ) T

U (F − Y ) to

constrain the label fitness based on semi-supervised setting, which

implies that F should be closed to the semi-supervised label infor-

mation Y . Matrix U = [ I; 0 ] where I ∈ R n × n and 0 ∈ R (N−n ) ×(N−n ) . In

addition, the linear projection function X

T W + 1 b

T is used to map

the data X into prediction label space. Both manifold ranking and

linear discriminate projection are used to learn F by constraining

the distances between them as small as possible. For W , the map-

ping matrix of X , we add a 2-norm regularization term ‖ W ‖ 2 to

avoid over-fitting.

Therein, the loss function is formulated as

E(S , d , W , b , F ) =

v d v ‖

S − A

v ‖

2 F + γ ‖ d ‖

2 2 + T r(F

T L S F )

+ T r

((F − Y )

T U (F − Y )

)+ μ

(β∥∥∥X

T W + 1 b

T − F

∥∥∥2

2 + ‖ W ‖

2 F

),

s.t. S1 = 1 , S ≥ 0 , d1 = 1 , d ≥ 0 , (5)

here d is a view-weight vector and d v the weight of v th view.

.1. Optimization

To solve the problem in Eq. (5) , we use an alternating optimiza-

ion strategy that optimizes one variable when fixing the other

nes. Initializing the S =

v d v A

v , the detail iterative process is as

ollows.

1. Fix F, and then compute W, b . The derivatives of the loss func-

tion Eq. (5) with respect to b and W being set to zero, then we

have

b =

1

m

(F

T 1 − W

T X1

),

W = β (β X

(I − (1 /m ) 11

T )

X

T + I ) −1 X

(I − (1 /m ) 11

T )

F . (6)

Setting H c = I − (1 /m ) 11 T

and B = β (β XH c X

T + I ) −1 XH c , then

Eq. (6) is formulated as

W = BF . (7)

2. Fix W , b , and then compute F . According to Eqs. (6) and (7) ,

we can develop X

T W + 1b

T as

X

T W + 1b

T = X

T BF +

1

m

(11

T F

T − 11

T X

T BF

)=

(I − 1

m

11

T )

X

T BF +

1

m

11

T F

T

=

(H c X

T B +

1

m

11

T )

F . (8)

Set C = H c X

T B +

1 m

11 T , Eq. (8) is formulated as

T W + 1b

T = CF . (9)

Updated the expression of optimal W and b in Eq. (5) , we can

ompute F by minimizing the objective function (5) with respect

o F :

= argmin

F

v d v ‖

S − A

v ‖

2 F + γ ‖ d ‖

2 2 + T r(F

T L S F )

+ T r

((F − Y )

T U (F − Y )

)+ μ

(T r(F

T B

T BF ) + βT r((CF − F )

T (CF − F ))

). (10)

By setting the derivative of Eq. (10) with respect to F equal to

, the prediction vector F is updated by:

=

(U + L S + μβ( C − I )

T

( C − I ) + μB

T B

)−1

UY , (11)

here N = X

T

c .

For H c , we have

c H c =

(I − 1

m

11

T )(

I − 1

m

11

T )

= I − 1

m

11

T − 1

m

11

T + − 1

m

2 11

T 11

T

= I − 1

m

11

T − 1

m

11

T + − 1

m

2 1 m 1

T

= H c = H

T

c . (12)

Page 4: Adaptive graph weighting for multi-view dimensionality ...see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's Homepa… · mensional reduction can be well-incorporated with graph-based

X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196 189

μ

W

c

E

t

D

b

F

w

t

w

S

S

c

S

E

m

S

F

w

p

S

w

d

W

b

d

w

t

t

w

b

o

d

T

e

c

A

I

O

3

t

o

d

E

F

t

a

H

f

b

s

i

u

We can also easily obtain the following equation:

βB

T XH c X

T B + μB

T B = μB

T (βXH c X

T + I ) B

= μβB

T XH c . (13)

hat’s more, the equation B

T XH c = H c X

T B can be derived ac-

ording to the property of matrix transpose. By taking Eq. (12) ,

q. (13) and B

T XH c = H c X

T B , μβ( C − I )

T ( C − I ) + μB

T B is rewrit-

en as

μβ( C − I ) T

( C − I ) + μB

T B

= μβ(

H c X

T B +

1

m

11

T − I

)T (H c X

T B +

1

m

11

T − I

)+ μB

T B

= μβ(

H c X

T B − H c

)T (H c X

T B − H c

)+ μB

T B

= μβB

T XH c X

T B − 2 μβH c X

T B + μβH c + μB

T B

= μβH c X

T B − 2 μβH c X

T B + μβH c

= −μβH c X

T B + μβH c

= −μβ2 H c X

T (βXH c X

T + I

)−1

XH c + μβH c . (14)

efining X c = XH c , the prediction label matrix F is finally updated

y

=

(U + L S + μβH c − μβ2 N

)−1 UY , (15)

here N = X

T

c (βX c X

T

c + I ) −1 X c = X

T

c X c

(βX

T

c X c + I

)−1

.

3. Fix F and d, then compute S . According to Eq. (5) we can see

hat S does not rely on W or b . Therefore the optimization problem

ith respect to S is specified as

min

1 = 1 , S ≥0

v d v ‖

S − A

v ‖

2 F + T r(F

T L S F ) . (16)

Our target is to find out an optimal integration affinity matrix

when fixing the predict label matrix F , so the problem (16) be-

omes

min

i, j ≥0 , S1 = 1

m ∑

v =1

N ∑

i, j=1

d v (S i, j − A

v i, j

)2 +

n ∑

i, j=1

‖ F i − F j ‖

2 2 S i, j . (17)

q. (17) is independent for varying i , we solve the following mini-

ization problem separately for each i :

min

i, j ≥0 , S1 = 1

N ∑

j=1

m ∑

v =1

d v (S i, j − A

v i, j

)2 +

N ∑

j=1

‖ F i − F j ‖

2 2 S i, j . (18)

or simplicity, we denote u i, j = ‖ F i − F j ‖ 2 2 and u i is a vector of

hich j th element is equal to u i,j . Replacing ‖ F i − F j ‖ 2 2 with u i,j in

roblem (18) , we have

min

i ·1 =1 , S i j ≥0

n ∑

j=1

‖ S i −∑

v d v − u i, j

2 ∑

v d v ‖

2 2 , (19)

hich can be optimized by Duchi et al. [40]

4. Fix S, then compute d . The problem can be simplified as

min

T 1 =1 , d v ≥0

v d v ‖ S − A

v ‖

2 F + γ ‖ d ‖

2 2 . (20)

e define e v = ‖ S − A

v ‖ 2 F , so the optimization problem (20) can

e converted to:

min

T 1 =1 , d v ≥0

v d v e

v + γ ‖ d ‖

2 2 . (21)

here the second term is utilized as a means to smoothen

he weight distribution. Straightforwardly, without the regulariza-

ion term( γ → 0), the trivial solution will be obtained, where the

eight of best view will be assigned to 1 and other weights will

e 0s. On the contrary, when γ → ∞ , the equal weights will be

btained. As a result, (21) can be converted to:

min

T 1 =1 , d v ≥0 ‖

e

2 γ+ d ‖

2 2 , (22)

he optimization is the same as S , which can be solved by Duchi

t al. [40] . In conclusion, the complete optimization procedure is

oncluded in Algorithm (1) .

lgorithm 1 The learning procedure of semi-supervised scheme.

nput:

N-sample matrix { X i } N i =1 with m different views;

Compute the affinity matrix { A v } m

v =1 for each view by Equation

(4)

Number of the labeled samples is n , the rest N − n ones are

unlabeled;

Corresponding binary label matrix Y ∈ R N×k with y i, j = 1 , if the

i th sample belongs to the jth category

We further define a diagonal matrix with the first n diagonal

element is 1 and the rest is 0;

Balance coefficients μ β;

Termination parameter ε (usually set to 10 −4 );

Maximum number of iterations t .

1: Initialization:

d = 1 /m *ones(m,1), S =

∑ m

v =1 d v · A v 2: for j = 1 ; j < t; j + + do

3: Update W and b by Equation (6) and Equation (7) respec-

tively;

4: Update the predicted label matrix F by Equation (15);

5: Update the integrated affinity matrix S by solving the prob-

lem (19);

6: Update the view weight d by solving the problem (22);

7: if the loss residual between two adjacent iteration is smaller

than ε then

8: break ;

9: end if

10: end for

utput: Transformation matrix W , bias vector b , label matrix F ,

integration affinity matrix S , and view weight vector d .

.2. Develop into unsupervised version

We can obtain an unsupervised version of the proposed objec-

ive function by setting all elements in U equal to 0 . We still focus

n optimizing the projection matrix W , the bias term b and pre-

iction label matrix F , so Eq. (5) is rewritten as:

u (S , d , W , b , F ) =

v d v ‖

S − A

v ‖

2 F + γ ‖ d ‖

2 2 + T r(F

T L S F )

+ μ(‖ W ‖

2 + β‖ X

T W + 1b

T − F ‖

2 2

),

s.t. S1 = 1 , S ≥ 0 , d1 = 1 , d ≥ 0 , F T

VF = I . (23)

or the prediction label matrix F , we add a constraint: F T

VF = I

o make F lie in a sphere after centering operation, which aims to

void the situation that trivial solution F = 0 [39,41] . We set V = c in this paper. Actually, the objective function (23) is a general

ormulation, which can be easily develop into supervised problem

y using different L S and V .

Same as the optimization process of semi-supervised dimen-

ionality reduction, we set the derivatives of the objective function

n Eq. (23) with respect to W and b to zero, then W and b can be

pdated by Eq. (7) . We can further rewrite Eq. (23) by replacing W

Page 5: Adaptive graph weighting for multi-view dimensionality ...see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's Homepa… · mensional reduction can be well-incorporated with graph-based

190 X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196

˜

o

b

E

P

I

t

f

E

E

w

P

w

i

w

z

S

z

E

w

w

S

S

A

R

b

and b :

E u (S , d , F ) =

v d v ‖

S − A

v ‖

2 F + γ ‖ d ‖

2 2 + T r(F

T L S F )

+ μ(

T r(F T

B

T BF ) + βT r ( CF − F )

T

( CF − F )

),

s.t. F T

H c F = I . (24)

Fix S and d , then compute F . F has no relation to the first term

and thus we can update the F by minimizing Eq. (24) with respect

to F :

F = argmin

F , F T H c F = I T r(F

T L S F )

+ μ(

T r(F T

B

T BF ) + βT r((CF − F )

T (CF − F ))

). (25)

According to Eq. (14) , Eq. (25) that computes the F is developed

as

F = argmin

F , F T H c F = I T rF

T (L S + μβH c − μβ2 N

)F

= argmin

F , F T H c F = I T r

(F

T (L S − μβ2 N

)F

), (26)

which can be optimized by generalized eigenvalue decomposition

[26] . S and d are also updated by solving the problem (19) and

(22) respectively. In conclusion, the optimization algorithm of un-

supervised scheme is shown in Algorithm 2 .

Algorithm 2 The learning procedure of unsupervised scheme.

Input:

N-sample matrix { X i } N i =1 with m different views;

Compute the affinity matrix { A v } m

v =1 for each view by Equation

(4)

Balance coefficients μ β;

Termination parameter ε (usually set to 10 −4 );

Maximum number of iterations t .

1: Initialization:

d = 1 /m *ones(m,1), S =

∑ m

v =1 d v · A v 2: for j = 1 ; j < t; j + + do

3: Update W and b by Equation (6) and Equation (7) respec-

tively;

4: Update the predicted label matrix F by solving the problem

(26) by generalized eigenvalue decomposition;

5: Update the integrated affinity matrix S by solving the prob-

lem (19);

6: Update the view weight d by solving the problem (22);

7: if the loss residual between two adjacent iteration is smaller

than ε then

8: break ;

9: end if

10: end for

Output: Transformation matrix W , bias vector b , label matrix F ,

integration affinity matrix S , and view weight vector d .

4. Theoretical analysis

In this section, we analyze the convergence and complexity of

our proposed method.

4.1. Convergence analysis

We first prove the convergence of Algorithm 1 , which naturally

engages for the convergence of Algorithm 2 . We firstly proof the

bjective function is convex w.r.t F, W , and b . We formulate E 1 ( W,

, F ) by

1 (W , b , F ) = T r(F T

L S F ) + T r

((F − Y )

T U (F − Y )

)+ μ

(β∥∥∥X

T W + 1 b

T − F

∥∥∥2

2 + ‖ W ‖

2 F

)(27)

roof. We know that U, L S ∈ R N × N , F, Y ∈ R N × k , W ∈ R f × k , b ∈ R k × 1 .

f the matrices U and L S are positive semi-definite, μ≥ 0 and β ≥ 0,

hen E 1 ( W, b, F ) is jointly convex with respect to F , L S , and b . In

unction E 1 ( W, b, F ), we remove the constant term T r(Y

T UY ) , then

1 ( W, b, F ) can be rewritten in matrix form as

1 (W , b , F ) = T r

[

F W

b T

] T

P

[

F W

b T

]

− T r

[

F W

b T

] T [

2 UY

0

0

]

(28)

here

=

[

μβI + L S + U −μβX

T −μβ1

μβX μI + μβXX

T μβX1

−μβ1

T μβX

T μβN

] T

(29)

Therefore, in order to prove that E 1 ( W, b, F ) is jointly convex

.r.t. F, W , and b , we only need to prove that the matrix P is pos-

tive semi-definite. For any vector z = [ z T 1 , z T 2 , z 3 ] T ∈ R (N+ f+1) ×1 ,

here z 1 ∈ R f × 1 , and z 3 is a scalar, we have

T P z = z T 1 (μβI + L S + U ) − 2 μβz T 1 X

T z 2 − 2 μβz T 1 1 z 3

+ z T 2 (μI + μβXX

T ) z 2 + 2 μβz T 2 X1 z 3 + μβNz T 3 z 3

= z T 1 (L S + U ) z 1 + μz T 2 z 2 + μβ(z T 1 z 1 − 2 z 1 X

T z 2

−2 z T 1 1 z 3 + z T 2 XX

T z 2 + 2 z T 2 X1 z 3 + Nz T 3 z 3 )

= z T 1 (L S + U ) z 1 + μz T 2 z 2 + μβ(z 1 − X

T z 2 − 1 z 3 ) T

×(z 1 − X

T z 2 − 1 z 3 ) . (30)

o if U and L S are positive semi-definite, μ≥ 0 and β ≥ 0, then

T Pz ≥ 0 for any z , and thus P is positive semi-definite. Therefore,

1 ( W, b, F ) is jointly convex w.r.t. F, W , and b .

As long as E 1 ( W, b, F ) being jointly convex w.r.t. F, W , and b ,

e have ∑

v (d v ) t+1 ‖

S t+1 − A

v ‖

2 F + T r(F T t+1 L S t+1

F t+1 ) + g(W t+1 , b t+1 , F t+1 )

v (d v ) t+1 ‖

S t+1 − A

v ‖

2 F + T r(F T t L S t+1

F t ) + g(W t , b t , F t ) , (31)

here g(W , b, F ) = E 1 (W , b , F ) − T r(F T

L S F ) . For the affinity matrix

, assume that S t+1 is the updated value from S t , hence we have

t+1 = argmin

S i, j ≥0 , S1 = 1

v d v ‖

S t − A

v ‖

2 F + T r(F

T

t L S t F t ) . (32)

nd we only need to prove ∑

v (d v ) t+1 ‖

S t+1 − A

v ‖

2 F + T r(F T t L S t+1

F t ) + g(W t , b t , F t )

v (d v ) t ‖

S t − A

v ‖

2 F + T r(F T t L S t F t ) + g(W t , b t , F t ) . (33)

emove the same term g ( W t , b t , F t ), and the inequality in (33) can

e rewritten as ∑

v d v ‖

S t+1 − A

v ‖

2 F + T r(F T t L S t+1

F t )

v d v ‖

S t − A

v ‖

2 F + T r(F T t L S t F t ) . (34)

Page 6: Adaptive graph weighting for multi-view dimensionality ...see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's Homepa… · mensional reduction can be well-incorporated with graph-based

X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196 191

N

t

T

A

I

r

F

T

4

p

3

m

b

c

t

5

d

m

a

p

[

5

t

T

w

i

w

a

E

w

v

c

v

F

E

U

e

a

5

e

0

T

w

T

s

6

p

s

t

t

a

r

According to the iteration step of d in Algorithm 1 , we have

(d v ) t =

1 2 ‖ S t −A v ‖ F , and then we can know that

v

‖ S t+1 − A

v ‖

2 F

2 ‖ S t − A

v ‖ F

+ T r(F T t L S t+1 F t )

≤∑

v

‖ S t − A

v ‖

2 F

2 ‖ S t − A

v ‖ F

+ T r(F T t L S t F t ) . (35)

ie [42] has proved that if we have any nonzero vectors F and F t ,

he following inequality holds

( √

F −√

F t ) 2 ≥ 0 ⇒ F − 2

FF t + F t ≥ 0

F − f

2

F t ≤

F t − F t

2

F t . (36)

herefore, we have ∑ ‖ S t+1 − A

v ‖ F −∑

v

‖ S t+1 − A

v ‖

2 F

2 ‖ S t − A

v ‖ F

≤∑ ‖ S t − A

v ‖ F −∑

v

‖ S t − A

v ‖

2 F

2 ‖ S t − A

v ‖ F

. (37)

dding the above two inequalities, we can easily determine that ∑ ‖ S t+1 − A

v ‖ F + T r(F T t L S t+1 F t )

≤∑ ‖ S t − A

v ‖ F + T r(F T t L S t F t ) . (38)

f we denote that d v t+1

=

1 2 ‖ S t+1 −A v ‖ F , we will have the following

esult according to (20) and (38) : ∑

v d

v t+1 ‖ S t+1 − A

v ‖

2 F + γ ‖ d t+1 ‖

2 2 + T r(F T t L S t+1

F t )

≤∑

v d

v t ‖ S t − A

v ‖

2 F + γ ‖ d t ‖

2 2 + T r(F T t L S t F t ) . (39)

inally combining inequalities 33 and 39 , we have ∑

v (d v ) t+1 ‖

S t+1 − A

v ‖

2 F + γ ‖ d t+1 ‖

2 2 + T r(F T t+1 L S t+1

F t+1 )

+ g(W t+1 , b t+1 , F t+1 ) �

v (d v ) t ‖

S t − A

v ‖

2 F

+ γ ‖ d t ‖

2 2 + T r(F T t L S t F t ) + g(W t , b t , F t ) . (40)

hat is to say, 5 will converge to a locally optimal solution.

.2. Complexity analysis

Now we consider the computational complexity of our pro-

osed method. Specifically, the computational complexity of step

, 4, 5 and 6 in Algorithm 1 respectively scales in O( f N

2 + N f 2 +f N k ), O( N

3 + N

2 k ), O( N ), and O( m ). Considering that the feature di-

ension f is the largest among the number of classes k , the num-

er of views m , and the number of samples N , the computational

omplexity of Algorithm 1 scales in O( f 2 ), otherwise the computa-

ional complexity scales in O( N

3 ) if N is the largest number.

. Relations to previous algorithms

In this section, we first develop our proposed framework into

ifferent classical models associated with the multi-view infor-

ation fusion: semi-supervised algorithm LGC [36] GFHF [37] ,

nd LapRLS/L [38] . We further discuss the relation between our

roposed unsupervised form with graph embedding framework

26] and spectral regression [43]

.1. Relation to other semi-supervised learning algorithms

• Relation 1: Ours and LGC, GFHF.

We first rewrite the two objective functions of LGC and GFHF

o a unified formulation:

r(F T

PF ) + T r

((F − Y )

T U (F − Y )

), (41)

here P ∈ R N × N is a normalized graph Laplacian matrix and U

s the diagonal matrix. The diagonal elements of LGC are all λ,

hereas the first n diagonal elements are ∞ and the rest N − n

re zero for GFHF respectively.

If we set μ = 0 , the objective function (5) turns into

(F , S ) =

∥∥∥∥S −∑

v d v A

v

∥∥∥∥2

F

+ T r(F T

L S F ) + T r

((F − Y )

T U (F − Y )

),

(42)

hich is a general form of LGC, GFHF cooperated with the multi-

iew integration. It is easily concluded that LGC, GFHF are two spe-

ial cased of our scheme when taking the same strategy for multi-

iew data integration.

• Relation 2: LapRLS/L and Ours.

Setting μ =

λ1 λ2

, β → ∞ that is μβ → ∞ in Eq. (5) , so we have

= X

T W + 1b

T . Eq. (5) is rewritten as

(W , b , S ) =

v d v ‖

S − A

v ‖

2 F + μ‖ W ‖

2 F

+ T r

((X

T W + 1b

T )T

L S

(X

T W + 1b

T ))

+ T r

((X

T W + 1b

T − Y ) T

U (X

T W + 1b

T − Y ) ). (43)

If we set the first n and rest N − n diagonal elements of matrix

as 1 nλs

and 0 respectively, then E(W , b , S ) − ∑

v d v ‖ S − A

v ‖ 2 F is

qual to 1 λs

E La (W , b ) in Eq. (3) . Therefore, the approach LapRLS/L is

nother special case of our proposed scheme for multi-view data.

.2. Relation to graph embedding framework

• Relation 3: Direct Graph Embedding and Ours.

As far as we know, the objective function of a direct graph

mbedding framework is formulated as T r(F T

L S F ) . Setting μ = in (23) , our scheme develops to

v d v ‖ S − A

v ‖ 2 F + γ ‖ d ‖ 2 2 + r(F

T L S F ) , which is identical with the graph embedding frame-

ork associated with our multi-view data integration algorithm.

herein, the graph embedding framework under the multi-view

cenario is a special case of our framework.

. Experiment

In this section, we evaluate the effectiveness of our pro-

osed scheme on four public benchmark datasets for both semi-

upervised and unsupervised multi-view dimensional reduction

asks. We will first introduce the experiment setup which covers

he datasets, baseline and evaluation metrics. And then, we will

nalysis the quantitative results for two tasks both. The hyperpa-

ameters will be discussed finally.

Page 7: Adaptive graph weighting for multi-view dimensionality ...see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's Homepa… · mensional reduction can be well-incorporated with graph-based

192 X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196

Table 2

Datasets description with the details of feature types, number of classes, and number of images.

View COIL UMIST YALE-B CMU

1 Color/1050 Color/1050 Color/210 Color/420

2 GIST/512 GIST/512 GIST/512 GIST/512

3 HOG_2 × 2/1050 HOG_2 × 2/1050 HOG_2 × 2/1050 HOG_2 × 2/1050

4 HOG_3 × 3/1050 HOG_3 × 3/1050 HOG_3 × 3/210 HOG_3 × 3/1050

5 LBP/1239 LBP/1239 LBP/1239 LBP/1239

6 SIFT/1050 SIFT/1050 SIFT/1050 SIFT/1050

#Classes 20 20 38 68

#Images 1440 575 2432 3329

Fig. 2. The randomly selected 10 images of CMU, YALE-B, COIL, and UMIST from top to bottom respectively.

s

s

w

y

[

t

i

S

[

(

i

C

s

W

t

y

S

t

A

w

r

F

t

6

s

g

n

6.1. Experimental setup

A. Benchmark Datasets

Towards the evaluation of the proposed scheme, four widely

used image datasets are involved: one for object and the remains

three for human faces, which are CMU PIE, COIL-20, UMIST, and

YALE-B.

CMU PIE dataset [44] includes about 40,0 0 0 facial samples of

68 identities. The images were collected over diverse poses, un-

der various illumination, and with different facial expressions. In

our experiment, we select the images from the frontal pose and

each identity has about 49 images that are of varying illuminations

conditions and facial expressions. We call this dataset as CMU for

short in the following part.

COIL-20 [45] consists of images of 20 objects, that each object

has 72 images acquired from diverse angles every five degrees gap.

We call this dataset as COIL for short in the following part.

UMIST [46] consists of 575 multi-view samples of 20 subjects,

covering a variety of poses from outline to front views.

YALE-B [47] contains 38 identities being used in this experi-

ment, with each subject having about 64 almost frontal images un-

der various illuminations conditions.

A collection of different kinds of feature descriptors are ex-

tracted to represent samples, and every single descriptor is re-

garded as one view. Referring to the Computer Vision Feature Ex-

traction Toolbox, 1 we totally extract 6 kinds of features: ‘color’,

‘gist’, ‘hog2x2’, ‘hog3x3’, ‘lbp’ and ‘sift’, thus 6-view representa-

tion obtained. Detail information for the datasets and features are

summarized in Table 2 and Fig. 2 give some examples of the four

datasets. In addition, we adopt a cross-validation protocol which

randomly chooses a half number of samples or 5 samples per cat-

egory for training respectively corresponding to semi-supervised

and unsupervised task, the remains for testing and this progress

is repeated 20 times. We report the mean accuracy and standard

deviation to show the quantitative result.

B. Baselines

1 https://github.com/adikhosla/feature-extraction/ .

f

T

F

There are two baselines separately for semi-supervised and un-

upervised multi-view dimensional reduction. In terms of semi-

upervised dimensional reduction task, we compare our approach

ith 8 state-of-the-art methods which are (a). Margin Fisher Anal-

sis (MFA) [26] , (b). Gaussian Fields and Harmonic Functions (GFHF)

37] , (c). Local and Global Consistency (LGC) [36] , (d). Transduc-

ive Component Analysis (TCA) [48] , (e). Semi-supervised Discrim-

nant Analysis (SDA) [49] , (f). Linear Laplacian Regularized Least

quares (LapRLS/L) [38] , (g) Flexible Manifold Embedding (FME)

50] , (h) Multiple View Semi-supervised Dimensionality Reduction

MVSSDR), (i)Parameter-Free Auto-Weighted Multiple Graph Learn-

ng: A Framework for Multiview Clustering and Semi-Supervised

lassification (FMCSC) [51] , (j) Multi-view clustering and semi-

upervised classification with adaptive neighbours (MVAN) [8] .

hen it comes to unsupervised multi-view dimensional reduction

ask, we compare our scheme with (a). Principal Component Anal-

sis (PCA) [30] , (b). Locality Preserving Projections LPP [41] and (c).

pectral Regression LPP (LPP-SR) [43] .

C. Evaluation Metrics

Classification accuracy is applied as an evaluation metric

hroughout the experiment, which is depicted as

ccuracy =

∑ n i =1 δ(y i , ̂ y i )

n

, (44)

here y i is the ground truth label of each sample, ˆ y i is the cor-

esponding predicted label, and n is the number of test samples.

unction δ( · , · ) compares the two arguments, which returns “1” if

he two arguments are equal and “0” otherwise.

.2. Semi-supervised experiments

To verify the priority of our proposed semi-supervised dimen-

ional reduction algorithm, only p samples for each category are

iven label information [49] in the training set. We verify recog-

ition accuracy for unlabeled data and testing set under three dif-

erent p : p = 1 / 2 / 3 , which are denoted as ’unlabeled’ and ’Test’.

here are two parameters in our proposed algorithm: μ and β .

or fair comparison, we set the range of the candidate values are

Page 8: Adaptive graph weighting for multi-view dimensionality ...see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's Homepa… · mensional reduction can be well-incorporated with graph-based

X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196 193

Fig. 3. The convergence curves of the four datasets. Our proposed algorithm convergent stably within a few iterations.

Table 3

Recognition performance(mean accuracy ± standard deviation %) on CMU over 20 random splits. we re-

port the results of unlabeled samples and testing samples under three situations(1/2/3 labeled sample(s)

each category) respectively. The best results are shown in underline.

Data Method 1 labeled 2 labeled 3 labeled

Unlabeled Test Unlabeled Test Unlabeled Test

CMU MFA – – 78.1 ± 3.6 76.8 ± 2.5 85.0 ± 2.9 86.1 ± 1.8

GFHF 40.9 ± 3.3 – 57.8 ± 2.6 – 65.8 ± 3.1 –

LGC 39.5 ± 3.1 – 50.9 ± 2.3 – 57.9 ± 0.9 –

TCA 62.3 ± 3.7 63.5 ± 3.7 79.9 ± 2.5 83.4 ± 2.3 86.9 ± 1.1 88.6 ± 1.1

SDA 63.4 ± 3.2 69.1 ± 2.8 83.5 ± 2.1 84.2 ± 2.0 89.6 ± 1.2 89.8 ± 1.1

LapRLS/L 59.4 ± 3.0 54.5 ± 2.8 84.1 ± 2.2 84.0 ± 1.8 89.8 ± 1.1 89.7 ± 1.1

FME 65.2 ± 2.5 64.7 ± 2.4 83.6 ± 2.0 84.5 ± 1.6 89.8 ± 1.1 91.9 ± 1.0

MVSSDR 73.1 ± 3.1 71.7 ± 2.9 83.9 ± 2.1 85.2 ± 2.0 89.0 ± 1.7 89.9 ± 0.9

FMCSC 72.3 ± 2.9 70.3 ± 2.9 85.6 ± 2.2 83.2 ± 2.1 89.5 ± 1.9 89.3 ± 1.4

MVAN 73.8 ± 3.9 72.0 ± 3.7 86.3 ± 2.9 85.2 ± 2.9 90.1 ± 1.3 88.6 ± 1.2

Ours/v1 71.9 ± 2.1 70.4 ± 1.9 83.8 ± 1.4 82.4 ± 1.3 89.7 ± 0.9 88.9 ± 1.2

Ours/v2 60.0 ± 1.9 58.5 ± 1.6 76.6 ± 1.5 75.0 ± 1.5 84.9 ± 1.4 83.5 ± 1.6

Ours/v3 56.0 ± 2.3 55.0 ± 1.7 71.5 ± 1.7 69.9 ± 1.5 80.0 ± 1.5 78.9 ± 1.6

Ours/v4 44.1 ± 1.9 42.8 ± 1.4 61.5 ± 2.1 59.1 ± 1.4 72.5 ± 1.4 71.1 ± 2.0

Ours/v5 59.5 ± 2.3 58.3 ± 1.7 77.0 ± 1.2 75.6 ± 1.6 86.0 ± 1.3 85.0 ± 1.4

Ours/v6 42.9 ± 3.1 40.4 ± 1.9 60.8 ± 1.4 65.2 ± 1.3 87.7 ± 0.9 88.9 ± 1.3

Ours 75.2 ± 1.8 72.0 ± 1.7 89.2 ± 1.2 87.1 ± 1.3 94.5 ± 0.8 93.2 ± 1.1

Table 4

Recognition performance(mean accuracy ± standard deviation %) on COIL over 20 random splits. we report

the results of unlabeled samples and testing samples under three situations(1/2/3 labeled sample(s) each

category) respectively. The best results are shown in underline.

Data Method 1 labeled 2 labeled 3 labeled

Unlabeled Test Unlabeled Test Unlabeled Test

COIL MFA – – 71.1 ± 2.7 72.1 ± 3.2 75.5 ± 2.4 75.2 ± 2.2

GFHF 77.7 ± 2.0 – 84.2 ± 2.1 – 86.5 ± 2.0 –

LGC 79.4 ± 2.7 – 81.9 ± 2.1 – 84.9 ± 2.2 –

TCA 71.5 ± 2.8 71.4 ± 2.7 79.1 ± 2.8 78.9 ± 2.1 81.6 ± 2.2 81.5 ± 2.8

SDA 59.9 ± 2.3 59.8 ± 2.2 74.1 ± 2.6 74.3 ± 2.1 79.3 ± 2.2 78.7 ± 2.4

LapRLS/L 64.5 ± 3.2 63.4 ± 3.3 75.6 ± 2.9 73.7 ± 2.5 79.6 ± 2.6 78.8 ± 2.5

FME 76.5 ± 3.4 77.7 ± 3.0 84.5 ± 2.9 86.9 ± 3.4 88.1 ± 2.3 90.6 ± 2.6

MVSSDR 76.2 ± 3.8 76.7 ± 3.5 85.4 ± 2.0 86.7 ± 2.3 90.0 ± 1.7 91.9 ± 1.1

FMCSC 79.3 ± 2.5 77.3 ± 2.4 88.7 ± 2.0 88.7 ± 2.0 91.8 ± 2.4 91.0 ± 2.0

MVAN 81.1 ± 2.7 81.6 ± 2.2 89.4 ± 2.5 88.9 ± 2.3 91.5 ± 1.6 90.3 ± 1.8

Ours/v1 76.8 ± 3.4 76.7 ± 3.0 85.8 ± 2.3 85.8 ± 2.4 88.5 ± 1.4 89.1 ± 1.6

Ours/v2 29.7 ± 12.1 39.1 ± 12.8 21.9 ± 9.5 33.9 ± 9.5 58.2 ± 7.5 70.6 ± 5.7

Ours/v3 71.4 ± 3.5 71.7 ± 3.6 79.1 ± 2.7 80.1 ± 2.2 85.1 ± 2.3 86.0 ± 2.7

Ours/v4 65.2 ± 4.6 66.8 ± 4.1 75.2 ± 2.5 77.4 ± 2.4 81.7 ± 2.3 83.9 ± 2.3

Ours/v5 22.9 ± 3.6 52.3 ± 3.0 23.4 ± 4.6 67.7 ± 2.9 18.2 ± 2.0 77.6 ± 3.2

Ours/v6 75.2 ± 3.0 74.7 ± 2.8 82.8 ± 2.1 83.0 ± 2.2 87.0 ± 1.7 86.9 ± 1.9

Ours 82.7 ± 2.0 83.2 ± 2.2 90.9 ± 1.7 91.3 ± 2.3 94.2 ± 1.8 94.9 ± 2.1

{

t

w

a

f

s

t

f

10 −2 , 10 −1

, 10 0 , 10 1 , 10 2 } and pick the best performances for all of

hese methods.

We plot the convergence curves of the four datasets in Fig. 3 ,

hich illustrates that our proposed algorithm converges quickly

nd stable. The fast convergence benefits from using linear trans-

ormation function for dimensionality reduction.

We use k nearest neighbor to classify the sample after dimen-

ionality reduction and Tables 3–6 report the mean top1 recogni-

ion accuracy ± standard deviation over 20 random splits for the

our datasets. We have the following observations:

a. Compared with the traditional semi-supervised methods, our

proposed scheme significantly outperforms them no matter

how many labeled samples are given.

b. Our scheme achieves better performance than any single view

in most cases, which demonstrates that our proposed multi-

Page 9: Adaptive graph weighting for multi-view dimensionality ...see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's Homepa… · mensional reduction can be well-incorporated with graph-based

194 X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196

Table 5

Recognition performance(mean accuracy ± standard deviation %) on UMIST over 20 random splits. we

report the results of unlabeled samples and testing samples under three situations(1/2/3 labeled sam-

ple(s) each category) respectively. The best results are shown in underline.

Data Method 1 labeled 2 labeled 3 labeled

Unlabeled Test Unlabeled Test Unlabeled Test

UMIST MFA – – 72.5 ± 5.2 73.8 ± 3.1 86.1 ± 2.9 84.6 ± 3.2

GFHF 65.6 ± 5.2 – 78.1 ± 3.9 – 87.9 ± 2.5 –

LGC 66.5 ± 5.5 – 82.3 ± 2.6 – 83.8 ± 3.7 –

TCA 65.2 ± 5.2 64.9 ± 5.1 79.9 ± 4.1 81.9 ± 4.2 83.9 ± 4.1 83.6 ± 3.9

SDA 58.2 ± 5.3 58.9 ± 4.1 75.9 ± 3.6 76.1 ± 3.5 85.1 ± 3.3 86.7 ± 3.5

LapRLS/L 56.1 ± 5.6 58.9 ± 5.1 76.9 ± 4.3 76.3 ± 4.2 85.1 ± 2.9 84.7 ± 2.1

FME 65.0 ± 4.5 67.1 ± 5.4 82.7 ± 4.3 79.9 ± 4.2 88.9 ± 2.2 87.1 ± 3.1

MVSSDR 71.4 ± 3.5 73.5 ± 2.7 85.0 ± 2.9 83.0 ± 2.7 87.0 ± 1.7 86.9 ± 1.9

FMCSC 75.5 ± 3.3 75.9 ± 3.1 87.1 ± 2.9 85.6 ± 2.5 89.9 ± 1.0 91.2 ± 1.0

MVAN 75.7 ± 3.3 76.7 ± 3.8 85.5 ± 3.0 87.3 ± 2.9 90.1 ± 2.3 89.9 ± 3.3

Ours/v1 75.7 ± 4.0 75.8 ± 5.0 89.6 ± 2.6 89.3 ± 3.1 92.5 ± 3.1 93.1 ± 3.1

Ours/v2 36.9 ± 7.4 41.1 ± 5.0 66.6 ± 3.8 68.3 ± 2.9 73.7 ± 4.0 79.5 ± 4.2

Ours/v3 46.4 ± 3.6 45.4 ± 3.4 68.34 ± 4.2 66.3 ± 3.4 77.3 ± 4.1 77.1 ± 3.8

Ours/v4 48.7 ± 3.7 46.8 ± 3.8 69.2 ± 3.7 68.2 ± 3.2 78.7 ± 4.6 78.0 ± 3.1

Ours/v5 8.7 ± 3.3 25.6 ± 4.5 21.7 ± 5.9 55.8 ± 4.3 30.7 ± 6.0 72.3 ± 3.5

Ours/v6 46.0 ± 3.6 45.5 ± 3.2 68.2 ± 4.8 67.0 ± 3.5 78.9 ± 4.2 79.3 ± 3.3

Ours 77.1 ± 4.8 77.0 ± 5.3 88.8 ± 3.0 89.4 ± 2.5 91.6 ± 3.2 92.9 ± 2.8

Table 6

Recognition performance(mean accuracy ± standard deviation %) on YALE-B over 20 random splits. we

report the results of unlabeled samples and testing samples under three situations(1/2/3 labeled sam-

ple(s) each category) respectively. The best results are shown in underline.

Data Method 1 labeled 2 labeled 3 labeled

Unlabeled Test Unlabeled Test Unlabeled Test

YALE-

B

MFA – – 51.6 ± 4.8 49.1 ± 4.2 69.9 ± 2.4 70.6 ± 2.8

GFHF 28.5 ± 2.5 – 45.9 ± 3.3 – 55.2 ± 3.5 –

LGC 32.2 ± 3.2 – 45.1 ± 3.1 – 49.6 ± 3.3 –

TCA 42.5 ± 3.0 42.6 ± 3.1 73.6 ± 3.3 75.4 ± 3.1 82.5 ± 2.9 84.2 ± 2.3

SDA 39.2 ± 3.0 43.0 ± 3.1 75.2 ± 2.6 75.9 ± 3.2 84.8 ± 2.1 85.0 ± 2.1

LapRLS/L 55.1 ± 3.2 51.2 ± 3.3 74.1 ± 3.2 75.6 ± 2.9 83.2 ± 2.6 85.8 ± 2.1

FME 57.9 ± 3.3 51.2 ± 2.9 77.1 ± 2.2 75.0 ± 2.6 86.9 ± 2.1 87.6 ± 2.0

MVSSDR 55.2 ± 3.0 52.7 ± 2.8 82.8 ± 2.1 75.0 ± 2.2 87.1 ± 1.7 85.9 ± 1.9

FMCSC 61.9 ± 3.6 51.3 ± 3.5 82.1 ± 3.3 74.3 ± 3.3 87.3 ± 2.7 85.0 ± 2.8

MVAN 60.1 ± 4.2 51.4 ± 4.1 81.6 ± 3.8 72.0 ± 3.2 90.0 ± 2.6 86.4 ± 2.9

Ours/v1 55.8 ± 3.1 51.2 ± 3.5 80.0 ± 2.5 75.8 ± 1.2 88.6 ± 1.3 85.3 ± 1.0

Ours/v2 60.8 ± 3.5 55.2 ± 3.5 80.0 ± 2.2 76.8 ± 2.0 88.5 ± 1.2 86.3 ± 1.3

Ours/v3 41.3 ± 2.3 39.2 ± 2.7 63.6 ± 2.3 60.4 ± 2.7 75.5 ± 2.4 73.3 ± 2.6

Ours/v4 47.3 ± 3.3 49.1 ± 2.7 67.6 ± 1.5 65.4 ± 2.6 76.4 ± 1.4 77.6 ± 1.8

Ours/v5 21.9 ± 3.3 46.2 ± 2.9 41.6 ± 2.5 60.2 ± 3.6 55.4 ± 1.3 77.1 ± 0.8

Ours/v6 47.5 ± 2.4 44.4 ± 2.7 71.0 ± 2.0 66.7 ± 2.7 81.9 ± 1.8 79.3 ± 2.2

Ours 63.4 ± 2.6 52.8 ± 2.7 84.1 ± 1.6 76.1 ± 2.4 91.6 ± 1.5 87.6 ± 1.6

Table 7

Recognition performance(mean accuracy ± standard deviation %) of PCA,

LPP-SR, and our proposed unsupervised scheme over 20 random splits

on four datasets. The best results are shown in underline.

Method CMU UMIST YALE-B

PCA 63.4 ± 1.6 (80) 85.5 ± 3.4 (50) 63.1 ± 1.3 (60)

LPP 87.7 ± 1.3 (80) 83.4 ± 3.5 (50) 58.3 ± 3.1 (60)

LPP-SR 84.7 ± 2.5 (80) 85.5 ± 3.2 (50) 63.5 ± 3.0 (60)

Ours 93.2 ± 2.5 (80) 90.4 ± 3.0(50) 88.4 ± 2.5 (60)

p

{

p

t

d

w

i

s

{

B

n

view integration algorithm can choose the useful features and

filter out the interfering ones.

c. The more labeled samples, the standard variation shows a de-

creasing tendency under most cases, which demonstrates that

more label information contributes to more stable performance.

6.3. Unsupervised experiments

When conducting the unsupervised multi-view dimensional

reduction experiment, we use the same ( β , μ) as the semi-

supervised experiments and verify the performance on multiple

feature dimensions: { 10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90 , 100 } .Table 7 illustrates the results of the four approaches and the num-

ber in brackets are the optimal dimension for the four datasets. We

can easily conclude that our proposed scheme achieves significant

performance boosting than the other unsupervised dimensionality

reduction approaches.

6.4. Hyperparameter analysis

Finally, we analyze the influence of hyperparameters μ and

β in the semi-supervised task. Set the number of labeled sam-

les to 3, and vary one of the parameters in the range of

10 −2 , 10 −1

, 10 0 , 10 1 , 10 2 } when fix another. Fig. 4 shows the

erformance tendency with respect to β and μ in which the

op row is of unlabeled data and the second row is of test

ata. In most situation, the bigger β , the higher performance

e can achieve, which demonstrates the importance of optimiz-

ng the label fitness. On the contrary, the recognition accuracy

hows negative tendency with respect to μ. We set { β, μ} = 10 , 0 . 1 } , { 10 , 1 } , { 10 , 0 . 01 } , { 100 , 0 . 1 } for COIL, CMU, UMIST, YALE-

respectively. From Fig. 4 we can see that the performance is

ot very stable with respect to various hyperparameters, however,

Page 10: Adaptive graph weighting for multi-view dimensionality ...see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's Homepa… · mensional reduction can be well-incorporated with graph-based

X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196 195

Fig. 4. The accuracy of varying β . The first four graphics are the accuracy of the unlabeled data, while the last four graphics are the accuracy of the unseen data for four

different datasets: (a) CMU, (b) COIL, (c) UMIST (d) YALE-B.

t

a

c

i

n

s

t

7

s

s

t

t

m

f

l

t

s

w

C

d

p

D

A

e

p

o

0

P

R

here exists a consistent rule throughout the eight experiments:

small μ cooperated with a large β lead to high accuracy. We

an further fix a group of { β, μ} = { 10 , 0 . 1 } at the cost of sacrific-

ng a small amount of accuracy for all the experiments. This phe-

omenon illustrates the relative importance between the dimen-

ionality reduction term ‖ X T W + 1 b T − F ‖ 2

2 and the regularization

erm ‖ W ‖ 2 F .

. Conclusion

In this paper, we proposed a unified and effective scheme to

olve the multi-view dimensionality reduction issue under both

emi-supervised and unsupervised scenarios. Particularly, an adap-

ive weighting multi-view graph is employed for various informa-

ion integration, which can discover the persistent and comple-

entary pattern. We further learn a linear regression projection

or dimensionality reduction and penalize the discrepancy between

ow-dimensional representation and prediction vector to maximize

he matching degree. In optimization, we combine multi-view fu-

ion, dimensionality reduction with graph regularization together,

hich are jointly optimized and cooperated with each other well.

omprehensive experiments on four benchmark databases clearly

emonstrate that our proposed scheme outperforms existing ap-

roaches.

eclaration of Competing Interest

None.

cknowledgments

Our work was supported in part by the National Natural Sci-

nce Foundation of China under Grant 61572388 and 61703327 , in

art by the Key R&D Program-The Key Industry Innovation Chain

f Shaanxi under Grant 2017ZDCXL-GY-05-04-02, 2017ZDCXL-GY-

5-02 and 2018ZDXM-GY-176, and in part by the National Key R&D

rogram of China under Grant 2017YFE0104100.

eferences

[1] M. Yang , C. Deng , F. Nie , Adaptive-weighting discriminative regression for mul-ti-view classification, Pattern Recogn. 88 (4) (2019) 236–245 .

[2] H. Zhang , V.M. Patel , R. Chellappa , Hierarchical multimodal metric learning formultimodal classification, in: IEEE International Conference on Computer Vi-

sion (CVPR), 2017, pp. 3057–3065 . [3] N. Silberman , D. Hoiem , P. Kohli , R. Fergus , Indoor segmentation and sup-

port inference from rgbd images, in: European Conference on Computer Vision

(ECCV), 2012, pp. 746–760 . [4] S. Song , S.P. Lichtenberg , J. Xiao , Sun rgb-d: a rgb-d scene understand-

ing benchmark suite, in: IEEE International Conference on Computer Vision(CVPR), 5, 2015, p. 6 .

[5] D.G. Lowe , Distinctive image features from scale-invariant keypoints, Int. J.Comput. Vision (IJCV) 60 (2) (2004) 91–110 .

[6] A . Oliva , A . Torralba , Modeling the shape of the scene: a holistic representationof the spatial envelope, Int. J. Comput. Vision (IJCV) 42 (3) (2001) 145–175 .

[7] J. Xu , J. Han , F. Nie , X. Li , Re-weighted discriminatively embedded k -means

for multi-view clustering, IEEE Trans. Image Process. (TIP) 26 (6) (2017)3016–3027 .

[8] F. Nie , G. Cai , X. Li , Multi-view clustering and semi-supervised classificationwith adaptive neighbours, in: American Association for Artificial Intelligence

(AAAI), 2017, pp. 2408–2414 . [9] X. Liu , L. Huang , C. Deng , B. Lang , D. Tao , Query-adaptive hash code ranking

for large-scale multi-view visual search, IEEE Trans. Image Process. (TIP) 25

(10) (2016) 4514–4524 . [10] X. Liu , L. Huang , C. Deng , J. Lu , B. Lang , Multi-view complementary hash tables

for nearest neighbor search, in: IEEE International Conference on ComputerVision (ICCV), 2015, pp. 1107–1115 .

[11] H. Hotelling , Relations between two sets of variates, Biometrika 28 (3/4) (1936)321–377 .

[12] A. Blum , T.M. Mitchell , Combining labeled and unlabeled sata with co-training,

in: CCLT, 1998, pp. 92–100 . [13] U. Brefeld , T. Scheffer , Co-em support vector learning, in: International Confer-

ence on Machine Learning (ICML), 2004 . [14] I.A. Muslea , Active learning with multiple views, J. Artif. Intell. Res. (JAIR) 27

(1) (2011) 203–233 . [15] S. Sun , F. Jin , Robust co-training, Int. J. Pattern Recognit. Artif. Intell. (TPAMI)

25 (07) (2011) 1113–1126 .

[16] M. Belkin , P. Niyogi , V. Sindhwani , Manifold regularization: a geometric frame-work for learning from labeled and unlabeled examples, J. Mach. Learn. Res.

(JMLR) 7 (Nov) (2006) 2399–2434 . [17] S. Yu , T. Falck , A. Daemen , L.-C. Tranchevent , J.A. Suykens , B. De Moor ,

Y. Moreau , L 2-norm multiple kernel learning and its application to biomedicaldata fusion, BMC Bioinf. 11 (1) (2010) 309 .

[18] J.A.K. Suykens , T.V. Gestel , J.D. Brabanter , B.D. Moor , J. Vandewalle , Least

squares support vector machines, Int. J. Circuit Theory Appl. (IJCTA) 27 (6)(2002) 605–615 .

Page 11: Adaptive graph weighting for multi-view dimensionality ...see.xidian.edu.cn/faculty/chdeng/Welcome to Cheng Deng's Homepa… · mensional reduction can be well-incorporated with graph-based

196 X. Xu, Y. Yang and C. Deng et al. / Signal Processing 165 (2019) 186–196

[

[19] J. Ye , S. Ji , J. Chen , Multi-class discriminant kernel learning via convex pro-gramming, J. Mach. Learn. Res. (JMLR) 9 (Apr) (2008) 719–758 .

[20] W. Liu , I.W. Tsang , Making decision trees feasible in ultrahigh feature and labeldimensions, J. Mach. Learn. Res. 18 (2017) 81:1–81:36 .

[21] W. Liu , D. Xu , I.W. Tsang , W. Zhang , Metric learning for multi-output tasks,IEEE Trans. Pattern Anal. Mach. Intell. 41 (2) (2019) 408–422 .

[22] W. Liu , I.W. Tsang , K.-R. Müller , An easy-to-hard learning paradigm for multi-ple classes and multiple labels, J. Mach. Learn. Res. 18 (2017) 94:1–94:38 .

[23] X. Peng, J. Feng, S. Xiao, W-Y. Yau, J.T. Zhou, S. Yang, Structured AutoEncoders

for Subspace Clustering, IEEE Trans. Image. Process. 27 (10) (2018) 5076–5086,doi: 10.1109/TIP.2018.2848470 .

[24] P.N. Belhumeur , J.P. Hespanha , D.J. Kriegman , Eigenfaces vs. fisherfaces: recog-nition using class specific linear projection, IEEE Trans. Pattern Anal. Mach. In-

tell. (TPAMI) 19 (7) (1997) 711–720 . [25] M. Sugiyama , Dimensionality reduction of multimodal labeled data by lo-

cal fisher discriminant analysis, J. Mach. Learn. Res. (JMLR) 8 (2007) 1027–

1061 . [26] S. Yan , D. Xu , B. Zhang , H.-J. Zhang , Q. Yang , S. Lin , Graph embedding and ex-

tensions: a general framework for dimensionality reduction, IEEE Trans. Pat-tern Anal. Mach. Intell. (TPAMI) 29 (1) (2007) 40–51 .

[27] Z. Huang, H. Zhu, J.T. Zhou, X. Peng, Multiple marginal fisher analysis, IEEETrans. Industr. Electron. (2018) 1, doi: 10.1109/TIE.2018.2870413 .

[28] H. Li , T. Jiang , K. Zhang , Efficient and robust feature extraction by maximum

margin criterion, IEEE Trans. Neural Netw. (TNN) 17 (1) (2006) 157–165 . [29] E. Yang , C. Deng , T. Liu , W. Liu , D. Tao , Semantic structure-based unsupervised

deep hashing, in: International Joint Conferences on Artificial Intelligence (IJ-CAI), 2018, pp. 1064–1070 .

[30] S. Wold , K. Esbensen , P. Geladi , Principal component analysis, ChemometricsIntell. Lab. Syst. 2 (1–3) (1987) 37–52 .

[31] X. He , D. Cai , S. Yan , H.-J. Zhang , Neighborhood preserving embedding, in: IEEE

International Conference on Computer Vision (ICCV), 2, 2005, pp. 1208–1213 . [32] L. Qiao , S. Chen , X. Tan , Sparsity preserving projections with applications to

face recognition, Pattern Recognit. (PR) 43 (1) (2010) 331–341 . [33] C. Deng , R. Ji , W. Liu , D. Tao , X. Gao , Visual reranking through weakly super-

vised multi-graph learning, in: IEEE International Conference on Computer Vi-sion (ICCV), 2013, pp. 2600–2607 .

[34] D. Zhang , Z.-H. Zhou , S. Chen , Semi-supervised dimensionality reduction, in:

Industrial Conference on Data Mining (ICDM), 2007, pp. 629–634 . [35] M. Sugiyama , T. Idé, S. Nakajima , J. Sese , Semi-supervised local fisher discrim-

inant analysis for dimensionality reduction, Mach. Learn. (ML) 78 (1–2) (2010)35 .

[36] D. Zhou , O. Bousquet , T.N. Lal , J. Weston , B. Schölkopf , Learning with local andglobal consistency, in: Conference and Workshop on Neural Information Pro-

cessing Systems (NeurIPS), 2004, pp. 321–328 . [37] X. Zhu , J. Lafferty , Z. Ghahramani , Combining active learning and semi-super-

vised learning using gaussian fields and harmonic functions, in: InternationalConference on Machine Learning (ICML) Workshop, 3, 2003 .

[38] V. Sindhwani , P. Niyogi , M. Belkin , S. Keerthi , Linear manifold regularization forlarge scale semi-supervised learning, in: International Conference on Machine

Learning (ICML) Workshop, 28, 2005 .

[39] M. Belkin , P. Niyogi , Laplacian eigenmaps and spectral techniques for embed-ding and clustering, in: Conference and Workshop on Neural Information Pro-

cessing Systems (NeurIPS), 2002, pp. 585–591 . [40] J. Duchi , S. Shalev-Shwartz , Y. Singer , T. Chandra , Efficient projections onto the

l1-ball for learning in high dimensions, in: International Conference on Ma-chine Learning (ICML), 2008, pp. 272–279 .

[41] X. He , S. Yan , Y. Hu , P. Niyogi , H.J. Zhang , Face recognition using laplacianfaces,

in: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI),2005, pp. 328–340 .

[42] F. Nie , H. Huang , C. Xiao , C. Ding , Efficient and robust feature selection via jointl2,1 -norms minimization, in: International Conference on Neural Information

Processing Systems, 2010 . [43] D. Cai , X. He , J. Han , Spectral regression for efficient regularized subspace

learning, in: IEEE International Conference on Computer Vision (ICCV), 2007,

pp. 1–8 . 44] T. Sim , S. Baker , M. Bsat , The CMU Pose, Illumination, and Expression Database,

IEEE Computer Society, 2003 . [45] C. Rate , C. Retrieval , Columbia object image library (coil-20), Computer (2011) .

[46] D.B. Graham , N.M. Allinson , Characterising Virtual Eigensignatures for GeneralPurpose Face Recognition, Springer Berlin Heidelberg, 1998 .

[47] A. Georghiades et al., From few to many: Illumination cone models for face

recognition under variable lighting and pose, 2001, pp. 643–660. [48] W. Liu , D. Tao , J. Liu , Transductive component analysis, in: Industrial Confer-

ence on Data Mining (ICDM), 2008, pp. 433–442 . [49] Deng, X. He, J. Han, Semi-supervised discriminant analysis (2007) 1–7.

[50] F. Nie , D. Xu , I.W.-H. Tsang , C. Zhang , Flexible manifold embedding: a frame-work for semi-supervised and unsupervised dimension reduction, IEEE Trans.

Image Process. (TIP) 19 (7) (2010) 1921–1932 .

[51] F. Nie , J. Li , X. Li , et al. , Parameter-free auto-weighted multiple graph learn-ing: a framework for multiview clustering and semi-supervised classifica-

tion., in: International Joint Conferences on Artificial Intelligence (IJCAI), 2016,pp. 1881–1887 .