radial basis function network for multi-task...

Radial Basis Function Network forMulti-task Learning

Xuejun Liao and Lawrence Carin

{xjliao,lcarin}@ee.duke.edu

Department of Electrical and Computer Engineering

Duke University

Durham, NC 27708-0291, USA

Xuejun Liao and Lawrence Carin, “Radial Basis Function Network for Multi-task Learning”, Annual Conference on Neural Information Processing Systems 2005 – p. 1/30

Oveview

Multi-task learning

Multi-task radial basis function (RBF) networksNetwork topologyMathematical formulation

Supervised learning of multi-task RBF networks

Active learning of multi-task RBF networks

Experimental results

Summary


Multi-task Learning

Solving multiple related tasks simultaneously under aunified representation

Making each task easier to solve by transferring what islearned from one task to another correlated task

Particularly useful when each individual task has scarcetraining data

Multi-task learning becomes superfluous whenthe tasks are identical - in this case, we simply poolthem togetherthe tasks are independent - then, there is nocorrelation to utilize and we learn each taskseparately


Multi-task RBF Network

Topology

…

f1(x)

Output layer

Hidden layer (basis functions)

Input layer

1(x) 2(x) N(x)0 1

Task

1

Task

2

Task

K

…

…

x1 x2 xd

f2(x) fK(x)

w1w2 wK

x = [ ]T… Input data point

Network response

(Specified by the

number of tasks)

(Specified by data

dimensionality)

(To be learned by algorithms)

Bias

Hidden-to-output weights

Figure 1: Each of the output nodes represents a unique task. Each task has its own hidden-to-output weights but all the tasks share the same hidden nodes. The activation of hiddennode n is characterized by a basis function φn(x) = φ(||x − cn||, σn). A typical choice of φ

is φ(z, σ) = exp(− z2

2σ2 ), which gives the Gaussian RBF.


Multi-task RBF Network (cont’d)

Mathematical formulation

Denote wk = [w0k, w1k, · · · , wNk]T as the weights

connecting hidden nodes to the k-th output node. Theoutput for the k-th task, in response to input x, takes theform

fk(x) = wTk φ(x) (1)

where φ(x) =[φ0(x), φ1(x), . . . , φN (x)

]Tis a column

containing N + 1 basis functions with φ0(x) ≡ 1 adummy basis accounting for the bias in Figure 1.


Supervised Learning

Assuming K tasks, each associated with a data setDk = {(x1k, y1k), · · · , (xJkk, yJkk)}, where yik is the target(desired output) of xik.

We are interested in learning the functions fk(x) for theK tasks, based on minimizing the squared error

e (φ,w) =∑K

k=1

{∑Jk

i=1

(w

Tk φik − yik

)2

+ ρ ||wk||2

}(2)

where φik = φ(xik) for notational simplicity. Theregularization terms ρ ||wk||

2, k = 1, · · · ,K, are used toprevent singularity of the A matrices defined in (3), withρ a small positive number.


Supervised Learning (cont’d)

For fixed φ’s, the w’s are solved as

wk = A−1k

∑Jk

i=1yikφik and Ak =∑Jk

i=1φikφTik + ρ I (3)

for k = 1, · · · ,K

The basis functions φ are obtained by minimizing

e(φ) =∑K

k=1

∑Jk

i=1

(y2ik − yikw

Tk φik

)(4)

which is obtained by substituting (3) into (2). Note e(φ)is a functional of φ only, as wk’s are related to φ by (3).



Recalling that φik is an abbreviation of

φ(xik) =[1, φ1(xik), . . . , φ

N (xik)]T

, this amounts todetermining N , the number of basis functions, and thefunctional form of each basis function φn(·), n = 1, . . . , N

Consider the candidate functions

{φnm(x) = φ(||x−xnm||, σ) : n = 1, · · · , Jm, m = 1, · · · ,K}

We learn the RBF network structure by selecting φ(·)from these candidate functions such that e(φ) in (4) isminimized. The following theorem tells us how toperform the selection in a sequential way, the proof ofwhich is given in the Appendix of the paper.



Theorem 1 Let φ(x) = [1, φ1(x), . . . , φN (x)]T and φN+1(x)be a single basis function. Let the A matrices associatedwith φ and [φ, φN+1]T be non-degenerate. Then

δe(φ, φN+1) = e(φ)− e([φ, φN+1]T )

=∑K

k=1

(cTk wk −

∑Jk

i=1yikφN+1ik

)2q−1k≥ 0 (5)

where φN+1ik

= φN+1(φik),

ck =∑Jk

i=1φikφN+1ik

dk =∑Jk

i=1(φN+1ik

)2 + ρ

qk = dk − cTk A

−1k

ck

(6)

and the wk’s and Ak’s are as given in (3).



Because δe(φ, φN+1) ≥ 0, adding φN+1 to φ generallymakes the squared error decrease.

The decrease δe(φ, φN+1) depends on φN+1. Bysequentially selecting basis functions that bring themaximum error reduction, we achieve the goal ofmaximizing e(φ). The details of the learning algorithmare summarized in Table 1.


Supervised Learning (cont’d)Table 1: Learning Algorithm of Multi-Task RBF Network

Input: {(x1k, y2k), · · · , (xJk,k, yJk,k)}k=1:K , φ(·, σ), σ, and ρ; Output: φ(·) and {wk}Kk=1 .

1. For m=1:K, For n=1:Jm, For k=1:K, For i=1:Jk

Compute bφnmik = φ(||xnm − xik||, σ);

2. Let N = 0, φ(·) = 1, e0 =

PKk=1

hPJki=1 y2

ik − (Jk + ρ)−1(

PJki=1 yik)2

i;

For k=1:K, compute Ak =Jk+ρ, wk =(Jk+ρ)−1PJki=1yik;

3. For m=1:K, For n=1:Jm

If bφnm is not marked as “deleted”

For k=1:K, compute

ck =

PJki=1 φik

bφnmik , qk =

PJki=1(bφnm

ik )2 + ρ− cTk A

−1k

ck ;

If there exists k such that qk = 0, mark bφnm as “deleted”;

else, compute δe(φ, bφnm) using (5).

4. If {bφik}i=1:Jk,k=1:K are all marked as “deleted”, go to 10.

5. Let (n∗, m∗) = arg max bφnm not marked as “deleted”δe(φ,bφnm); Mark bφn∗m∗

as “deleted”.

6. Tune RBF parameter σN+1 = arg max σ δe(φ, φ(|| · −xn∗m∗ ||, σ))

7. Let φN+1(·) = φ(|| · −xn∗m∗ ||, σN+1); Update φ(·)← [φT (·), φN+1(·)]T ;

8. For k=1:K

Compute Anewk and w

newk respectively by (A-1) and (A-3) in the appendix of the paper; Update

Ak ← Anewk , wk ← w

newk

9. Let eN+1 = eN − δe(φ, φN+1); If the sequence {en}n=0:(N+1) is converged, go to 10, elseupdate N ← N + 1 and go back to 3.

10. Exit and output φ(·) and {wk}Kk=1 .


Active Learning

Assume the data in Dk are initially unsupervised (onlyx’s are available without access to the associated y’s)

We want to actively select a subset from Dk to besupervised (y’s acquired) such that the resultingnetwork generalizes well to the remaining data in Dk.

We first learn the basis functions φ from theunsupervised data

We then select data to be supervised, based on φ.

Both of these steps are based on the following theorem,the proof of which is given in the Appendix of the paper.


Active Learning (cont’d)Theorem 2 Let there be K tasks and the data set of the k-th task isDk ∪ D̃k where Dk = {(xik, yik)}Jk

i=1 and D̃k = {(xik, yik)}Jk+ eJk

i=Jk+1. Let therebe two multi-task RBF networks, whose output nodes are characterizedby fk(·) and f∼

k (·), respectively, for task k = 1, . . . , K. The two networkshave the same given basis functions (hidden nodes)φ(·) = [1, φ1(·), · · · , φN (·)]T , but different hidden-to-output weights. Theweights of fk(·) are trained with Dk ∪ D̃k, while the weights of f∼

k (·) aretrained using D̃k. Then for k = 1, · · · , K, the square errors committed onDk by fk(·) and f∼

k (·) are related by

0 ≤ [det Γk]−1 ≤λ−1max,k≤

∑Jk

i=1(yik − fk(xik))2

∑Jk

i=1 (yik − f∼k (xik))2

≤ λ−1min,k ≤ 1 (7)

where Γk =[I + Φ

Tk (ρ I + Φ̃kΦ̃

T

k )−1Φk

]2with Φ =

[φ(x1k), . . . , φ(xJkk)

]

and Φ̃ =[φ(xJk+1,k), . . . , φ(x

Jk+ eJk,k)], and λmax,k and λmin,k are

respectively the largest and smallest eigenvalues of Γk.


Active Learning (cont’d)

Specializing Theorem 2 to the case J̃k = 0, we have

Corollary 1 Let there be K tasks and the data set of the k-th task isDk = {(xik, yik)}Jk

i=1. Let the RBF network, whose output nodes arecharacterized by fk(·) for task k = 1, . . . , K, have given basisfunctions (hidden nodes) φ(·) = [1, φ1(·), · · · , φN (·)]T and thehidden-to-output weights of task k be trained with Dk. Then fork = 1, · · · , K, the squared error committed on Dk by fk(·) is boundedas

0 ≤ [det Γk]−1 ≤ λ−1max,k ≤

∑Jk

i=1 (yik − fk(xik))2

∑Jk

i=1 y2ik

≤ λ−1min,k ≤ 1

where Γk =(I + ρ−1

ΦTk Φk

)2with Φ =

[φ(x1,k), . . . , φ(xJk,k)

], and

λmax,k and λmin,k are respectively the largest and smallesteigenvalues of Γk



We are interested in selecting the basis functions φ thatminimize the squared errors for all tasks, before seeingy’s.

By Corollary 1, this is accomplished by maximizing theeigenvalues of Γk for 1, · · · ,K

Expanding detΓk as

detΓk = det[(

IJk×Jk+ ρ−1

ΦTk Φk

)2]

=[det(ρI(N+1)×(N+1) + ΦkΦ

Tk )

]2[det(ρ I(N+1)×(N+1))]

−2

= [detA2k][det(ρ I(N+1)×(N+1))]

−2 (8)

Then the problem boils down to selecting φ to maximizedetAk for all tasks simultaneously, i.e., selecting φ tomaximize

∏Kk=1 detAk.



The selection proceeds in a sequential manner:

Suppose we have selected basis functions φ = [1, φ1, · · · , φN ]T ,with Ak =

∑Jk

i=1 φikφTik+ ρ I(N+1)×(N+1), k = 1, · · · , K.

Augmenting basis functions to [φT , φN+1]T , the A matriceschange to A

newk =

∑Jk

i=1[φTik, φN+1

ik ]T [φTik, φN+1

ik ] + ρ I(N+2)×(N+2).

Using the determinant formula of block matrices, we get∏K

k=1(detAnewk )−2 =

∏K

k=1(qk detAk)−2, where qk is as given in(6). As Ak ’s do not depend on φN+1, the left-hand side isminimized by maximizing

∏K

k=1 q2k.

The selection is easily implemented by making the following twominor modifications in Table 1: (a) in step 2, computee0 =

∑K

k=1 ln(Jk + ρ)−2; in step 3, compute

δe(φ, φ̂nm) =∑K

k=1 ln q2k. Employing the logarithm is for gaining

additivity and it does not affect the maximization.Xuejun Liao and Lawrence Carin, “Radial Basis Function Network for Multi-task Learning”, Annual Conference on Neural Information Processing Systems 2005 – p. 16/30


Based on the basis functions φ determined above, weproceed to selecting data to be supervised.

The selection of data is based on an iterative use ofCorollary 2 as given in slide 18, which is aspecialization of Theorem 2, as is Corollary 1.



Corollary 2 Let there be K tasks and the data set of the k-th task isDk = {(xik, yik)}Jk

i=1. Let there be two RBF networks, whose output nodesare characterized by fk(·) and f+

k (·), respectively, for task k = 1, . . . , K.The two networks have the same given basis functionsφ(·) = [1, φ1(·), · · · , φN (·)]T , but different hidden-to-output weights. Theweights of fk(·) are trained with Dk, while the weights of f+

k (·) are trainedusing D+

k = Dk ∪ {(xJk+1,k, yJk+1,k)}. Then for k = 1, · · · , K, the squarederrors committed on (xJk+1,k, yJk+1,k) by fk(·) and f+

k (·) are related by[f+

k (xJk+1,k)− yJk+1,k

]2=

[γ(xJk+1,k)

]−1[fk(xJk+1,k)− yJk+1,k

]2, where

γ(xJk+1,k) =[1 + φ

T (xJk+1,k)A−1k φ(xJk+1,k)

]2≥ 1 and

Ak =∑Jk

i=1

[ρI + φ(xik)φT (xik)

]is the same as in (3).



Two observations are made from Corollary 2:If γ(xJk+1,k) ≈ 1, seeing yJk+1,k does not effect theerror on xJk+1,k, indicating Dk already containsufficient information about (xJk+1,k, yJk+1,k).

If γ(xi)≫ 1, seeing yJk+1,k greatly decrease the erroron xJk+1,k, indicating xJk+1,k is significantly dissimilar(novel) to Dk and xJk+1,k must be supervised toreduce the error.



Data Selection proceeds in a sequential fashion:

Given Dk = {(xik, yik)}Jk

i=1 have been selected, the data pointselected next should be

xJk+1,k = arg maxi>Jk

k=1,··· ,K

γ(xik) = arg maxi>Jk

k=1,··· ,K

[1+φT(xik)A−1

k φ(xik)]2

After xJk+1,k is selected, the Ak is updated asAk ← Ak + φ(xJk+1,k)φT(xJk+1,k), and the next selection begins.

As the iteration advances γ will decrease until it reachesconvergence, upon which we stop selecting data.

Finally we use (3) to compute w from the selected x

and their associated targets y, completing learning ofthe multi-task RBF network.


Experimental Results

We consider three types of RBF networks to learn K

tasks, each with its data set Dk:In the first, which we call “one RBF network”, we letthe K tasks share both basis functions φ (hiddennodes) and hidden-to output weights w, thus we donot distinguish the K tasks and design a single RBFnetwork to learn a union of them.The second is the multi-task RBF network, where theK tasks share the same φ but each has its own w.In the third, we have K independent networks, eachdesigned for a single task.


Experimental Results (cont’d)

We use a school data set from the Inner LondonEducation Authority, consisting of examination recordsof 15362 students from 139 secondary schools. Thedata are available at

http://multilevel.ioe.ac.uk/intro/datasets.html

The goal is to predict the exam scores of the students.

Treating each school as a task, we perform multi-tasklearning.

After converting each categorical variable to a numberof binary variables, we obtain a total number of 27 inputvariables, i.e., x ∈ R

27.



Some details of the implementation:

The multi-task RBF network is implemented as shown in Figure 1and trained with the learning algorithm in Table 1.

The “one RBF network” is implemented as a special case ofFigure 1, with a single output node and trained using the union ofsupervised data from all 139 schools.

139 independent RBF networks are designed, each having asingle output node and trained for a single school.

The Gaussian RBF φn(x) = exp(− ||x−cn||2

2σ2 ) is used, where thecn’s are selected from training data points and σn’s are initializedas 20 and optimized as described in Table 1. The regularizationparameter ρ is set to 10−6.



Training and testing

We randomly take 75% of the 15362 data points as training(supervised) data and the remaining 25% as test data.

The generalization performance is measured by the squared error(fk(xik)− yik)2 averaged over all test data xik of tasksk = 1, · · · , K.

We made 10 independent trials to randomly split the data intotraining and test sets and the squared error averaged over the testdata of all the 139 schools and the trials are shown in Table 2, forthe three types of RBF networks.



Table 2: Squared error averaged over the test data of all 139 schools and the

10 independent trials for randomly splitting the school data into training (75%) andtesting (25%) sets.

Multi-task RBF network Independent RBF networks One RBF network

109.89± 1.8167 136.41± 7.0081 149.48± 2.8093



Table 2 shows the multi-task RBF network outperformsthe other two types of RBF networks by a considerablemargin. This is attributed to:

The “one RBF network” ignores the difference between the tasksand the independent RBF networks ignore the tasks’ correlations,therefore they both perform inferiorly.

The multi-task RBF network uses the shared hidden nodes (basisfunctions) to capture the common internal representation of thetasks and meanwhile uses the independent hidden-to-outputweights to learn the statistics specific to each task.



We apply active learning to split the data into trainingand test sets using the two-step procedure:

First we learn the basis functions φ of multi-task RBF networkusing all 15362 data, which are unsupervised, i.e., their targets y

are not required.

Based on the φ, we then select the data x to be supervised,collect the y associated with them, and use them as training datato learn the hidden-to-output weights w.

To make the results comparable, we use the sametraining data to learn the other two types of RBFnetworks (including learning their own φ and w).

The networks are then tested on the remaining data,with the results shown in Figure 2.



5000 6000 7000 8000 9000 10000 11000 12000100

120

140

160

180

200

220

240

260

Number of training (supervised) data

Squ

ared

err

or a

vera

ged

over

test

dat

a

Multi−task RBF networkIndependent RBF networksOne RBF network

Figure 2: Squared error averaged over the test data of all 139 schools, as a func-

tion of the number of training (supervised) data. The data are split into training andtest sets via active learning.



Discussions on Figure 2:

Clearly the multi-task RBF network maintains its superiorperformance all the way down to 5000 training data points,whereas the independent RBF networks have their performancesdegraded seriously as the training data diminish.

This demonstrates the increasing advantage of multi-task learningas the number of training data decreases.

The “one RBF network” seems also insensitive to the number oftraining data, but it ignores the inherent dissimilarity between thetasks, which makes its performance inferior.


Summary

We have presented the structure and learningalgorithms for multi-task learning with radial basisfunction (RBF) networks.

Supervised learning of the network structure isachieved via simple and efficient evaluation ofcandidate basis functions

Unsupervised learning of the network structure enablesus to actively split the data x into training and test sets,without accessing the y’s.

The proposed unsupervised criteria select novel datainto the training set, hence what remain in the test setare similar to those in the training set. This improves thegeneralization of the resulting network to the test data.


radial basis function network for multi-task...

Documents