arX
iv:1
910.
0587
8v2
[cs
.HC
] 2
9 Fe
b 20
201
Manifold Embedded Knowledge Transfer
for Brain-Computer InterfacesWen Zhang and Dongrui Wu
Abstract—Transfer learning makes use of data or knowledgein one problem to help solve a different, yet related, problem.It is particularly useful in brain-computer interfaces (BCIs),for coping with variations among different subjects and/ortasks. This paper considers offline unsupervised cross-subjectelectroencephalogram (EEG) classification, i.e., we have labeledEEG trials from one or more source subjects, but only unlabeledEEG trials from the target subject. We propose a novel man-ifold embedded knowledge transfer (MEKT) approach, whichfirst aligns the covariance matrices of the EEG trials in theRiemannian manifold, extracts features in the tangent space,and then performs domain adaptation by minimizing the jointprobability distribution shift between the source and the targetdomains, while preserving their geometric structures. MEKTcan cope with one or multiple source domains, and can becomputed efficiently. We also propose a domain transferabilityestimation (DTE) approach to identify the most beneficial sourcedomains, in case there are a large number of source domains.Experiments on four EEG datasets from two different BCIparadigms demonstrated that MEKT outperformed several state-of-the-art transfer learning approaches, and DTE can reducemore than half of the computational cost when the numberof source subjects is large, with little sacrifice of classificationaccuracy.
Index Terms—Brain-computer interfaces, electroencephalo-gram, Riemannian manifold, transfer learning
I. INTRODUCTION
A brain-computer interface (BCI) provides a direct commu-
nication pathway between a user’s brain and a computer [1],
[2]. Electroencephalogram (EEG), a multi-channel time-series,
is the most frequently used BCI input signal. There are three
common paradigms in EEG-based BCIs: motor imagery (MI)
[3], event-related potentials (ERPs) [4], and steady-state visual
evoked potentials [2]. The first two are the focus of this paper.
In MI tasks, the user needs to imagine the movements
of his/her body parts, which causes modulations of brain
rhythms in the involved cortical areas. In ERP tasks, the user is
stimulated by a majority of non-target stimuli and a few target
stimuli; a special ERP pattern appears in the EEG response
after the user perceives a target stimulus. EEG-based BCI
systems have been widely used to help people with disabilities,
and also the able-bodied [1].
A standard EEG signal analysis pipeline consists of tem-
poral (band-pass) filtering, spatial filtering, and classification
[5]. Spatial filters such as common spatial patterns (CSP) [6]
W. Zhang and D. Wu are with the Ministry of Education Key Laboratoryof Image Processing and Intelligent Control, School of Artificial Intelligenceand Automation, Huazhong University of Science and Technology, Wuhan430074, China. Email: [email protected], [email protected].
Dongrui Wu is the corresponding author.
are widely used to enhance the signal-to-noise ratio. Recently,
there is a trend to utilize the covariance matrices of EEG
trials, which are symmetric positive definite (SPD) and can be
viewed as points on a Riemannian manifold, in EEG signal
analysis [7]–[9]. For MI tasks, the discriminative information
is mainly spatial, and can be directly encoded in the covariance
matrices. On the contrary, the main discriminative information
of ERP trials is temporal. A novel approach was proposed in
[10] to augment each EEG trial by the mean of all target trials
that contain the ERP, and then their covariance matrices are
computed. However, Riemannian space based approaches are
computationally expensive, and not compatible with Euclidean
space machine learning approaches.
A major challenge in BCIs is that different users have
different neural responses to the same stimulus, and even the
same user can have different neural responses to the same
stimulus at different time/locations. Besides, when calibrat-
ing the BCI system, acquiring a large number of subject-
specific labeled training examples for each new subject is time-
consuming and expensive. Transfer learning [11]–[15], which
uses data/information from one or more source domains to
help the learning in a target domain, can be used to address
these problems. Some representative applications of transfer
learning in BCIs can be found in [16]–[21]. Many researchers
[19]–[21] attempted to seek a set of subject-invariant CSP
filters to increase the signal-to-noise ratio. Another pipeline
is Riemannian geometry based. Zanini et al. [22] proposed
a Riemannian alignment (RA) framework to align the EEG
covariance matrices from different subjects. He and Wu [23]
extended RA to Euclidean alignment (EA) in the Euclidean
space, so that any Euclidean space classifier can be used after
it.
To utilize the excellent properties of the Riemannian ge-
ometry and avoid its high computational cost, as well as to
leverage knowledge learned from the source subjects, this
paper proposes a manifold embedded knowledge transfer
(MEKT) framework, which first aligns the covariance matrices
of the EEG trials in the Riemannian manifold, then performs
domain adaptation in the tangent space by minimizing the
joint probability distribution shift between the source and the
target domains, while preserving their geometric structures,
as illustrated in Fig. 1. Additionally, we propose a domain
transferability estimation (DTE) approach to select the most
beneficial subjects in multi-source transfer learning. Experi-
ments on four datasets from two different BCI paradigms (MI
and ERP) verified the effectiveness of MEKT and DTE.
The remainder of this paper is organized as follows: Sec-
tion II introduces related work on spatial filters, Riemannian
2
Source Domain 1
Source Domain 2
Target Domain
Tangent Space
Learned SubspaceRiemannian
Manifold
#á$
Tangent Space
Feature Extraction
Centroid Alignment
1
1
Fig. 1. Illustration of our proposed MEKT. Squares and circles represent examples from different classes. Different colors represent different domains. Alldomains are first aligned on the Riemannian manifold, and then mapped onto a tangent space. A and B are projection matrices of the source and the targetdomains, respectively.
geometry, tangent space mapping, RA, EA, and subspace
adaptation. Section III describes the details of the proposed
MEKT and DTE. Section IV presents experiments to compare
the performance of MEKT with several state-of-the-art data
alignment and transfer learning approaches. Finally, Section V
draws conclusions.
II. RELATED WORK
This section introduces background knowledge on spatial
filters, Riemannian geometry, tangent space mapping, RA,
EA, and subspace adaptation, which will be used in the next
section.
A. Spatial Filters
Spatial filtering can be viewed as a data-driven dimension-
ality reduction approach that promotes the variance difference
between two conditions [24]. It is common in MI-based BCIs
to use CSP filters [25] to simultaneously diagonalize the two
intra-class covariance matrices.
Consider a binary classification problem. Let (Xi, yi) be
the ith labeled training example, where Xi ∈ Rc×t, in which
c is the number of EEG channels, and t the number of time
domain samples. For Class k (k = 0, 1), CSP finds a spatial
filter matrix W ∗k ∈ R
c×f , where f is the number of spatial
filters, to maximize the variance difference between Class kand Class 1− k:
W ∗k = arg max
W∈Rc×f
tr(W⊤ΣkW )
tr[W⊤Σ1−kW ], (1)
where Σk is the mean covariance matrix of all EEG trials in
Class k, and tr is the trace of a matrix. The solution W ∗k is
the concatenation of the f leading eigenvectors of the matrix
(Σ1−k)−1Σk.
Finally, we concatenate the 2f spatial filters from both
classes to obtain the complete CSP filters:
W ∗ = [W ∗0 ,W
∗1 ] ∈ R
c×2f (2)
and compute the spatially filtered Xi by:
X ′i = (W ∗)⊤Xi ∈ R
2f×t (3)
The log-variances of the filtered trial X ′ can be extracted:
x = log
(
diag(X ′X ′⊤)
tr(X ′X ′⊤)
)
(4)
and used as input features in classification.
B. Riemannian Geometry
All SPD matrices P ∈ Rc×c form a differentiable Rieman-
nian manifold. Riemannian geometry is used to manipulate
them. Some basic definitions are provided below.
The Riemannian distance between two SPD matrices P1 and
P2 is:
δ (P1, P2) =∥
∥log(
P−11 P2
)∥
∥
F, (5)
where ‖ · ‖F is the Frobenius norm, and log donates the
logarithm of the eigenvalues of P−11 P2.
The Riemannian mean of {Pi}ni=1 is:
MR = argminP
n∑
i=1
δ2(P, Pi), (6)
The Euclidean mean is:
ME =1
n
n∑
i=1
Pi, (7)
The Log-Euclidean mean [7] is:
ML = exp
(
n∑
i=1
wi logPi
)
, (8)
where wi is usually set to 1n .
C. Tangent Space Mapping
Tangent space mapping is also known as the logarithmic
mapping, which maps a Riemannian space SPD matrix Pi to
a Euclidean tangent space vector xi around an SPD matrix
M , which is usually the Riemannian or Euclidean mean:
xi = upper (logM (MrefPiMref)) , (9)
3
where upper takes the upper triangular part of a c × c SPD
matrix and forms a vector xi ∈ R1×c(c+1)/2, and Mref is a ref-
erence matrix. To obtain a tangent space locally homomorphic
to the manifold, Mref = M−1/2 is needed [24].
Congruent transform and congruence invariance [26] are
two important properties in the Riemannian space:
M (FP1F, FP2F ) = F · M(P1, P2) · F, (10)
δ(
G⊤P1G,G⊤P2G)
= δ (P1, P2) , (11)
where M is the Euclidean or Riemannian mean operation, Fis a nonsingular square matrix, and G ∈ R
c×c is an invertible
symmetric matrix. (11) suggests that the Riemannian distance
between two SPD matrices does not change, if both are left
and right multiplied by an invertible symmetric matrix.
D. Riemannian Alignment (RA)
RA [22] first computes the covariance matrices of some
resting (or non-target) trials, {Pi}ni=1, in which the subject is
not performing any task (or not performing the target task),
and then the Riemannian mean MR of these matrices, which
is used as the reference matrix to reduce the inter-session or
inter-subject variations, by the following transformation:
P ′i = M
−1/2R PiM
−1/2R , (12)
where Pi is the covariance matrix of the i-th trial, and P ′i the
corresponding aligned covariance matrix. Then, all P ′i can be
classified by a minimum distance to mean (MDM) classifier
[8].
E. Euclidean Alignment (EA)
Although RA-MDM has demonstrated promising perfor-
mance, it still has some limitations [23]: 1) it processes the
covariance matrices in the Riemannian space, whereas there
are very few Riemannian space classifiers; 2) it computes the
reference matrix from the non-target stimuli in ERP-based
BCIs, which requires some labeled trials from the new subject.
EA [23] extends RA and solves the above problems by
transforming an EEG trial Xi in the Euclidean space:
X ′i = M
−1/2E Xi, (13)
where ME is the Euclidean mean of the covariance matrices
of all EEG trials, computed by (7).
However, EA only considers the marginal probability dis-
tribution shift, and works best when the number of EEG
channels is small. When there are a large number of channels,
computing M−1/2E may be numerically unstable.
F. Subspace Adaptation
Tangent space vectors usually have very high dimension-
ality, so they cannot be used easily in transfer learning. An
intuitive approach is to align them in a lower dimensional
subspace. Pan et al. [11] proposed transfer component analysis
(TCA) to learn the transferable components across domains
in a reproducible kernel Hilbert space using maximum mean
discrepancy (MMD) [27]. Joint distribution adaptation (JDA)
[14] improves TCA by considering the conditional distribution
shift using pseudo label refinement. Joint geometrical and
statistical alignment (JGSA) [15] further improves JDA by
adding two regularization terms, which minimize the within-
class scatter matrix and maximize the between-class scatter
matrix.
III. MANIFOLD EMBEDDED KNOWLEDGE TRANSFER
(MEKT)
This section proposes the MEKT approach. Its goal is to
use one or multiple source subjects’ data to help the target
subject, given that they have the same feature space and label
space. For the ease of illustration, we focus on a single source
domain first.
Assume the source domain has nS labeled instances
{(XS,i, yS,i)}nS
i=1, where XS,i ∈ Rc×t is the i-th feature ma-
trix, and yS,i ∈ {1, ..., l} is the corresponding label, in which
c, t and l denote the number of channels, time domain samples,
and classes, respectively. Let yS = [yS,1; · · · ; yS,nS] ∈ R
nS×1
be the label vector of the source domain. Assume also the
target domain has nT unlabeled feature matrices {XT,i}nT
i=1,
where XT,i ∈ Rc×t.
MEKT consists of the following three steps:
1) Covariance matrix centroid alignment (CA): Align the
centroids of the covariance matrices of {XS,i}nS
i=1 and
{XT,i}nT
i=1, so that their marginal probability distribu-
tions are close.
2) Tangent space feature extraction: Map the aligned co-
variance matrices to a tangent space feature matrix
XS ∈ Rd×nS , and XT ∈ R
d×nT , where d = c(c+1)/2is the dimensionality of the tangent space features.
3) Mapping matrices identification: Find projection matri-
ces A ∈ Rd×p and B ∈ R
d×p, where p ≪ d is the
dimensionality of a shared subspace, such that A⊤XS
and B⊤XT are close.
After MEKT, a classifier can be trained on (A⊤XS ,yS) and
applied to B⊤XT to obtain the target pseudo labels yT .
Next, we describe the details of the above three steps.
A. Covariance Matrix Centroid Alignment (CA)
CA serves as a preprocessing step to reduce the marginal
probability distribution shift of different domains, and enables
transfer from multiple source domains.
Let PS,i = XS,iX⊤S,i be the i-th covariance matrix in the
source domain, and Mref = M−1/2, where M can be the
Riemannian mean in (6), the Euclidean mean in (7), or the
Log-Euclidean mean in (8). Then, we align the covariance
matrices by
P ′S,i = MrefPS,iMref, i = 1, ..., nS (14)
Likewise, we can obtain the aligned covariance matrices
{P ′T,i}
nT
i=1 of the target domain.
CA has two desirable properties:
4
1) Marginal probability distribution shift minimization.
From the properties of congruent transform and con-
gruence invariance, we have
M(M⊤refP1Mref, ...,M
⊤refPnS
Mref)
= M⊤refM(P1, ..., PnS
)Mref = M⊤refMMref = I,
(15)
i.e., if we choose M as the Riemannian (or Euclidean)
mean, then different domains’ geometric (or arithmetic)
centers all equal the identity matrix. Therefore, the
marginal distributions of the source and the target do-
mains are brought closer on the manifold.
2) EEG trial whitening. In the following, we show that each
aligned covariance matrix is approximately an identity
matrix after CA.
If we decompose the reference matrix as Mref =[
w1, . . . ,wc
]
, then the (m,n)-th element of P ′S,i is:
P ′S,i(m,n) = w⊤
mPS,iwn, (16)
From (15) we have
w⊤mM(P1, ..., PnS
)wn =
{
1, m = n0, m 6= n
. (17)
The above equation holds no matter whether M is the
Riemannian mean, or the Euclidean mean.
For CA using the Euclidean mean, the average of the
m-th diagonal element of {P ′S,i}
nS
i=1 is
1
nS
nS∑
i=1
P ′S,i(m,m) = w⊤
mM(P1, ..., PnS)wm = 1,
(18)
Meanwhile, for each diagonal element, we have
P ′S,i(m,m) = ‖X⊤
S,iwm‖22 > 0, therefore the diagonal
elements of P ′S,i are around 1. Similarly, the off-diagonal
elements of P ′S,i are around 0. Thus, P ′
S,i is approxima-
tively an identify matrix, i.e., the aligned EEG trials are
approximated whitened.
CA with the Riemannian mean is an iterative process
initialized by the Euclidean mean. CA with the Log-
Euclidean mean is an approximation of CA with the
Riemannian mean, with reduced computational cost [7].
So, (18) also holds approximately for these two means.
This whitening effect will also be experimentally
demonstrated in Section IV-E.
B. Tangent Space Feature Extraction
After covariance matrix CA, we map each covariance matrix
to a tangent space feature vector in Rd×1:
xS,i = upper(
log(
P ′S,i
))
, i = 1, ..., nS (19)
xT,i = upper(
log(
P ′T,i
))
, i = 1, ..., nT (20)
Note that this is different from the original tangent space
mapping in (9), in that (9) uses the same reference matrix Mref
for all subjects, whereas our approach uses a subject-specific
Mref for each different subject.
Next, we form new feature matrices XS = [xS,i, ...,xS,nS]
and XT = [xT,i, ...,xT,nT].
C. Mapping Matrices Identification
CA does not reduce the conditional probability distribution
discrepancies. We next find projection matrices A,B ∈ Rd×d′
,
which map XS and XT to lower dimensional matrices A⊤XS
and B⊤XT , with the following desirable properties:
1) Joint probability distribution shift minimization. In tradi-
tional domain adaptation [11], [14], MMD is frequently
used to reduce the marginal and conditional probability
distribution discrepancies between the source and the
target domains, i.e.,
DS,T ≈ D (Q (XS) , Q (XT ))
+D (Q (yS |XS) , Q (yT |XT ))
=
∥
∥
∥
∥
∥
∥
1
nS
nS∑
i=1
A⊤xS,i −1
nT
nT∑
j=1
B⊤xT,j
∥
∥
∥
∥
∥
∥
2
F
+
l∑
k=1
∥
∥
∥
∥
∥
∥
1
nkS
nkS∑
i=1
A⊤xkS,i −
1
nkT
nkT∑
j=1
B⊤xkT,j
∥
∥
∥
∥
∥
∥
2
F
,
(21)
where xkS,i and xk
T,j are the tangent space vectors in
the k-th (k = 1, ..., l) class of the source domain and
the target domain, respectively, and nkS and nk
T are the
number of examples in the k-th class of the source
domain and the target domain, respectively.
Next, we propose a new measure, joint probability
MMD, to quantify the probability distribution shift be-
tween the source and the target domains, by considering
the joint probability directly, instead of the marginal and
the conditional probabilities separately.
Then, the joint probability MMD between the source
and the target domains is:
D′S,T = D (Q (XS ,yS) , Q (XT , yT ))
= D (Q (XS |yS)Q(yS), Q (XT |yT )Q(yT ))
≈
l∑
k=1
∥
∥
∥
∥
∥
∥
P (ykS)
nkS
nkS∑
i=1
A⊤xkS,i −
P (ykT )
nkT
nkT∑
j=1
B⊤xkT,j
∥
∥
∥
∥
∥
∥
2
F
=
l∑
k=1
∥
∥
∥
∥
∥
∥
1
nS
nkS∑
i=1
A⊤xkS,i −
1
nT
nkT∑
j=1
B⊤xkT,j
∥
∥
∥
∥
∥
∥
2
F
,
(22)
Let the one-hot encoding matrix of the source domain
label vector1 yS be YS , and the one-hot encoding matrix
of the predicted target label vector yT be YT . (22) can
be simplified as
D′S,T =
∥
∥N⊤S X⊤
S A−N⊤T X⊤
T B∥
∥
2
F, (23)
where
NS =YS
nS, NT =
YT
nT. (24)
1For example, for binary classification with two classes 1 and 2, if yS =
212
, then YS =
0 11 00 1
.
5
The joint probability MMD is based on the joint prob-
ability rather than the conditional probability, which in
theory can handle more probability distribution shifts.
2) Source domain discriminability. During subspace map-
ping, the discriminating ability of the source domain can
be preserved by:
minA
tr(A⊤SwA) s.t. A⊤SbA = I, (25)
where Sw =∑l
k=1
∑nkS
i=1(xkS,i)
⊤xkS,ihk is the within-
class scatter matrix, in which hk = 1 − 1nkS
, and
Sb =∑l
k=1 nk (mk − m) (mk − m)⊤
is the between-
class scatter matrix, in which mk is the mean of samples
from Class k, and m is the mean of all samples.
3) Target domain locality preservation. We also introduce
a graph-based regularization term to preserve the local
structure in the target domain. Under the manifold
assumption [28], if two samples xT,i and xT,j are close
in the original target domain, then they should also be
close in the projected subspace.
Let S ∈ RnT×nT be a similarity matrix:
Sij =
e−‖xT,i−xT,j‖
2
22σ2 , if xT,i ∈ Np(xT,j)
or xT,j ∈ Np(xT,i)
0, otherwise
, (26)
where Np(xT,i) is the set of the p-nearest neighbors of
xT,i, and σ is a scaling parameter, which usually equals
1 [29].
Using the normalized graph Laplacian matrix L =I −D−1/2SD−1/2, where D is a diagonal matrix with
Dii =∑nT
j=1 Sij , graph regularization is expressed as:
nT∑
i,j=1
‖B⊤xT,i −B⊤xT,j‖22Sij = tr(B⊤XTLX
⊤T B),
(27)
To remove the scaling effect, we add a constraint on the
target embedding [30]:
minB
tr(B⊤XTLX⊤T B) s.t. B⊤XTHX⊤
T B = I,
(28)
where H = I− 1nT
1nTis the centering matrix, in which
1nT∈ R
nT×nT is an all-one matrix.
4) Parameter transfer and regularization. Since the source
and the target domains have the same feature space, and
CA has brought their probability distributions closer,
we want the projection matrix B to be similar to
the projection matrix A learned in the source domain.
Additionally, for better generalization performance, we
want to ensure that A and B do not include extreme
values. Thus, we have the following constraints on the
projection matrices:
minA,B
(
‖B −A‖2F + ‖B‖2F)
. (29)
D. The Overall Loss Function of MEKT
Integrating all regularization and constraints above, the
formulation of MEKT is:
minA,B
α tr(A⊤SwA) + β tr(B⊤XTLX⊤T B) +D′
S,T
+ ρ(‖B −A‖2F + ‖B‖2F )
s.t. B⊤XTHX⊤T B = I, A⊤SbA = I
(30)
where α, β and ρ are trade-off parameters to balance the
importance of the source domain discriminability, the target
domain locality, and the parameter regularization, respectively.
Let W = [A;B]. Then, the Lagrange function is
J = tr(
W⊤(αP + βL+ ρU +R)W + η(I −W⊤VW ))
(31)
where
P =
[
Sw 0
0 0
]
, L =
[
0 0
0 XTLX⊤T
]
, (32)
U =
[
I −I−I 2I
]
, V =
[
Sb 0
0 XTHX⊤T
]
, (33)
R =
[
XSNSN⊤S X⊤
S −XSNSN⊤T X⊤
T
−XTNTN⊤S X⊤
S XTNTN⊤T X⊤
T
]
, (34)
Setting the derivative ∇WJ = 0, we have
(αP + βL + ρU +R)W = ηV W (35)
(35) can be solved by generalized eigen-decomposition, and
W consists of the p trailing eigenvectors. Since YT is needed
in NT [see (24)], and hence R, we use a general expectation-
maximization like pseudo label refinement procedure [14] to
refine the estimation, as shown in Algorithm 1.
Note that for the clarity of explanation, Algorithm 1 only
considers one source domain. When there are multiple source
domains, we perform CA and compute the tangent space
feature vectors X(i)S ∈ R
d×n(i)S for each source domain
separately, and then assemble their feature vectors into a single
source domain feature matrix XS = [X(1)S , ..., X
(z)S ] ∈ R
d×n∗
,
where n(i)S is the number of trails in the i-th source domain,
z is the number of source domains, and n∗ =∑z
i=1 n(i)S .
E. Kernelization Analysis
Nonlinear MEKT can be achieved through kernelization in
a Reproducing Kernel Hilbert Space [15].
Let the kernel function be φ : x 7→ φ(x). Define Φ(X) =[φ(x1), .., φ(xn)] ∈ R
d×n, where n = nS + nT . We use the
Representer Theorem [31] A = Φ(X)A and B = Φ(X)B to
kernelize MEKT, where X = [XS , XT ], and A ∈ Rn×p and
B ∈ Rn×p are two projection matrices to be optimized.
Let KS = Φ(X)⊤Φ(XS) and KT = Φ(X)⊤Φ(XT ). Then,
all x are replaced by φ(x), XS by Φ(XS), and XT by Φ(XT ),in the above derivations. The optimization problem becomes
minA,B
α tr(A⊤SwA) + β tr(B⊤KTLK⊤T B)
+∥
∥N⊤S K⊤
S A−N⊤T K⊤
T B∥
∥
2
F+ ρ(W⊤UW )
s.t. B⊤KTHK⊤T B = I, A⊤SbA = I
(36)
6
Algorithm 1: Manifold Embedded Knowledge Transfer
(MEKT)
Input: nS source domain samples {(XS,i, yS,i)}nS
i=1,
where XS,i ∈ Rc×t and yS,i ∈ {1, ..., l};
nT target domain feature matrices {XT,i}nT
i=1,
where XT,i ∈ Rc×t;
Number of iterations N ;
Weights α, β, ρ;
Dimensionality of the shared subspace, p.
Output: yT ∈ RnT×1, the labels for {XT,i}
nT
i=1.
Calculate the covariance matrices {PS,i}nS
i=1 and their
mean matrix M in the source domain, using (6), (7), or
(8);
Calculate {P ′S,i}
nS
i=1 using (14);
Map each P ′S,i to a tangent space feature vector
xS,i ∈ Rd×1 using (19) (d = c(c+ 1)/2);
Repeat the above procedure to get xT,i ∈ Rd×1 using
(20);
Form XS = [xS,1, ...,xS,nS] and XT = [xT,1, ...,xT,nT
];Construct P , L, U , V and R in (32)-(34);
for n = 1, ..., N do
Solve (35), and construct W ∈ R2d×p as the p
trailing eigenvectors;
Construct A as the first d rows in W , and B as the
last d rows;
Train a classifier f on (A⊤XS ,yS) and apply it to
B⊤XT to update yT ;
Update R in (34).end
return yT .
where Sw =∑l
k=1 KkSH
kS(K
kS)
⊤, in which KkS is the part
of KS from Class k only, and HkS = I − 1
nkS
1 the centering
matrix. The Laplacian matrix L is constructed in the original
data space. In Sb, mk is the mean of KkS , and m the mean of
K = [KS ,KT ]. U is obtained by replacing I in (33) with K .
(36) can be optimized in a similar way as (30).
F. Domain Transferability Estimation (DTE)
When there are a large number of source domains, estimat-
ing domain transferability can advise which domains are more
important, and also reduce the computational cost. In BCIs,
DTE can be used to find subjects which have low correlations
to the tasks and hence may cause negative transfer. Although
source domain selection is important, it is very challenging,
and hence very few publications can be found in the literature
[4], [13], [32], [33].
Next, we propose an unsupervised DTE strategy.
Assume there are z labeled sources domains Si ={X
(i)S ,y
(i)S }zi=1, where X
(i)S is the feature matrix of the i-
th source domain, y(i)S is the corresponding label vector.
Assume also there is a target domain T with unlabeled feature
matrix XT . Let Sb be the between-class scatter matrix, similar
to Sb in (25), and SSi,Tb be the scatter matrix between the
source and the target domains. We define the discriminability
of the i-th source domain as DIS(Si) = ‖SSi
b ‖1, and the
difference between the source domain and the target domain
as DIF (Si,T) = ‖SSi,Tb ‖1.
Then, the transferability of Source Domain Si is computed
as:
r(Si,T) =DIS(Si)
DIF (Si,T)(37)
We then select z∗ ∈ (1, z) source subjects with the highest
r(Si,T).
IV. EXPERIMENTS
In this section, we evaluate our method for both single-
source to single-target (STS) transfers and multi-source to
single-target (MTS) transfers. The code is available online2.
A. Datasets
We used two MI datasets and two ERP datasets in our
experiments. Their statistics are summarized in Table I.
TABLE ISTATISTICS OF THE TWO MI AND TWO ERP DATASETS.
DatasetNumber of Number of Number of Trails per Class-
Subjects Channels Time Samples Subject Imbalance
MI1 7 59 300 200 NoMI2 9 22 750 144 No
RSVP 11 8 45 368-565 YesERN 16 56 260 340 Yes
For both MI datasets, a subject sat in front of a computer
screen. At the beginning of a trial, a fixation cross appeared on
the black screen to prompt the subject to be prepared. Shortly
after, an arrow pointing to a certain direction was presented
as a visual cue for a few seconds, during which the subject
performed a specific MI task. Then, the visual cue disappeared,
and the next trial started after a short break. EEG signal was
recorded during the experiment, and used to classify which
MI the user was performing. Usually, EEG shortly after the
visual cue onset is highly related to the MI task.
0 1 2 3 5 6 74 8 t (s)
Fixation CrossVisual Cue: Motor Imagery
Break Next Trial
Fig. 2. Timing scheme of the motor imagery tasks in the first two datasets.
For the first MI dataset3 (MI1), 59-channel EEGs were
recorded at 100 Hz from seven healthy subjects, each with
100 left hand MIs and 100 right hand MIs. For the second
MI dataset4 (MI2), 22-channel EEGs were recorded at 250
Hz from nine heathy subjects, each with 72 left hand MIs
and 72 right hand MIs. Both datasets were used for two-class
classification.
2https://github.com/chamwen/MEKT3http://www.bbci.de/competition/iv/desc 1.html.4http://www.bbci.de/competition/iv/desc 2a.pdf.
7
The first ERP dataset5 contained 8-channel EEG recordings
from 11 healthy subjects in a rapid serial visual presentation
(RSVP) experiment. The images were presented at different
rates (5, 6, and 10 Hz) in three different experiments. We
only used the 5 Hz version. The goal was to classify from
EEG if the subject had seen a target image (with airplane) and
non-target image (without airplane). The number of images
for different subjects varying between 368 and 565, and the
target to non-target ratio was around 1:9. The sampling rate
was 2048 Hz, and the RSVP data had been band-pass filtered
to 0.15-28 Hz.
The second ERP dataset6 was recorded from a feedback
error-related negativity (ERN) experiment [34], which was
used in a Kaggle competition for two-class classification. It
was collected from 26 subjects and partitioned into training
set (16 subjects) and test set (10 subjects). We only used the
16 subjects in the training set as we do not have access to the
test set. The average target to non-target ratio was around 1:4.
The 56-channel EEG data had been downsampled to 200 Hz.
B. EEG Data Preprocessing
EEG signals from all datasets were preprocessed using the
EEGLAB toolbox [35]. We followed the same preprocessing
procedures in [23], [36].
For the two MI datasets, a causal 50-order 8-30 Hz7 finite
impulse response (FIR) band-pass filter was used to remove
muscle artifacts and direct current drift, and hence to obtain
cleaner MI signals. Next, EEG signals between [0.5, 3.5]seconds after the cue onsets were extracted as trials. The RSVP
signal was downsampled to 64 Hz to reduce the computational
cost, and epoched to 0.7s intervals immediately after the
stimulus onsets as trials. The ERN signal was bandpass filtered
to 1-40 Hz, and epoched to 1.3s intervals immediately after
the feedbacks (which contained the ERP associated with the
user’s response to the feedback event) as trials.
MI1 had 59 EEG channels, which were not easy to manipu-
late. Thus, we reduced the number of its tangent space features
to the number of source domain samples (200), according to
their F values in one-way ANOVA. For the ERN dataset, we
used xDAWN [38] to reduce the number of channels from 56
to 6.
The dimensionalities of different input spaces are shown in
Table II. ni is the number of samples in the i-th domain, and
c the number of selected channels for the two ERP datasets.
Specifically, for RSVP, c = 8 and ni varies from 368 to 565;
for ERN, c = 6 and ni = 340. Augmented covariance matrices
[10] were used in the Riemannian space for ERP, so they
had dimensionality of 2c × 2c. Only the c × c upper right
block of the augmented covariance matrix contains temporal
information [10], so these c2 elements were selected as the
tangent space features.
Next, we describe how the Euclidean space features were
determined. For the two MI datasets, six log-variance features
5https://www.physionet.org/physiobank/database/ltrsvp/.6https://www.kaggle.com/c/inria-bci-challenge.7We bandpass filtered the EEG signal to 8-30 Hz because MI is mainly
indicated by the change of the mu rhythm (about 813 Hz) and the beta (about1430 Hz) rhythm [37].
TABLE IIINPUT SPACE DIMENSIONALITIES IN DIFFERENT STS TASKS.
MI1 MI2 ERP (RSVP and ERN)
Euclidean 6×200 6×144 20×ni
Tangent 200×200 253×144 c2 × ni
Riemannian 59×59×200 22×22×144 2c× 2c× ni
of the CSP filtered trials [see (4)] were used as features. For
the two ERP datasets, after spatial filtering by xDAWN, we
assembled each EEG trail (which is a matrix) into a vector,
performed principal component analysis on all vectors from
the source subjects, and extracted the scores for the first 20
principal components as features.
C. Baseline Algorithms
We compared our MEKT approaches (MEKT-R: the Rie-
mannian mean is used as the reference matrix; MEKT-E: the
Euclidean mean is used as the reference matrix; MEKT-L:
the Log-Euclidean mean is used as the reference matrix) with
seven state-of-the-art baseline algorithms for BCI classifica-
tion. According to the feature space type, these baselines can
be divided into three categories:
1) Euclidean space approaches:
a) CSP-LDA (linear discriminant analysis) [39] for
MI, and CSP-SVM (support vector machine) [40]
for ERP.
b) EA-CSP-LDA for MI, and EA-xDAWN-SVM for
ERP, i.e., we performed EA [23] as a preprocessing
step before spatial filtering and classification.
2) Riemannian space approach: RA-MDM [22] for MI, and
xDAWN-RA-MDM for ERP.
3) Tangent space approaches, which were proposed for
computer vision applications, and have not been used
in BCIs before. CA was used before each of them. In
each learned subspace, the sLDA classifier [41] was used
for MI, and SVM for ERP.
a) CA (centroid alignment).
b) CA-CORAL (correlation alignment) [12].
c) CA-GFK (geodesic flow kernel) [13].
d) CA-JDA (joint distribution adaptation) [14].
e) CA-JGSA (joint geometrical and statistical align-
ment) [15].
Hyper-parameters of all baselines were set according to
the recommendations in their corresponding publications. For
MEKT, T = 5, α = 0.01, β = 0.1, ρ = 20, and d = 10 were
used.
D. Experimental Settings
We evaluated unsupervised STS and MTS transfers. In STS,
one subject was selected as the target, and another as the
source. Let z be the number of subjects in a dataset. Then,
there were z(z − 1) different STS tasks. In MTS, one subject
was used as the target, and all others as the sources, so there
were z different MTS tasks. For example, MI1 included seven
subjects, so we had 7 × 6 = 42 STS tasks, e.g., S2→S1
8
(Subject 2 as the source, and Subject 1 as the target), S3→S1,
S4→S1, S5→S1, S6→S1, S7→S1, ... S6→S7, and seven MTS
tasks, e.g., {S2, S3, S4, S5, S6, S7}→S1, . . ., {S1, S2, S3,
S4, S5, S6}→S7.
The balanced classification accuracy (BCA) was used as the
performance measure:
BCA =1
l
l∑
k=1
tpknk
, (38)
where tpk and nk are the number of true positives and the
number of samples in Class k, respectively.
E. Visualization
As explained in Section III-A, CA makes the aligned
covariance matrices approximate the identity matrix, no matter
whether the Riemannian mean, or the Euclidean mean, or
the Log-Euclidean mean, is used as the reference matrix. To
demonstrate that, Fig. 3 shows the raw covariance matrix of the
first EEG trial of Subject 1 in MI2, and the aligned covariance
matrices using different references. The raw covariance matrix
is nowhere close to identity, but after CA, the covariance ma-
trices are approximately identity, and hence the corresponding
EEG trials are approximately whitened.
5 10 15 20
5
10
15
2010
15
20
5 10 15 20
5
10
15
20 0
0.5
1
1.5
5 10 15 20
5
10
15
20 0
0.5
1
1.5
5 10 15 20
5
10
15
20 0
0.5
1
Fig. 3. The raw covariance matrix (Trial 1, Subject 1, MI2), and those afterCA using different reference matrices.
Next, we used t-SNE [42] to reduce the dimensionality of
the EEG trials to two, and visualize if MEKT can bring the
data distributions of the source and the target domains together.
Fig. 4 shows the results on transferring Subject 2’s data to
Subject 1 in MI2, before and after different data alignment
approaches. Before CA, the source domain and target domain
samples do not overlap at all. After CA, the two sets of
samples have identical mean, but different variances. CA-GFK
and CA-JDA make the variance of the source domain samples
and the variance of the target domain samples approximately
identical, but different classes are still not well separated.
MEKT-R not only makes the overall distributions of the source
domain samples and the target domain samples consistent, but
also samples from the same class in the two domains close,
which should benefit the classification.
-20 -10 0 10 20
-10
0
10
20Raw
-5 0 5 10 15-20
-15
-10
-5
0CA
-10 -5 0 5 10-20
-15
-10
-5
0
CA-GFK
-10 0 10 20-15
-10
-5
0
5
10CA-JDA
-10 -5 0 5 100
10
20
30MEKT-R
Source Class 1Source Class 2Target Class 1Target Class 2
Fig. 4. t-SNE visualization of the data distributions before and after CA, andwith different transfer learning approaches, when transferring Subject 2’s data(source) to Subject 1 (target) in MI2.
F. Classification Accuracies
The means and standard deviations of the BCAs on the four
datasets with STS and MTS transfers are shown in Tables III
and IV, respectively. All MEKT-based approaches achieved
the best (in bold) or the second best (underlined) performance
in all scenarios in contrast to the baselines.
Fig. 5 shows the BCAs of all tangent space based ap-
proaches when different reference matrices were used in CA.
The Riemannian mean obtained the best BCA in four out of
the six approaches, and also the best overall performance.
We also performed paired t-tests on the BCAs to check if
the performance improvements of MEKT-R over others were
statistically significant. Before each t-test, we performed a
Lilliefors test [43] to verify that the null hypothesis that the
data come from a normal distribution cannot be rejected. Then,
we performed false discovery rate corrections [44] by a linear-
step up procedure under a fixed significance level (α = 0.05)
on the paired p-values of each task.
The false discovery rate adjusted p-values (q-values) are
shown in Table V. MEKT-R significantly outperformed all
9
TABLE IIIMEAN (%) AND STANDARD DEVIATION (%; IN PARENTHESIS) OF THE
BCAS IN STS TRANSFERS. FOR THE CA-BASED APPROACHES, THE SLDACLASSIFIER WAS USED FOR MI, AND SVM FOR ERP.
MI1 MI2 Avg
CSP-LDA 57.23 (10.56) 58.7 (11.58) 57.97EA-CSP-LDA 66.85 (10.56) 65.00 (14.06) 65.93RA-MDM 64.98 (10.37) 66.60 (12.60) 65.79CA 66.17 (9.93) 66.02 (13.14) 66.10CA-CORAL 67.69 (10.68) 67.26 (13.34) 67.48CA-GFK 66.62 (10.53) 65.54 (13.56) 66.08CA-JDA 66.01 (12.55) 66.59 (15.28) 66.30CA-JGSA 65.81 (13.06) 65.90 (16.73) 65.85
MEKT-E 69.19 (12.84) 68.34 (15.51) 68.77MEKT-L 70.74 (12.28) 68.56 (15.66) 69.65MEKT-R 70.99 (12.46) 68.73 (15.73) 69.86
RSVP ERN Avg
CSP-LDA 58.58 (7.98) 54.34 (5.87) 56.46EA-CSP-LDA 58.76 (7.51) 55.57 (6.26) 57.17RA-MDM 60.37 (8.05) 56.22 (6.89) 58.30CA 58.34 (6.98) 56.97 (7.06) 57.66CA-CORAL 58.45 (6.84) 57.04 (7.00) 57.75CA-GFK 59.93 (7.61) 57.24 (7.34) 58.59CA-JDA 60.27 (7.75) 57.56 (7.63) 58.92CA-JGSA 55.23 (6.74) 57.17 (7.72) 56.20
MEKT-E 61.08 (8.59) 58.01 (7.76) 59.55
MEKT-L 61.15 (8.44) 57.91 (7.74) 59.53MEKT-R 61.24 (8.36) 57.85 (7.75) 59.55
TABLE IVMEAN (%) AND STANDARD DEVIATION (%; IN PARENTHESIS) OF THE
BCAS IN MTS TRANSFERS.
MI1 MI2 Avg
CSP-LDA 59.71 (12.93) 67.75 (12.92) 63.73EA-CSP-LDA 79.79 (6.57) 73.53 (15.96) 76.66RA-MDM 73.29 (9.25) 72.07 (9.88) 72.68CA 76.29 (9.66) 71.84 (13.89) 74.07CA-CORAL 78.86 (8.73) 72.38 (13.38) 75.62CA-GFK 76.79 (12.57) 72.99 (15.82) 74.89CA-JDA 81.07 (11.19) 74.15 (15.77) 77.61CA-JGSA 76.79 (12.35) 73.07 (16.33) 74.93
MEKT-E 81.29 (10.18) 76.00 (17.61) 78.65MEKT-L 83.07 (9.30) 76.54 (16.72) 79.81MEKT-R 83.42 (9.55) 76.31 (16.76) 79.87
RSVP ERN Avg
CSP-LDA 65.36 (9.32) 61.87 (4.51) 63.62EA-CSP-LDA 69.07 (9.05) 64.63 (5.86) 66.85RA-MDM 67.29 (8.38) 62.90 (6.79) 65.10CA 67.35 (7.52) 65.89 (7.30) 66.62CA-CORAL 66.94 (7.46) 66.17 (7.74) 66.56CA-GFK 67.75 (7.48) 66.03 (7.50) 66.89CA-JDA 66.06 (6.18) 64.64 (6.50) 65.35CA-JGSA 64.57 (5.79) 57.68 (8.04) 61.13
MEKT-E 67.92 (6.70) 66.70 (8.00) 67.31
MEKT-L 68.40 (6.40) 65.98 (7.94) 67.19MEKT-R 68.38 (6.36) 66.17 (7.68) 67.28
CA
CA-CORALCA-GFK
CA-JDACA-JGSA
MEKT62
64
66
68
70
BC
A (
%)
Fig. 5. Average BCAs (%) of the tangent space approaches on the fourdatasets, when different reference matrices were used in CA.
baselines in almost all STS transfers. The performance im-
provements became less significant when there were multiple
source domains, which is reasonable, because generally in
machine learning the differences between different algorithms
diminish as the amount of training data increases.
TABLE VFALSE DISCOVERY RATE ADJUSTED p-VALUES IN PAIRED t-TESTS
(α = 0.05).
MEKT-R vs MI1 MI2 RSVP ERN
STS
CSP-LDA .0000 .0000 – –xDAWN-SVM – – .0002 .0000EA-CSP-LDA .0030 .0003 – –EA-xDAWN-SVM – – .0000 .0000
RA-MDM .0003 .0340 .0412 .0004CA .0000 .0006 .0000 .0010
CA-CORAL .0005 .0340 .0000 .0014
CA-GFK .0000 .0001 .0016 .0130
CA-JDA .0003 .0183 .0386 .2627CA-JGSA .0021 .0006 .0000 .0241
MTS
CSP-LDA .0329 .1140 – –xDAWN-SVM – – .2077 .0306
EA-CSP-LDA .2808 .1636 – –EA-xDAWN-SVM – – .5733 .2632xDAWN-RA-MDM .0824 .1636 .5347 .0632CA .0329 .1260 .4727 .8380CA-CORAL .0897 .1636 .3477 .9914CA-GFK .0824 .1260 .5347 .9117CA-JDA .2379 .1636 .0349 .0632CA-JGSA .1344 .1636 .0323 .0018
We also considered linear and radial basis function (RBF;
kernel width 0.1) kernels in MEKT-R, and repeated the above
experiments. The results are shown in Table VI, where Primal
denotes the primal MEKT-R without kernelization. The primal
MEKT-R achieved the best (in bold) or the second best (under-
lined) performance in all scenarios. However, the differences
among the three approaches were very small.
G. Computational Cost
This subsection empirically checked the computational cost
of different algorithms, which were implemented in Matlab
2018a on a laptop with i7-8550U [email protected], 8GB mem-
ory, running 64-bit Windows 10 Education Edition.
For simplicity, we only selected one transfer task in each
dataset. For STS transfer, the first subject in each dataset
10
TABLE VIAVERAGE BCAS(%) OF THE PROPOSED MEKT UNDER DIFFERENT
KERNELS.
Primal Linear RBF
STS
MI1 70.99 70.99 70.37MI2 68.73 68.73 68.37
RSVP 61.24 60.49 61.78
ERN 57.85 57.44 58.45
MTS
MI1 83.42 83.36 78.21MI2 76.31 76.31 76.08
RSVP 68.38 68.41 68.22ERN 66.17 66.02 65.49
Avg 69.14 68.97 68.37
was selected as the target domain, and the second subject
as the source domain. For MTS transfer, the first subject
as the target domain, and all other subjects as the source
domains. we repeated the experiment 20 times, and show the
average computing time in Table VII. In summary, EA was the
most efficient. RA-MDM, CA-JDA and MEKT-R had similar
computational cost. MEKT-L and MEKT-E had comparable
classification performance with MEKT-R (Tables III and IV),
but much less computational cost. MEKT-L achieved the
best compromise between the classification accuracy and the
computational cost.
TABLE VIICOMPUTING TIME (SECONDS) OF DIFFERENT APPROACHES IN STS AND
MTS TRANSFERS.
RA-MDM EA CA-JDA MEKT-E MEKT-L MEKT-R
STS
MI1 5.49 0.44 5.45 2.53 2.75 5.42MI2 0.48 0.27 0.54 0.43 0.47 0.60
RSVP 0.42 0.05 0.45 0.23 0.27 0.43ERN 0.54 0.47 0.53 0.38 0.42 0.53
MTS
MI1 13.61 0.94 9.24 11.06 11.48 12.96MI2 1.01 0.69 1.35 1.13 1.20 1.29
RSVP 3.13 1.08 8.64 5.61 5.98 6.74ERN 5.49 7.95 14.92 10.39 10.74 11.95
H. Effectiveness of the Joint Probability MMD
To validate the superiority of the joint probability MMD
over the traditional MMD, we replaced the joint probability
MMD term D′S,T in (30) by the traditional MMD term DS,T
in (21), and repeated the experiments. The results are shown
in Table VIII. The joint probability MMD outperformed the
traditional MMD in six out of the eight tasks. We expect that
the joint probability MMD should also be advantageous in
other applications that the traditional MMD is now used.
I. Effectiveness of DTE
This subsection validates our DTE strategy on MTS tasks
to select the most beneficial source subjects.
Table IX shows the BCAs when different source domain
selection approaches were used: RAND randomly selected
round[(z− 1)/2] source subjects [because there was random-
ness, we repeated the experiment 20 times, and report the
TABLE VIIIAVERAGE BCAS (%) WHEN DS,T IN (21) OR D
′
S,T IN (22) WAS USED IN
(30).
DS,T D′
S,T
STS
MI1 65.33 70.99
MI2 66.78 68.73
RSVP 61.11 61.24ERN 58.62 57.85
MTS
MI1 73.86 83.42
MI2 74.23 76.31RSVP 69.33 68.38ERN 65.59 66.17
Avg 66.86 69.14
mean and standard deviation (in the parentheses)], ROD was
the approach proposed in [13], and ALL used all z source
subjects. Table X shows the computational cost of different
algorithms.
Tables IX and X shows that the proposed DTE outperformed
RAND and ROD in terms of the classification accuracy.
Although its BCAs were generally slightly worse than those
of ALL, its computational cost was much lower than ALL,
especially when z became large, i.e., when z ≫ 1, it can save
over 50% computational cost.
TABLE IXAVERAGE BCAS (%) WITH DIFFERENT SOURCE DOMAIN SELECTION
APPROACHES. RAND, ROD AND DTE EACH SELECTED round[(z− 1)/2]SOURCE SUBJECTS. ALL USED ALL SOURCE SUBJECTS.
z RAND ROD DTE ALL
MI1 7 81.53 (1.19) 81.86 82.14 83.42MI2 9 75.05 (1.06) 74.38 76.23 76.31
RSVP 11 67.48 (0.31) 67.79 68.70 68.38ERN 16 65.31 (0.52) 65.36 65.51 66.17
TABLE XCOMPUTING TIME (SECONDS) OF DIFFERENT SOURCE DOMAIN
SELECTION APPROACHES. RAND, ROD AND DTE EACH SELECTED
round[(z − 1)/2] SOURCE SUBJECTS. ALL USED ALL SOURCE SUBJECTS.
z RAND ROD DTE ALL
MI1 7 11.55 12.46 11.77 12.84MI2 9 0.90 1.11 0.94 1.24
RSVP 11 3.08 3.22 3.10 6.80ERN 16 6.27 6.42 6.29 11.57
V. CONCLUSIONS
Transfer learning is popular in EEG-based BCIs to cope
with variations among different subjects and/or tasks. This
paper has considered offline unsupervised cross-subject EEG
classification, i.e., we have labeled EEG trials from one or
more source subjects, but only unlabeled EEG trials from the
target subject. We proposed a novel MEKT approach, which
has three steps: 1) align the covariance matrices of the EEG
trials in the Riemannian manifold; 2) extract tangent space
features; and, 3) perform domain adaptation by minimizing
the joint probability distribution shift between the source
11
and the target domains, while preserving their geometric
structures. An optional fourth step, DTE, was also proposed
to identify the most beneficial source domains, and hence
to reduce the computational cost. Experiments on four EEG
datasets from two different BCI paradigms demonstrated that
MEKT outperformed several state-of-the-art transfer learning
approaches. Moreover, DTE can reduce more than half of
the computational cost when the number of source subjects
is large, with little sacrifice of classification accuracy.
REFERENCES
[1] J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M.Vaughan, “Brain-computer interfaces for communication and control,”Clinical Neurophysiology, vol. 113, no. 6, pp. 767–791, Jun. 2002.
[2] R. P. Rao, Brain-computer interfacing: an introduction. Cambridge,England: Cambridge University Press, 2013.
[3] B. He, B. Baxter, B. J. Edelman, C. C. Cline, and W. W. Ye, “Nonin-vasive brain-computer interfaces based on sensorimotor rhythms,” Proc.
of the IEEE, vol. 103, no. 6, pp. 907–925, May 2015.[4] D. Wu, “Online and offline domain adaptation for reducing BCI cali-
bration effort,” IEEE Trans. on Human-Machine Systems, vol. 47, no. 4,pp. 550–563, Sep. 2017.
[5] F. Lotte, L. Bougrain, A. Cichocki, M. Clerc, M. Congedo, A. Rako-tomamonjy, and F. Yger, “A review of classification algorithms for EEG-based brain–computer interfaces: a 10 year update,” Journal of Neural
Engineering, vol. 15, no. 3, p. 031005, Apr. 2018.[6] Z. J. Koles, M. S. Lazar, and S. Z. Zhou, “Spatial patterns underlying
population differences in the background EEG,” Brain Topography,vol. 2, no. 4, pp. 275–284, Jun. 1990.
[7] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “Fast and simplecalculus on tensors in the log-Euclidean framework,” in Proc. Int’l Conf.
on Medical Image Computing and Computer-Assisted Intervention, PalmSprings, CA, Oct. 2005, pp. 115–122.
[8] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, “Multiclass brain-computer interface classification by Riemannian geometry,” IEEE Trans.
on Biomedical Engineering, vol. 59, no. 4, pp. 920–928, Apr. 2012.[9] F. Yger, M. Berar, and F. Lotte, “Riemannian approaches in brain-
computer interfaces: a review,” IEEE Trans. on Neural Systems and
Rehabilitation Engineering, vol. 25, no. 10, pp. 1753–1762, Nov. 2017.[10] L. Korczowski, M. Congedo, and C. Jutten, “Single-trial classification
of multi-user P300-based Brain-Computer Interface using riemanniangeometry,” in Proc. 37th Annu. Int’l Conf. IEEE Eng. Med. Biol. Soc.,Milan, Italy, Aug. 2015, pp. 1769–1772.
[11] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation viatransfer component analysis,” IEEE Trans. on Neural Networks, vol. 22,no. 2, pp. 199–210, Feb. 2011.
[12] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domainadaptation,” in Proc. 30th AAAI Conf. on Artificial Intell., Arizona, Feb.2016.
[13] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel forunsupervised domain adaptation,” in Proc. IEEE Conf. on Computer
Vision and Pattern Recognition, Providence, RI, Jun. 2012, pp. 2066–2073.
[14] M. Long, J. Wang, G. Ding, J. Sun, and P. S. Yu, “Transfer featurelearning with joint distribution adaptation,” in Proc. IEEE Int’l Conf. on
Computer Vision, Sydney, Australia, Dec. 2013, pp. 2200–2207.[15] J. Zhang, W. Li, and P. Ogunbona, “Joint geometrical and statistical
alignment for visual domain adaptation,” in Proc. IEEE Conf. on
Computer Vision and Pattern Recognition, Hawaii, Jul. 2017, pp. 1859–1867.
[16] D. Wu, B. J. Lance, and T. D. Parsons, “Collaborative filtering for brain-computer interaction using transfer learning and active class selection,”PLOS One, vol. 8, no. 2, p. e56624, Feb. 2013.
[17] D. Wu, V. J. Lawhern, W. D. Hairston, and B. J. Lance, “SwitchingEEG headsets made easy: Reducing offline calibration effort using activewighted adaptation regularization,” IEEE Trans. on Neural Systems and
Rehabilitation Engineering, vol. 24, no. 11, pp. 1125–1137, Mar. 2016.[18] V. Jayaram, M. Alamgir, Y. Altun, B. Scholkopf, and M. Grosse-
Wentrup, “Transfer learning in brain-computer interfaces,” IEEE Com-
put. Intell. Mag., vol. 11, no. 1, pp. 20–31, Jan. 2016.[19] H. Kang, Y. Nam, and S. Choi, “Composite common spatial pattern
for subject-to-subject transfer,” IEEE Signal Processing Letters, vol. 16,no. 8, pp. 683–686, Aug. 2009.
[20] F. Lotte and C. Guan, “Learning from other subjects helps reducingbrain-computer interface calibration time,” in Proc. IEEE Int’l Conf. on
Acoustics Speech and Signal Processing, Dallas, TX, Mar. 2010, pp.614–617.
[21] Y. Jin, M. Mousavi, and V. R. de Sa, “Adaptive CSP with subspace align-ment for subject-to-subject transfer in motor imagery brain-computerinterfaces,” in Proc. 6th Int’l Conf. on Brain-Computer Interface (BCI),GangWon, South Korea, Jan. 2018, pp. 1–4.
[22] P. Zanini, M. Congedo, C. Jutten, S. Said, and Y. Berthoumieu, “Transferlearning: a Riemannian geometry framework with applications to brain-computer interfaces,” IEEE Trans. on Biomedical Engineering, vol. 65,no. 5, pp. 1107–1116, Aug. 2018.
[23] H. He and D. Wu, “Transfer learning for brain-computer interfaces: AEuclidean space data alignment approach,” IEEE Trans. on Biomedical
Engineering, vol. 67, no. 2, pp. 399–410, 2020.[24] A. Barachant, S. Bonnet, M. Congedo, and C. Jutten, “Classification of
covariance matrices using a riemannian-based kernel for BCI applica-tions,” Neurocomputing, vol. 112, pp. 172–178, Jul. 2013.
[25] H. Ramoser, J. Muller-Gerking, and G. Pfurtscheller, “Optimal spatialfiltering of single trial EEG during imagined hand movement,” IEEE
Trans. on Rehabilitation Engineering, vol. 8, no. 4, pp. 441–446, Dec.2000.
[26] R. Bhatia, Positive Definite Matrices. New Jersey: Princeton UniversityPress, 2009.
[27] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. Smola,“A kernel two-sample test,” Journal of Machine Learning Research,vol. 13, no. 3, pp. 723–773, Mar. 2012.
[28] M. Belkin and P. Niyogi, “Semi-supervised learning on Riemannianmanifolds,” Machine Learning, vol. 56, no. 1-3, pp. 209–239, Jul. 2004.
[29] D. Cai, X. He, and J. Han, “Document clustering using locality pre-serving indexing,” IEEE Trans. on Knowledge and Data Engineering,vol. 17, no. 12, pp. 1624–1637, Oct. 2005.
[30] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionalityreduction and data representation,” Neural Computation, vol. 15, no. 6,pp. 1373–1396, Jun. 2003.
[31] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: Ageometric framework for learning from labeled and unlabeled examples,”Journal of Machine Learning Research, vol. 7, no. Nov., pp. 2399–2434,2006.
[32] D. Wu, V. J. Lawhern, S. Gordon, B. J. Lance, and C.-T. Lin, “Driverdrowsiness estimation from EEG signals using online weighted adap-tation regularization for regression (OwARR),” IEEE Trans. on Fuzzy
Systems, vol. 25, no. 6, pp. 1522–1535, Nov. 2017.[33] C.-S. Wei, Y.-P. Lin, Y.-T. Wang, T.-P. Jung, N. Bigdely-Shamlo, and C.-
T. Lin, “Selective transfer learning for EEG-based drowsiness detection,”in Proc. IEEE Int’l Conf. on Systems, Man and Cybernetics, Hong Kong,Oct. 2015, pp. 3229–3232.
[34] P. Margaux, M. Emmanuel, D. Sbastien, B. Olivier, and M. Jrmie,“Objective and subjective evaluation of online error correction duringP300-based spelling,” Advances in Human-Computer Interaction, vol.2012, p. 4, Dec. 2012.
[35] A. Delorme and S. Makeig, “EEGLAB: An open source toolbox foranalysis of single-trial EEG dynamics including independent componentanalysis,” Journal of Neuroscience Methods, vol. 134, no. 1, pp. 9–21,Mar. 2004.
[36] X. Zhang and D. Wu, “On the vulnerability of CNN classifiers inEEG-based BCIs,” IEEE Trans. on Neural Systems and Rehabilitation
Engineering, vol. 27, no. 5, pp. 814–825, Apr. 2019.[37] Q. Ai, Q. Liu, W. Meng, and S. Q. Xie, Advanced Rehabilitative
Technology. Academic Press, 2018.[38] B. Rivet, A. Souloumiac, V. Attina, and G. Gibert, “xDAWN algorithm
to enhance evoked potentials: application to brain-computer interface,”IEEE Trans. on Biomedical Engineering, vol. 56, no. 8, pp. 2035–2043,Aug. 2009.
[39] C. M. Bishop, Pattern Recognition and Machine Learning. New York,NY: Springer, 2006.
[40] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vectormachines,” ACM Trans. on Intell. Systems and Technol., vol. 2, no. 3,p. 27, Apr. 2011.
[41] R. Peck and J. Van Ness, “The use of shrinkage estimators in lineardiscriminant analysis,” IEEE Trans. on Pattern Analysis and Machine
Intell., vol. 4, no. 5, pp. 530–537, May 1982.[42] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal
of Machine Learning Research, vol. 9, no. Nov., pp. 2579–2605, 2008.[43] H. W. Lilliefors, “On the Kolmogorov-Smirnov test for normality
with mean and variance unknown,” Journal of the American statistical
Association, vol. 62, no. 318, pp. 399–402, Jun. 1967.
12
[44] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate:a practical and powerful approach to multiple testing,” Journal of the
Royal statistical society: series B (Methodological), vol. 57, no. 1, pp.289–300, Jan. 1995.