journal of ?, vol., no., ? ? 1 unseen face presentation … · 2020. 1. 1. · with the problem,...

JOURNAL OF ?, VOL. ?, NO. ?, ? ? 1

Unseen Face Presentation Attack Detection UsingClass-Specific Sparse One-Class Multiple Kernel

Fusion RegressionShervin Rahimzadeh Arashloo

Abstract—The paper addresses face presentation attack detec-tion in the challenging conditions of an unseen attack scenariowhere the system is exposed to novel presentation attacks thatwere not present in the training step. For this purpose, a pureone-class face presentation attack detection approach based onkernel regression is developed which only utilises bona fide(genuine) samples for training. In the context of the proposedapproach, a number of innovations, including multiple kernelfusion, client-specific modelling, sparse regularisation and proba-bilistic modelling of score distributions are introduced to improvethe efficacy of the method. The results of experimental evaluationsconducted on the OULU-NPU, Replay-Mobile, Replay-Attackand MSU-MFSD datasets illustrate that the proposed methodcompares very favourably with other methods operating in anunseen attack detection scenario while achieving very competitiveperformance to multi-class methods (benefiting from presentationattack data for training) despite using only bona fide samples fortraining.

Index Terms—Face presentation attack detection, spoofing, un-seen (novel) attacks, subject-specific modelling, novelty detection,one-class classification, kernel regression, sparse regularisation,multiple kernel fusion.

I. INTRODUCTION

THE functionality of face biometric systems in practicalsituations is compromised by their susceptibility to pre-

sentation attacks (PA’s) where an illegitimate trial is made toget access to the system by presenting fake biometric traitsto system sensors. Due to potential security risks associatedwith the problem, face presentation attack detection (a.k.a.anti-spoofing) has received much attention over the past years,resulting in a variety of countermeasures [1], [2]. A majorityof the existing approaches assume that the face PAD problemis a closed-set recognition task and subsequently formulate andtrain a binary classifier using the available positive (bona fide)and negative (PA) training samples. Nevertheless, even withthe impressive performances reported on some databases, thetechnology is not mature yet. The discrepancies in samplesdue to varying image acquisition settings, including sensorinteroperability issues, different environmental conditions, etc.degrade the performance of face PAD techniques. In thiscontext, a particularly challenging facet of the problem isdue to previously ”unseen” attack types for which no similartraining samples in terms of the presentation attack instruments

S.R. Arashloo is with the department of computer engineering, Fac-ulty of Engineering, Bilkent University, Ankara, 06800, Turkey. E-mail:[email protected].

Manuscript received ? ?, 2020; revised ? ?, 2020.

are available at training time. In these situations, the commonclosed-set two-class formulation of the problem tends to gen-eralise poorly, degrading the performance of face PAD meth-ods. While earlier benchmark datasets only included a singleattack type (typically printed paper), more recent databasesincorporate a more diverse set of attacks, including digitalphoto attacks, replay attacks, mask attacks, make-up attacks,etc. Although introducing new datasets [3], [4] that covera wider variety of possible presentation attack mechanismsand instruments is desirable, yet, it may not completely solvethe problem. The limitation arises from the fact that not allpossible attack scenarios can be anticipated and covered in thedatasets since there is always the possibility of developing anew attack strategy to bypass the existing countermeasures.Consequently, error rates obtained on one or more datasetscannot be generalised and regarded as representative error ratescorresponding to a real-world operation of the system.

In practice, presentation attacks may potentially appear asfairly diverse, or novel and unpredictable while bona fidesample distributions tend to have relatively less diversity.Notwithstanding other strategies, an alternative approach tothe problem is to try to capture the distribution of bonafide samples. This objective may be realised through a one-class classification (OCC) approach to identify patterns froma target class, conforming to a specific condition, and differ-entiate them from non-target objects. OCC differs from theconventional multi-class formulation in that it mainly relieson samples from a single class for training. In the context ofbiometric PAD, this approach has been examined, for instance,in speech [5], fingerprint [6], or iris anti-spoofing [7] as wellas in face PAD [8], [9] by considering bona fide samples astarget objects and then trying to detect PA’s as anomalies w.r.tthe population of target observations. The merits of a one-classstrategy against the conventional binary-class formulation maybe outlined as follows:

• The multi-modal nature of presentation attack data com-plicates learning an effective decision boundary as thebinary classifier, in this case, tries to associate a po-tentially very diverse set of attack samples to a singlecluster/class. In contrast, since a one-class face PADmethod primarily utilises positive (bona fide) samplesfor training, it would reduce the undesirable impacts ofthe diversity of presentation attack data on performance.In this respect, the difficulties associated with a sensiblechoice of the negative (non-target) set common to the

arX

iv:1

912.

1327

6v1

[cs

.CV

] 3

1 D

ec 2

019


conventional binary techniques would be also avoided byvirtue of a one-class formulation.

• In a real-world scenario, attackers may devise inventiveways to fool the system which may not have been avail-able during the training stage of the PAD system. Thiswould in turn have an adverse impact on the performanceof the PAD sub-system as the two-class classifiers whenexposed to novel (unseen) attack types, tend to exhibitpoor generalisation performance. On the contrary, a one-class method is expected to be able to naturally deal withirregular (unknown) patterns by its inherent capability todeal with never seen, non-target observations, and hence,is better equipped to detect unknown presentation attacks.

• Training a two-class presentation attack detection systemrequires both genuine as well as attack samples. Whilecollecting genuine access data is relatively easy, increas-ing the number of presentation attack data for trainingis more complicated and laborious. Hence, the designof presentation attack detection systems based on a two-class formulation may be affected by a class imbalanceproblem. In contrast, in a one-class anomaly detection-based approach it is relatively easier to extend the trainingset as the decision boundary, in this case, is primarilydetermined by single-class (i.e. bona fide) observations.

• The merits of using client-specific solutions over client-independent models in biometrics have been explored indifferent studies [10], [11], [12]. However, in the commonPAD formulations, the two-class learning scheme inhibitsthe use of subject-specific solutions in open datasetscenarios, as it is not viable to access subject-specificpresentation attack data at the enrolment stage. Theone-class formulation, by contrast, is readily amenableto subject-specific model construction in open datasetsettings.

• During the operational course of a biometrics systemmore samples would become available over time. Acommon approach in this case is to utilise the thusobtained samples to update the model. While the two-class PAD systems may be not be able to effectivelybenefit from such data to enhance the performance, apure anomaly detection-based formulation may readilyutilise such data as for training only positive samples arerequired.

• And last but not least, from a general pattern classificationperspective, since a one-class formulation makes lessassumptions regarding the problem at hand, it tendsto have a higher capacity to cope with the problemsembedded in the nature of the data such as feature andlabel noise, small sample size, contamination of the data,etc.

In practice, based on the availability of training samples, dif-ferent avenues may be followed to build a one-class classifier.In this study, we conform to the strictest protocol, avoidingthe use of PA data altogether (zero-shot [13]) and design theproposed approach in a pure anomaly detection frameworkutilising only bona fide samples for training.

A. Contributions

The main contributions of the current work may be sum-marised as follows:

1) Following a one-class pure anomaly detection formalismand using only bona fide samples for system training, anew approach to face PAD based on kernel regressionis presented. As only bona fide samples are used fortraining, in the evaluation phase, the detection system isnaturally exposed to unseen attacks.

2) A sparse representation-based variant of the proposedone-class face PAD approach is introduced where fordecision making at the operational phase, only a smallfraction of the bona fide training samples (as few as5 frames) may be used, resulting in multiple orders ofspeed-up gain with respect to its non-sparse counterpart.

3) A client-specific variant of the proposed face PADapproach is introduced by training a separate modelfor each individual client enrolled in the system. Theefficacy of the class-specific solution compared withthe client-independent scheme (applying a single globalmodel to detect PA’s against all subjects) is validated ondifferent datasets.

4) In the context of the proposed sparse subject-specificone-class model for face PAD, we present a multiplekernel fusion anomaly detection scheme to combine thecomplementary information provided by different viewsof the problem for improved detection performance.

5) In terms of the proposed approach, it is illustratedthat using generic pre-trained off-the-shelf deep CNNmodels, it is possible to obtain state-of-the-art perfor-mance, even in the challenging conditions of ”unseen”presentation attacks.

6) While the proposed approach is applicable to each indi-vidual frame in isolation, in order to derive a consensusdecision for an entire video sequence, we examine twoprocedures. The first one applies a mean fusion rule overraw scores corresponding to individual frames. The otherassumes a mean fusion over a probabilistic measure ofnormality. Both alternatives are illustrated to be effectivein improving detection performance for a video sequencecompared with a single frame.

7) And, lastly, a thorough evaluation of the proposed facePAD approach is carried out on four publicly availabledatasets in an unseen (zero-shot) presentation attackdetection scenario (not utilizing any presentation attackdata for training), demonstrating the competitive perfor-mance of the proposed approach against other one-classas well as multi-class face PAD techniques.

B. Organisation

The rest of the paper is organised as follows: Section IIreviews related work with an emphasis on unseen face PADapproaches. In Section III, some background material on linearregression and its extension to the reproducing kernel Hilbertspace is provided. In Section IV, the proposed one-classapproach for unseen face PAD is introduced where sparse reg-ularisation of the solution, client-specific model construction,


multiple kernel fusion, frame and video-level decision makingare discussed. The results of an experimental evaluation of theproposed approach are presented and discussed in Section V.Finally, conclusions are drawn in Section VI.

II. RELATED WORK

Different countermeasures including hardware-based,software-based and challenge-response methods [14] havebeen proposed for face presentation attack detection. Thesoftware-based approaches try to classify an image (sequence)based on different features derived from image content. Inthis study, we follow a software-based approach to face PADusing a single modality (i.e. visible spectrum RGB images)for PA detection in contrast to some other studies operatingbeyond the visible spectrum [15]. Among the cues conveyedby an image/image sequence for face PAD including texture,motion, frequency, colour, shape or reflectance, texture isthe most commonly used feature for face PAD [16], [17].Deployment of texture for face PAD is partly fuelled by theassumption that face PA’s introduce certain texture patternsthat do not occur in bona fide samples. A different categoryof face PAD methods constitutes motion-based approachesconsidering either intra-face variations [18], [19] or analysingthe consistency of a subject’s motion with the environment[20]. A alternative category of approaches corresponds tofrequency-based methods to discern image artefacts in theFourier domain either from a single image [21] or from animage sequence [22], [20]. Colour characteristics [23], [24]as well as shape information are also employed as differentcues to deal with presentation attacks [25]. Motivated by theassumption that bona fide and PA attempts behave differentlyunder the same illumination conditions, some other methods[26] use reflectance for face PAD. Other work [27] focuseson using a statistical model of image noise for face PAD. Arecent category of approaches corresponds to deep learningmethods. Inspired by the successful of such models, and inparticular, deep convolutional neural networks (CNN’s) indifferent image analysis applications, different methods havebeen proposed to utilise the discriminatory information ofdeep CNN representations for face PAD [28], [29], [30].

In terms of the two-class classifiers used for decision mak-ing, different alternatives have been examined. These includediscriminant classifiers with the Support Vector Machinesbeing the most frequently used approach [31], [32]. Otherclassifiers following a discriminative procedure include thelinear discriminant analysis [23], [33], neural networks [19],convolutional neural networks [28], [29], Bayesian networks[34] as well as Adaboost [35]. An alternative category includesregression-based approaches trying to project input featuresonto their labels [36]. Methods trying to learn a distance metric[37] also exist in the literature. Some heuristics have been alsoexamined for classification in face PAD [18], [38].

In contrary to the common close-set two-class formulationof the problem, there also exists a different category ofapproaches trying to address the face PAD problem in anunseen attack scenario where in the evaluation phase, thesystem is exposed to PA’s which were not available during the

training stage of the system. One main group of these methodsformulates the problem as an OCC problem to detect unseenPA’s [8]. In our earlier study [8], a new evaluation protocolas well as a one-class formulation of the face PAD problembased on the anomaly detection concept was presented wherethe training data came from the positive class only. The eval-uation and comparison of 20 different one-class and two-classmethods on different databases illustrated that in the presenceof unseen PA’s, one-class formalism may be considered as apromising direction for investigation. Other work following aone-class formulation [9] considers a Gaussian mixture model(GMM) one-class learner for classification operating on imagequality measures. Other work [39], analyses One-Class SVMand AutoEncoder-based classifiers to address unseen facePAD. Both approaches, together with the Local Binary Patternfeature are evaluated and compared on four face PAD datasets.The study in [40] examined a new strategy to utilise theclient-specific information to train one-class classifiers usingonly bona fide data. It was found that one-class classifierstrained with client-specific information appeared to be morerobust to unseen PA’s compared to client-independent andmulti-class methods. In an alternative study [13], detectionof unknown PA’s was addressed in a Zero-Shot Face Anti-spoofing (ZSFA) scenario via a deep tree network (DTN)to partition the PA samples into semantic sub-groups in anunsupervised fashion. Upon the arrival of new a sample,whether known or unknown in terms of attack type, DTNroutes it to the most similar PA cluster to make a binarydecision. The work in [41] introduces a method where a deepmetric learning model is proposed using a triplet focal loss asregularisation for the so-called metric-softmax approach. Thebenefits of the proposed deep anomaly detection architecturewere demonstrated by introducing a few-shot a posterioriprobability estimation mechanism.

For a more detailed review of the face PAD methods onemay consult [1], [42], [2].

III. BACKGROUND

In this section, a brief background on linear regression andits extension to the reproducing kernel Hilbert space (kernelregression) shall be provided. In Section IV, kernel regressionis developed as a tool for face PA anomaly detection.

A. Linear Regression

Let us assume there exists a set of points x1, . . . , xnin a d-dimensional space, i.e. xi ∈ Rd, collected into thematrix Xn×d which are to be projected onto the real numbersy1, . . . , yn, i.e. yi ∈ R, collectively represented as vectoryn×1. A commonly followed approach to the problem above isthat of linear least squares regression [43] where one assumesthat the responses yi’s are generated through a linear processg(xi) = α>xi and then tries to estimate α by minimising asum of squared errors:

αopt = argminα

(Xα− y)>(Xα− y) (1)


The solution to the problem above may be found by takingthe derivative of the cost function (Xα−y)>(Xα−y) w.r.t.α and setting it to zero which yields

α = (X>X)−1X>y (2)

Given α, function g(.) at an arbitrary point z may then beevaluated as

g(z) = α>z = y>X(X>X)−1z = y>(XX>)−1Xz

= [〈z, x1〉, . . . , 〈z, xn〉](XX>)−1y (3)

where 〈., .〉 denotes inner (dot) product and .> represents thetranspose operation.

B. Kernel Regression

Let F be a feature space induced by a non-linear mappingφ : Rd → F . For a suitably chosen mapping, an inner product〈., .〉 on F may be represented as 〈φ(xi),φ(xj)〉 = κ(xi, xj),where κ(., .) is a positive semi-definite kernel function. Inkernel regression, each point x is first projected onto φ(x)followed by seeking a real-valued function g(φ(x)) = f(x)minimising a sum of squared differences between the expectedand the generated responses. Using Eq. 3, the relation for f(z)may be written as

f(z) = [〈φ(z), φ(x1)〉, . . . , 〈φ(z), φ(xn)〉](φ(X)φ(X)>)−1y

= [κ(z, x1), . . . κ(z, xn)]K−1y (4)

where we have used the notation K = φ(X)φ(X)> to denote

the so-called kernel matrix. Denoting α = K−1y, functionf(.) may be represented as

f(.) = [κ(., x1), . . . κ(., xn)]α

=

n∑i=1

αiκ(., xi) (5)

The full responses on the training set X may be derived asf(X) = Kα and the corresponding cost function in this caseis ‖Kα− y‖2 where ‖.‖2 denotes L2-norm.

IV. KERNEL REGRESSION FOR ONE-CLASS FACE PAD

In a one-class classification task, it is desirable to havenormal samples forming a compact cluster while being distantfrom anomalies. In a reproducing kernel Hilbert space (RKHS)and in the absence of outlier training data, in a one-classclassification paradigm, it is common practice to consider theorigin as an artificial exemplar outlier with respect to thedistribution of positive samples [44].

In this work, a projection function (defined in terms ofkernel regression) is sought such that it maps bona fidesamples onto a compact cluster, distant from a hypotheticalnon-target observation lying at the origin. This objective maybe achieved by setting the responses for all target observationsto a common fixed real number, distinct from zero, i.e. yi = c,for all i, s.t. c 6= 0. In this case, the kernel regressionapproach would form a compact cluster of target samplesas they would be all mapped onto the same point. Notethat kernel regression performs an exact interpolation when

the parameters characterising the regression (i.e. α) can beuniquely determined which is the case when the kernel matrixis positive-definite. Since by assumption c 6= 0, the projectednormal observations would lie away from the (hypothetical)outlier, i.e. the origin. Without loss of generality one may setc = 1, as the exact value for c would only act as a scalingfactor. Having set the response vector y to 1n×1, one maysolve for α as α = K−1y = K−11n×1.

The procedure discussed above provides the best separabil-ity of normal samples from outliers with respect to the Fishercriterion, stated formally in the following theorem.

Theorem Assuming the origin as a hypothetical outlier, thekernel regression approach with the responses for all targetsamples set to a common fixed real number other than zero,yields a zero within-class scatter while providing a positivebetween-class scatter, and thus corresponds to the optimalkernel Fisher criterion for classification, i.e. kernel Fishernull-space.

Proof In a Fisher classifier, the criterion function to bemaximised is

J =ϕ>sBϕ

ϕ>sWϕ(6)

where sB is the between-class scatter matrix, sW stands forthe within-class scatter matrix and ϕ denotes one axis of thesubspace. For the two-category case, the Fisher criterion maybe equivalently written as the ratio of the between-class scatterto the total within-class scatter as [43]

J =sBsW

=(m1 −m2)

2

s21 + s22(7)

where m1 and m2 stand for the means of the two classes whiles21 and s22 represent the corresponding within-class scatters.As noted earlier, the Fisher analysis, originally developed fora binary classification problem, may be applied to the one-class scenario by assuming the origin as an exemplar outlier.In this case, the within-class scatter of the transformed bonafide samples is

s21 =∑C1

(yi −m1)2 = y>y − 1

ny>1n×ny (8)

where C1 denotes the target class while yi denotes the pro-jection of sample xi ∈ C1 onto an optimal feature subspaceand m1 stands for the mean of the positive class. y is a vectorcollection of the responses for transformed data points of thetarget class while n denotes the number of bona fide samples.

As for the negative (presentation attack) class, it is assumedthat only a single hypothetical sample exists at the origin, thusm2 = 0 and s2 = 0. The total within-class scatter in this casewould then be sW = s21 + s22 = s21.

Regarding the between-class scatter (numerator of J in Eq.7) we have

sB = (m1 −m2)2 = (

1

n

n∑i=1

yi − 0)2 =1

n2y>1n×ny (9)


As it is assumed that all target observations are mapped onto 1,by substituting y = (1, . . . , 1)> in the relation for the within-class scatter one obtains

sW |y=(1,...,1)> = y>y − 1

ny>1n×ny = 0 (10)

Using y = (1, . . . , 1)>, the between-class scatter may bederived as

sB |y=(1,...,1)> =1

n2y>1n×ny = 1 (11)

Thus, the one-class kernel regression when the responses ofall target (bona fide) samples are equal and distinct fromzero, corresponds to a projection function (i.e. f(.)) leadingto sW = 0 and sB = 1, and consequently, represents a kernelFisher null-space analysis [45], [46]. �

The foregoing kernel regression-based classification ap-proach may be considered as a one-class novelty detectionvariant of our earlier study [47] introduced for face PAD andthat of [10] previously developed for face verification.

A. Regularisation

Formulating a one-class classifier based on the kernelregression formalism opens the possibility to regularise thesolution. Regularising the solution of a regression problemmay be driven by different objectives. The first case whereregularisation may be applied is when the number of observa-tions is smaller than the number of variables which makesthe least-squares problem ill-posed. In this case, regulari-sation introduces additional constraints to uniquely specifythe solution. The second case is when the model suffersfrom poor generalisation performance. In these circumstances,regularisation serves to improve the generalisation capabilityof the model by imposing a limitation on the available functionspace by introducing a penalty to discourage certain regionsof the solution space. In such cases, regularisation correspondsto priors on the solution to maintain a trade-off between datafidelity and some constraint on the solution.

A regularised kernel regression problem may be expressedas finding the vector minimser of the following cost function:

P (α) = ‖Kα− y‖2 +R(α) (12)

where, as before, ‖Kα − y‖2 measures the closeness of thegenerated responses to the expected responses y while R(α)encodes a desired regularisation on the solution α.

Regularisation schemes favouring sparseness of the solutionto derive the simplest possible explanation of an observationin terms of as few as possible atoms from a dictionary areamong the widely applied techniques [48]. Sparsity of thesolution to a least squares problem may be encouraged viaan L1-norm, the thus obtained problem being known as abasis pursuit in signal processing and Lasso in statistics. Inaddition to enhanced generalisation performance, sparse L1-norm models lead to scalable algorithms that can be appliedto problems with a large number of parameters. In this work,encouraging sparseness of the solution of the one-class kernelregression is realised by prescribing an L1-regulariser on α,

i.e. R(α) = δ∑ni=1 |αi|. The objective function for the sparse

one-class kernel regression may then be expressed as

P (α) = ‖Kα− y‖2 + δ

n∑i=1

|αi| (13)

where parameter δ controls the trade-off between data fidelityand sparseness of the solution. By imposing a strong sparse-ness prior on α, each response yi would be characterised byonly a few samples from among the training set, an importantimplication of which is a reduction in the computational costof the algorithm in the operational stage.

B. Class-Specific ModellingThe common methodology in the literature for face PAD is

to learn a single classifier applicable to all samples, ignoringthe identity information. In other words, the PAD sub-system istypically designed in a client-independent fashion. The client-independent design scheme rests on the (implicit) assump-tion that the class label (bona fide/PA) of an observationis independent of the client identity. The client-independentformalism has been re-evaluated in different studies [49],[40], [50] where it has been observed that utilisation ofsubject-specific information leads to significant improvementsin detection performance. Some other work utilising subject-specific information include [36] and [51]. More specifically,in a class-specific one-class approach for face PAD, one tries todetermine what is anomalous with respect to a specific subjectas opposed to the class-independent formulation where thegoal is to decide what is an outlier in general, irrespectiveof any specific identity. Training a client-specific face PADmodel, however, necessitates subject identities to be accessibleboth during the training as well as the operational phaseof the system. As discussed in [49], such information isreadily available to a face PAD engine as it works alongsidea face recognition system. As such, in the training phase, theenrolment data associated with each client may be employedto build a subject-specific PAD model. In the operational phaseof a face verification system, a test subject claims an identity,while in an identification scenario, the test sample is com-pared against multiple models stored in the database, whoseidentities are already known. In both operational modes, theidentity of the target class is known and may be deployed bythe PAD sub-system to compare a test observation against themodel of a specific client. Our earlier study [40], examininga client-specific one-class approach validated the merits ofincorporating subject-specific information for the face PADproblem.

Driven by these observations, the current work advocates aclient-specific one-class kernel regression approach to the facePAD problem. More specifically, in the present study, duringthe training phase, a separate one-class kernel regressionclassifier is built for each subject using the corresponding bonafide data while in the operational stage, the test sample ismatched against a subject-specific model.

C. Multiple Kernel FusionThe kernel function plays an important role in kernel-based

methods as it specifies the embedding of the data in the


(a) (b) (c) (d) (e)

Fig. 1: Multiple regions used to derive local face representa-tions. (a): image obtained from face detection; (b): region R1;(c): region R2; (d): region R3; (e): region R4.

feature space. While ideally the embedding is to be learntdirectly from training samples, in practice, a relaxed versionof the problem is considered by trying to learn an optimalcombination of multiple kernels providing different viewsof the problem at hand. The coefficients characterising thecombination may then be learnt using training instances frommultiple classes [52]. In a one-class novelty detection approachand in the absence of non-target (PA) training samples (i.e.unseen face PAD), we opt for the average fusion of multiplebase kernels. In this context, diverse views of the face PADproblem are constructed following two mechanisms: use ofmultiple (local) face image regions and deployment of multiplerepresentations derived from these regions, discussed next.

1) Multiple Regions: In addition to the whole face imageproviding discriminatory information at a global level, differ-ent local regions of the face image convey distinct informationfor decision making while introducing diversity and providingthe possibility to identify facial parts which possess face PAcharacteristics. In order to benefit from local information,along with the whole face image cropped tightly from facedetection bounding box to minimise the background effects(identified as region 1), three additional local regions are alsoconsidered. These include eyes and nose as region 2, the areassurrounding the nose as region3 and region 4 which focuseson the areas surrounding the nose and the mouth, Fig. 1.

2) Multiple Image Representation: The second source ofdiversity for multiple kernel fusion is derived through usingdifferent image representations. While a wide range of dif-ferent image features including LBP and its variants [53],[10], image quality measures [53], [9], etc. are applied toface PAD, more recently, a great deal of research has beendirected towards investigating the applicability of deep convo-lutional neural network (CNN) representations for face PAD[30] illustrating that such features may provide discrimina-tive information for detection of presentation attacks. Theapplication of CNN-based representation may also simplifythe overall design architecture of biometric systems throughadoption of common representation for the face matchingproblem. Following the same methodology, the current studyutilises CNN representations as features for face PAD. Forthis purpose, features obtained from the penultimate layers ofpre-trained GoogleNet [54], ResNet50 [55] and VGG16 [56]networks are used to construct multiple representations foreach facial region.

3) Learning the discriminant: Once multiple representa-tions from different regions are derived, the learning process ofthe proposed subject-specific sparse one-class multiple kernel

fusion regression approach for subject c is given as

αc = argminα

‖ 1

RN(

R∑r=1

N∑n=1

Kcrn)α− y‖2 + δ

n∑i=1

|αi| (14)

where R = 4 and N = 3 denote the number of facial regionsand deep CNN networks used for face image representation,respectively while Kc

rn stands for the kernel matrix associatedwith region r whose representation is obtained via CNN withindex n. In a subject-specific approach, the kernel matricesKcrn’s are constructed by using the training instances (i.e.

features extracted from bona fide frames) of subject c only. Inthis study, Eq. 14, is solved using the Least Angle Regression(LARS) algorithm [57] which facilitates deriving solutionswith all possible cardinalities on α.

D. Decision Strategy

Once the projection parameter αc is inferred for client c,the projection of a test sample (z) onto the feature subspaceof subject c is given as

f c(z) =

M∑i=1

αci (1

RN

R∑r=1

N∑n=1

κrn(z,xci )) (15)

where αci denotes the ith (non-zero) element of the discrimi-nant in the Hilbert space for the cth subject while xci denotesthe ith training instance of subject c. M is the total numberof non-zero elements of α. κrn(z,xci ) is the kernel function,capturing the similarity between the rth region of the testsample z and that of the ith training instance of subject cbased on the representation derived through the nth deep CNNnetwork.

1) Raw score fusion: Eq. 15 provides the raw projectionscore for a single frame of a test video sequence. In orderto derive a score for the whole test video, one possibility isto simply average the raw scores corresponding to individualframes comprising the video, leading to a decision rule as

1

F

F∑f=1

f c(zf ) ≥ τ c bona fide

1

F

F∑f=1

f c(zf ) < τ c PA

(16)

where F denotes the total number of frames in a videosequence while τ c is the threshold for decision making forsubject c.

2) Fusion of probabilistic scores: Next, we approximatethe probability density function of the score distributions cor-responding to bona fide samples using a Gaussian function anduse the cumulative density function of the inferred Gaussian as


a probabilistic measure of normality. In this case, the decisionrule for a video sequence reads

1

F

F∑f=1

∫ zf

−∞Nµ,σ(zf ) ≥ τ c bona fide

1

F

F∑f=1

∫ zf

−∞Nµ,σ(zf ) < τ c PA

(17)

where Nµ,σ denotes a normal distribution with mean µ andstandard deviation σ.

V. EXPERIMENTAL EVALUATION

In this section, the results of an experimental evaluation ofthe proposed approach on four publicly available datasets inan ”unseen” attack scenario are presented. Once the imple-mentation details and the performance metrics are discussed,the requirements for a dataset to be used in a client-specificmodelling are outlined. The discussion is then followed by abrief description of the datasets employed in this study. Next,the effects of multiple kernel fusion, client-specific modelling,sparse regularisation and temporal aggregation are presented.Finally, the proposed approach is compared with the state-of-the-art methods in different scenarios. While the proposedapproach operates in a zero-shot (unseen attack) scenario, weprovide comparisons against other methods which operate inan unseen attack scenario (where no attack data is used fortraining) as well as multi-class approaches where both bonafide and attack samples are utilise for training.

A. Implementation Details

A number of details regarding implementation of the pro-posed approach are in order. Each frame is initially pre-processed using the photometric normalisation method of [58]to compensate for illumination variations. The face detectionbounding boxes provided along with each dataset are usedto locate the face in each frame where in the case of amissing bounding box for a frame, the coordinates of the lastdetected face in the corresponding video sequence is used. Inorder to select facial regions (Fig. 1) in a consistent fashionacross different frames, the OpenFace library [59] is used todetect landmarks around facial features. For the extraction ofdeep CNN representations, their implementations in MATLABare used, yielding 1024-, 2048- and 4096-dimensional featurevectors for GoogLenet (N1), ResNet50 (N2) and VGG16 (N3),respectively. The thus obtained feature vectors are normalisedto have a unit L2-norm. The kernel function used is that of aGaussian (i.e. κ(xi, z) = exp(−θ‖xi − z‖2) kernel yielding apositive definite kernel matrix where θ is set to the reciprocalof the average Euclidean distance between bona fide trainingsamples.

B. Database Requirements for Client-Specific Modelling

While different datasets are available for the evaluation offace PAD methods, we have chosen four datasets for evaluationof the proposed approach in an unseen PAD scenario. The

criteria for this selection not only include presence of differentPA instruments, different illumination conditions, differentacquisition devices, but also: 1) availability of data for subject-specific model construction, as well as 2) availability of arelatively large number of evaluation results of other face PADtechniques in an unseen scenario for comparison. Regardingthe first requirement, one should note that the majority ofthe existing databases lack a dedicated separate enrolmentset of data for subject-specific training. This requirementrules out the possibility to use some data sets including theSiW, SiW-M and other datasets [3], [26], [4] in the contextof a client-specific modelling approach. The few exceptionsare the Replay-Attack [60], Replay-Mobile [61] and OULU-NPU datasets [62]. While in the Replay-Attack and Replay-Attack datasets a separate enrolment set is provided, the mostchallenging protocol of the OULU-NPU database (protocol 4)is defined in a way that allows for client-specific modelling.Considering the second criterion for database selection, onenotices that a relatively large number of unseen face PADevaluation results are available on the MSU-MFSD dataset[63].

C. Datasets

The datasets used in the current work are briefly introducednext.

1) The Replay-Mobile dataset: The Replay-Mobile dataset[61] includes 1190 video recordings of both bona fide andattack samples of 40 individuals recorded under differentillumination conditions using two different acquisition devices.Five different lighting conditions are considered for recordingbona fide access attempts. In order to produce the presentationattacks, high-resolution photos and videos from each subjectsare captured under two different illumination conditions. Thedataset is divided into three disjoint subsets of training, devel-opment and testing and an additional partition correspondingto enrolment data. The training set contains 120 bona fide and192 attack samples while the development set incorporates 160real-accesses and 256 attacks. The test set is comprised of 110bona fide videos and 192 presentation attacks. The enrolmentset includes 160 bona fide video recordings.

2) The Replay-Attack dataset: The Replay-Attack database[60] includes 1300 video recordings of bona fide and attacksamples of 50 different individuals. The videos are capturedunder two different illumination and background conditions.Attacks are created either using a printed image, a mobilephone or a high definition iPad screen. The video recordings inthis database are divided randomly into three subject-disjointsubsets for training (60 bona fide and 300 PA samples),development (60 bona fide and 300 PA samples) and testing(80 bona fide and 400 PA samples) purposes. In addition to thetraining, development and test sets, the Replay-Attack databasecomes with an extra set of videos recorded in a separatesession to serve as enrolment data for each of the 50 clients.

3) The OULU-NPU dataset: The OULU-NPU database[62] consists of 4950 bona fide and attack video recordings of55 subjects recorded using six different cameras in three ses-sions with different illumination conditions and backgrounds.


The dataset includes previously unseen input sensors, attacktypes and acquisition conditions. The videos of the 55 subjectsin this database are divided into three subject-disjoint subsetsfor training (360 bona fide and 1440 PA samples), development(270 bona fide and 1080 PA samples) and testing (360 bonafide and 1440 PA samples) purposes. For the evaluationpurposes, four protocols are designed amongst which, the forthprotocol is the most challenging one as in this protocol theevaluation is performed against previously unseen illuminationconditions and background scenes, attacks and input sensors.As the test set of the forth evaluation protocol is comprised ofsamples corresponding to the third session, in order to buildclient-specific models, the videos from other two sessions maybe used.

4) The MSU-MFSD dataset: The MSU MFSD database[63] includes 440 videos of photo and video attack attemptsof 55 individuals recorded using two different cameras. Thepublicly available subset, however, includes 35 subjects. Thedataset is divided into two subject-disjoint sets for training(30 bona fide and 90 PA’s) and testing (40 bona fide and 120PA’s) purposes. While for the previous three datasets thereexists independent bona fide data for subject-specific modelconstruction, in this dataset, no such data exists. Thus, for theconstruction of class-specific models on this dataset, we use5% of the bona fide data (roughly corresponding to 26 framesper subject) for training and the remaining 95% for testingpurposes.

D. Performance Metrics

For performance reporting ISO metrics BSISO-IEC30107-3-2017 are used in this study [64]. These are: 1) attack presen-tation classification error rate (APCER) which represents theproportion of attack presentations using the same PA instru-ment species incorrectly classified as bona fide presentations;and, 2) bona fide presentation classification error rate (BPCER)which corresponds to proportion of bona fide presentationsincorrectly classified as PA’s. When considering how well aPAD subsystem performs in detecting PA’s, the APCER ofthe most successful PA instrument species (PAIS) is reported:

APCER = maxPAIS

(APCERPAIS) (18)

More specifically, the APCER is computed separately for eachpresentation attack instrument (e.g. display or print) and theoverall presentation attack detection performance correspondsto the most successful attack, i.e. the attack with the highestAPCER.

In order to summarised the overall performance of the PADsubsystem as a single number, the Average Classification ErrorRate (ACER) defined as the average of the APCER and theBPCER at a specific decision threshold (the threshold wherethe false rejection rate (FRR) and the false acceptance rate(FAR) are balanced) may be used:

ACER =maxPAIS(APCERPAIS) +BPCER

2(19)

The ACER measure has been deprecated and would be usedin here for the sake of completeness and in order to enable acomparison between our results to some other approaches.

TABLE I: Effect of Multiple Kernel Fusion in terms of ACER(%). R.M.:Replay-Mobile; R.A.: Replay-Attack; M.M.: MSU-MFSD; O.N.: OULU-NPU; All: average over all datasets.

Kernel R.M. R.A. M.M. O.N. AllR1N1 20.00 3.33 0.04 13.75 9.28R2N1 15.23 3.12 0.04 10.42 7.20R3N1 18.18 8.13 0.04 7.50 8.46R4N1 18.41 8.12 0.04 9.17 8.93R1N2 30.45 1.67 0.04 11.25 10.85R2N2 25.68 0.21 0.04 25.42 12.83R3N2 18.64 0.21 0.04 20.42 9.82R4N2 19.09 0 0.04 12.08 7.80R1N3 20.45 0.42 0.04 10.42 7.83R2N3 24.55 0 0.04 8.33 8.23R3N3 18.64 0.42 0.04 11.67 7.69R4N3 19.90 0.63 0.04 7.50 7.01Fusion 15.68 0 0 6.67 5.58

In addition to the ISO metrics, the performance of theproposed approach is also reported in terms of the Area Underthe ROC Curve (AUC) as well as the Half Total Error Rate(HTER) and the Equal Error Rate (EER), whenever requiredfor a comparison to other techniques.

E. Effect of Multiple Kernel Fusion

First, the effect of fusing multiple views of the problemvia a kernel fusion strategy is analysed. For this purpose andin order to summarise the performance of each individualkernel to facilitate a comparison, ACER’s corresponding todifferent facial regions and different CNN representationsare reported in Table I where RxNy denotes the represen-tation for region x ∈ {1, . . . , 4} derived from deep CNNy ∈ {GoogLenet, ResNet50, VGG16}. From the table, thefollowing observations may be made. First, the best perform-ing regional representation among others in terms of averageACER is R4N3 (i.e. VGG16 applied to the region focusing onthe areas surrounding the nose and the mouth). Interestingly,the performance of R4 is better than the whole face image (R1)using the same deep CNN representation. Second, among threedifferent deep representations one observes that the VGG16network provides the most discriminative features for presen-tation attack detection. Third, regardless of the region andthe deep CNN features employed, the Replay-Mobile and theOULU-NPU (forth protocol) appear to be more challengingas compared with other datasets. In terms of the kernel fusionstrategy, the average ACER for the fusion is 5.58% whereasthe best performing single kernel provides an average ACERof 7.01% while the worst performing kernel yields an averageACER of 12.83%. That is, the kernel fusion improves theaverage ACER w.r.t. the best single kernel by more than 20%and by more than 56% w.r.t. the worst performing singlekernel.

F. Effect of Client-Specific Modelling

In this section, the efficacy of subject-specific modellingcompared to the subject-independent scheme is investigated.In the client-specific approach, a separate one-class kernel


TABLE II: Effect of Client-Specific Modelling in termsof ACER (%). R.M.:Replay-Mobile; R.A.: Replay-Attack;M.M.: MSU-MFSD; O.N.: OULU-NPU; All: average over alldatasets.

R.M. R.A. M.M. O.N. AllClient-Independent 24.77 4.60 3.75 43.33 19.11Client-Specific 15.68 0.0 0.0 6.67 5.58

regression classifier is built for each subject and subsequentlyused for decision making. In contrast, in the class-independentapproach, a single one-class classifier applicable to all subjectsenrolled into the dataset is trained. The performances of thetwo different approaches in terms of ACER are reportedin Table II. From the table, it may be observed that onall four datasets, the client-specific modelling improves theperformance. While on the Replay-Attack and MSU-MFSDdatasets the client-specific modelling leads to perfect detectionperformance (zero ACER’s), on the Replay-Mobile dataset,it leads to more than 36% improvement in terms of theACER. The reduction in the error rate on the OULU-NPUdataset is even larger; While the client-independent approachyields an ACER of 43.33% on this dataset, the subject-specific methodology provides an ACER of 6.67%. That is,a tremendous improvement of more than 84% is achieved. Interms of the average ACER on the four datasets, the subject-independent approach yields an ACER of 19.11% while theACER corresponding to the class-specific approach is 5.58%,amounting to 70% improvement in ACER, on average.

G. Effect of Sparse Regularisation

Next, we examine the effectiveness of sparse regularisationon the one-class kernel regression approach. For this purpose,using the LARS algorithm [57], solutions (α) with differentcardinalities (number of non-zero elements) of 2, 3, 4, 5, 10,20, 30 and 50 are obtained. Cardinalities of higher than 50are found not to improve the overall average ACER. Theperformances corresponding to different cases in terms ofACER are reported in Table III. The last column of the table(ARSG), reports the Average Relative Speed-up Gain achievedcompared to the non-sparse solution in the test phase. From thetable, it may be observed that, interestingly, even by using 2training frames, one may achieve an impressive overall averageACER of 8.10%. Increasing the cardinality from 2 towards 5,a reduction in the average ACER over four datasets is achievedwhere by using only 5 bona fide training frames, the proposedapproach yields an average ACER of 5.58%. Increasing thecardinality beyond 5 towards 50, no further improvement interms of the ACER is obtained. In terms of ARSG, it may beobserved that the best performing method in terms of averageACER with NNZ=5 (NNZ: Number of Non-Zero elements inα), yields an impressive 160× speed-up gain compared to thenon-sparse solution.

H. Effect of Temporal Aggregation

In this section, the effect of temporal aggregation of frame-level scores to derive a video-level decision using a sum

TABLE III: Effect of Sparsity in terms of ACER (%). NNZ:number of non-zero elements of α ; R.M.: Replay-Mobile;R.A.: Replay-Attack; M.M.: MSU-MFSD; O.N.: OULU-NPU;All: average over all datasets; ARSG: average relative speedup gain.

NNZ R.M. R.A. M.M. O.N. All ARSG2 21.36 0.0 5.63 5.42 8.10 ≈ 390×3 18.86 0.0 0.0 5.83 6.17 ≈ 260×4 17.50 0.0 0.0 6.67 6.04 ≈ 200×5 15.68 0.0 0.0 6.67 5.58 ≈ 160×10 16.14 0.0 0.0 7.50 5.91 ≈ 80×20 15.68 0.0 0.0 9.17 6.21 ≈ 40 ×30 15.23 0.0 0.0 9.17 6.10 ≈ 25×50 15.23 0.0 0.0 10.0 6.30 ≈ 15×

TABLE IV: Effect of Temporal Aggregation in terms of ACER(%). R.M.:Replay-Mobile; R.A.: Replay-Attack; M.M.: MSU-MFSD; O.N.: OULU-NPU; All: average over all datasets.

R.M. R.A. M.M. O.N. AllFrame-Level 15.92 1.91 4.01 14.14 8.99Video-Level (raw) 15.68 0.0 0.0 6.67 5.58Video-Level (prob.) 13.64 0.0 0.0 6.25 4.97

fusion rule over raw scores and over probabilistic scores isanalysed. The frame-level and video-level performances ofthe proposed approach on different datasets along with theaverage ACER’s are reported in Table IV. From the table itmay be observed that the sum fusion over raw scores (Video-level (raw)) improves the average ACER on four datasetsfrom 8.99% to 5.58%. That is, more than 37% reduction inthe average ACER. The average fusion rule applied to theprobabilistic scores yields an even further reduction in theaverage ACER, from 5.58% to 4.97%, corresponding to morethan 10% improvement.

I. Comparison to Other Methods

In this section, the performance of the proposed sparse one-class kernel fusion regression approach (NNZ=5) is comparedagainst other methods on four datasets. Note that the proposedapproach does not utilise any samples for training (i.e. operatesin a zero-shot attack scenario). Nevertheless, we providecomparisons between the proposed method and both one-classas well as multi-class approaches.

In addition to other existing face PAD techniques in the lit-erature, in order to provide further baseline performances andenable a more critical assessment of the proposed technique,three other kernel-based one-class classifiers are includedfor comparison. These are Support Vector Data Description(SVDD) [65], Kernel Principal Component Analysis (KPCA)[66] and the Gaussian Process (GP) [67]. For a fair compar-ison, kernel matrices are computed and shared between theproposed method and the other three kernel-based one-classapproaches.

1) OULU-NPU: The results of a comparison between theproposed approach and some other methods on the fourth pro-tocol of the OULU-NPU dataset are reported in Table V. Fromthe table, it may be observed that the average ACER of the


TABLE V: Comparison of the performance of the proposedapproach to other methods (including multi-class methods) onprotocol 4 of the OULU-NPU dataset. (SVDD, KPCA and GPbenefit from the same multiple kernel fusion strategy as thatof this work.)

Method APCER(%) BPCER(%) ACER (%)Massy HNU [68] 35.8±35.3 8.3±4.1 22.1±17.6GRADIANT[68] 5.0±4.5 15.0±7.1 10.0±5.0FAS-BAS[3] 9.3±5.6 10.4±6.0 9.5±6.0LBP-SVM[53] 41.67±27.03 55.0±21.21 48.33±6.07IQM-SVM[53] 34.17±25.89 39.17±23.35 36.67±12.13DeepPixBiS[53] 36.67±29.67 13.33±16.75 25.0±12.67SVDD 25.0±17.32 8.33±6.83 16.67±10.68KPCA 13.33±14.72 11.67±11.25 12.5±12.94GP 15.83±16.25 2.5±4.18 9.17±8.76This work 11.67±13.66 8.3±2.04 6.25±6.85

proposed approach on the OULU-NPU dataset (6.25 ± 6.85)is better than other existing methods in spite of the fact thatall other methods in Table V use PA data for training. Thesecond best performing method on this dataset in terms ofACER is FAS-BAS approach [3] with an ACER of 9.5± 6.0.The GRADIANT method [68] with an ACER of 10.0 ± 5.0represents the best performing method on the fourth protocolof the OULU-NPU dataset in the ”Competition on GeneralizedSoftware-based Face Presentation Attack Detection in MobileScenarios” [68]. The proposed approach also performs betterthan other kernel-based (SVDD, KPCA and GP) methods.

2) Replay-Attack: The results of a comparison between theproposed approach and other methods in an unseen attackdetection scenario on the Replay-Attack dataset are presentedin Table VI. As it may be observed from the table, theproposed approach achieves perfect detection performance onthis dataset (AUC=100%) while the best unseen face PADmethod from the literature obtains a detection performance of99.8% in terms of AUC. The proposed method also performsbetter than the SVDD approach while yielding a similarperformance as those of KPCA and GP which both benefitfrom a similar multiple kernel fusion strategy.

Table VII presents a comparison between the state of the artmulti-class methods and the proposed approach on the Replay-Attack dataset in terms of HTER. The proposed approachachieves a zero HTER, outperforming the majority of multi-class methods. The only method among others achieving azero HTER is that of [69].

3) MSU-MFSD: Table VIII presents a comparison betweenthe proposed approach and other methods operating in anunseen scenario in terms of AUC. The following observationsmay be made from the table. The purposed method performsbetter than other existing face PAD methods, achieving aperfect detection performance as compared to an AUC of 93%for the DTL method [13]. Compared with SVDD, KPCA andGP, the proposed solution performs better than SVDD, whileKPCA and GP, benefiting from a similar multiple kernel fusionstrategy, achieve a similar performance as that of the proposedapproach.

The results of a comparison between this work and the stateof the art multi-class methods in terms of EER are presented

TABLE VI: Comparison of the performance of the proposedapproach to some other methods on the Replay-Attack datasetin an unseen attack scenario in terms of AUC (%).(SVDD,KPCA and GP benefit from the same multiple kernel fusionstrategy as that of this work.)

Method AUC (%)OCSVM+IMQ [8] 80.76OCSVM+BSIF [8] 81.94NN+LBP [39] 91.26GMM+LBP [39] 90.06OCSVM+LBP [39] 87.90AE+LBP [39] 86.12DTL [13] 99.80SVDD 97.50KPCA 100GP 100This work 100

TABLE VII: Comparison of the performance of proposedapproach to the state-of-the-art multi-class methods on theReplay-Attack dataset in terms of HTER.

Method HTER(%)Boulkenafet et al. [70] 2.90lsCNN [28] 2.50lsCNN Traditionally Trained [28] 1.75Chingovska et al. [71] 6.29Boulkenafet et al. [24] 3.50LBP + GS-LBP [17] 3.13Patch-based CNN [72] 1.25Depth-based CNN [72] 0.75Fusion of the two Patch and Depth CNNs [72] 0.72Image Quality Assessment [73] 0.03Deep Learning [69] 0This work 0

in Table IX. As it may be observed from the table, theproposed approach (achieving a perfect detection performance)outperforms many multi-class methods. Among others, onlythe RD method [51] provides a similar performance.

4) Replay-Mobile: A similar comparison as those on otherdatasets is performed on the Replay-Mobile dataset. In anunseen scenario, the results are reported in Table X andcompared against others in terms of HTER (HTER is chosenfor comparison as the majority of the existing unseen PADresults on this dataset have reported their performances interms HTER). From the table, it may be observed that theproposed approach betters better than other methods in anunseen PAD scenario reported in the literature, achieving aHTER of 13.64 compared to the best performing unseenmethod of [40] with a HTER of 14.34. The proposed one-class method also performs better than SVDD, KPCA and GPapproaches for face PAD in an unseen attack scenario.

A comparison of the proposed approach to the state of theart multi-class methods is made in Table XI. As it may beobserved from the table, the proposed method performs betterthan some other multi-class methods, yet performs inferiorcompared to some multi-class techniques.

J. Summary of Performance Evaluations

Based on the evaluations conducted on four datasets, theproposed approach, when compared to other methods operat-ing in an unseen attack scenario (not using PA samples for


TABLE VIII: Comparison of the performance of the proposedapproach to some other methods on the MSU-MFSD datasetin an unseen attack scenario in terms of AUC. (SVDD, KPCAand GP benefit from the same multiple kernel fusion strategyas that of this work.)

Method AUC(%)OCSVM+IMQ [8] 67.77OCSVM+BSIF [8] 75.64NN+LBP [39] 81.59GMM+LBP [39] 81.34OCSVM+LBP [39] 84.47AE+LBP [39] 87.63DTL [13] 93.00SVDD 97.5KPCA 100GP 100This work 100

TABLE IX: Comparison of the performance of proposedapproach to the state-of-the-art multi-class methods on theMSU-MFSD dataset in terms of EER (%).

Method EER (%)Texture analysis [70] 4.9HSV+YCbCr fusion [74] 2.2Multiscale Fusion [75] 6.9IDA [63] 8.5Colour LBP [24] 10.8HRLF [76] 0.04RD [51] 0.0This work 0.0

training) obtains the-state-of-the-art performance. Moreover,the performance of the proposed method is also very competi-tive to the state-of-the-art multi-class methods. In this context,on three out of four datasets, the proposed approach achievesbetter or similar performance compared to methods that benefitfrom PA training samples for training.

K. A Note on Inter-Dataset Evaluation

As discussed previously, one of the key contributions of thepresent work is the introduction of class-specific modellingfor face PAD in the context of sparse one-class kernel fusionregression. For this purpose, naturally, subject-specific data isrequired to build client-specific classifiers. However, as thereis not overlap between subjects from different datasets, client-specific approach prevents an inter-dataset evaluation. Whilethe cross-dataset evaluation is expected to be more challengingthan the intra-dataset evaluation, it should be noted that thecurrent study has addressed a different, and possibly morechallenging problem associated with the face PAD problem,i.e. unseen attack detection. In this respect, the difficultiesassociated with an inter-dataset evaluation attributed to, forinstance, different imaging sensors, different lighting condi-tions, etc. are somewhat included in the experiments conductedon different datasets such as OULU-NPU which naturallyincorporate such variations.

L. Observations

• In this study, the methodology was confined to utilisingonly bona fide samples for training to investigate the

TABLE X: Comparison of the performance of the proposedapproach to some other methods on the Replay-Mobile datasetin an unseen attack scenario in terms of HTER.(SVDD, KPCAand GP benefit from the same multiple kernel fusion strategyas that of this work.)

Method HTER (%)GoogleNet+SVDD [40] 14.34ResNet50+SVDD [40] 21.76VGG16+SVDD [40] 18.78GoogleNet+MD [40] 13.70ResNet50+MD [40] 21.81VGG16+MD [40] 19.84GoogleNet+GMM [40] 14.21ResNet50+GMM [40] 21.53VGG16+GMM [40] 18.05SVDD 16.14KPCA 17.05GP 16.36This work 13.64

TABLE XI: Comparison of the performance of proposedapproach to the state-of-the-art multi-class methods on theReplay-Mobile dataset in terms of HTER.

Method HTERtwo-class SVM + LBP [9] 17.2two-class SVM + Motion [9] 10.4two-class SVM + Gabor [9] 9.13two-class SVM + IQM [9] 4.10This work 13.64

potential of a pure anomaly detection approach for facePAD. Nevertheless, in the presence of attack samples fortraining, further refinements to a one-class learner throughdifferent mechanisms including a feature selection fromamong deep CNN representations obtained from localfacial regions, a supervised multiple kernel learning ap-proach to learn weights associated with different kernels,etc. may be considered.

• A further observation corresponds to the Gaussianityassumption made regarding bona fide score distributions(Eq. 17). In practice, as observed from the results pre-sented earlier, despite being a simplistic assumption, itled to improvements in detection performance. How-ever, a more suitable distribution model such as thosecorresponding to Extreme Value Theory [77] may beconsidered for improved modelling.

• A different observation corresponds to a general char-acteristic of one-class learners. In a novelty detectiontask where no information regarding anomaly samples isaccessible, it is challenging to set the decision thresholds.A common practice in this case is to set the decisionthreshold at a pre-specified confidence level such thata small proportion of bona fide samples is rejected. Inthe case of availability of a separate development setspecific for each subject, one would be able to set thethreshold such that a desired trade-off between FAR andFRR is maintained. In the anomaly approach followedin this work, the HTER corresponds to the confidencelevel point closest to the EER setting. However, settingthe decision threshold in an OCC approach to ensure ansuitable practical performance remains an open problem,


subject to future investigation.• Finally, we note that training a one-class learner over

deep CNN representations may be viewed as a new CNNmodel whose classification layers are constructed usinga one-class classifier, where for training, all networkparameters except the regression layers are freezed whichis considered to be a useful mechanism in the presenceof limited training samples.

VI. CONCLUSION

The paper addressed the face presentation attack detectionproblem in the challenging conditions of an unseen attackdetection setting. For this purpose, a one-class novelty de-tection approach, based on kernel regression, was presented.Benefiting from generic deep CNN representations, additionalmechanisms including a multiple kernel fusion approach,sparse regularisation of the regression solution, client-specific,and probabilistic modelling were introduced to improve theperformance of the system. Experimental evaluation of theproposed approach on four publicly available datasets in anunseen face PAD setting illustrated that the proposed methodoutperforms other methods operating in an unseen scenariowhile competing closely with the state-of-the-art multi-classmethods. In the case of availability of PA training samples,further improvements may be achieved by refining the solutionof a one-class learner through different mechanisms includingfor instance a feature selection procedure, or multiple kernellearning using multi-class training samples, etc. which are leftas future directions of investigation.

REFERENCES

[1] S. Bhattacharjee, A. Mohammadi, A. Anjos, and S. Marcel, RecentAdvances in Face Presentation Attack Detection. Cham: SpringerInternational Publishing, 2019, pp. 207–228.

[2] S. Marcel, M. S. Nixon, J. Fierrez, and N. W. D. Evans, Eds., Handbookof Biometric Anti-Spoofing - Presentation Attack Detection, SecondEdition, ser. Advances in Computer Vision and Pattern Recognition.Springer, 2019.

[3] Y. Liu, A. Jourabloo, and X. Liu, “Learning deep models for faceanti-spoofing: Binary or auxiliary supervision,” in 2018 IEEE/CVFConference on Computer Vision and Pattern Recognition, June 2018,pp. 389–398.

[4] A. J. X. L. Yaojie Liu, Joel Stehouwer, “Deep tree learning for zero-shot face anti-spoofing,” in In Proceeding of IEEE Computer Vision andPattern Recognition (CVPR 2019), Long Beach, CA, 2019.

[5] A. O. E. L. J. Villalba, A. Miguel, “Spoofing detection with dnn and one-class svm for the asvspoof 2015 challenge,” in 16th Annual Conferenceof the International Speech Communication Association, September2015.

[6] Y. Ding and A. Ross, “An ensemble of one-class svms for fingerprintspoof detection across different fabrication materials,” in 2016 IEEEInternational Workshop on Information Forensics and Security (WIFS),Dec 2016, pp. 1–6.

[7] A. F. Sequeira, S. Thavalengal, J. Ferryman, P. Corcoran, and J. S.Cardoso, “A realistic evaluation of iris presentation attack detection,” in2016 39th International Conference on Telecommunications and SignalProcessing (TSP), June 2016, pp. 660–664.

[8] S. Arashloo, J. Kittler, and W. Christmas, “An anomaly detectionapproach to face spoofing detection: A new formulation and evaluationprotocol,” IEEE Access, vol. 5, pp. 13 868–13 882, 2017.

[9] O. Nikisins, A. Mohammadi, A. Anjos, and S. Marcel, “On effectivenessof anomaly detection approaches against unseen presentation attacksin face anti-spoofing,” in 2018 International Conference on Biometrics(ICB), Feb 2018, pp. 75–81.

[10] S. R. Arashloo and J. Kittler, “Class-specific kernel fusion of multipledescriptors for face verification using multiscale binarised statistical im-age features,” IEEE Transactions on Information Forensics and Security,vol. 9, no. 12, pp. 2100–2109, Dec 2014.

[11] S. R. Arashloo, “Multiscale binarised statistical image features forsymmetric face matching using multiple descriptor fusion based onclass-specific lda,” Pattern Analysis and Applications, vol. 20, no. 1,pp. 113–126, Feb 2017.

[12] A. Iosifidis and M. Gabbouj, “Neural class-specific regression for faceverification,” IET Biometrics, vol. 7, pp. 63–70(7), January 2018.

[13] Y. Liu, J. Stehouwer, A. Jourabloo, and X. Liu, “Deep tree learning forzero-shot face anti-spoofing,” CoRR, vol. abs/1904.02860, 2019.

[14] T. Edmunds, “Protection of 2D face identification systems againstspoofing attacks.” Theses, Univ. Grenoble Alpes, Jan. 2017.

[15] R. Tolosana, M. Gomez-Barrero, C. Busch, and J. Ortega-Garcia,“Biometric presentation attack detection: Beyond the visible spectrum,”IEEE Transactions on Information Forensics and Security, vol. 15, pp.1261–1275, 2020.

[16] Z. Boulkenafet, J. Komulainen, and A. Hadid, “On the generalizationof color texture-based face anti-spoofing,” Image and Vision Computing,vol. 77, pp. 1 – 9, 2018.

[17] F. Peng, L. Qin, and M. Long, “Face presentation attack detection usingguided scale texture,” Multimedia Tools and Applications, vol. 77, no. 7,pp. 8883–8909, Apr 2018.

[18] K. Kollreider, H. Fronthaler, and J. Bigun, “Non-intrusive livenessdetection by face images,” Image and Vision Computing, vol. 27, no. 3,pp. 233 – 244, 2009, special Issue on Multimodal Biometrics.

[19] L. Feng, L.-M. Po, Y. Li, X. Xu, F. Yuan, T. C.-H. Cheung, and K.-W.Cheung, “Integration of image quality and motion cues for face anti-spoofing: A neural network approach,” Journal of Visual Communicationand Image Representation, vol. 38, pp. 451 – 460, 2016.

[20] A. Pinto, H. Pedrini, W. R. Schwartz, and A. Rocha, “Face spoofingdetection through visual codebooks of spectral temporal cubes,” IEEETransactions on Image Processing, vol. 24, no. 12, pp. 4726–4740, Dec2015.

[21] G. Kim, S. Eum, J. K. Suhr, D. I. Kim, K. R. Park, and J. Kim, “Faceliveness detection based on texture and frequency analyses,” in 2012 5thIAPR International Conference on Biometrics (ICB), March 2012, pp.67–72.

[22] A. Pinto, W. R. Schwartz, H. Pedrini, and A. d. R. Rocha, “Usingvisual rhythms for detecting video-based facial spoof attacks,” IEEETransactions on Information Forensics and Security, vol. 10, no. 5, pp.1025–1038, May 2015.

[23] J. Galbally and S. Marcel, “Face anti-spoofing based on general imagequality assessment,” in 2014 22nd International Conference on PatternRecognition, Aug 2014, pp. 1173–1178.

[24] Z. Boulkenafet, J. Komulainen, and A. Hadid, “Face anti-spoofing basedon color texture analysis,” in 2015 IEEE International Conference onImage Processing (ICIP), Sept 2015, pp. 2636–2640.

[25] T. Wang, J. Yang, Z. Lei, S. Liao, and S. Z. Li, “Face livenessdetection using 3d structure recovered from a single camera,” in 2013International Conference on Biometrics (ICB), June 2013, pp. 1–6.

[26] X. Tan, Y. Li, J. Liu, and L. Jiang, “Face liveness detection from a singleimage with sparse low rank bilinear discriminative model,” in ComputerVision – ECCV 2010, K. Daniilidis, P. Maragos, and N. Paragios, Eds.Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 504–517.

[27] H. P. Nguyen, A. Delahaies, F. Retraint, and F. Morain-Nicolier, “Facepresentation attack detection based on a statistical model of imagenoise,” IEEE Access, vol. 7, pp. 175 429–175 442, 2019.

[28] G. B. de Souza, J. P. Papa, and A. N. Marana, “On the learning ofdeep local features for robust face spoofing detection,” CoRR, vol.abs/1806.07492, 2018.

[29] J. Yang, Z. Lei, and S. Z. Li, “Learn convolutional neural network forface anti-spoofing,” CoRR, vol. abs/1408.5601, 2014.

[30] A. George, Z. Mostaani, D. Geissenbuhler, O. Nikisins, A. Anjos, andS. Marcel, “Biometric face presentation attack detection with multi-channel convolutional neural network,” IEEE Transactions on Informa-tion Forensics and Security, vol. 15, pp. 42–55, 2020.

[31] G. Heusch and S. Marcel, Remote Blood Pulse Analysis for Face Pre-sentation Attack Detection. Cham: Springer International Publishing,2019, pp. 267–289.

[32] L. Li, X. Feng, Z. Xia, X. Jiang, and A. Hadid, “Face spoofing detectionwith local binary pattern network,” Journal of Visual Communicationand Image Representation, vol. 54, pp. 182 – 192, 2018.

[33] B. Hamdan and K. Mokhtar, “The detection of spoofing by 3d mask in a2d identity recognition system,” Egyptian Informatics Journal, vol. 19,no. 2, pp. 75 – 82, 2018.


[34] T. Edmunds and A. Caplier, “Motion-based countermeasure againstphoto and video spoofing attacks in face recognition,” Journal of VisualCommunication and Image Representation, vol. 50, pp. 314 – 332, 2018.

[35] M. Killiolu, M. Takiran, and N. Kahraman, “Anti-spoofing in facerecognition with liveness detection using pupil tracking,” in 2017 IEEE15th International Symposium on Applied Machine Intelligence andInformatics (SAMI), Jan 2017, pp. 000 087–000 092.

[36] J. Yang, Z. Lei, D. Yi, and S. Z. Li, “Person-specific face antispoofingwith subject domain adaptation,” IEEE Transactions on InformationForensics and Security, vol. 10, no. 4, pp. 797–809, April 2015.

[37] E. E. A. Abusham and H. K. Bashir, “Face recognition using local graphstructure (lgs),” in Human-Computer Interaction. Interaction Techniquesand Environments, J. A. Jacko, Ed. Berlin, Heidelberg: Springer BerlinHeidelberg, 2011, pp. 169–175.

[38] D. C. Garcia and R. L. de Queiroz, “Face-spoofing 2d-detection basedon moire-pattern analysis,” IEEE Transactions on Information Forensicsand Security, vol. 10, no. 4, pp. 778–786, April 2015.

[39] F. Xiong and W. AbdAlmageed, “Unknown presentation attack detectionwith face rgb images,” in 2018 IEEE 9th International Conference onBiometrics Theory, Applications and Systems (BTAS), Oct 2018, pp. 1–9.

[40] S. Fatemifar, S. R. Arashloo, M. Awais, and J. Kittler, “Spoofingattack detection by anomaly detection,” in ICASSP 2019 - 2019 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP), May 2019, pp. 8464–8468.

[41] D. Prez-Cabo, D. Jimnez-Cabello, A. Costa-Pazo, and R. Lpez-Sastre,“Deep anomaly detection for generalized face anti-spoofing,” 04 2019.

[42] R. Ramachandra and C. Busch, “Presentation attack detection methodsfor face recognition systems: A comprehensive survey,” ACM Comput.Surv., vol. 50, no. 1, pp. 8:1–8:37, Mar. 2017.

[43] E. Alpaydin, Introduction to Machine Learning, 3rd ed., ser. AdaptiveComputation and Machine Learning. MIT Press, 2014.

[44] B. Scholkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C.Williamson, “Estimating the support of a high-dimensional distribution,”Neural Comput., vol. 13, no. 7, pp. 1443–1471, Jul. 2001.

[45] J. Liu, Z. Lian, Y. Wang, and J. Xiao, “Incremental kernel null spacediscriminant analysis for novelty detection,” in 2017 IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), July 2017, pp.4123–4131.

[46] P. Bodesheim, A. Freytag, E. Rodner, M. Kemmler, and J. Denzler,“Kernel null space methods for novelty detection,” in 2013 IEEEConference on Computer Vision and Pattern Recognition, June 2013,pp. 3374–3381.

[47] S. R. Arashloo, J. Kittler, and W. Christmas, “Face spoofing detectionbased on multiple descriptor fusion using multiscale dynamic binarizedstatistical image features,” IEEE Transactions on Information Forensicsand Security, vol. 10, no. 11, pp. 2396–2407, Nov 2015.

[48] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust facerecognition via sparse representation,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 31, no. 2, pp. 210–227, Feb2009.

[49] I. Chingovska and A. R. dos Anjos, “On the use of client identityinformation for face antispoofing,” IEEE Transactions on InformationForensics and Security, vol. 10, no. 4, pp. 787–796, April 2015.

[50] S. Fatemifar, M. Awais, S. R. Arashloo, and J. Kittler, “Combiningmultiple one-class classifiers for anomaly based face spoofing attackdetection,” in ICB 2019 - 2019 IEEE International Conference onBiometrics (ICB), June 2019.

[51] T. Edmunds and A. Caplier, “Fake face detection based on radiometricdistortions,” in 2016 Sixth International Conference on Image ProcessingTheory, Tools and Applications (IPTA), Dec 2016, pp. 1–6.

[52] F. Yan, J. Kittler, K. Mikolajczyk, and A. Tahir, “Non-sparse multiplekernel fisher discriminant analysis,” J. Mach. Learn. Res., vol. 13, no. 1,pp. 607–642, Mar. 2012.

[53] A. George and S. Marcel, “Deep pixel-wise binary supervision forface presentation attack detection,” in International Conference onBiometrics, 2019.

[54] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in 2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2015, pp. 1–9.

[55] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pp. 770–778, 2016.

[56] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.

[57] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angleregression,” The Annals of Statistics, vol. 32, no. 2, pp. 407–451, 2004.

[58] V. Struc and N. Pavesic, Photometric normalization techniques forillumination invariance. IGI-Global, 2011, pp. 279–300.

[59] B. Amos, B. Ludwiczuk, and M. Satyanarayanan, “Openface: A general-purpose face recognition library with mobile applications,” CMU-CS-16-118, CMU School of Computer Science, Tech. Rep., 2016.

[60] I. Chingovska, A. Anjos, and S. Marcel, “On the effectiveness of localbinary patterns in face anti-spoofing,” in 2012 BIOSIG - Proceedingsof the International Conference of Biometrics Special Interest Group(BIOSIG), Sept 2012, pp. 1–7.

[61] A. Costa-Pazo, S. Bhattacharjee, E. Vazquez-Fernandez, and S. Marcel,“The replay-mobile face presentation-attack database,” in Proceedingsof the International Conference on Biometrics Special Interests Group(BioSIG), Sep. 2016.

[62] Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadid, “Oulu-npu:A mobile face presentation attack database with real-world variations,”in 2017 12th IEEE International Conference on Automatic Face GestureRecognition (FG 2017), May 2017, pp. 612–618.

[63] D. Wen, H. Han, and A. K. Jain, “Face spoof detection with imagedistortion analysis,” IEEE Transactions on Information Forensics andSecurity, vol. 10, no. 4, pp. 746–761, April 2015.

[64] “Information technology biometric presentation attack detection part 3:Testing and reporting,” International Organization for Standardization,Standard, 2017.

[65] D. M. Tax and R. P. Duin, “Support vector data description,” MachineLearning, vol. 54, no. 1, pp. 45–66, Jan 2004.

[66] H. Hoffmann, “Kernel pca for novelty detection,” Pattern Recognition,vol. 40, no. 3, pp. 863 – 874, 2007.

[67] M. Kemmler, E. Rodner, E.-S. Wacker, and J. Denzler, “One-classclassification with gaussian processes,” Pattern Recognition, vol. 46,no. 12, pp. 3507 – 3518, 2013.

[68] Z. Boulkenafet, J. Komulainen, Z. Akhtar, A. Benlamoudi, D. Samai,S. E. Bekhouche, A. Ouafi, F. Dornaika, A. Taleb-Ahmed, L. Qin,F. Peng, L. B. Zhang, M. Long, S. Bhilare, V. Kanhangad, A. Costa-Pazo, E. Vazquez-Fernandez, D. Perez-Cabo, J. J. Moreira-Perez,D. Gonzalez-Jimenez, A. Mohammadi, S. Bhattacharjee, S. Marcel,S. Volkova, Y. Tang, N. Abe, L. Li, X. Feng, Z. Xia, X. Jiang, S. Liu,R. Shao, P. C. Yuen, W. R. Almeida, F. Andalo, R. Padilha, G. Bertocco,W. Dias, J. Wainer, R. Torres, A. Rocha, M. A. Angeloni, G. Folego,A. Godoy, and A. Hadid, “A competition on generalized software-basedface presentation attack detection in mobile scenarios,” in 2017 IEEEInternational Joint Conference on Biometrics (IJCB), Oct 2017, pp. 688–696.

[69] I. Manjani, S. Tariyal, M. Vatsa, R. Singh, and A. Majumdar, “Detectingsilicone mask-based presentation attack via deep dictionary learning,”IEEE Transactions on Information Forensics and Security, vol. 12, no. 7,pp. 1713–1723, July 2017.

[70] Z. Boulkenafet, J. Komulainen, and A. Hadid, “Face spoofing detec-tion using colour texture analysis,” IEEE Transactions on InformationForensics and Security, vol. 11, no. 8, pp. 1818–1830, Aug 2016.

[71] I. Chingovska, “Trustworthy biometric verification under spoofing at-tacks: Application to the face mode,” Ph.D. dissertation, Ecole Polytech-nique Federale de Lausanne, Nov. 2015, these EPFL, no 6879 (2016).

[72] Y. Atoum, Y. Liu, A. Jourabloo, and X. Liu, “Face anti-spoofingusing patch and depth-based cnns,” in 2017 IEEE International JointConference on Biometrics (IJCB), Oct 2017, pp. 319–328.

[73] E. Fourati, W. Elloumi, and A. Chetouani, “Anti-spoofing in facerecognition-based biometric authentication using image quality assess-ment,” Multimedia Tools and Applications, Oct 2019.

[74] Z. Boulkenafet, J. Komulainen, and A. Hadid, “Face antispoofing usingspeeded-up robust features and fisher vector encoding,” IEEE SignalProcessing Letters, vol. 24, no. 2, pp. 141–145, Feb 2017.

[75] Z. Boulkenafet, J. Komulainen, Xiaoyi Feng, and A. Hadid, “Scale spacetexture analysis for face anti-spoofing,” in 2016 International Conferenceon Biometrics (ICB), June 2016, pp. 1–6.

[76] U. Muhammad and A. Hadid, “Face anti-spoofing using hybrid residuallearning framework,” in 2019 International Conference on Biometrics(ICB), June 2019.

[77] S. Coles, An introduction to statistical modeling of extreme values, ser.Springer Series in Statistics. London: Springer-Verlag, 2001.

journal of ?, vol., no., ? ? 1 unseen face presentation … · 2020. 1. 1. · with the problem,...

Documents