bayesian hypothesis testing for pattern discrimination in brain decoding

10
Bayesian hypothesis testing for pattern discrimination in brain decoding Emanuele Olivetti a,b, , Sriharsha Veeramachaneni c , Ewa Nowakowska d a NeuroInformatics Laboratory (NILab), Bruno Kessler Foundation and University of Trento, via delle Regole 101, 38121, Trento, Italy b Center for Mind and Brain Sciences (CIMeC), University of Trento, Italy c WindLogics Inc., St. Paul (MN), USA d Multivariate Statistics Group, GfK Polonia, Warszawa, Poland article info Available online 11 May 2011 Keywords: Interpretability and validation Brain decoding Generalization error Hypothesis testing abstract Research in cognitive neuroscience and in brain–computer interfaces (BCI) is frequently concerned with finding evidence that a given brain area processes, or encodes, given stimuli. Experiments based on neuroimaging techniques consist of a stimulation protocol presented to a subject while his or her brain activity is being recorded. The question is then whether there is enough evidence of brain activity related to the stimuli within the recorded data. Finding a link between brain activity and stimuli has recently been proposed as a classification task, called brain decoding. A classifier that can accurately predict which stimuli were presented to the subject provides support for a positive answer to the question. However, it is only the answer for a given data set and the question still remains whether it is a general rule that will apply also to new data. In this paper we try to reliably answer the neuroscientific question about the presence of a significant link between brain activity and stimuli once we have the classification results. The proposed method is based on a Beta-Binomial model for the population of generalization errors of classifiers from multi-subject studies within the Bayesian hypothesis testing framework. We present an application on nine brain decoding investigations from a real functional magnetic resonance imaging (fMRI) experiment about the relation between mental calculation and eye movements. & 2011 Elsevier Ltd. All rights reserved. 1. Introduction Classification-based approaches for data analysis are provok- ing wide interest and increasing adoption within the cognitive neuroscience and brain–computer interface (BCI) communities. One of the core problems of these investigations is hypothesis testing, i.e. finding evidence that the brain, or part of it, encodes/ processes given stimuli from the observation of neuroimaging data. A classification algorithm is trained on the recorded data to learn how to discriminate between those stimuli [1]. Then the misclassification rate of the classifier is estimated and used within a statistical test. In this work we propose a principled approach to provide a reliable answer to the hypothesis testing problem based on modeling the relation between the observed error of the classifier and its generalization error. The model addresses the multi-subject case in order to make inferential statements about the whole target population of the neuroscientific study. During a neuroimaging-based investigation a sequence of stimuli is usually presented to the subject at specific time intervals and he or she is requested to accomplish an active or passive task accordingly. As an example, during the eye move- ment (saccades) task described in [2] an arrow pointing to the left or to the right is presented at regular intervals on the screen and the subject is requested to track it constantly. Functional mag- netic resonance imaging (fMRI) data are recorded simultaneously to capture the concurrent brain activity. The portion of data related to the bilateral posterior superior parietal lobule (PSPL) is used to predict whether the subject moved his or her eyes rightwards or leftwards. Accurate prediction would support the hypothesis of PSPL being involved in eye-movement processing. The use of classification-based techniques in the field of neuroimaging is known as brain decoding and has recently been recognized as a valuable alternative to the more established brain encoding approach [3]. In brain encoding direct inference is done by estimation of parametric models that reconstruct observed neuroimaging data from stimuli and incorporates prior knowl- edge of the underlying physics. Conversely brain decoding corre- sponds to inverse inference where stimuli are predicted from neuroimaging data. The brain decoding approaches proposed in the literature as well as a portion of the brain encoding ones (see for example [4,5]) are multivariate and exploits the inherent Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2011.04.025 Corresponding author at: NeuroInformatics Laboratory (NILab), Bruno Kessler Foundation and University of Trento, via delle Regole 101, 38121, Trento, Italy. Tel.: þ39 0461882760. E-mail address: [email protected] (E. Olivetti). Pattern Recognition 45 (2012) 2075–2084

Upload: emanuele-olivetti

Post on 11-Sep-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bayesian hypothesis testing for pattern discrimination in brain decoding

Pattern Recognition 45 (2012) 2075–2084

Contents lists available at ScienceDirect

Pattern Recognition

0031-32

doi:10.1

� Corr

Founda

Tel.: þ3

E-m

journal homepage: www.elsevier.com/locate/pr

Bayesian hypothesis testing for pattern discrimination in brain decoding

Emanuele Olivetti a,b,�, Sriharsha Veeramachaneni c, Ewa Nowakowska d

a NeuroInformatics Laboratory (NILab), Bruno Kessler Foundation and University of Trento, via delle Regole 101, 38121, Trento, Italyb Center for Mind and Brain Sciences (CIMeC), University of Trento, Italyc WindLogics Inc., St. Paul (MN), USAd Multivariate Statistics Group, GfK Polonia, Warszawa, Poland

a r t i c l e i n f o

Available online 11 May 2011

Keywords:

Interpretability and validation

Brain decoding

Generalization error

Hypothesis testing

03/$ - see front matter & 2011 Elsevier Ltd. A

016/j.patcog.2011.04.025

esponding author at: NeuroInformatics Labo

tion and University of Trento, via delle Rego

9 0461882760.

ail address: [email protected] (E. Olivetti).

a b s t r a c t

Research in cognitive neuroscience and in brain–computer interfaces (BCI) is frequently concerned with

finding evidence that a given brain area processes, or encodes, given stimuli. Experiments based on

neuroimaging techniques consist of a stimulation protocol presented to a subject while his or her brain

activity is being recorded. The question is then whether there is enough evidence of brain activity

related to the stimuli within the recorded data. Finding a link between brain activity and stimuli has

recently been proposed as a classification task, called brain decoding. A classifier that can accurately

predict which stimuli were presented to the subject provides support for a positive answer to the

question. However, it is only the answer for a given data set and the question still remains whether it is

a general rule that will apply also to new data. In this paper we try to reliably answer the

neuroscientific question about the presence of a significant link between brain activity and stimuli

once we have the classification results. The proposed method is based on a Beta-Binomial model for the

population of generalization errors of classifiers from multi-subject studies within the Bayesian

hypothesis testing framework. We present an application on nine brain decoding investigations from

a real functional magnetic resonance imaging (fMRI) experiment about the relation between mental

calculation and eye movements.

& 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Classification-based approaches for data analysis are provok-ing wide interest and increasing adoption within the cognitiveneuroscience and brain–computer interface (BCI) communities.One of the core problems of these investigations is hypothesistesting, i.e. finding evidence that the brain, or part of it, encodes/processes given stimuli from the observation of neuroimagingdata. A classification algorithm is trained on the recorded data tolearn how to discriminate between those stimuli [1]. Then themisclassification rate of the classifier is estimated and usedwithin a statistical test. In this work we propose a principledapproach to provide a reliable answer to the hypothesis testingproblem based on modeling the relation between the observederror of the classifier and its generalization error. The modeladdresses the multi-subject case in order to make inferentialstatements about the whole target population of theneuroscientific study.

ll rights reserved.

ratory (NILab), Bruno Kessler

le 101, 38121, Trento, Italy.

During a neuroimaging-based investigation a sequence ofstimuli is usually presented to the subject at specific timeintervals and he or she is requested to accomplish an active orpassive task accordingly. As an example, during the eye move-ment (saccades) task described in [2] an arrow pointing to the leftor to the right is presented at regular intervals on the screen andthe subject is requested to track it constantly. Functional mag-netic resonance imaging (fMRI) data are recorded simultaneouslyto capture the concurrent brain activity. The portion of datarelated to the bilateral posterior superior parietal lobule (PSPL)is used to predict whether the subject moved his or her eyesrightwards or leftwards. Accurate prediction would support thehypothesis of PSPL being involved in eye-movement processing.

The use of classification-based techniques in the field ofneuroimaging is known as brain decoding and has recently beenrecognized as a valuable alternative to the more established brain

encoding approach [3]. In brain encoding direct inference is doneby estimation of parametric models that reconstruct observedneuroimaging data from stimuli and incorporates prior knowl-edge of the underlying physics. Conversely brain decoding corre-sponds to inverse inference where stimuli are predicted fromneuroimaging data. The brain decoding approaches proposed inthe literature as well as a portion of the brain encoding ones (seefor example [4,5]) are multivariate and exploits the inherent

Page 2: Bayesian hypothesis testing for pattern discrimination in brain decoding

E. Olivetti et al. / Pattern Recognition 45 (2012) 2075–20842076

distributed nature of the brain activity. This is an advantage ofmultivariate approaches over the more traditional mass-univari-ate encoding ones which are prevalent in the neuroscientificcommunity.

The brain decoding approach can be used for extractingstimulus-related information from brain data and to provideanswers to the neuroscientific question. This question is ahypothesis testing problem that can be studied both on a singlesubject basis or in a multi-subject study. In the first case we wantto know whether the classifier designed on data from a givenbrain area of one subject is accurate enough to claim that the areais actually processing the stimuli of interest for that subject. In themulti-subject case we want to extend the claim to the wholetarget population of the study, e.g. healthy subjects, by conduct-ing the neuroimaging experiment and data analysis on a group ofsubjects. In this case the claim is about how much accurate theclassification algorithm is over the target population. The multi-subject case is the one of the primary interest for cognitiveneuroscience and BCI research and the main target of this work.

Making inferences on the expected accuracy of a classificationalgorithm over the target population must take into account thevariability coming from different sources. First the data collectedfrom a neuroimaging experiment are a small sample with very highdimensionality. The observed misclassification rate of a classifierover a small sample is usually a poor estimate of the generalizationerror because of the high variance within the sample. Second, thefew subjects involved in the study are themselves a sample of thetarget population which may or may not represent it accurately.Third, the classifiers designed from data of each subject depend onthe variability of the small high-dimensional sample on which theywere trained on. Small train samples may be sub-optimal to reachaccurate classification. In order to get a reliable answer from thehypothesis test, the methodology to implement it needs to take intoaccount these sources of variance.

In this work we rely on statistical hypothesis testing as theframework to provide an answer to the neuroscientific questionabout the presence of stimulus-related information within braindata. The literature in statistics has addressed hypothesis testingwith two main approaches: the classical/frequentist one and theBayesian one. To the best of our knowledge the brain decodingliterature has relied on the classical/frequentist hypothesis testingframework and on the Student’s t-test in case of multi-subjectstudies. In this work we present a solution based on the Bayesianhypothesis testing framework and provide motivations to supportthis choice over the classical one. A Beta-Binomial hierarchicalmodel is proposed to account for the different sources of varianceaffecting the relation between the observed number of errors persubject and the generalization error over the target population.We show how only this model accounts for the amount ofavailable data when estimating the generalization error and relieson valid bounds for the error rates of the classifiers. Moreover weprovide a simple and effective way to compute the proposed testand we show its application in a multi-subject fMRI study wherenine investigations about the relation of eye movements andmental calculation are carried out on 15 subjects.

In the following sections we provide a brief overview of theliterature related to this work. In Section 2 we provide a detailedformal description of the pattern discrimination task in braindecoding. We then introduce the classical and Bayesian solutionsto the hypothesis testing problem. We conclude Section 2describing the Beta-Binomial hierarchical model which links theobserved quantities to the population parameters. In Section 3 weshow an application on real data from an fMRI study about therelation between mental calculation and eye movements. InSection 4 we describe our conclusions and propose future direc-tions of research.

1.1. Related works

This work touches three different scientific areas: brain decod-ing, generalization error bounds in classification and hypothesistesting. Among the increasing literature on brain decoding theworks of Haynes and Rees [3] and of Pereira et al. [1] are mainlydevoted to introduce the topic in the context of fMRI experiments.Applications of brain decoding techniques are now frequent in alldifferent neuroimaging modalities, like magnetoencephalography(MEG) [6,7], fMRI [8,9,2] and electroencephalography (EEG)[10,7].

The problem of studying the generalization error of a classifierfrom the number of errors on a limited amount of predictions hasa long tradition in the field of machine learning. Kohavi [11]proved empirically the unbiasedness of cross-validation andbootstrap estimators on large datasets, while Langford [12] andKaariainen and Langford [13] proposed the test set bound andfurther bounds based on the Binomial distribution of the observednumber of errors given the generalization error. The recent workof Isaksson et al. [14] proved the inadequacy of cross-validationand bootstrap estimators in small sample classification andproposed a Bayesian approach from the ideas of the testset bound.

The theory of statistical hypothesis testing has a long history.An accessible comparison of the classical/frequentist andBayesian approaches in the context of medical applications hasbeen described by Goodman [15,16] and a link between the twoapproaches is described for example in [17]. The Bayesianhypothesis testing framework was first introduced by Jeffreys[18]. A main element in this framework is the Bayes factor whichis extensively described in the context of practical applications bythe thorough review of Kass and Raftery [19].

2. Methods

In this section we introduce the notation and basic concepts togive a formal description of the pattern discrimination problem inbrain decoding. We then introduce the classical and Bayesianhypothesis testing frameworks, the first in the case of theStudent’s t-test and the second oriented to the Bayes factor. Inorder to compute the Bayes factor and use it for testing thehypotheses of the brain decoding problem we first derive theposterior probability distribution of the true error of a classifierfor one subject given the observed number of errors made, as wellas the posterior distribution of the observed number of errorsunder binomial distribution with beta prior. As the second andfinal step we propose two related hierarchical models, one foreach hypothesis, to handle the multi-subject case. From thesemodels we derive the marginal likelihoods in order to computethe Bayes factor and the posterior distribution of the populationparameters.

2.1. Notation and basic definitions

During a neuroimaging experiment a sequence of stimuli isusually presented to the subject at specific time intervals and heor she is requested to accomplish an active or passive taskaccordingly. In the case of fMRI neuroimaging experiments thedata consist of multiple consecutive 3D images where the value ofeach voxel, the 3D pixel, represents the local blood oxygenationlevel dependent (BOLD) contrast [20] which is an indirect mea-sure of local neural activity. In the case of EEG and MEGexperiments the data consist of multiple segments, called trials,of multichannel recordings at high temporal resolution, e.g.1 KHz, of the local electrical and magnetic fields, respectively,

Page 3: Bayesian hypothesis testing for pattern discrimination in brain decoding

E. Olivetti et al. / Pattern Recognition 45 (2012) 2075–2084 2077

which are related to the local neural activity. The data are firstpre-processed in order to reduce noise and artifacts. The pre-processing steps differ between fMRI and EEG or MEG data. Forexample fMRI data usually require head movement correctionand detrending, while EGG or MEG data require band-passfiltering, downsampling and eye-blinking detection and removal.See [1,21] for pre-processing details of the case of fMRI data. See[10,7,6] for examples of the EEG and MEG cases.

After pre-processing the data can be represented as a set of n

instances {r1,y,rn} described by d real random variables

X ¼ ðX1, . . . ,XdÞARd. Typical dimensions for fMRI data are

d� 1032105 which correspond the number of voxels of one or

more regions of interest (ROIs) or of the whole segmented brain.In the case of EEG and MEG data the random vector X is built byconcatenating pre-processed data from one trial for a subset or allthe channels of the multichannel recording thus yielding

d� 1032104.1 The number of instances n greatly depends on

the pre-processing steps and on the specific stimulation protocol.In the case of fMRI data it can correspond the number of 3D

images recorded during an fMRI session, typically n� 1022103, at

the frequency of one every 2–3 s. Or it can be the number ofblocks or events in a block or event-related design of the

stimulation protocol, i.e. d� 1012102. See [1] for extensive details

on the available alternatives. In the EEG and MEG cases thenumber of instances usually corresponds to the number of trials

of the neuroimaging experiment, typically n� 1022103.

The instances are associated to the respective stimuli whichare their class labels (Y1,y,Yn). This association is straightforwardin the case of EEG or MEG experiments, i.e. the class label is thestimulus given during the trial. In the fMRI case the associationholds similarly but it must take into account the nature of theBOLD response which is delayed with respect to the stimulusonset and lasts for 15–20 s after the stimulus offset. See [1] forfurther details on association schema. Even though stimulationprotocols usually involve more than two kinds of stimuli, withinthe brain decoding analysis they are frequently investigated inpairs. For this reason this work focuses on Boolean class labels, i.e.YiAf0,1g. We defer the extension of the proposed method to morethan two class labels in future work. To conclude, a neuroimagingexperiment produces a dataset of class-labeled examplesS¼ fðx1,y1Þ, . . . ,ðxn,ynÞg, where the joint distribution of variablesand class labels D is unknown.

Pereira et al. [1] defines the problem of pattern discrimination

task in brain decoding as that of establishing whether or not braindata carry information about the task of interest. This task differsfrom the one of pattern localization whose aim is to identify where

in the brain the stimulus-related information is localized. Eventhough pattern discrimination can be used for localization pur-poses, e.g. by restricting the whole analysis to a ROI, the problemis inherently hypothesis testing while pattern localization typi-cally aims at producing brain maps. This work is about patterndiscrimination and we restrict our attention to that problem fromnow on.

In pattern discrimination one of the key elements is the binaryclassifier c : Rd-f0,1g which is a function that given a segmentof brain data, i.e. an instance, predicts which was the stimulus,i.e. its class label, that generated that brain response. We denote Cas the set of classifiers of interest, e.g. C is the set of linearfunctions. A classification algorithm G designs one classifier cACfrom training examples according to a criterion such as maximumlikelihood or empirical risk minimization.

1 The number of channels for these kinds of recordings ranges between a few

tens and a few hundreds.

A classifier that is able to accurately predict the class labels offuture examples provides evidence for the presence of informa-tion between instances and class labels. Among the many metricsto measure the performance of a classifier we focus on the mostpopular one, i.e. the misclassification rate, also known as errorrate. Different kinds of error rates can be defined: the true error,also known as generalization error, of c is defined as

e¼ ED½jcðXÞ�Y j� ð1Þ

which is the probability of the classifier to predict incorrectlywhen drawing examples from D. Throughout this work theclassifier cAC denotes the one designed from a finite amounttrain data.

In practice D is unknown so e is not observable. Given a sampleS0 ¼ fðx01,y01Þ, . . . ,ðx

0m,y0mÞg, called the test set, of class-labeled

instances and of size m¼ jS0j the observed number of errors e is

e¼X

ðx0i,y0

iÞAS0

jcðx0iÞ�y0ij ð2Þ

which is an observable quantity related to e. If S0 is drawnindependently from S and the pairs of the test setfðx01,y01Þ, . . . ,ðx

0m,y0mÞg are independent and identically distributed

(i.i.d.) observations drawn from D then

em ¼e

mð3Þ

is an unbiased estimator or e and the following distribution-freeprobabilistic bound (see for example [22], Corollary 8.1), calledthe Chernoff–Hoeffding bound, holds: for every x40

pðje�emj4xjS0Þr2e�2mx2

ð4Þ

The i.i.d. assumption of the test set is reasonable in pre-processeddata from EEG or MEG experiments where stimulation protocolsare based on randomization, effects are assumed to last less thanthe trial, trials are separated by a resting period and low-passfiltering is applied. The fMRI case is more complex since themeasured signal, i.e. the BOLD contrast, quantifies the hemody-namic response process which is correlated to neuronal activitybut has both delay and strong temporal autocorrelation. Thehemodynamic response to an impulse stimulus has a peak after4–6 s, almost fades out after 15 s and disappears completely after20 s. In the brain decoding literature it is common practice toestimate error rates when examples are 10 s or more far eachother, sometimes violating the i.i.d. assumption to some degree.In this work we assume that the i.i.d. assumption holds or it ismildly violated without major effects. To the best of our knowl-edge the brain decoding literature does not provide results aboutthe quantification of the effects in error estimation due tothis issue.

In all practical cases only a finite dataset S is available both fordesigning the classifier and estimating its error rate. The specificerror rate that is estimated is the true error e of the classifiertrained on whole S. Various point estimators of e have beenproposed in the literature each with a different strategy toallocate data between the train set and the test set: there-substitution, or empirical, estimator uses all the data fordesigning c and re-use the same data to compute e. This estimatoris known to be strongly biased toward underestimating the trueerror because train and test sets are not independent. The holdout

estimator splits data into two groups whose sizes are in a givenratio, e.g. 70/30, and uses the first to design the classifier and thesecond to estimate its error. This estimator does not suffer thebias of dependency between train and test but uses a reducedtrain set which might not be sufficient for designing an accurateclassifier. The cross-validation (CV) and bootstrap (BTS) estimatorsare based on resampling schemes but they are reliable only for

Page 4: Bayesian hypothesis testing for pattern discrimination in brain decoding

Table 1Guidelines for the interpretation of Bayes factor B10 in terms of the strength of

evidence in favor of H1 and against H0, from [19].

B10 o1 1–3 3–20 20–150 4150

Strength Negative Weak Positive Strong Decisive

2 Interpreting the p-value directly as the probability that the null hypothesis is

true given the data is a misconception termed the ‘‘p-value fallacy’’.

E. Olivetti et al. / Pattern Recognition 45 (2012) 2075–20842078

large datasets [11]. Martin et al. [23] discussed the issues in CVand BTS estimates for small samples. Isaksson et al. [14] provedempirically that both cross-validation and bootstrap are unreli-able in small sample classification.

2.2. Hypothesis testing

Neuroscientific investigations are frequently concerned withdrawing conclusions on the whole population of subjects theyaddress, e.g. healthy subjects. For this reason the experiment isconducted over a group of N subjects and inductive reasoning isnecessary to make inferences. Moreover the mere informationabout the observed error rate values over the N subjects issometimes not enough to draw any general conclusion. Statisticalinference is necessary to go beyond the data and allow for resultgeneralization. The usual statistical framework used in thiscontext refers to the classical hypothesis testing approach.

2.2.1. Student’s t-test on estimated errors

A classical approach to test the null hypothesis of no stimuli-related information within brain data over the whole populationof subjects is based on the Student’s independent one-samplet-test carried out on the estimated errors e ¼ fe1, . . . ,eNg. The nullhypothesis H0 is that the average true error of the population ofsubjects is equal to m¼ 1=2, which corresponds to randomguessing. The t-test computes the probability of observing a valueof the t statistic equal of more extreme than the actual onecomputed on observed data if the null hypothesis was true. In theStudent’s independent one-sample t-test the t statistic is

t¼m�1=2

s=ffiffiffiffiNp ð5Þ

where m is the sample mean over the N subjects:

m ¼ 1

N

XN

i ¼ 1

ei ð6Þ

and the sample standard deviation is

s ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPNi ¼ 1ðei�mÞ2

N�1

sð7Þ

Under the assumption of independence and normality of feig andassuming that s follows a w2 distribution, then the t statisticsfollows the Students’ t distribution with N�1 degrees of freedom.If the probability of observing such a t or a more extreme value isbelow a given confidence level (e.g. 0.05), then the null hypoth-esis is rejected and the alternative complementary hypothesis H1

of presence of information is accepted.Notice that the t-test assumes that the estimates of the error

rates fei gi ¼ 1,...,N are normally distributed but this assumptionconflicts with the fact that error rates are bounded between 0 and1, i.e. it is not possible to observe error rates outside those boundsdespite the non-null probability density assigned to them by thetails of the normal distribution. Moreover the t-test does notaccount for the size m of the test sets, which means that it doesnot account for the confidence we have on the error rateestimates as expressed, for example, by Eq. (4).

2.2.2. Bayesian hypothesis testing

A frequent criticism (see for example [15]) to the classicalhypothesis testing setup is that it is based on calculating deduc-tively the probability of observing the data under a givenhypothesis pðdatajHiÞ, but the question we want to address isinstead inductive and specifically how likely a hypothesis is givenobserved data, i.e. pðHijdataÞ. The Bayesian hypothesis testingframework addresses the inductive question while the classical

framework does not [24]. Moreover it can incorporate priorinformation when available and it does not require the twohypotheses to be complementary in order to conduct the test.Therefore we decided to turn to the Bayesian hypothesis testingframework for the proposed solution.

In order to compare two alternative hypotheses the Bayesianhypothesis testing framework [18,19] proposes to study the ratioof how likely the two hypotheses are given data and priorinformation. This is called the posterior odds ratio O:

O10 ¼pðH1jdata,IÞ

pðH0jdata,IÞ¼

pðdatajH1,IÞpðH1jIÞ

pðdatajH0,IÞpðH0jIÞ¼ B10

pðH1jIÞ

pðH0jIÞð8Þ

where B10 is called Bayes factor and pðHijIÞ is the probability ofeach hypothesis before seeing data based on prior information I. Ithas to be noted that the marginal likelihood pðdatajHi,IÞ isobtained by integrating (not maximizing) over the parameterspace on which we model the dependency between observationsand the hypothesis:

pðdatajHi,IÞ ¼

ZpðdatajYi,Hi,IÞpðYijHi,IÞ dYi ð9Þ

where Yi ¼ fyi1, . . . ,yikg is the set of parameters of the i-th modeland pðYijHi,IÞ is its prior probability.

Guidelines for the interpretation of the Bayes factor values interms of evidence in favor of one hypothesis or the other werefirst introduced by Jeffreys [18]. In Table 1, taken from [19], weillustrate commonly accepted categories associated with thenumerical value of B10. These categories can be thought as relatedto those commonly adopted for the p-value of the classicalhypothesis testing framework, e.g. 0.05, 0.01 and 0.001.In analogy to the classical framework once a threshold on theBayes factor B10 is decided the hypothesis H0 is rejected when theevidence is greater than the threshold. A quantitative relationshipbetween the p-value of the classical framework and the evidenceof the Bayes factor is not straightforward. The main reason is thatthey belong to different frameworks each targeting a differentconcept as explained in Section 2.2.1 and at the beginning of thissection. A bridge between the two frameworks is proved in [17]where a calibration is proposed in order to interpret the p-valueas the probability that the null hypothesis is true in the light ofthe data.2 Specifically, when po1=e an upper bound on the Bayesfactor provided by the data is

Bup10 ¼�

1

eplnpð10Þ

meaning that for a given p-value the evidence provided by data infavor of H1 over H0 is no more than B10

up. Table 2 shows someexamples of p-values and the corresponding upper bound of theevidence.

2.3. Bayesian hypothesis testing for brain decoding

In the hypothesis testing problem of brain decoding thehypothesis H0 corresponds to the case where there is no evidenceof stimuli-related brain activity within the recorded data and inparticular this statement is meant about the whole population of

Page 5: Bayesian hypothesis testing for pattern discrimination in brain decoding

Table 2Calibration of p-values as upper bound (B10

up) of the evidence in favor of H1 with

respect to H0, following Eq. (10). Adapted from [17].

p-Value 0.1 0.05 0.01 0.005 0.001

B10up 1.6 2.5 8 14 53

E. Olivetti et al. / Pattern Recognition 45 (2012) 2075–2084 2079

the study. Conversely the hypothesis H1 corresponds to evidenceof stimuli-related brain activity within the data. The observeddata of Eq. (8) on which the Bayes factor will be computed are theset E¼{e1,y,eN} of the observed number of errors made on thetest sets of each subject by the corresponding classifiers. TheBayes factor becomes then:

B10 ¼pðe1, . . . ,eN jH1Þ

pðe1, . . . ,eNjH0Þ¼

Rpðe1, . . . ,eNjY1,H1ÞpðY1jH1Þ dY1Rpðe1, . . . ,eN jY0,H0ÞpðY0jH0Þ dY0

ð11Þ

where Yi are the parameters of the model underlying hypothesisHi and, as it will be explained in detail in the next sections,specifically they are the generalization errors of the classifiers ofeach subject fe1, . . . ,eNg, the mean m and variance of the popula-tion. In this work we propose two alternative models to describethe observations, each related to one of the two hypothesis. Inanalogy to Section 2.2.1, the proposed model under H0 willassume that the expected generalization error over the popula-tion is m¼ 1=2, i.e. random chance, while the one under H1 willassume that 0rmo1=2.

In the following sections we first introduce a Beta-Binomialmodel to describe the relationship between the observed numberof errors the classifier makes on data from a single subject and thegeneralization error of that classifier. Then we introduce the twoalternative models about the population of subjects as a multi-subject extension of the Beta-Binomial model. To conclude weprovide a simple method to estimate the Bayes factor.

2.4. Single subject: the Beta-Binomial model

Many authors like Breiman [25] and Langford [12] observedthat under the sole i.i.d. assumption the number of observederrors e for the binary classifier c is binomially distributed giventhe number of predictions m and the true error rate e:

pðeje,mÞ ¼ Binðeje,mÞ ¼m

e

� �eeð1�eÞm�e

ð12Þ

This property is the building block of the binomial test used in thebrain decoding literature (see for example [1]) for testing whetherthe number of misclassified examples e provides sufficient sup-port or not to reject the null hypothesis of no stimuli-relatedinformation within brain data for a single subject. The binomialtest falls within the classical hypothesis testing framework, whichmeans that under the hypothesis of no information and given athreshold g for the p-value the test becomes

pðkreÞ ¼ 1�X

k ¼ 0,...,e

Bin kjm,1

2

� �rg ð13Þ

Here we focus on the full posterior distribution of the true errordeduced by the application of the Bayes theorem

pðeje,mÞ ¼pðeje,mÞpðejmÞ

pðejmÞ¼

pðeje,mÞpðejmÞRpðeje,mÞpðejmÞ de

ð14Þ

It is both meaningful and convenient for conjugacy reasons toassume a Beta prior distribution for the true error

pðeÞ ¼ Betaðeja,bÞ ¼GðaþbÞGðaÞGðbÞ

ea�1eb�1 ð15Þ

then

pðeje,mÞ ¼

m

e

� �eeð1�eÞm�eBetaðeja,bÞ

R m

e

� �eeð1�eÞm�eBetaðeja,bÞ de

ð16Þ

and the posterior distribution can be derived exactly:

pðeje,mÞ ¼ Betaðeja,bÞ ¼GðaþbÞ

GðaÞGðbÞea�1eb�1 ð17Þ

where a¼ eþa and b¼m�eþb. Further motivations to supportthe choice of the Beta prior distribution in this context arepresented in [23].

As a final step it is possible to derive a closed form solution forthe posterior probability of observing a given number of errors e

out of m prediction given a Betaða,bÞ prior, which is the Beta-binomial distribution:

pðejm,a,bÞ ¼ Betabinðejm,a,bÞ

¼

Z 1

oBinðkjm,eÞBetaðeja,bÞ de

¼m

e

� �Bðeþa,m�eþbÞ

Bða,bÞð18Þ

where Bðx,yÞ ¼R 1

0 tx�1ð1�tÞy�1 dt is the beta function.

2.5. Multiple subjects: population models under H1 and H0

We assume that the neuroscientific study involves N indepen-dent subjects for which the investigator collected brain data whilepresenting the same stimulation protocol. After designing N

classifiers C¼{c1,y,cN}, one for each subject, from train data theinvestigator computes the related numbers of misclassified exam-ples E¼{e1,y,eN} on each subject-specific test set of size m.

We assume a Betaða,bÞ distribution for the generalizationerrors of the population. The mean m and variance s2 of thedistribution can be expressed in terms of parameters a and b:

m¼ a

aþbð19Þ

s2 ¼ab

ðaþbÞ2ðaþbþ1Þð20Þ

and equivalently a and b can be expressed in terms of mean andvariance:

a¼ mm�m2�s2

s2ð21Þ

b¼ mm2�2mþs2þ1

s2�1 ð22Þ

We propose the following hierarchical models [26] to link theobservations E¼{e1,y,eN} to the generalization errors fe1, . . . ,eNg

and to the parameters of the population m and s2. In the case ofH1 we assume a non-informative prior for m uniformly distributedin ½0,12Þ meaning that the expected generalization error of thepopulation is better (lower) than that of predicting at random, i.e.1/2. That range for m and the fact that a and b must be non-negative leads to s2A ½0,mð1�mÞ�. According to [27] a non-infor-mative prior for variance parameters in hierarchical models is theuniform distribution on the standard deviation s. The proposedhierarchical model under the hypothesis H1 is then:

m�Uð0,12Þ

s�Uð0,ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffimð1�mÞ

pÞ ð23Þ

Page 6: Bayesian hypothesis testing for pattern discrimination in brain decoding

3 In all cases described in [2] the classification algorithm was support vector

machines (SVMs) with linear kernel and parameter C¼1.

E. Olivetti et al. / Pattern Recognition 45 (2012) 2075–20842080

and

e1, . . . ,eN � Betaða,bÞ

e1 � Binðm,e1Þ

^

eN � Binðm,eNÞ ð24Þ

where a and b are those of Eqs. (21) and (22).Under the hypothesis H0 the expected generalization error of

the population is that of predicting at random, i.e. m¼ 12. Then

according to the Beta prior the population is parametrized only bys2. Plugging in m¼ 1

2 in Eqs. (21)–(23) we obtain the proposedmodel under H0:

s�Uð0,12Þ ð25Þ

where a¼ 1=8s2�1=2, b¼a and the remaining part of the hier-archical model is the same as in Eq. (24).

2.6. Computation of the Bayes factor and posterior probabilities

We now present the derivations of the Bayes factor B10 and theposterior probabilities of the population parameters m and sgiven observations and the models described in Section 2.5.Additionally we propose a simple yet efficient way to evaluatethem numerically. We start writing the detailed form of B10 fromEq. (11) and use the independence assumption between subjects:

B10 ¼pðe1, . . . ,eNjH1Þ

pðe1, . . . ,eNjH0Þð26Þ

B10 ¼

Rpðe1, . . . ,eNjm,e1, . . . ,eN ,m,sÞpðe1, . . . ,eNjm,sÞpðm,sÞde1 . . . deN dm dsR

pðe1, . . . ,eNjm,e1, . . . ,eN ,sÞpðe1, . . . ,eNjsÞpðsÞde1 . . . deN dsð27Þ

B10 ¼

R QNi ¼ 1 pðeijm,ei,m,sÞpðeijm,sÞpðm,sÞde1 . . .deN dm dsR QN

i ¼ 1 pðeijm,ei,sÞpðeijsÞpðsÞde1 . . .deN dsð28Þ

where pðm,sÞ is the prior distribution of the population para-meters under H1 (Eq. (23)) and pðsÞ of H0 (Eq. (25)). We note thatthe term ei can be integrated out:Z

pðeijm,ei,m,sÞpðeijm,sÞpðm,sÞ dei ¼ pðeijm,m,sÞpðm,sÞ ð29Þ

giving

B10 ¼

R QNi ¼ 1 pðeijm,m,sÞpðm,sÞ dm dsR QN

i ¼ 1 pðeijm,sÞpðsÞ dsð30Þ

We recall the Beta-binomial distribution of Eq. (18) togetherwith the Eqs. (21) and (22) and derive the final form of the Bayesfactor as

B10 ¼

R QNi ¼ 1 Betabinðeijm,aðm,sÞ,bðm,sÞÞpðm,sÞ dm dsR QN

i ¼ 1 Betabinðeijm,a 12 ,s� �

,a 12 ,s� �

ÞpðsÞ dsð31Þ

which can be approximated by simple Monte Carlo integration:

B10 �

1K

PKk ¼ 1

QNi ¼ 1 Betabinðeijm,aðmðkÞ,sðkÞÞ,bðmðkÞ,sðkÞÞÞ

1K

PKk ¼ 1

QNi ¼ 1 Betabinðeijm,a 1

2 ,sðkÞ� �

,b 12 ,sðkÞ� �

Þð32Þ

given a sample fðmð1Þ,sð1ÞÞ, . . . ,ðmðKÞ,sðKÞÞg drawn from the priordistributions pðm,sÞ for H1 and a sample fsð1Þ, . . . ,sðKÞg drawn fromthe prior distribution pðsÞ for H0.

The posterior probabilities of m and s given observationsE¼{e1,y,eN} follow from the application of the Bayes theorem.In case of H1:

pðm,sje1, . . . ,eN ,H1Þ ¼pðe1, . . . ,eNjm,s,H1Þpðm,sjH1Þ

pðe1, . . . ,eN jH1Þð33Þ

pðm,sje1, . . . ,eN ,H1Þ ¼

QNi ¼ 1 Betabinðeijm,aðm,sÞ,bðm,sÞÞpðsjm,H1ÞpðmjH1Þ

pðe1, . . . ,eNjH1Þ

ð34Þ

pðm,sje1, . . . ,eN ,H1Þ ¼

QNi ¼ 1 Betabinðeijm,aðm,sÞ,bðm,sÞÞ 2ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

mð1�mÞp

pðe1, . . . ,eN jH1Þ

ð35Þ

where pðsjm,H1Þ ¼ 1=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffimð1�mÞ

pand pðmÞ ¼ 2 in the range of

interest.For the case of H0 the derivation of the posterior distribution of

s is analogous to the case of H1:

pðsje1, . . . ,eN ,H0Þ ¼

QNi ¼ 1 Betabinðeijm,að12 ,sÞ,bð12,sÞÞ2

pðe1, . . . ,eN jH0Þð36Þ

To conclude both posterior distributions pðm,sje1, . . . ,eN ,H1Þ andpðsje1, . . . ,eN ,H0Þ can be easily evaluated once B10 has beencomputed and the marginals pðe1, . . . ,eNjH1Þ and pðe1, . . . ,eNjH0Þ

are available.

3. Results

We tested the proposed method on data from an fMRI studyabout the relation between mental calculation and eye move-ments at cortical level as described in [2]. During the experimentthe subjects were presented visual stimuli about leftward orrightward arrows on a screen and they were requested to trackthem by moving their eyes (saccades) to the left side or the rightside, respectively. In the second part of the experiment the samesubjects were presented stimuli about mental calculation: addi-tions or subtractions. Each subject was requested to work out themental computation immediately after the presentation of eachstimulus. The main neuroscientific question was whether braincortex areas processing saccadic eye movements encode simplemental calculation tasks in the same way. In order to answer thisquestion a subject-specific classifier3 was trained on the eye-movement dataset and then it was tested on the mental calcula-tion dataset, as if a rightward arrow stimulus corresponded to anaddition and a leftward one to a subtraction.

A second investigation within the same study was about therelation between symbolic and non-symbolic processing ofnumerals for the brain regions processing eye movements. Thestimuli presented during the calculation task were also groupedin two groups: those using numerals represented as dottedpatterns (non-symbolic) and those using Arabic numerals. Theaim of the investigation was twofold: first to test the ability of thebrain cortex to generalize from saccades to calculation in eachnotation separately and second to test the sharing of corticalresources between notations. In the first case a classifier wastrained on the saccades dataset and tested on the portion of thecalculation dataset related to Arabic numerals and on that of non-symbolic numerals, separately. In the second case a classifier wastrained on the brain data related to Arabic numerals and tested onbrain data related to non-symbolic numerals (and vice versa).

The setting of the whole experiment does not exactly matchthe typical one of brain decoding experiments where the subjectis requested to perform the same task for the train and the testsets. Nevertheless the method we propose does not require thisassumption and the brain decoding approach applies identically.

It must be noted that the difference between the tasks of thetrain set and the test set could introduce a further difficulty in

Page 7: Bayesian hypothesis testing for pattern discrimination in brain decoding

Table 3Student’s one-samples t-test results from [2]. For each area and task the sample

mean (m) of the estimated errors fei=mgi ¼ 1,...,15 and sample variance of the mean

(s=ffiffiffiffiffiffi15p

) over the 15 subjects of the study are reported, together with the

corresponding t-value and p-value. Bold face p-values correspond to rejection of

the null hypothesis at the threshold of 0.05 adopted in [2].

Exp. nr. Area Task m s=ffiffiffiffiffiffi15p

t pddof ¼ 14

1 PSPL saccades-calc: 0.45 0.018 2.78 0.015

2 FEF saccades-calc: 0.51 0.053 �0.18 0.86

3 lFEF saccades-calc: 0.508 0.063 �0.12 0.9

4 M1 saccades-calc: 0.498 0.019 0.09 0.93

5 hIPS saccades-calc: 0.515 0.016 �0.94 0.36

6 PSPL saccades-Arabic 0.458 0.02 2.10 0.054

7 PSPL saccades-nonsymb: 0.446 0.02 2.67 0.018

8 PSPL Arabic-nonsymb: 0.393 0.025 4.37 7�10�4

9 PSPL nonsymb:-Arabic 0.378 0.021 5.71 5�10�5

Table 4Results of the Bayes factor computation on real data. For each area and task the

size of the test set (m), the Bayes factor and its interpretation in terms of strength

of evidence against the null hypothesis are reported. Bold face strength indicates a

Bayes factor greater than 3, i.e. Positive strength (see Table 1).

Exp. nr. Area Task m B10 Strength

1 PSPL saccades-calc: 108 2.5 Weak

2 FEF saccades-calc: 108 0.11 Negative

3 lFEF saccades-calc: 108 0.13 Negative

4 M1 saccades-calc: 108 0.05 Negative

5 hIPS saccades-calc: 108 0.03 Negative

6 PSPL saccades-Arabic 54 0.73 Negative

7 PSPL saccades-nonsymb: 54 2.3 Weak

8 PSPL Arabic-nonsymb: 54 58 Strong

9 PSPL nonsymb:-Arabic 54 710 Decisive

E. Olivetti et al. / Pattern Recognition 45 (2012) 2075–2084 2081

designing an accurate classifier: the classifier trained on the firsttask could not reach accurate results when predicting future dataof the first task itself, and then the classifier is expected to be evenless accurate when used on a different task, i.e. the second one.Nevertheless these facts do not invalidate our approach sincehypothesis testing can only be used to reject the hypothesis ofabsence of information within a brain region and not to favor it.When the null hypothesis cannot be rejected we cannot be surethat the reason was the absence of stimuli-related brain activitymore than the lack of enough data, or the limited ability of theclassification algorithm to extract the information.

A group of 15 subjects was involved in the study in [2]. Thefirst part of the experiment, about saccadic movements, consistedof 82 arrow presentations, equally divided between leftward ones(41 times) and rightward ones (41 times). The second part of theexperiment, about mental calculation, consisted of 108 presenta-tions, equally divided between subtraction (54 times) and addi-tion (54 times) and between Arabic numerals (54 times) and non-symbolic numerals (54 times). Concurrent fMRI data was col-lected from the brain of each subject. The analysis was restrictedto the following brain regions: the bilateral posterior superiorparietal lobule (PSPL), the bilateral frontal eye fields (FEF), twoclusters lateral to the FEF (lFEF), the motor hand area (M1) andthe horizontal segment of the intraparietal sulcus (hIPS). Classi-fication was attempted on the tasks of the nine investigationspreviously described. The size of the test sets consisted in thenumber of presentations of the stimuli considered by eachpattern discrimination task, i.e. either 108 or 54. The BOLD valuesof each example were taken from a portion of the 3D imagesrecorded immediately after each presentation such that consecu-tive examples were separated by 15 s, providing fair indepen-dence between examples. Full details of the construction of thedataset and of pre-processing steps are provided in the supple-mentary materials of [2].

Table 3 shows the list of the results presented in [2] that were-computed from single-subject values kindly provided to us bythe authors of that paper.4 The results are based on the Student’sone sample t-test of the classical hypothesis testing framework asdescribed in Section 2.2.1. For each area and task the averageestimated error over the 15 subjects, its standard deviation, the t-value and p-value are reported. The p-value is shown in bold faceif po0:05 according to the null hypothesis rejection thresholdadopted in [2].

Table 4 shows the results based on the Bayes factor of theBayesian hypothesis testing framework described in Section 2.2.2

4 The values in the entries of experiments 6 and 7 of Table 3 slightly differ

from those published in [2]. These differences does not influence our results.

and on the proposed models described in Section 2.5. For eacharea and task the table reports the size of the test set (m), theBayes factor computed through the proposed Beta-Binomialmodel (Eq. (32)) and the strength of evidence according toTable 1. The Bayes factor B10 and the strength of evidence areshown in bold face when evidence is at least Positive, i.e. B1043,which is the rejection threshold for H0 that we adopted in analogyto [2]. The Bayes factors were computed from the original sets ofsingle-subject error rates of the neuroscientific study. In thisapplication we assume that the prior distributions of the twohypotheses are equal, i.e. pðH1Þ ¼ pðH0Þ. For this reason B10

represents the odds in favor of H1 according to Eq. (8). Fig. 1shows the posterior densities pðm,sje1, . . . ,eN ,m,H1Þ of the popula-tion parameters m and s of each of the nine experiments underhypothesis H1. Fig. 2 reports the posterior densitiespðsje1, . . . ,eN ,m,H0Þ of the nine experiments under H0. Posteriordensities are computed from Eqs. (35) to (36), respectively. Theimplementation of the simple Monte Carlo integration and of theposterior densities of the population parameters was done inPython language on top of the scientific libraries NumPy andSciPy.5 The code is freely available at https://github.com/emanuele/Bayes-factor-multi-subject together with graphs fromsimulated data about the frequency of Types I and II errors as afunction of the Bayes factor threshold which we do not presenthere for lack of space. The computation of the Bayes factor andposterior probabilities of each experiment required 105 samplesin order to get the results shown in Table 4, in Figs. 1 and 2. Thetime requested for the computations of each experiment was inthe order of 1 s on a standard computer.

4. Discussion

In this work we present the first Bayesian test for braindecoding which aims at answering the question about the pre-sence of stimuli-related information within brain data on thepopulation of interest, i.e. when the data come from a neuroima-ging experiments on multiple subjects. Differently from theclassical framework of hypothesis testing the answer of theBayesian hypothesis test is directed to the central problem ofdeciding which one of the two hypothesis is favored by data asexplained in Section 2.2.2.

The proposed test overcomes some of the approximations andlimitations of the commonly used Student’s one-sample t-test.Specifically in the proposed test we model the confidence of theestimated error rate of each subject by means of accounting forthe size of test set. Moreover we model the generalization errors

5 http://www.scipy.org

Page 8: Bayesian hypothesis testing for pattern discrimination in brain decoding

Fig. 1. Posterior densities pðm,sje1 , . . . ,eN ,m,H1Þ over the nine experiments from [2] under hypothesis H1 and following Eq. (35). Contour lines are equally spaced at 70

units of probability density starting from zero.

Fig. 2. Posterior densities pðsje1 , . . . ,eN ,m,H0Þ over the nine experiments from [2] under hypothesis H0 and following Eq. (36).

E. Olivetti et al. / Pattern Recognition 45 (2012) 2075–20842082

Page 9: Bayesian hypothesis testing for pattern discrimination in brain decoding

E. Olivetti et al. / Pattern Recognition 45 (2012) 2075–2084 2083

of the population as a bounded distribution matching their rangeof validity, i.e. [0,1], instead relying on the unbounded normaldistribution, as proposed by the t-test. Specifically the proposedtest relies on a more tenable assumption of Beta prior for thegeneralization error of the population of interest.

The relationship between the t-test and the Bayes factor is notstraightforward. The two tests belong to different frameworks,they target different goals and they rely on different assumptions.In this work we report guidelines about the interpretation ofthe Bayes factor and an upper bound for it given the p-value,both taken from the literature, in order to illustrate theirrelation.

A further contribution of this work is a simple and effectiveway to compute the Bayes factor and posterior probabilities of theparameters of the population for the brain decoding problem in amulti-subject study. The algorithms to compute the marginallikelihoods require an amount of time in the order of 1 s forevaluating the proposed test on the classification results for atypical fMRI experiment on a group of subjects. We published thecode implementing the proposed method and we distribute it asfree/open-source software.

In Section 3 we showed an application on real data from anfMRI study involving nine investigations. On those data wepresented results of the Student’s one-sample t-test and of theproposed Bayesian test. The results of the two tests reported inTables 3 and 4 mostly agree over the nine experiments with theexception that, for experiments 1 and 7, the evidence in favor ofH1 is considered weak according to the interpretation guidelinesof the Bayes factor. This fact is not a surprise given the bound ofEq. (10) and it supports the position expressed in [17] that thecommonly used threshold of 0.05 for the p-value does not providesubstantial support for experimental claims.

4.1. Future work

Our future research on Bayesian hypothesis testing for braindecoding will address the multi-class case, for cases in whichmore than two kinds of stimuli are presented to the subject andthe classification problem is multi-class.

A known problem specific to fMRI data is the autocorrelationin time of the signal which leads to non-i.i.d. examples when thestimulation protocol does not provide sufficiently separatedstimuli to the subject. In practice many neuroimaging experi-ments adopt stimulation protocols which violates the i.i.d.assumption to some degree. An interesting direction for futureresearch is the empirical assessment of the effects of the differentdegrees of violation. Simulated fMRI data might provide a usefulmean to quantify these effect on error rate estimates and on theresults of the hypothesis tests.

Acknowledgment

The authors are grateful to Susanne Greiner, Paolo Avesaniand the anonymous reviewers for the helpful feedback and toAndre Knops and Bertrand Thirion for providing the originalclassification results computed for each subject involved in thestudy in [2].

References

[1] F. Pereira, T. Mitchell, M. Botvinick, Machine learning classifiers and fMRI: atutorial overview, NeuroImage 45 (1) (2009) 199–209. doi:10.1016/j.neuroimage.2008.11.007.

[2] A. Knops, B. Thirion, E.M. Hubbard, V. Michel, S. Dehaene, Recruitment of an areainvolved in eye movements during mental arithmetic, Science (New York, NY)324 (5934) (2009) 1583–1585. doi:10.1126/science.1171599.

[3] J.-D. Haynes, G. Rees, Decoding mental states from brain activity in humans,Nature Reviews Neuroscience 7 (7) (2006) 523–534. doi:10.1038/nrn1931.

[4] M.W. Woolrich, M. Jenkinson, J.M. Brady, S.M. Smith, Fully Bayesian spatio-temporal modeling of FMRI data, IEEE Transactions on Medical Imaging 23(2) (2004) 213–231. doi:10.1109/TMI.2003.823065.

[5] W. Penny, N. Trujillobarreto, K. Friston, Bayesian fMRI time series analysiswith spatial priors, NeuroImage 24 (2) (2005) 350–362. doi:10.1016/j.neuroimage.2004.08.034.

[6] M. van Gerven, O. Jensen, Attention modulations of posterior alpha as a controlsignal for two-dimensional brain–computer interfaces, Journal of NeuroscienceMethods 179 (1) (2009) 78–84. doi:10.1016/j.jneumeth.2009.01.016.

[7] A.M. Chan, E. Halgren, K. Marinkovic, S.S. Cash, Decoding word and category-specific spatiotemporal representations from MEG and EEG, NeuroImage 54(4) (2011) 3028–3039. doi:10.1016/j.neuroimage.2010.10.073.

[8] T.M. Mitchell, R. Hutchinson, R.S. Niculescu, F. Pereira, X. Wang, M. Just,S. Newman, Learning to decode cognitive states from brain images, MachineLearning 57 (1) (2004) 145–175. doi:10.1023/B:MACH.0000035475.85309.1b.

[9] Y. Kamitani, F. Tong, Decoding the visual and subjective contents of thehuman brain, Nature Neuroscience 8 (5) (2005) 679–685. doi:10.1038/nn1444.

[10] J.R. Wolpaw, N. Birbaumer, W.J. Heetderks, D.J. McFarland, P.H. Peckham,G. Schalk, E. Donchin, L.A. Quatrano, C.J. Robinson, T.M. Vaughan, Brain–computer interface technology: a review of the first international meeting,IEEE Transactions on Rehabilitation Engineering 8 (2) (2000) 164–173.doi:10.1109/TRE.2000.847807.

[11] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimationand model selection, Proceedings of the Fourteenth International JointConference on Artificial Intelligence, vol. 2, 1995, pp. 1137–1143.

[12] J. Langford, Tutorial on practical prediction theory for classification, Journalof Machine Learning Research 6 (2005) 273–306.

[13] M. Kaariainen, J. Langford, A comparison of tight generalization error bounds,in: ICML ’05: Proceedings of the 22nd International Conference on MachineLearning, ACM, New York, NY, USA, 2005, pp. 409–416. doi:10.1145/1102351.1102403.

[14] A. Isaksson, M. Wallman, H. Goransson, M. Gustafsson, Cross-validation andbootstrapping are unreliable in small sample classification, Pattern Recogni-tion Letters 29 (14) (2008) 1960–1965. doi:10.1016/j.patrec.2008.06.018.

[15] S.N. Goodman, Toward evidence-based medical statistics. 1: the P valuefallacy, Annals of Internal Medicine 130 (12) (1999) 995–1004.

[16] S.N. Goodman, Toward evidence-based medical statistics. 2: the Bayes factor,Annals of Internal Medicine 130 (12) (1999) 1005–1013.

[17] T. Sellke, M.J. Bayarri, J.O. Berger, Calibration of p values for testing precisenull hypotheses, The American Statistician 55 (1) (2001) 62–71. doi:10.2307/2685531.

[18] H. Jeffreys, Theory of Probability, third ed., Oxford University Press, USA,1961.

[19] R.E. Kass, A.E. Raftery, Bayes factors, Journal of the American StatisticalAssociation 90 (430) (1995) 773–795. doi:10.2307/2291091.

[20] N.K. Logothetis, J. Pauls, M. Augath, T. Trinath, A. Oeltermann, Neurophysio-logical investigation of the basis of the fMRI signal, Nature 412 (6843) (2001)150–157. doi:10.1038/35084005.

[21] S.C. Strother, Evaluating fMRI preprocessing pipelines, IEEE Engineering inMedicine and Biology Magazine 25 (2) (2006) 27–41. doi:10.1109/MEMB.2006.1607667.

[22] L. Devroye, L. Gyorfi, G. Lugosi, A Probabilistic Theory of Pattern Recognition(Stochastic Modelling and Applied Probability), corrected ed., Springer, 1996.

[23] J.K. Martin, D.S. Hirschberg, Small sample statistics for classification errorrates II: confidence intervals and significance tests, Technical Report 96-22,University of California, Irvine, 1996.

[24] R. Hubbard, M.J. Bayarri, Confusion over measures of evidence (p’s) versuserrors (alpha’s) in classical statistical testing, The American Statistician 57 (3)(2003) 171–178. doi:10.1198/0003130031856.

[25] L. Breiman, J. Friedman, C.J. Stone, R.A. Olshen, Classification and RegressionTrees, first ed., Chapman and Hall/CRC, 1984.

[26] A. Gelman, J.B. Carlin, H.S. Stern, D.B. Rubin, Bayesian data analysis, secondedition, in: Chapman & Hall/CRC Texts in Statistical Science, second ed.,Chapman and Hall/CRC, 2003.

[27] A. Gelman, Prior distributions for variance parameters in hierarchical models,Bayesian Analysis 1 (2006) 1–19.

Emanuele Olivetti received his master’s degree in physics and his Ph.D. in computer science from the University of Trento, Italy. He is a researcher at the Bruno KesslerFoundation (FBK) working on machine learning for neuroimaging experiments jointly with the local center for mind and brain sciences (CIMeC) within the University ofTrento. His research interests include brain decoding, learning algorithms for diffusion MRI data, joint analysis of multiple neuroimaging data sources, active learning andBayesian inference.

Page 10: Bayesian hypothesis testing for pattern discrimination in brain decoding

E. Olivetti et al. / Pattern Recognition 45 (2012) 2075–20842084

Sriharsha Veeramachaneni conducts research in machine learning applied to natural language processing and in data mining at the Research and Development division ofthe Thomson Reuters Corporation. He obtained his Ph.D. in computer engineering from the Rensselaer Polytechnic Institute in 2002 and subsequently worked as aresearcher at the Fondazione Bruno Kessler, Trento, Italy. His research interests include data compression, online learning, and semi-supervised and domain adaptivelearning.

Ewa Nowakowska received her M.Sc. degrees in mathematics (2007) and psychology (2006) from the University of Warsaw. She is currently pursuing the Ph.D. degree atWarsaw School of Economics, Section of Mathematical Statistics. She works as a statistical consultant at the Multivariate Statistics Group, Gfk Polonia, since five years. Herresearch interests focus on Bayesian inference (non-parametric Bayesian approach, hierarchical models, empirical Bayes, Bayesian clustering) as well as classicalmultivariate data analysis (finding structure in data, robust clustering, clusterability assessment, robustness to collinearity within the data, Shapley Value Regression).