patterncomponentmodeling ... · bdepartment of statistical and actuarial ... (i.e. a variable that...

38
Pattern component modeling: A flexible approach for understanding the representational structure of brain activity patterns Jörn Diedrichsen a,b,c,* , Atsushi Yokoi a,e , Spencer A. Arbuckle a,d a Brain and Mind Institute, Western University, Canada b Department of Statistical and Actuarial Sciences, Western University, Canada c Department of Computer Science, Western University, Canada d Department of Neuroscience, Western University, Canada e Graduate School of Frontier Biosciences, Osaka University, Japan Abstract Representational models specify how complex patterns of neural activity relate to visual stimuli, motor actions, or abstract thoughts. Here we review pattern component modeling (PCM), a practical Bayesian approach for evaluating such models. Similar to encoding models, PCM evaluates the ability of models to predict novel brain activity patterns. In contrast to encoding models, however, the activity of individual voxels across conditions (activity profiles) are not directly fitted. Rather, PCM integrates over all possible activity profiles and computes the marginal likelihood of the data under the activity profile distribution specified by the representational model. By using an analytical expression for the marginal likelihood, PCM allows the fitting of flexible representational models, in which the relative strength and form of the encoded feature spaces can be estimated from the data. We present here a number of different ways in which such flexible representational models can be specified, and how models of different complexity can be compared. We then provide a number of practical examples from our recent work in motor control, ranging from fixed models to more complex non-linear models of brain representations. The code for the fitting and cross-validation of representational models is provided in an open-source software toolbox. Keywords: Multi-voxel pattern analysis, fMRI, Bayesian models, motor representations 1. Introduction The study of brain representations aims to illuminate the relationship between complex patterns of activity in the brain and "things in the world" - be it objects, actions, or abstract concepts. By un- derstanding internal syntax of brain representations, and especially how the structure of representations varies across different brain regions, we ultimately hope to gain insight into the way the brain processes 5 information. * Corresponding author: Brain and Mind Institute, Natural Sciences Centre, Western University, London, Ontario, N6A 5B7, Canada Email address: [email protected] (Jörn Diedrichsen) Preprint submitted to NeuroImage August 29, 2017

Upload: domien

Post on 05-Jul-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

Pattern component modeling: A flexible approach for understanding therepresentational structure of brain activity patterns

Jörn Diedrichsena,b,c,∗, Atsushi Yokoia,e, Spencer A. Arbucklea,d

aBrain and Mind Institute, Western University, CanadabDepartment of Statistical and Actuarial Sciences, Western University, Canada

cDepartment of Computer Science, Western University, CanadadDepartment of Neuroscience, Western University, Canada

eGraduate School of Frontier Biosciences, Osaka University, Japan

Abstract

Representational models specify how complex patterns of neural activity relate to visual stimuli, motor

actions, or abstract thoughts. Here we review pattern component modeling (PCM), a practical Bayesian

approach for evaluating such models. Similar to encoding models, PCM evaluates the ability of models to

predict novel brain activity patterns. In contrast to encoding models, however, the activity of individual

voxels across conditions (activity profiles) are not directly fitted. Rather, PCM integrates over all possible

activity profiles and computes the marginal likelihood of the data under the activity profile distribution

specified by the representational model. By using an analytical expression for the marginal likelihood,

PCM allows the fitting of flexible representational models, in which the relative strength and form of the

encoded feature spaces can be estimated from the data. We present here a number of different ways in

which such flexible representational models can be specified, and how models of different complexity can

be compared. We then provide a number of practical examples from our recent work in motor control,

ranging from fixed models to more complex non-linear models of brain representations. The code for the

fitting and cross-validation of representational models is provided in an open-source software toolbox.

Keywords: Multi-voxel pattern analysis, fMRI, Bayesian models, motor representations

1. Introduction

The study of brain representations aims to illuminate the relationship between complex patterns of

activity in the brain and "things in the world" - be it objects, actions, or abstract concepts. By un-

derstanding internal syntax of brain representations, and especially how the structure of representations

varies across different brain regions, we ultimately hope to gain insight into the way the brain processes5

information.

∗Corresponding author: Brain and Mind Institute, Natural Sciences Centre, Western University, London, Ontario, N6A5B7, Canada

Email address: [email protected] (Jörn Diedrichsen)

Preprint submitted to NeuroImage August 29, 2017

Page 2: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

Central to the definition of representation is the concept of decoding (deCharms and Zador, 2000).

A feature (i.e. a variable that describes some aspect of the "things in the world") that can be decoded

from the ongoing neural activity in a region may be said to be represented there. For example, a feature

could be the direction of a movement, the orientation and location of a visual stimulus, or the semantic10

meaning of a word. Of course, if we allow the decoder to be arbitrarily complex, we would use the term

representation in the most general sense. For example, using a computer vision algorithm, one may be able

to identify objects based on activity in primary visual cortex. However, we may not necessarily conclude

that object identity is represented in V1 - at least not explicitly. Therefore, it makes sense to restrict

our definition of a representation explicitly to features that can be linearly decoded from the population15

activity (DiCarlo et al., 2012; DiCarlo and Cox, 2007; Kriegeskorte, 2011; Diedrichsen and Kriegeskorte,

2017).

While decoding approaches are very popular in the study of multi-voxel activity patterns (Haxby et al.,

2001; Norman et al., 2006; Pereira et al., 2009), they are not particularly useful when making inferences

about the nature of brain representations. The fact that we can decode feature X well from region A20

does not imply that the representation in A is well characterized by feature X - there may be many other

features that better describe the activity patterns in this region.

Encoding analysis offers a solution to this problem, as it characterizes how well the activities in a

specific region can be explained by a given set of features. Each column in the data matrix (Fig. 1a) is an

activity profile of a single voxel. Note that we will in the following use the term voxel interchangeably with25

the more general term of measurement channel, which could, depending on the measurement modality,

refer to a single neuron, an electrode, or sensor. In encoding analyses each activity profile is modeled

as the linear combination of a set of features. Each voxel therefore has its own set of parameters (W)

that determine the weight of each feature. This approach can visualized by plotting the activity profile

of each voxel into the space spanned by the experimental conditions (Fig. 1b). Each dot refers to the30

activity profile, indicating how strongly the voxel is activated by each condition. Estimating the weights is

equivalent to a projection of each of the activity profiles onto the feature vectors. The quality of the model

can then be evaluated by determining how well unseen activity data can be predicted. When estimating

the weights, encoding models often use some form of regularization, which essentially imposes a prior

on the feature weights. This prior is an important component of the model. It determines a predicted35

distribution of the activity profiles (Diedrichsen and Kriegeskorte, 2017). An encoding model that matches

the real distribution of activity profiles best will show the best prediction performance.

The interpretational problem for encoding models is that for each feature set that predicts the data

well, there is an infinite number of other (rotated) features sets that describe the same distribution of

2

Page 3: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

Voxels (Channels)

Con

ditio

ns

Features

Encoding

Decoding

SufficientStatistics

RepresentationalModel

θModel-parameters

Voxel-parameters

W

Condition 1

Condition 2

Con

ditio

n 3

Voxel 1

Voxel 2

Voxe

l 3Condition 2

Condition 1Condition 3

Encoding analysis

Condition 1

Condition 2

Con

ditio

n 3

PCM RSA

A

B C D

Pattern

Pro

file

G

G1

G2

G3

M

Figure 1: Decoding, encoding, and representational models. (A) Activity data is measured across a number of conditions

and voxels (or more generally, measurement channels). The data can be viewed in terms of activity patterns (rows) or activity

profiles (columns). The data can be used to decode specific features that describe the experimental conditions (decoding).

Alternatively, a set of features can be used to predict the activity data (encoding). Rather than fitting voxel-parameters,

representational models work at the level of a sufficient statistics (the second moment, G) of the activity profiles. Models

are also formulated in terms of their predicted second moment. Different feature models can be flexibly combined using

second-level parameters (θ). (B) In Encoding analysis, the activity profile of each voxel is viewed as a points in the space

of the experimental conditions. The distribution of activity profiles is then described by a set of features (red vectors). (C)

In PCM, the distribution of activity profiles is directly fitted using a multivariate normal distribution.(D) Representational

similarity analysis (RSA) provides an alternative view by plotting the activity patterns in the space defined by different voxel

activities. The distances between activity patterns serves here as the sufficient statistic, which is fully defined by the second

moment matrix.

3

Page 4: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

activity profiles and, hence, predict the data equally well. The argument may be made that to understand40

brain representations, we should not think about specific features that are encoded, but rather about the

distribution of activity profiles. This can be justified by considering a read-out neuron that receives input

from a population of neurons. From the standpoint of this neuron, it does not matter which neuron has

which activity profile (as long as it can adjust input weights) and which features were chosen to describe

these activity profiles - all that matters is the information can be read out from the code. Thus, from45

this perspective it may be argued that the formulation of specific feature sets and the fitting of feature

weights for each voxel are unnecessary distractions in understanding the information content of the activity

patterns.

Therefore, the approach of pattern component modeling (PCM) abstracts from specific activity pat-

terns. This is done by summarizing the data using a suitable summary statistic (Fig. 1a) that describes50

the shape of the activity profile distribution (Fig. 1c). This critical characteristic is the covariance matrix

(or more generally, the second moment) of the activity profile distribution. The second moment determines

how well we can linearly decode any feature from the data. If, for example, activity measured for two

experimental conditions is highly correlated in all voxels, then the difference between these two conditions

will be very difficult to decode. If however, the activities are uncorrelated, then decoding will be very easy.55

Thus, the second moment is a central statistical quantity that determines the representational content of

the brain activity patterns of an area (Diedrichsen and Kriegeskorte, 2017).

Similarly, a representational model is formulated in PCM not by its specific feature set, but by its

predicted second moment matrix. If two feature sets have the same second moment matrix , then the

two models are equivalent. Thus, PCM makes hidden equivalences between encoding models explicit.60

To evaluate models, PCM simply compares the likelihood of the data under the distribution predicted

by the model. To do so, we rely on a generative model of brain activity data, which fully specifies

the distribution and relationship between the random variables. Specifically, the true activity profiles

are assumed to have a multivariate Gaussian distribution and the noise is also assumed to be Gaussian

with known covariance structure. Having a fully-specified generative model allows us to calculate the65

marginal likelihood of data under the model, averaged over all possible values of the feature weights. This

results in the so-called model evidence, which can be used to compare different models directly, even if

they have different numbers of features. In summarizing the data using a sufficient statistic, PCM is

closely linked to representation similarity analysis (RSA), which characterizes the second moment of the

activity profiles in terms of the distances between activity patterns (Fig. 1d). In a previous paper we70

have extensively compared these three approaches (Diedrichsen and Kriegeskorte, 2017) in their ability

to infer on so-called fixed representational models, i.e. models that consist of one specific feature set or

4

Page 5: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

predicted representational structure. We showed that, if assumptions of the generative model are met,

PCM provides the likelihood-ratio test between different models and therefore outperforms, as predicted

by the Neyman-Pearson Lemma (Neyman and Pearson, 1933), both encoding analysis and RSA in the75

number of correct model inferences.

In this paper, we focus on a different advantage of PCM, namely that it removes the requirement to

fit and cross-validate individual voxel weights by providing a single analytical expression for the marginal

likelihood of the data under the model. This enables the user to easily fit second-level parameters, namely

model parameters that determine the shape of the distribution of activity profiles. From the perspective80

of encoding models, these would be parameters that change the form of the feature matrix. For example,

we can fit the distribution of activity profiles using a weighted combination of 3 different feature sets (Fig.

1a). Such component models (see section 2.2.2) are extremely useful if we hypothesize that a region cares

about different groups of features (i.e. colour, size, orientation), but we do not know how strongly each

feature is represented. In encoding models, this would be equivalent to providing separate scaling factors85

for different parts of the feature matrix (de Heer et al., 2017).

In section 2, we will present the fundamentals of the generative approach taken in PCM and outline

different ways in which flexible representational models with free parameters can be specified. We will then

discuss methods for model fitting and model evaluation. In section 3, we provide three illustrative examples

from our work on finger representations in primary sensory and motor cortices, highlighting different90

representational models and ways of providing inferences on them. We also show that PCM continues to

make stable inferences when the assumption of Gaussianity is violated. The methods presented in this

paper are available as an open-source Matlab toolbox (Diedrichsen et al., 2016).

2. Methods

2.1. Generative Model95

Central to PCM is a generative model of the measured brain activity data Y, a matrix of NxP activity

measurements, referring to N time points (or trials) and P voxels (for notation see appendix A). The data

can refer to the minimally preprocessed raw activity data, or to already deconvolved activity estimates,

such as those obtained as beta weights from a first-level time series model. U is the matrix of true activity

patterns (a number of conditions x number of voxels matrix) and Z the design matrix. Also influencing100

the data are effects of no interest B and noise:

5

Page 6: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

Y = ZU + XB + ε (1)

up ∼ N(0,G) (2)

εp ∼ N(0,Sσ2) (3)

There are a five assumptions in this generative model. First, the activity profiles (up, columns of

U) are considered to be a random variable. Representational models therefore do not specify the exact

activity profiles of specific voxels, but simply the characteristics of the distribution from which they

originate. Said differently, PCM is not interested in which voxel has which activity profiles - it ignores105

their spatial arrangement. This makes sense considering that activity patterns can vary widely across

different participants (Ejaz et al., 2015) and do not directly impact what can be decoded from a region.

For this, only the distribution of these activity profiles in this region is important. The decision to treat

up, as a random variable makes PCM different from classical linear multivariate models, such as MANOVA

or MANCOVA. Rather, the imposition of a prior on up, puts PCM in the class of hierarchical Bayesian110

linear models.

The second assumption is that the mean activity pattern (the mean activity of each voxel across

conditions) is modeled using the effects of no interest. Therefore, X should include an intercept term that

models the condition-independent mean of each activity profile. While one could also artificially remove the

mean of each condition across voxels (Walther et al., 2016), this approach would remove differences that,115

from the perspective of decoding and representation, are highly meaningful (Diedrichsen and Kriegeskorte,

2017).

The third assumption is that both the true activity profiles and the noise come from a multivariate

Gaussian distribution. This assumption is common to many analysis approaches to fMRI data, and

enables here the analytical derivation of the marginal likelihood. The main consequence of the Gaussian120

assumption is that it causes PCM to focus on the second moment as the sufficient statistic, as the first two

moments completely determine the multivariate Gaussian distribution. We believe that this assumption

is sensible for 3 reasons. Even if the true distribution of the activity profiles is better described by a

non-Gaussian distribution, the focus on the second moment makes sense, as it characterizes the linear

decodability of any feature of the stimuli (Diedrichsen and Kriegeskorte, 2017). Secondly, while real fMRI125

data typically show some deviations from Gaussianity, these violations are usually mild due to the large-

scale averaging inherent in the methods. Finally, inferences with PCM remain robust and superior relative

to other competing methods, even when the data is non-Gaussian (see section 3.4).

Fourthly, the model assumes that different voxels are independent from each other. If we used raw data,

this assumption would clearly be violated, given the strong spatial correlation of noise processes in fMRI130

6

Page 7: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

data. To reduce these dependencies we typically uses spatially pre-whitened data, which is divided by an

estimate of the spatial covariance matrix (Walther et al., 2016; Diedrichsen and Kriegeskorte, 2017). One

complication here is that spatial pre-whitening usually does not remove spatial dependencies completely,

given the estimation error in the spatial covariance matrix (Diedrichsen et al., 2016). While residual

dependencies can be taken into account when calculating the likelihood (Diedrichsen et al., 2016), this135

will not alter the rank-ordering of different model likelihoods, but simply their scaling.

Finally, we assume that the temporal covariance of the activity data is known up to a constant term

σ2. The original formulation of PCM used a model which assumed that the noise is also temporally

independent and identically distributed (i.i.d.) across different trials, i.e. S = I. However, as pointed out

recently (Cai et al., 2016), this assumption is often violated in non-random experimental designs with140

strong biasing consequences for estimates of the covariance matrix. PCM therefore allows the user to

provide an estimate of the true covariance structure of the noise (S), or derive parts of the noise structure

within the model-fitting procedure (see section 2.3).

2.2. Representational Model Types

Given our definition (in section 2.1), a representational model is fully specified by the second-moment145

matrix of the activity profiles (G). Many models, including encoding models with a fixed feature set,

predict a specific structure of this matrix (fixed models). In other situations we may wish to estimate the

most likely structure of G from the data without constraints (free models). The most interesting cases,

however, are models that impose some constraints on the possible structure of G, with the exact form

depending on some additional model parameters θ.150

2.2.1. Fixed models

In fixed models, the second moment matrix G is exactly predicted by the model. The simplest example

is the Null model, which states that G = 0. This is equivalent to assuming that there is no difference

between the activity patterns measured under any of the conditions. The Null-model is useful if we want

to test whether there are any differences between experimental conditions.155

Fixed models also occur when the representational structure is predicted from independent data. An

example for this is shown in section 3.1, where we predict the structure of finger representations directly

from the correlational structure of finger movements in every-day life (Ejaz et al., 2015). Importantly,

fixed models only predict the second moment matrix up to a proportional constant. The width of the

distribution will vary with the overall signal-to-noise-level (assuming we use pre-whitened data). Thus,160

when evaluating fixed models we allow the predicted second moment matrix to be scaled by an arbitrary

positive constant, exp(θs). Encoding models with a specific feature set are also equivalent to a fixed

7

Page 8: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

PCM model. The estimation of the common ridge coefficient in encoding analysis takes the place of the

determination of the signal (θs) and noise (θε) parameter in PCM.

2.2.2. Component models165

A more flexible model is to express the second moment matrix as a linear combination of different com-

ponents. For example, the representational structure of activity patterns in the human object recognition

system in inferior temporal cortex can be compared to the response of a convolutional neural network that

is shown the same stimuli (Khaligh-Razavi and Kriegeskorte, 2014). Each layer of the network predicts

a specific structure of the second moment matrix and therefore constitutes a fixed model. However, the170

real representational structure seems to be best described by a mixture of multiple layers. In this case,

the overall predicted second moment matrix is a weighted sum of component matrices:

G =∑h

exp(θh)Gh. (4)

The weights for each component need to be positive - allowing negative weights would not guarantee

that the overall second moment matrix would be positive definite. Therefore we use the exponential of the

weighing parameter here, such that we can use unconstrained optimization to estimate the parameters.175

2.2.3. Feature models

A representational model can be also formulated in terms of the features that are thought to be encoded

in the voxels, as done in encoding approaches. Features are hypothetical tuning functions, i.e. models of

what activation profiles of single neurons could look like. Examples of features would be Gabor elements for

lower-level vision models (Kay et al., 2008), elements with cosine tuning functions for different movement180

directions for models of motor areas (Eisenberg et al., 2010), and semantic features for association areas

(Huth et al., 2016). The actual activity profile of each voxel is a weighted combination of the feature

matrix up = Mwp. To derive the predicted second moment matrix of the activity profiles from the

feature set, we need to make an assumption about the distribution of wp. Analogous to the use of ridge-

regression in encoding analysis (Diedrichsen and Kriegeskorte, 2017), one option is to assume that all185

features are equally strongly and independently encoded, i.e. E(wpw

Tp

)= I. Under this assumption the

second moment becomes G = MMT . A feature model can now be flexibly parametrized by expressing

the feature matrix as a weighted sum of linear components.

M =∑h

θhMh (5)

8

Page 9: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

Each parameter θh determines how strong the corresponding set of features is represented across the190

population of voxels. Note that this parameter is different from the actual feature weights W. Under this

model, the second moment matrix becomes

G = UUT /P =1

P

∑h

θ2hMhMTh +

∑i

∑j

θiθjMiMTj . (6)

From the last expression we can see that, if features that belong to different components are independent

of each other, i.e. MiMj = 0, then a feature model is equivalent to a component model with Gh = MhMTh .195

The only technical difference is that we use the square of the parameter θh, rather than its exponential,

to enforce non-negativity. Thus, component models assume that the different features underlying each

component are encoded independently in the population of voxels - i.e. knowing something about the

tuning to feature of component A does not tell you anything about the tuning to a feature of component

B. If this cannot be assumed, then the representational model is better formulated as a feature model.200

In summary, feature and component models test how well combination of feature set or feature spaces

can explain the neuronal data, very similar to the variance partitioning approach for encoding models

(de Heer et al., 2017; Lescroart et al., 2015). The advantage of performing this analysis in PCM is that

the optimal weighting of each feature space in the combination can be easily determined. Note also that

MANOVA models with factorial designs (Allefeld and Haynes, 2014) can be written as component or205

feature models, with one component for each main effect or interaction.

2.2.4. Nonlinear models

The most flexible way of defining a representational model is to express the second moment matrix as a

non-linear (matrix valued) function, G = F (θm). While often a representational model can be expressed

as a component or feature model, sometimes this is not possible. One example is a representational210

model in which the width of the tuning curve (or the width of the population receptive field) is a free

parameter (Dumoulin and Wandell, 2008). Such parameters would influence the features, and hence also

the second-moment matrix in a non-linear way. Computationally, such non-linear models are not much

more difficult to estimate than component or feature models, assuming that one can analytically derive

the matrix derivatives ∂G/∂θ for each non-linear parameter. We provide an example of a very useful215

non-linear model below (see section 3.3).

9

Page 10: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

2.2.5. Free models

The most flexible representational model is the free model, in which the predicted second moment

matrix is unconstrained. Thus, the estimation of this model simply derives the maximum-likelihood

estimate of the second-moment matrix in the presence of noise. A free model can be useful for a number220

of reasons. First, we may want to estimate the likelihood of the data under a free model to obtain a noise

ceiling - i.e. an estimate of how well the most flexible model can fit the data (see section 2.8). Secondly, we

may want an estimate of the second moment matrix to derive the corrected correlation between different

patterns, which is less influenced by noise than the simple correlation estimate (Cai et al., 2016; Diedrichsen

et al., 2011). Note, however, that this estimator still has the problem of being biased in small samples (see225

3.2). In estimating an unconstrained G, it is important to ensure that the estimate will still be a positive

definite matrix. For this purpose, we express the second moment as the square of an upper-triangular

matrix, G = AAT (Cai et al., 2016; Diedrichsen et al., 2011). The parameters θm are then simply all the

upper-triangular entries of A.

2.3. Noise assumptions230

To complete our generative model, we need to introduce some assumptions about the distribution of the

noise. In general, we assumed that the noise comes from a multivariate normal distribution with covariance

matrix Sσ2 (see section 3.4 for violations of this assumption). What is a reasonable noise structure to

assume? First, the data can usually be assumed to be independent across imaging runs. If the data are

regression estimates from a first-level model, and if the design of the experiment is balanced, then it is235

usually also permissible to make the assumption that the noise is independent within each imaging run

S = I, (Diedrichsen et al., 2011). Usually, however, the regression coefficients from a single imaging run

show positive correlations with each other. This is due to the fact that the regression weights measure

the activation during a condition as compared to a resting baseline, and the resting baseline is common

to all conditions within the run (Diedrichsen et al., 2011). To account for this, one can model the mean240

activation (across conditions) for each voxel with a separate fixed effect for each run. This effectively

accounts for any uniform correlation.

Usually, assuming equal correlations of the activation estimates within a run is only a rough approxima-

tion to the real covariance structure. A better estimate can be obtained by using an estimate derived from

the design matrix and the estimated temporal autocorrelation of the raw signal. As pointed out recently245

(Cai et al., 2016), the structure of some experimental designs can have substantial biasing influence on the

estimation of the second moment matrix. This is especially evident in cases where the sequence of trials

is not random, but in which one specific condition is more likely followed by another specific condition.

In these cases, the accuracy of inference hinges critically on the quality of our estimate of the temporal

10

Page 11: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

auto-covariance structure of the true noise S. Note that it has been recently demonstrated that especially250

for high sampling rates, a simple autoregressive model of the noise is insufficient (Eklund et al., 2012).

The last option is to estimate the covariance structure of the noise from the data itself within the

PCM model. This can be achieved by introducing random effects into the generative model (Eq. 1)

that account for the noise covariance structure. One example used here is to assume that the data are

independent within each imaging run, but share an unknown covariance, which is then estimated as a part255

of the covariance matrix (Diedrichsen et al., 2011). While this approach is similar to just removing the

run mean from the data as a fixed effect (see above) it is a good strategy if we actually want to model

the difference of each activation pattern against the resting baseline. When treating the mean activation

pattern in each run as a random effect, the algorithm finds a compromise between how much of the shared

pattern in each run to ascribe to the random run-to-run fluctuations, and how much to ascribe to a stable260

mean activation. This approach is use in example 3.3.

2.4. Likelihood and Optimization

Given our generative model we can derive the likelihood of the data under the model. Importantly, we

do not want the likelihood for specific values of the estimates of the true activity patterns U. This is a

difference to encoding approaches, in which we would estimate the values of U by estimating the feature265

weights W. In PCM, we want to assess how likely the data is under any possible value of U, as specified

by the prior distribution. Thus, we wish to calculate the marginal likelihood

p (Y|θ) =∫p (Y|U,θ) p (U|θ) dU. (7)

Given that all the involved distributions are normal, this marginal distribution of a single measured

activity profile yi (given the fixed effects estimates bi) can be derived as

p (yi|θ,bi) = N (Xbi,V(θ))

V(θ) = ZG(θ)ZT + Sσ2ε

σ2ε = exp(θε) (8)

270

To make our marginal likelihood unconditional on the fixed effects, we need to take into account that

we estimate bi from the data and then use the residuals to calculate the likelihood. This removal can be

written as a pre-multiplication with the residual-forming matrix

11

Page 12: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

R(θ) = I−X(XTV(θ)−1X

)−1XTV(θ)−1, (9)

with the residual after removing the fixed effects being RY. With this in hand, our final marginal

log-likelihood of all data (assuming independence of the voxels) is275

log p (Y|θ) = −NP2

ln (2π)− P

2ln (|V(θ)|)− 1

2trace

(YYTV(θ)−1R(θ)

)− P

2ln∣∣XTV(θ)−1X

∣∣ , (10)

where the last term accounts for the removal of the fixed effects. The interested reader can find a full

derivation of the log likelihood and its derivatives in respect to the parameters in the manual accompanying

our PCM toolbox. For maximum likelihood estimation, we can use conjugate gradient descent, Newton-

Raphson (Lindstrom and Bates, 1988), or Expectation-Maximization (McLachlan and Krishnan, 1997).

Fast implementations of these algorithms are available and described in the toolbox (Diedrichsen et al.,280

2016).

2.5. Cross-validated estimate of the second moment matrix

Instead of the full optimization of Eq. 10, we can also obtain simpler estimates of the second moment

matrix. The first option is to first estimate the average activity patterns for each condition using linear

regression, and then calculate the second moment matrix of these activity patterns.285

U =(ZTS−1Z

)−1ZTS−1Y (11)

Gsimple = UUT /P (12)

The problem with this naive estimate is that it is positively biased by the measurement noise. In the

absence of fixed effects, the expected value of this estimate is

E(Gsimple) = G +(ZTS−1Z

)−1σ2ε . (13)

An unbiased estimate of G can be obtained by cross-validation. For each of the independent partitions

of the datam, we estimate the activity patterns on that run, U(m) and on all other runs, U(∼m) separately.

Then, we can calculate the average matrix product over these independent partitions of the data290

Gcv =1

M

∑m

U( m)U(∼m)T /P. (14)

12

Page 13: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

Because in each product we multiply noise terms of independent partitions of the data, we obtain an

unbiased estimate of the second moment matrix

E(Gcv

)= G. (15)

While this estimate of the second moment matrix is unbiased, it is not guaranteed to be positive

definite. Thus it is not equivalent to the maximum-likelihood estimate obtained by maximizing Eq.

10. For data visualization and starting values, however, this estimate is extremely useful. Furthermore,295

distances calculated from this estimate are equivalent to the crossnobis distance estimator, described in

detail elsewhere (Diedrichsen and Kriegeskorte, 2017; Walther et al., 2016; Diedrichsen et al., 2016).

2.6. Visualization

After fitting a representational model to the data, it is important to visualize the result. The first

obvious visualization is to plot the values of the model parameters θm. How important are different feature300

components to explain the representation in an area?

The second, more comprehensive approach is to visualize the second-moment matrix itself. Specifically,

we want to compare the fitted second moment matrix to a direct estimate of the second moment matrix

from the data. For the latter, we can either use the estimate from a free model, or we can use the

cross-validated estimate (Eq. 14). A very powerful way to visualize the second moment matrix is to plot305

the different conditions into a space spanned by the most important eigenvectors of the second moment

matrix (for an example, see Fig. 3b, Fig 5d-i). The distances between different conditions will then show

the approximate distances between the patterns. Note that this yields exactly the same visualization as

when performing classical multi-dimensional scaling (Borg and Groenen, 2005) on the Euclidean distances

between activity patterns (Diedrichsen and Kriegeskorte, 2017).310

Finally, we can also visualize the actual estimated activity patterns for each condition across different

voxels. This is usually done in encoding approaches, to inspect the spatial distribution of activity profile

across the cortical surface. In PCM, the true activation profiles (the random effects) are never explicitly

calculated, as they are integrated over when calculating the marginal likelihood (Eq. 7). However, the

estimates can be derived as:315

U = G(θ)ZTV(θ)−1R(θ)Y. (16)

Similarly, for a feature model, the voxel-feature weights can be calculated as 1

1It is important to keep in mind that the second moment matrix (and hence correlations or distances) calculated from the

13

Page 14: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

W = M(θ)TZTV−1RY. (17)

2.7. Inference

While visualization is important to further understand the structure of neuronal representations, the

main purpose of applying models to data is statistical inference. PCM leaves open a number of possibilities

here. How to exactly perform inference on a representational model, especially for groups of participants,320

is still a subject of debate and development. As we will see, the discussion shares a lot of the problems

and issues with model inference for other multivariate fMRI models, such as DCM (Penny et al., 2004).

Principally, there are two ways of performing inferences from the fit of a representational model. First,

we can perform inference on the parameter estimates. Alternatively, we can use the marginal likelihood of

the data (Eq. 10) as an approximation of the model evidence to compare which of our candidate models325

provides the most appropriate description of our data.

2.7.1. Inference based on individual parameter estimates

First we may make inferences based on the parameters of a single fitted model. The parameter may

be the weight of a specific component or another metric derived from the second moment matrix. For

example, the estimated correlation coefficient between condition 1 and 2 would be r1,2 = G1,2/√

G1,1G2,2.330

We may want to test whether the correlation between the patterns is larger than zero, or whether a

parameter differs between two different subject groups, two different regions, or whether they change with

experimental treatments.

The simplest way of testing parameters would be to use the point estimates from the model fit from

each subject and apply frequentist statistics to test different hypotheses, for example using a t- or F-335

test. Alternatively, one can obtain estimates of the posterior distribution of the parameters using MCMC

sampling (Murphy, 2012) or Laplace approximation (Friston et al., 2007). This allows the application of

Bayesian inference, such as the report of credibility intervals.

One important limitation to keep in mind is that parameter estimates from PCM are not unbiased

in small samples. This is caused because estimates of G are constrained to be positive definite. This340

means that the variance of each feature must be larger or equal to zero. Thus, if we want to determine

whether a single activity pattern is different from baseline activity, we cannot simply test our variance

pattern estimates U do not equal those derived directly from our estimate G(θ). This is because the former do not reflect the

uncertainty in the activation estimates, which plays a role in determining the second moment. In short, the second moment

of the mean estimates does not equal the mean second moment estimate. Therefore, the activity estimates should only be

used to visualize the spatial arrangement of activity patterns, not to make inferences about the representational structure.

14

Page 15: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

estimate (i.e. elements of G) against zero - they trivially will always be larger, even if the true variance is

zero. Similarly, another important statistic that measures the pattern separability or classifiability of two

activity patterns, is the Euclidean distance, which can be calculated from the second moment matrix as345

d = G1,1+G2,2− 2G1,2. Again, given that our estimate of G is positive definite, any distance estimate is

constrained to be positive. To determine whether two activity patterns are reliably different, we cannot

simply test these distances against zero, as the test will be trivially larger than zero. A better solution for

inferences from individual parameter estimates is therefore to use a cross-validated estimate of the second

moment matrix (Eq. 14) and the associated distances (Walther et al., 2016; Diedrichsen et al., 2016). In350

this case the expected value of the distances will be zero, if the true value is zero. As a consequence,

variance and distance estimates can become negative. These techniques, however, take us out of the

domain of PCM and into the domain of representational similarity analysis (Kriegeskorte et al., 2008;

Diedrichsen and Kriegeskorte, 2017).

2.7.2. Inference using model evidence355

As an alternative to parameter-based inference, we can fit multiple models and compare them according

to their model evidence; the likelihood of the data given the models (integrated over all parameters). In

encoding models, the weights W are directly fitted to the data, and hence it is important to use cross-

validation to compare models with different numbers of features. The marginal likelihood (Eq. 10) already

integrates all over all likely values of U, and hence W, thereby removing the bulk of free parameters. Thus,360

in practice the marginal likelihood will be already close to the true model evidence.

Our marginal likelihood (Eq. 10), however, still depends on the free parameters θ. So, when comparing

models, we need to still account for the risk of overfitting the model to the data. For fixed models, there

are only two free parameters: one relating to the strength of the noise (θε) and one relating to the strength

of the signal (θs). This compares very favorably to the vast number of free parameters one would have in365

an encoding model, which is the size of W, the number of features x number of voxels. However, even the

fewer model parameters still need to be accounted for. We consider here four ways of doing so.

The first option is to use empirical Bayes or Type-II maximal likelihood. This means that we simply

replace the unknown parameters with the point estimates that maximize the marginal likelihood. This

is in general a feasible strategy if the number of free parameters is low and all models have the same370

numbers of free parameters, which is for example the case when we are comparing different fixed models.

The two free parameters here determine the signal-to-noise ratio. For models with different numbers of

parameters we can penalize the likelihood by 12dθ log(n), yielding the Bayes information criterion (BIC)

as the approximation to model evidence.

As an alternative option, we can use cross-validation within the individual (Fig. 2a) to prevent overfit-375

15

Page 16: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

A

B

θεθ θs,ˆ ˆ

εθ θs,ˆ ˆ

εθ θs,ˆ ˆ

θεθ θs, , m

m

ˆ ˆ

maximize

evaluate

evaluate

Y1,train Y1,test

Y1

Y2

Y3

Figure 2: Cross-validation for testing representational models. (A) Within-subject cross-validation. Both the model

parameters (θm), as well as the noise (θε), and the signal parameter (θs) are fitted on the training data set. The likelihood

of the test data (Y1,test) is then evaluated under those parameters. (B) Group cross-validation, here shown with the data

from three subjects (Y1−3). The model parameters are commonly fit to all but one subject, allowing each subject a separate

noise and signal parameter. The likelihood of the data from the left-out subject is then evaluated under the group model

parameters after optimizing the individual’s noise and signal parameters.

ting for more complex flexible models, as is also currently common practice for encoding models (Naselaris

et al., 2011). Taking one imaging run of the data as test set, we can fit the parameters to data from the

remaining runs. We then evaluate the likelihood of the left-out run under the distribution specified by

the estimated parameters. By using each imaging run as a test set in turn, and adding the log-likelihoods

(assuming independence across runs), we thus can obtain an approximation to the model evidence. Note,380

however, that for a single (fixed) encoding model, cross-validation is not necessary under PCM, as the

activation parameters for each voxel (W or U) are integrated out in the likelihood. Therefore, it can be

handled with the first option we described above.

For the third option, if we want to test the hypothesis that the representational structure in the

same region is similar across subjects, we can perform cross-validation across participants (Fig. 2b). We385

can estimate the parameters that determine the representational structure using the data from all but

one participant and then evaluate the likelihood of data from the left-out subject under this distribution.

When performing cross-validation within individuals, a flexible model can fit the representational structure

of individual subjects in different ways, making the results hard to interpret. When using the group cross-

validation strategy, the model can only fit a structure that is common across participants. Different from390

encoding models, representational models can be generalized across participants, as we do not fit the actual

activity patterns, but rather the representational structure. In a sense, this method is performing "hyper

alignment” (Guntupalli et al., 2016) without explicitly calculating the exact mapping into voxel space (Eq.

16

Page 17: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

17). When using this approach, we still allow each participant to have its own signal and noise parameters,

because the signal-to-noise ratio is idiosyncratic to each participant’s data. When evaluating the likelihood395

of left-out data under the estimated model parameters, we therefore plug in the ML-estimates for these

two parameters for each subject.

Finally, a last option is to implement a full Bayesian approach and to impose priors on all parameters,

and then use a Laplace approximation to estimate the model evidence (Kass and Raftery, 1995; Friston

et al., 2007). While it certainly can be argued that this is the most elegant approach, we find that cross-400

validation at the level of model parameters provides us with a practical, straightforward, and transparent

way of achieving a good approximation.

Each of the inference strategies supplies us with an estimate of the model evidence. To compare

models, we then calculate the log Bayes factor, which is the difference between the log model evidences.

logB12 = logp(Y|M1)

p(Y|M2)= log p(Y|M1)− log p(Y|M2) (18)

Log Bayes factors of over 1 are usually considered positive evidence and above 3 strong evidence for405

one model over the other (Kass and Raftery, 1995).

2.7.3. Group inference

How to perform group inference in the context of Bayesian model comparison is a topic of ongoing

debate in the context of neuroimaging. A simple approach is to assume that the data of each subject

is independent (a very reasonable assumption) and that the true model is the same for each subject (a410

maybe less reasonable assumption). This motivates the use of log Group Bayes Factors (GBF), which is

simple sum of all individual log Bayes factor across all subjects n

logGBF =∑n

logBn. (19)

Performing inference on the GBF is basically equivalent to a fixed-effects analysis in neuroimaging, in

which we combine all time series across subjects into a single data set, assuming they all were generated by

the same underlying model. A large GBF therefore could be potentially driven by one or few outliers. We415

believe that the GBF therefore does not provide a desirable way of inferring on representational models -

even though it has been widely used in the comparison of DCM models (Friston et al., 2003).

At least the distribution of individual log Bayes factors should be reported for each model. When

evaluating model evidences against a Bayesian criterion, it can be useful to use the average log Bayes

factor, rather than the sum. This stricter criterion is independent of sample size, and therefore provides420

17

Page 18: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

a useful estimate or effect size. It expresses how much the favored model is expected to perform better

on a new, unseen subject. We can also use the individual log Bayes factors as independent observations

that are then submitted to a frequentist test, using either a t-, F-, or nonparametric test. This provides

a simple, practical approach that we will use in our examples here. Note, however, that in the context of

group cross-validation, the log-Bayes factors across participants are not strictly independent.425

Finally, it is also possible to build a full Bayesian model on the group level, assuming that the winning

model is different for each subject and comes from a multinomial distribution with unknown parameters

(Stephan et al., 2009).

2.8. Noise ceilings

Showing that a model provides a better explanation of the data as compared to a simpler Null-model430

is an important step. Equally important, however, is to determine how much of the data the model does

not explain. Noise ceilings (Nili et al., 2014) provide us with an estimate of how much systematic structure

(either within or across participants) is present in the data, and what proportion is truly random. In the

context of PCM, this can be achieved by fitting a fully flexible model, i.e. a free model in which the second

moment matrix can take any form. The non-cross-validated fit of this model provides an absolute upper435

bound - no simpler model will achieve a higher average likelihood. As this estimate is clearly inflated

(as it does not account for the parameter fit) we can also evaluate the free model using cross-validation.

Importantly, we need to employ the same cross-validation strategy (within /between subjects) as used

with the models of interest. If the free model performs better than our model of interest even when cross-

validated, then we know that there are definitely aspects of the representational structure that the model440

did not capture. If the free model performs worse, it is overfitting the data, and our currently best model

provides a more concise description of the data. In this sense, the performance of the free model in the

cross-validated setting provides a “lower bound” to the noise ceiling. It still may be the case that there is

a better model that will beat the currently best model, but at least the current model already provides

an adequate description of the data. Because they are so useful, noise ceilings should become a standard445

reporting requirement when fitting representational models to fMRI data, as they are in other fields of

neuroscientific inquiry already. The Null-model and the upper noise ceiling also allow us to normalize the

log model evidence to be between 0 (Null-model) and 1 (noise ceiling), effectively obtaining a Pseudo-R2.

3. Examples and Results

In this section we provide examples of how to build and evaluate representational models from our450

recent studies on movement representations. We did not always use PCM in the original papers - partly

because some of the methods were not fully developed at this point - partly because other strategies of

18

Page 19: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

getting similar results (i.e. using RSA with cross-validated distances) were at this point more common

and easier to communicate. We show here how to perform these analyses using PCM in a concise manner

- often allowing for more powerful inferences.455

3.1. Fixed and component models: Representational structure of finger movements

In a recent paper (Ejaz et al., 2015), we studied the activity patterns associated with single finger

movements in primary sensory and motor cortices. The actual patterns of neural activity were quite

variable across different participants. However, when we studied the second moment matrix of the activity

profile distribution (Fig. 3a), we found that they were highly correlated across different individuals. As460

can be seen from a multi-dimensional scaling of the activity patterns (Fig. 3b), the thumb (d1) showed

the most unique pattern, while the patterns of other fingers (d2-d5) were arranged according to their

neighborhood relationship. Interestingly, the pattern for the little finger (d5) was already more similar to

the thumb pattern than the middle or ring finger - thus the structure was not well explained by a simple

somatotopic ordering of the fingers on the cortical sheet.465

1 2 3 4 5

1

2

3

4

5

Data MDS

1

23

4

5

1 2 3 4 5

1

2

3

4

5

Muscle

1 2 3 4 5

1

2

3

4

5

Usage

mus

cle

usag

e

mus

cle

+ us

age

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rel

ativ

e Li

kelih

ood

A B

C D

E

Figure 3: Comparison of representational models of the structure of finger representations using PCM. (A) A cross-validated

estimate of the second moment of the activity profile distribution. The rows and columns of the matrix refer to digit 1-5.

(B) Classical multi-dimensional scaling of the representational structure. Shown are the finger patterns projected on the

eigenvectors of the second moment matrix (C) Predicted second-moment matrix under the muscle model. (D) Predicted

second-moment matrix under the natural statistics of hand movements. (E) Pseudo-R2 based on the log-Bayes factor for

the models against a null model. The gray bar indicates the area between the lower and upper noise ceiling.

19

Page 20: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

What can explain this highly invariant representation? We tested a number of hypotheses. First,

we considered the possibility that two cortical activity patterns could also be more similar, because the

two movements involved similar sets of muscles. We therefore measured the muscle activity involved in

making the single finger movements and predicted the second-moment matrix of the patterns directly

from the correlation matrix of the measured muscle patterns (Fig. 3c). Alternatively, we measured the470

movements of the fingers during normal every-day activities (Ingram et al., 2008), hypothesizing that two

fingers that commonly move together would also exhibit similar activity pattern. Thus, we predicted the

second-moment matrix of the patterns directly from the correlation matrix of the natural statistics of

finger movements (Fig. 3d).

In the original paper (Ejaz et al., 2015), we used RSA to compare the models. However, we have475

recently shown that the same model comparison can be performed in a more powerful fashion through

PCM (Diedrichsen and Kriegeskorte, 2017). The two models give us a classic example of “fixed” models, in

which we fully predict the distribution of activity profiles (up to a signal scaling constant) using external

data. Per subject, we therefore only estimated the signal and noise parameter to optimize the likelihood

(Eq. 10). Because both models had only those two free parameters, we used Type-II maximum likelihood480

(empirical Bayes) as an estimate of model evidence.

To scale the model evidence we also fit a null model. In this case the null model predicted that the

second moment matrix would be identity - that is, all finger patterns are equally far apart from each

other. We then expressed each subject’s model evidence relative to the null model by taking the difference

between log model evidences (see section 2.7.2). Note that we also could have used the null model that485

there are no differences between activity patterns. This would have given us much larger log-Bayes factors

against the null model, but would have left the differences between models unchanged.

We also calculated a noise ceiling by fitting a free model of the second moment matrix either to all

the subjects (upper bound) or to all participants, excluding the one that we evaluated the fit on (lower

bound, see section 2.8).490

We can then display the model evidence in the form of a “Pseudo-R2”, with zero referring to the null

model and 1 to the upper noise ceiling. As can be seen from Figure 3e, the usage model was consistently

associated with a larger evidence than the muscle model. This was the case in all of the seven participants.

The average log-Bayes factor for a single individual was 81.56 (SE=24.801), suggesting overwhelming

evidence for the usage over the muscle model. The usage model did not, however, reach the lower noise495

ceiling, indicating that there is still systematic structure in the representations that is unexplained by the

model.

We then explored whether a combination of the two models would explain the data better than the

20

Page 21: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

hand usage model alone. Thus, we built a component model, where the two component matrices (Eq.

4) corresponded to predicted second moment matrices from the muscle and usage model, respectively.500

Such a model tests the idea that some voxels in the population may be better described as tuned to

muscle activations independent of natural usage, whereas others reflect the natural statistics, with the

exact proportions of these populations unknown. While we here only combine 2 components, in ongoing

work we are using many more components to describe the representation of movement sequences.

Adding more possible components will of course lead to a better fit to the data. Thus, to evaluate505

the model evidence, we are using here cross-validation across subjects (Fig 2b), estimating the mixture

proportions on data from all but one subject and then evaluating the fit on the left-out subject (only

allowing free scaling and noise parameters, as for the other models). In this particular case, the combination

model did not perform better than the usage model alone. The weight of the muscle model was usually

very low, and the log Bayes factor between the two models was ever so slightly in favour of the simpler510

usage model, indicating that the combination model overfit the data.

3.2. Feature models: Correlation between ipsilateral and contralateral finger presentations

A common application of PCM (and the one that it was originally developed for) is to estimate the true

correspondence between two activity patterns. For example in M1, both movements of the contralateral

and ipsilateral fingers evoke activity patterns that allow decoding of which finger moved (Diedrichsen515

et al., 2013). The patterns evoked by ipsilateral movement are weaker than those elicited by contralateral

movement, but they appear to match on a finger-by-finger basis.

While it is easy to establish that the patterns are more correlated than expected by chance (r > 0), the

true extent of the correspondence is notoriously hard to estimate. This is because correlation estimates

are biased by noise. From a neuroscientific perspective, however, it is important to know if the true520

correlation is r = 1, as this would indicate that the ipsilateral movements reactivate the exact patterns of

the contralateral movements. In contrast, a slightly lower correlation, i.e. r = 0.8 would provide evidence

for that there exists a neuronal population that uniquely encodes ipsilateral finger movements. This in turn

would suggest that both contra- and ipsilateral M1 have a function in controlling fine finger movements.

Naively, we could just estimate the correlation between finger patterns from the covariance between525

the pattern for the contra (c) and ipsilateral (i) fingers

r =cov(uc, ui)√var(uc)var(ui)

. (20)

For such a correlation to be informative, we first need to subtract the mean activity patterns for all

fingers of each hand, to ensure that the correlations are not simply driven by a correlation between the

21

Page 22: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

overall activity patterns. Using this simple covariance estimate (Eq. 11) , we obtain an average correlation

between finger pairs of r = 0.28(+0.02) (Fig. 4d), which is significantly larger than zero. However, this530

estimate is severely biased by noise. To show this, we simulated data using a true correlation of 1 between

control and ipsilateral patterns. To make the simulation realistic, we used different signal-to-noise levels

for the contralateral fingers (1, 0.5, 0.1), and strongly reduced signal to noise (0.2) for the ipsilateral hand.

Even with these relatively low levels of measurement noise , the simple correlation estimates are strongly

biased towards 0 ( Fig. 4g).535

To obtain a better estimate, we could use a cross-validated estimate of the covariance matrix (Eq. 14).

This estimate of the covariance matrix is unbiased, but is not guaranteed to be positive definite (see section

2.5). When calculating correlations based on such covariance estimates, one can encounter a number of

problems. If one of the variance estimates (diagonal entries in the matrix) is negative, the correlation is

undefined. Even if the variance estimates are forced to be positive, the denominator of Eq. 20 may become540

very small, leading to correlations far outside the interval of [−1; 1]. To make the correlation estimate

more stable, we can force the matrix to be semi-positive definite by removing eigenvectors with negative

eigenvalues. Using this technique, we obtain a higher estimates of the correlation (Fig. 4e). Simulations,

however, demonstrate that this correlation is still strongly biased (Fig. 4h).

Using PCM, we can model a finger-by-finger correlation using a feature model with three parameters545

per finger (Fig. 4b). The patterns for the contralateral fingers are modeled by a finger-specific pattern

w1−5 with variance of 1, multiplied by a finger-specific parameter θa,i. Thus, the predicted variance (Fig.

4c) of each finger pattern is θ2a,i. The patterns for the ipsilateral fingers are modeled as the sum of the

contralateral pattern, weighted by θb,i, and an independent ipsilateral finger-specific pattern ( w6−10 ),

weighted by θc,i. Accordingly, the variance of the ipsilateral patterns will be θ2b + θ2c and the covariance550

θaθb. If θc,i = 0, then the correlation is 1 and the ipsilateral patterns are only a scaled version of the

contralateral patterns. If θb,i = 0,then the correlation is zero. The estimated model correlations are higher

than the ones obtained with crossvalidation (Fig. 4f), but are still clearly biased (Fig. 4i). Note that this

is equally the case when using an unconstrained covariance model, as has been suggested by Cai et al.

(2016).555

PCM, however, offers the possibility to test whether contra- and ipsilateral patterns correspond per-

fectly (r = 1), or whether there is only partial correspondence (r < 1) by performing formal model

comparison A model for which all θc,i are forced to be zero restricts the correlation to be r = 1. We can

compare this to a model in which the correlation is free to vary. To account for the added parameters

in the second model, we again can perform cross-validation across participants. As a baseline we also560

consider the null model with θb,i = 0, i.e. r = 0.

22

Page 23: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

θa,1uc,1

uc,5

ui,1

ui,5

θb,1 θc,1

θd

θdθe

θe

W=

A B C

D E F

G H I

cont

ral.

ipsi

l.contral. ipsil.

0.6 0.8 1

5

10

Correlation

Cou

nt

Correlation

Cou

nt

0.6 0.8 1

5

10

Correlation

Cou

nt

0.6 0.8 1

5

10

Simplecorrelation

Pattern componentmodel

Cross-validatedestimate

θa,5

θb,5 θc,5

-0.5

0

0.5

1

Cor

rela

tion

Signal to noise0 0.4 0.8

-0.5

0

0.5

1

Cor

rela

tion

Signal to noise0 0.4 0.8

-0.5

0

0.5

1

Cor

rela

tion

Signal to noise0 0.4 0.8

r flex r=10

50

100

Log-

likel

ihoo

d

r=0

Figure 4: Determining the true correlation between activity patterns. (A) Cross-validated estimate of the covariance

matrix of fingers of the contra- and ipsilateral hands. The upper right 5x5 block of the matrix indicates the co-variances of

finger patterns across hands. The fact that the diagonal of this block is higher than the off-diagonal, indicates a positive

correlation of matching finger pairs. (B) Feature model, expressing the patterns of the contralateral fingers (uc,1 − uc,5)

and ipsilateral fingers (ui,1 − ui,5) as a combination of the shared finger pattern (weighted by θa,i and θb,i, respectively) and

one that is idiosyncratic to the ipsilateral hand, weighted by θc,i. Empty parts of the matrix are zero. To complete the

model, we also added a hand-specific pattern for the contra and ipsilateral hand, weighted by θd and θe respectively. (C)

Marginal log-likelihood of the flexible model (red) and the model constraining the correlation to one (blue, θc,i = 0), relative

to the log-likelihood of the zero-correlation model. (D) Distribution of simple correlation coefficients for the 12 hemispheres.

(E) Distribution of the correlation calculated from a cross-validated estimate of the covariance matrix. (F) Distribution of

correlation coefficients derived from the PCM model. (G) Simulation assuming that the contra- and ipsilateral patterns are

perfectly correlated (r=1) with unequal signal strength forms hands and fingers are uneven. Correlation estimates approach 1

as signal-to-noise ratio (x-axis) increases. (H) The same simulation for finger-specific feature model and (I for cross-validated

estimates. The dots reflect the estimated correlation from single simulations. Red lines indicate the average.

23

Page 24: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

As can be seen from the model evidence, there is overwhelming evidence for both main models over

the null-model, with an average log BF of 67.4 (SD = 33.6), confirming the result already obtained using

simple correlations. Using PCM can now compare the r = 1 model to the r < 1 model. While latter

model has a slight advantage over the more constrained model, the average log-BF was close to zero. Thus,565

our data provides no notable evidence that the correlation is smaller than 1. We can therefore conclude

that in M1 proper, ipsilateral finger movements appear to reactivate the patterns associated with the

corresponding contralateral finger. Using fMRI at the current resolution (2.3mm), we could not detect a

unique population code for the ipsilateral movement. Importantly this conclusion is only made possible

using model comparison with PCM - no parameter-based approach that we know of would enable such a570

conclusion.

3.3. Nonlinear model: Scaling of finger patterns with movement speed

PCM is also helpful in addressing another important question that is often asked in the context of

fMRI analysis: How do tuning curves of individual voxels change when a modulatory factor (attention,

visual contrast, movement speed, amount of training, etc.) is altered? Often, the average regional activity575

changes substantially over different levels of these factors. But what happens to the activity patterns /

activity profiles themselves?

Here we answer this question in the context of the tuning of voxels for individual finger movements in

M1, and how these tuning curves change with increasing movement frequency. It is known that overall

BOLD signal increases non-linearly with movement frequency, and that these non-linearities are at least580

to a large part caused by neuronal factors (Hermes et al., 2012; Siero et al., 2013), namely that the activity

elicited by a finger press declines the shorter the time since the last press was. However, we currently do

not know how individual tuning functions of the voxels change with increasing movement frequency. We

therefore studied the BOLD signal patterns in M1 during tapping of the 5 fingers of the right hand at 2,

4, 8 or 16 presses over 6s.585

We then defined four models of how voxel tuning curves to individual finger movements change with

increasing movement frequency. The first possibility is that the voxel activities scale with the same

(arbitrary) factor for all fingers (Fig. 5a). This would mean that the entire pattern of BOLD activity for

each finger remains structurally the same, simply being scaled by a single common factor for each increase

in movement frequency. Using PCM, we can formalize this "scaling model" by590

24

Page 25: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

yi,j = θsj fi + ε (21)

fi ∼ N(0,Σf ) (22)

Σf = A(θf )A(θf )T

(23)

,

where the activity pattern y for finger i at movement frequency j is a product of the speed scaling

parameter θsj and the finger pattern fi, plus Gaussian noise ε. The finger patterns are distributed according

to an arbitrary second moment matrix Σf , which is fit the same way as a free model, i.e. as the square

of a upper-triangular matrix A. Thus, the parameters determining the distribution of finger tuning595

functions (θf ) interact multiplicatively with the parameters that determine the common scaling of activity

with frequency (θs). This makes the model inherently non-linear. When plotting this prediction in

representational space (Fig. 5d and e), one can see that the distances between finger patterns scale by a

certain amount with each increase in movement frequency, but this scaling remains linear from the rest

activity (cross).600

Secondly, the increase of activity with movement frequency could be related to an independent additive

background pattern (Fig. 5b). The presence of such a component could indicate that there is some non-

finger specific input to M1 increases with increasing movement frequency (e.g. attentional modulation,

etc). Alternatively, such an additive pattern could be caused by a non-specific spreading of the BOLD

signal with increasing neural activity. Under this "additive model", the finger patterns are modeled as605

Yi,j = fi + θajα+ ε (24)

where the activity pattern y of finger i at movement frequency j is the sum of the finger pattern f

for finger i, an additive background pattern α that is scaled by parameter θa, and Gaussian noise ε. Here

we defined the background pattern to be the mean activity pattern across fingers at the same movement

frequency. Under this model, the distances between finger patterns in representational space would remain

the same across movement frequencies. Importantly, the additive pattern simply shifts the finger patterns610

away from baseline in this representational space- voxels become more tuned to all finger presses in a

non-specific fashion (Fig. 5e and H).

Of course it is also possible to combine the above two models (Fig. 5c). In this case the activity of

each voxel for each finger is scaled up (such as in the scaling model) and the activity for each voxel would

increase by a set amount for all fingers (as in the additive model):615

25

Page 26: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

Yi,j = θsj fi + θajα+ ε (25)

.

The predicted representational structure is then a combination of the additive and scaling effects (Fig.

5f and i). Importantly, this model makes no constraints on the contributions of the scaling and additive

components for each movement frequency. Therefore, it is for example possible for pattern changes in low

movement frequency to be driven by scaling, whereas changes for high frequencies being better described620

by an additive component.

Finally, the representational structure could also change in more complex ways, deforming with in-

creasing movement frequency. We would expect such behavior if different voxels showed different points

of BOLD saturation or other non-linear effects distortion. Such complex behavior would make the inter-

pretation the representational structure as measured with fMRI difficult, as it would clearly indicate that625

it can change with the overall amount of activity in a region. For this possibility we considered a free,

unconstrained model, which also serves as the noise-ceiling. Thus, this model can capture any systematic

change of population tuning with movement frequency.

With PCM, we are effectively assessing the likelihood of the collection of all measured voxel tuning

curves in M1 under each of these models. As before, we are using between-subject cross-validation to com-630

pare models with different numbers of parameters. The marginal likelihoods (Fig.5k) show that the additive

model is better than the scaling model, but that both models are clearly outperformed by the noise-ceiling

model. The comparison to the free (noise-ceiling) model indicates that the representational structure is

nearly completely modeled using the combination of scaling and additive components. Although there

is a significant difference between the combination and the noise ceiling model (t = 3.31, p < 0.01), the635

average Bayes factor separating these models was very small (0.014). This results indicate that despite

some systematic distortion, the changes in the population of voxel tuning curves in M1 are well described

by a common overall scaling factor plus a finger-independent background pattern. Whether this additive

pattern reflects increased general neuronal processing, or whether it is caused by an unspecific spatial

spread of the BOLD signal is currently not clear.640

This example should hopefully demonstrate the power of PCM to test well-specified and complex

model of population tuning. Such models can help us understand more clearly how attention or repetition

influences individual voxel tuning. In the context of repetition suppression, fatigue, sharpening, and

facilitation models have been discussed (Grill-Spector et al., 2006), which are structurally very similar to

the scaling and additive models presented here. PCM allows for a robust, practical and powerful way of645

testing such models on fMRI data.

26

Page 27: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

0

Act

ivity

finger1 2 3 4 5

1

23

4

5

A B CScaling Model Additive Model Combination Model

1

23

4

5 1

23

4

5

1

234

5

1

2

3

45

1

2

3

4

5

1

2

3

4

5

1

2

3

45

1

2

3

45

1

2

3

45

1

2

3

45

1

2

3

45

1

2

3

45

1

2

3

4

5

1

2

3

4

5

lowhigh

pressingspeed

slow

fast

pressingspeed

scal

ing

addi

tive0

0.2

0.4

0.6

0.8

1

Rel

ativ

e Li

kelih

ood

Observed Second-Momentslow fast

finger 1 2 3 4 5

pressingspeed

slow

fast

J K Model Fits

com

bi-

natio

n

1 2 3 4 5

finger1 2 3 4 5

finger1 2 3 4 5

D E F

G H I

Figure 5: Scaling of finger activity patterns with movement frequency. (A-C) Example voxel tuning curves for the 5

individuated finger movements under the three models: Scaling, Additive, and Combination models. For a voxel with the

same baseline tuning (dashed line), each model predicts a different change in tuning as finger movement frequency increases

(solid line). (D-I) Assuming the same behavior across all voxels, the scaling model (D, G), the additive model (E, H), and

the combination model (F, I) predict the a particular arrangement of the patterns in representational space. The space is

defined as the first three principal components that best capture the overall representational structure. Conditions with the

same movement frequency are connected by lines. The blue crosses indicate resting baseline. Model predictions are made for

the set of best-fitting parameters from the group fit. (J) The overall estimate of the second moment matrix. (K) We then

assessed the likelihood of the data under the three models. The combination model yielded the greatest likelihood of the

three models, and performed nearly as well as the unconstrained, free model. Plotted here is the cross-validated marginal

log-likelihood, standardized by the upper noise ceiling.

27

Page 28: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

3.4. Sensitivity to the Gaussian-Gaussian assumption

After three example that demonstrate the power and flexibility of PCM, we turn to an important

additional question that has not be addressed in previous papers (Diedrichsen and Kriegeskorte, 2017;

Diedrichsen et al., 2011): What happens to PCM inferences if the assumption that both the signal and650

the noise come from a Gaussian distribution is violated? Typically, both raw fMRI data and activity

estimates from a first-level GLM show slightly heavier tails (more outlying values) than predicted by the

Gaussian distribution. To test how this influences model selection performance of PCM, we simulated

data either according to the muscle or the usage model as described in section 3.1. To vary the non-

Gaussianity of the data we drew the signal or noise term not from a Gaussian distribution, but from655

a Student t-distribution with degree of freedom (df) between 3 (very heavy tails) and 100 (practically

Gaussian). In all simulations we scaled the data, such that the second moment of the signal and noise

distributions remained constant even when the distributional shape changed. Inspection of the empirical

data underlying example 1 showed that a t-distribution with 10-20 df was appropriate range to describe

the positive kurtosis (heavier tails) of the activity estimates that form the input data to PCM.660

We then used either PCM, voxel-wise encoding models, or 3 different variants of RSA to compare which

of the two models provided the better explanation for the data. For each approach, the proportion of correct

model decisions was recorded. The implementation details of the different methods, and the results for

Gaussian data can be found in Diedrichsen & Kriegeskorte (2017). For nearly Gaussian data (df = 100)

PCM was the method with the largest proportion of correct model choices (Fig. 6). This is expected, as665

- for Gaussian data - PCM implements the likelihood-ratio test between models, which by the Newman-

Pearson Lemma constitutes a theoretical upper bound of the best achievable model selection accuracy (if

the two models are equally likely a-priori). However, even when we increase the non-Gaussianity of the

noise (Fig. 6a), or the signal (Fig. 6b) , PCM robustly remained the best method.

This result may seem counterintuitive at first, as PCM is the only method that explicitly assumes a670

Gaussian distribution of the data. However, note that this assumption merely leads PCM to focus on the

second moment as a sufficient statistic of the data - a characteristic that it shares with all other methods

tested here (Diedrichsen and Kriegeskorte, 2017). While our simulations changes higher-order moments of

the noise and signal distribution, the second moment was kept constant. Therefore, the inference using all

methods was basically blind to these changes. In summary, our simulations show that PCM is not only675

a very powerful method when conditions are met, but that it still can provide exact (and, compared to

other competing methods, superior) model inferences on more realistic data distributions.

28

Page 29: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

100 30 19 14 10 7 5 350

60

70

80

Mod

el S

elec

tion

accu

racy

(%)

EncodingPCM

RSA-Fixed

RSA-RankRSA-Pearson

100 30 19 14 10 7 5 3

degrees of Freedom degrees of Freedom

50

60

70

80

A B

Figure 6: Robustness of model selection accuracy of PCM, Encoding analysis and RSA for violations of Gaussianity. (A)

Model selection accuracy for simulated data coming from the muscle and usage model of example 1. The noise was drawn

from a student t-distribution with df = 3− 100. RSA models were evaluated using a correlation without intercept, Pearson

correlation or Spearman rank-correlation (Diedrichsen and Kriegeskorte, 2017). (B) Model selection accuracy for simulated

data where the signal was drawn from a t-distribution.

4. Discussion

In this paper we have presented a statistical approach for fitting and comparing representational

models. PCM is conceptually very closely linked to encoding models, as well as to RSA (Diedrichsen and680

Kriegeskorte, 2017). Because it implements a likelihood-ratio test between models (Neyman and Pearson,

1933), it can be shown to be statistically more powerful than either competing approach (Diedrichsen and

Kriegeskorte, 2017). More importantly, the analytical tractability of PCM makes it especially attractive

for the fitting and comparison of flexible representational models.

In encoding models, a separate weight of each voxel and feature is explicitly fitted. To evaluate model685

fit, the prediction accuracy for left-out data is used. In contrast, PCM evaluates the marginal likelihood of

the data under the distribution predicted by the model. In this expression, the unknown activity profiles

for each voxel are integrated out (Eq. 7). This means that we can evaluate different encoding models, even

with different numbers of features, by simply comparing their marginal likelihoods. The step of integrating

over the actual activity profiles is justified under the assumption that the exact spatial arrangement of690

the different activity profiles does not matter for the representational content of a brain area - even if the

neurons were differently arranged, a fully connected read-out neuron could extract the same information.

Indeed, it can be shown that the spatial arrangement of patterns associated with finger movements is

variable in primary motor cortex, whereas the representational structure - and hence likely the function -

is remarkably invariant across participants (Ejaz et al., 2015).695

29

Page 30: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

PCM uses the second moment as a summary statistics for observed data. Analogously, it also uses

the second moment to specify the representational model. It therefore constitutes an abstraction from

the use of concrete feature sets. Using features to describe a population of activity profiles has a long

tradition in neuroscience. In motor control, the response properties of single neurons in primary motor

cortex have often been characterized as being tuned to movement direction (Georgopoulos et al., 1986).700

However, many studies have made it clear that representations in M1 contain a complex mixture of

features, including position (Sergio and Kalaska, 1997), direction (Schwartz et al., 1988), force (Sergio

et al., 2005), or muscle activity (Kakei et al., 1999), often with non-linear interactions between them

(Sergio and Kalaska, 1997). Indeed, it has been questioned whether a description of population activity

in terms of semantically definable features is at all sensible (Shenoy et al., 2013).705

This is also one of the key lessons from the emerging field of deep learning algorithms as a model of

neural processing (Marblestone et al., 2016). Most often, the learned features found in the hidden layers

defy simple semantic description. This means that we need tools to formulate and test hypotheses about

neural representations that describe the representational space itself, without getting distracted by the

relatively superficial issue of which axes should be used to describe the space. Indeed, the representational710

structure found in different layers of a deep neural networks can be used as a model for representational

structures found in the human brain (Khaligh-Razavi and Kriegeskorte, 2014). Even if it turns out that

real neural representations are distinctively different from the activation states found in artificial neural

networks , deep nets form an ideal testing ground for the methods we employ to understand biological

systems. If our methods cannot provide insight into the transformations occurring between the layers of715

these still relatively simple networks, they most likely will tell us very little about what happens in the

brain (Jonas and Kording, 2017).

PCM has a very tight relationship to RSA, which also uses a summary statistics to compare data and

model predictions. In the case of RSA, the summary statistics are the distances between all pairs of neural

activity patterns. These distances usually contain the same information as the second moment matrix.720

However, to approach the statistical power in disambiguating models that is inherent in PCM, one needs

to takes into account the co-dependences between these distance estimates (Diedrichsen and Kriegeskorte,

2017).

The key advantage of PCM, however, is having an analytical expression for the marginal likelihood.

This frees the user from having to perform fitting and cross-validation of voxel-feature weights at every725

step, and makes it easier to explore a larger space representational models. Note that an encoding model

with a specific feature set (despite all the free voxel-feature weight parameters) is a fixed PCM model,

in which the only signal and noise parameter (corresponding to the ridge coefficient) is free to vary.

30

Page 31: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

More complex models, in which the second moment matrix has some degree of flexibility, would refer

to extension, differential scaling, combination or nonlinear changes of the model features. Having an730

analytical expression for the marginal likelihood allows us to directly estimate these parameters. For

model comparison, we can use simple cross-validation (rather than the double cross-validation necessary

in encoding approaches). It is also possible to avoid cross-validation completely by using a fully Bayesian

approach, in which a normal prior is imposed on the parameters θ. Here we chose cross-validation as a

practical and robust approach for estimating the model evidence.735

Despite these practical advantages, there are a number of drawbacks and limitations to our approach.

The reader should be cautioned that, although PCM attempts to separate the contribution of signal and

noise to the second moment of the data, the estimate of the true second moment matrix is not unbiased

when using small samples. This is due to the fact that the estimate of G is constrained to be positive

definite. Hence, when the true signal variance is zero, the variance estimates will be zero or positive,740

causing the mean estimates to be larger than zero. For unbiased results, one would need to use cross-

validation (Eq. 14), which sometimes also results in negative variance estimates. Note, however, that

although the cross-validated second moment matrix estimate is unbiased, correlation coefficients derived

from this estimate are not. In general, the restriction to positive definite estimates prevents us from being

able to determine, based on the second moment matrix alone, whether a certain feature is encoded above745

chance. For this, we would need to compare a representational model with and without that feature

encoded (see section 3.1).

Another limitation of PCM arises from the fact that it uses only the second moment of the data as a

sufficient statistic. In this, PCM ignores certain aspects of the neural code. First, it neglects the spatial

arrangement of the different activity profiles on the cortical sheet. One of the coding principles in the750

neocortex is that nearby locations also have similar activity profiles (Graziano and Aflalo, 2007). Therefore,

the exact distribution of activity profiles across the cortical surface may contain some important insights

into information coding. On the other hand, there is evidence that the exact arrangement of activity

patterns on the cortical surface may reflect mostly random biological variation that in itself does not

speak to the computational function of the region (Ejaz et al., 2015).755

Similarly, PCM ignores all higher statistical moments of the activity profile distribution. In primary

visual cortex, for example, many voxels are tuned to one specific stimulus locations in the visual field,

but very few respond to two separate locations (Dumoulin and Wandell, 2008). Thus, when we use

stimulus location as our axes, the activity profiles will cluster around the axes, and their distribution will

be distinctly non-Gaussian. This does not mean that assumption of Gaussian distributed activity profiles760

(as made in PCM) would automatically lead to invalid inference (see also section 3.4). Independent

31

Page 32: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

of the shape of the distribution, the second moment of the activity profiles determines which stimulus

features can be read out from the neuronal code by a fully connected neuron. The question of whether

we can find evidence for clear non-Gaussian distributions of activity profiles, and what these may imply

for the neural representations in these regions, is in highly interesting topic that warrants further study765

(Norman-Haignere et al., 2013).

5. Appendix A: Notation

Table 1: For non-scalars, the second column indicates the vector / matrix size.

K Number of conditions

P Number of measurement channels (voxels, electrodes, neurons)

N Overall number of measurements

Q Number of features in model

M Number of independent partitions of the data

B J × P Regression coefficients for the J regressors of no-interest

G K ×K Second moment of up

M K ×Q Matrix of model features for all condition

R N ×N Residual forming matrix for removal of fixed effects

S N ×N Temporal variance-covariance structure of the noise

U K × P Matrix of true activation patterns

up K × 1 True activity profile for voxel p

U(m) K × P Matrix of estimated activity patterns, based on data from partition m

U(∼m) K × P Matrix of estimated for activity patterns, based on data independent of m

V(θ) N ×N Predicted variance-covariance matrix of the data

W Q× P Matrix of all voxel-feature weights

wp Q× 1 Vector of voxel-feature weights for voxel p

X N × J Design matrix containing J regressors of no-interest

Y N × P Matrix of brain measurements, concatenated activity estimates or time series data

Z N ×K Design matrix, indicating how measurements relate to activity patterns

θε Second-level parameter for noise variance

θh Second-level parameter for signal strength of component h

θm Vector of all second-level parameters that determine the structure of G

θ Vector of all second-level parameters

32

Page 33: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

Funding

This work was supported by a Scholar award of the James McDonnell foundation, and an NSERC

discovery grant, RGPIN-2016-04890, both to JD.

References

Allefeld, C., Haynes, J. D., 2014. Searchlight-based multi-voxel pattern analysis of fmri by cross-validated

manova. Neuroimage 89, 345–57.

DOI http://www.ncbi.nlm.nih.gov/pubmed/24296330

Borg, I., Groenen, P., 2005. Modern Multidimensional Scaling: theory and applications, 2nd Edition.

Springer-Verlag, New York.

Cai, M. B., Schuck, N. W., Pillow, J., Niv, Y., 2016. A Bayesian method for reducing bias in neural

representational similarity analysis. pp. 4952–4960.

de Heer, W. A., Huth, A. G., Griffiths, T. L., Gallant, J. L., Theunissen, F. E., 2017. The hierarchical

cortical organization of human speech processing. J Neurosci 37 (27), 6539–6557.

DOI https://www.ncbi.nlm.nih.gov/pubmed/28588065

deCharms, R. C., Zador, A., 2000. Neural representation and the cortical code. Annu Rev Neurosci 23,

613–47.

DOI https://www.ncbi.nlm.nih.gov/pubmed/10845077

DiCarlo, J. J., Cox, D. D., 2007. Untangling invariant object recognition. Trends Cogn Sci 11 (8), 333–41.

DOI https://www.ncbi.nlm.nih.gov/pubmed/17631409

DiCarlo, J. J., Zoccolan, D., Rust, N. C., 2012. How does the brain solve visual object recognition? Neuron

73 (3), 415–34.

DOI https://www.ncbi.nlm.nih.gov/pubmed/22325196

Diedrichsen, J., Kriegeskorte, N., 2017. Representational models: A common framework for understand-

ing encoding, pattern-component, and representational-similarity analysis. PLoS Comput Biol 13 (4),

e1005508.

DOI https://www.ncbi.nlm.nih.gov/pubmed/28437426

Diedrichsen, J., Yokoi, A., Arbuckle, S. A., 2016. Pattern Component Modelling Toolbox.

DOI https://github.com/jdiedrichsen/pcm_toolbox

33

Page 34: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

Diedrichsen, J., Ridgway, G. R., Friston, K. J., Wiestler, T., 2011. Comparing the similarity and spatial

structure of neural representations: a pattern-component model. Neuroimage 55 (4), 1665–78.

DOI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=

Citation&list_uids=21256225

Diedrichsen, J., Wiestler, T., Krakauer, J. W., 2013. Two distinct ipsilateral cortical representations for

individuated finger movements. Cereb Cortex 23 (6), 1362–77.

DOI http://www.ncbi.nlm.nih.gov/pubmed/22610393

Diedrichsen, J., Zareamoghaddam, H., Provost, S., 2016. The distribution of crossvalidated mahalanobis

distances. ArXiv.

Dumoulin, S. O., Wandell, B. A., 2008. Population receptive field estimates in human visual cortex.

Neuroimage 39 (2), 647–60.

DOI http://www.ncbi.nlm.nih.gov/pubmed/17977024

Eisenberg, M., Shmuelof, L., Vaadia, E., Zohary, E., 2010. Functional organization of human motor

cortex: directional selectivity for movement. J Neurosci 30 (26), 8897–905.

DOI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=

Citation&list_uids=20592212

Ejaz, N., Hamada, M., Diedrichsen, J., 2015. Hand use predicts the structure of representations in senso-

rimotor cortex. Nat Neurosci 18 (7), 1034–40.

DOI http://www.ncbi.nlm.nih.gov/pubmed/26030847

Eklund, A., Andersson, M., Josephson, C., Johannesson, M., Knutsson, H., 2012. Does parametric fmri

analysis with spm yield valid results? an empirical study of 1484 rest datasets. Neuroimage 61 (3),

565–78.

DOI http://www.ncbi.nlm.nih.gov/pubmed/22507229

Friston, K., Mattout, J., Trujillo-Barreto, N., Ashburner, J., Penny, W., 2007. Variational free energy and

the laplace approximation. Neuroimage 34 (1), 220–34.

DOI http://www.ncbi.nlm.nih.gov/pubmed/17055746

Friston, K. J., Harrison, L., Penny, W., 2003. Dynamic causal modelling. Neuroimage 19 (4), 1273–302.

DOI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=

Citation&list_uids=12948688

34

Page 35: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

Georgopoulos, A. P., Schwartz, A. B., Kettner, R. E., 1986. Neuronal population coding of movement

direction. Science 233 (4771), 1416–1419.

Graziano, M. S., Aflalo, T. N., 2007. Mapping behavioral repertoire onto the cortex. Neuron 56 (2),

239–51.

DOI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=

Citation&list_uids=17964243

Grill-Spector, K., Henson, R., Martin, A., 2006. Repetition and the brain: neural models of stimulus-

specific effects. Trends Cogn Sci 10 (1), 14–23.

DOI https://www.ncbi.nlm.nih.gov/pubmed/16321563

Guntupalli, J. S., Hanke, M., Halchenko, Y. O., Connolly, A. C., Ramadge, P. J., Haxby, J. V., 2016. A

model of representational spaces in human cortex. Cereb Cortex 26 (6), 2919–34.

DOI http://www.ncbi.nlm.nih.gov/pubmed/26980615

Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., Pietrini, P., 2001. Distributed and

overlapping representations of faces and objects in ventral temporal cortex. Science 293 (5539), 2425–30.

Hermes, D., Siero, J. C., Aarnoutse, E. J., Leijten, F. S., Petridou, N., Ramsey, N. F., 2012. Dissociation

between neuronal activity in sensorimotor cortex and hand movement revealed as a function of movement

rate. J Neurosci 32 (28), 9736–44.

DOI https://www.ncbi.nlm.nih.gov/pubmed/22787059

Huth, A. G., de Heer, W. A., Griffiths, T. L., Theunissen, F. E., Gallant, J. L., 2016. Natural speech

reveals the semantic maps that tile human cerebral cortex. Nature 532 (7600), 453–8.

DOI http://www.ncbi.nlm.nih.gov/pubmed/27121839

Ingram, J. N., Kording, K. P., Howard, I. S., Wolpert, D. M., 2008. The statistics of natural hand

movements. Exp Brain Res 188 (2), 223–36.

DOI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=

Citation&list_uids=18369608

Jonas, E., Kording, K. P., 2017. Could a neuroscientist understand a microprocessor? PLoS Comput Biol

13 (1), e1005268.

DOI https://www.ncbi.nlm.nih.gov/pubmed/28081141

Kakei, S., Hoffman, D. S., Strick, P. L., 1999. Muscle and movement representations in the primary motor

cortex. Science 285 (5436), 2136–9.

35

Page 36: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

DOI http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=m&form=6&dopt=r&uid=

10497133

Kass, R. E., Raftery, A. E., 1995. Bayes factors. Journal of the American Statistical Association 90 (430),

773–795.

DOI http://amstat.tandfonline.com/doi/abs/10.1080/01621459.1995.10476572.

Kay, K. N., Naselaris, T., Prenger, R. J., Gallant, J. L., 2008. Identifying natural images from human

brain activity. Nature 452 (7185), 352–5.

DOI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=

Citation&list_uids=18322462

Khaligh-Razavi, S. M., Kriegeskorte, N., 2014. Deep supervised, but not unsupervised, models may explain

it cortical representation. PLoS Comput Biol 10 (11), e1003915.

DOI http://www.ncbi.nlm.nih.gov/pubmed/25375136

Kriegeskorte, N., 2011. Pattern-information analysis: from stimulus decoding to computational-model

testing. Neuroimage 56 (2), 411–21.

DOI https://www.ncbi.nlm.nih.gov/pubmed/21281719

Kriegeskorte, N., Mur, M., Bandettini, P., 2008. Representational similarity analysis - connecting the

branches of systems neuroscience. Front Syst Neurosci 2, 4.

DOI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=

Citation&list_uids=19104670

Lescroart, M. D., Stansbury, D. E., Gallant, J. L., 2015. Fourier power, subjective distance, and object

categories all provide plausible models of bold responses in scene-selective visual areas. Front Comput

Neurosci 9, 135.

DOI https://www.ncbi.nlm.nih.gov/pubmed/26594164

Lindstrom, M. J., Bates, M. B., 1988. Newton-raphson and em algorithms for linear mixed-effects models

for repeated-measures data. Journal of the American Statistical Association 83 (404), 1014–1022.

Marblestone, A. H., Wayne, G., Kording, K. P., 2016. Toward an integration of deep learning and neuro-

science. Front Comput Neurosci 10, 94.

DOI https://www.ncbi.nlm.nih.gov/pubmed/27683554

McLachlan, G. J., Krishnan, T., 1997. The EM Algorithm and Extensions. Wiley Series in Probability

and Statistics. Wiley, New York.

36

Page 37: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

Murphy, K. P., 2012. Machine Learning: A probabilistic perspective. MIT press, Cambridge, MA.

Naselaris, T., Kay, K. N., Nishimoto, S., Gallant, J. L., 2011. Encoding and decoding in fmri. Neuroimage

56 (2), 400–10.

DOI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=

Citation&list_uids=20691790

Neyman, J., Pearson, E. S., 1933. On the problem of the most efficient test of statistical hypotheses.

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences

231, 289–337.

Nili, H., Wingfield, C., Walther, A., Su, L., Marslen-Wilson, W., Kriegeskorte, N., 2014. A toolbox for

representational similarity analysis. PLoS Comput Biol 10 (4), e1003553.

DOI http://www.ncbi.nlm.nih.gov/pubmed/24743308

Norman, K. A., Polyn, S. M., Detre, G. J., Haxby, J. V., 2006. Beyond mind-reading: multi-voxel pattern

analysis of fmri data. Trends Cogn Sci 10 (9), 424–30.

Norman-Haignere, S., Kanwisher, N., McDermott, J. H., 2013. Cortical pitch regions in humans respond

primarily to resolved harmonics and are located in specific tonotopic regions of anterior auditory cortex.

J Neurosci 33 (50), 19451–69.

DOI http://www.ncbi.nlm.nih.gov/pubmed/24336712

Penny, W. D., Stephan, K. E., Mechelli, A., Friston, K. J., 2004. Comparing dynamic causal models.

Neuroimage 22 (3), 1157–72.

DOI https://www.ncbi.nlm.nih.gov/pubmed/15219588

Pereira, F., Mitchell, T., Botvinick, M., 2009. Machine learning classifiers and fMRI: A tutorial overview.

Neuroimage 45 (1 Suppl), S199–209.

DOI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=

Citation&list_uids=19070668

Schwartz, A. B., Kettner, R. E., Georgopoulos, A. P., 1988. Primate motor cortex and free arm movements

to visual targets in three-dimensional space. i. relations between single cell discharge and direction of

movement. J Neurosci 8 (8), 2913–27.

DOI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=

Citation&list_uids=3411361

37

Page 38: Patterncomponentmodeling ... · bDepartment of Statistical and Actuarial ... (i.e. a variable that describes some aspect of ... (rotated) features sets that describe the same distribution

Sergio, L. E., Hamel-Paquet, C., Kalaska, J. F., 2005. Motor cortex neural correlates of output kinematics

and kinetics during isometric-force and arm-reaching tasks. J Neurophysiol 94 (4), 2353–78.

DOI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=

Citation&list_uids=15888522

Sergio, L. E., Kalaska, J. F., 1997. Systematic changes in directional tuning of motor cortex cell activity

with hand location in the workspace during generation of static isometric forces in constant spatial

directions. J Neurophysiol 78 (2), 1170–4.

DOI http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=

Citation&list_uids=9307146

Shenoy, K. V., Sahani, M., Churchland, M. M., 2013. Cortical control of arm movements: a dynamical

systems perspective. Annu Rev Neurosci 36, 337–59.

DOI https://www.ncbi.nlm.nih.gov/pubmed/23725001

Siero, J. C., Hermes, D., Hoogduin, H., Luijten, P. R., Petridou, N., Ramsey, N. F., 2013. Bold consistently

matches electrophysiology in human sensorimotor cortex at increasing movement rates: a combined 7t

fmri and ecog study on neurovascular coupling. J Cereb Blood Flow Metab 33 (9), 1448–56.

DOI https://www.ncbi.nlm.nih.gov/pubmed/23801242

Stephan, K. E., Penny, W. D., Daunizeau, J., Moran, R. J., Friston, K. J., 2009. Bayesian model selection

for group studies. Neuroimage 46 (4), 1004–17.

DOI https://www.ncbi.nlm.nih.gov/pubmed/19306932

Walther, A., Nili, H., Ejaz, N., Alink, A., Kriegeskorte, N., Diedrichsen, J., 2016. Reliability of dissimilarity

measures for multi-voxel pattern analysis. Neuroimage 137, 188–200.

DOI http://www.ncbi.nlm.nih.gov/pubmed/26707889

38