hidden markov model and support vector machine based...

15
This content has been downloaded from IOPscience. Please scroll down to see the full text. Download details: IP Address: 169.229.158.96 This content was downloaded on 02/10/2014 at 19:47 Please note that terms and conditions apply. Hidden Markov model and support vector machine based decoding of finger movements using electrocorticography View the table of contents for this issue, or go to the journal homepage for more 2013 J. Neural Eng. 10 056020 (http://iopscience.iop.org/1741-2552/10/5/056020) Home Search Collections Journals About Contact us My IOPscience

Upload: dinhnhi

Post on 30-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

This content has been downloaded from IOPscience Please scroll down to see the full text

Download details

IP Address 16922915896

This content was downloaded on 02102014 at 1947

Please note that terms and conditions apply

Hidden Markov model and support vector machine based decoding of finger movements

using electrocorticography

View the table of contents for this issue or go to the journal homepage for more

2013 J Neural Eng 10 056020

(httpiopscienceioporg1741-2552105056020)

Home Search Collections Journals About Contact us My IOPscience

OPEN ACCESSIOP PUBLISHING JOURNAL OF NEURAL ENGINEERING

J Neural Eng 10 (2013) 056020 (14pp) doi1010881741-2560105056020

Hidden Markov model and support vectormachine based decoding of fingermovements using electrocorticographyTobias Wissel1910 Tim Pfeiffer1 Robert Frysch1 Robert T Knight23Edward F Chang2 Hermann Hinrichs4567 Jochem W Rieger38

and Georg Rose1

1 Chair for Healthcare Telematics and Medical Engineering Otto-von-Guericke-University MagdeburgPostfach 4120 D-39016 Magdeburg Germany2 Department of Neurological Surgery University of California San Francisco 505 Parnassus AveM-779 San Francisco CA 94143-0112 USA3 Department of Psychology and the Helen Wills Neuroscience Institute University of CaliforniaBerkeley 132 Barker Hall Berkeley CA 94720-3190 USA4 Clinic of Neurology Otto-von-Guericke-University Magdeburg Leipziger Straszlige 44D-39120 Magdeburg Germany5 Leibniz-Institute for Neurobiology Brenneckestraszlige 6 D-39118 Magdeburg Germany6 German Center for Neurodegenerative Diseases (DZNE) Leipziger Straszlige 44 D-39120 MagdeburgGermany7 Center of Behavioural Brain Sciences (CBBS) Universitatsplatz 2 D-39106 Magdeburg Germany8 Applied Neurocognitive Psychology Faculty VI Carl-von-Ossietzky University D-26111 OldenburgGermany

E-mail wisselrobuni-luebeckde

Received 28 May 2013Accepted for publication 28 August 2013Published 18 September 2013Online at stacksioporgJNE10056020

AbstractObjective Support vector machines (SVM) have developed into a gold standard for accurateclassification in brainndashcomputer interfaces (BCI) The choice of the most appropriate classifierfor a particular application depends on several characteristics in addition to decoding accuracyHere we investigate the implementation of hidden Markov models (HMM) for online BCIsand discuss strategies to improve their performance Approach We compare the SVM servingas a reference and HMMs for classifying discrete finger movements obtained fromelectrocorticograms of four subjects performing a finger tapping experiment The classifierdecisions are based on a subset of low-frequency time domain and high gamma oscillationfeatures Main results We show that decoding optimization between the two approaches is dueto the way features are extracted and selected and less dependent on the classifier Anadditional gain in HMM performance of up to 6 was obtained by introducing modelconstraints Comparable accuracies of up to 90 were achieved with both SVM and HMMwith the high gamma cortical response providing the most important decoding information forboth techniques Significance We discuss technical HMM characteristics and adaptations in

9 Author to whom any correspondence should be addressed10 Present address Institute for Robotics and Cognitive Systems University of Luebeck Ratzeburger Allee 160 D-23538 Luebeck Germany

Content from this work may be used under the terms of the Creative Commons Attribution 30 licenceAny further distribution of this work must maintain attribution to the author(s) and the title of the work journal citation and DOI

1741-256013056020+14$3300 1 copy 2013 IOP Publishing Ltd Printed in the UK amp the USA

J Neural Eng 10 (2013) 056020 T Wissel et al

the context of the presented data as well as for general BCI applications Our findings suggestthat HMMs and their characteristics are promising for efficient online BCIs

S Online supplementary data available from stacksioporgJNE10056020mmedia

(Some figures may appear in colour only in the online journal)

1 Introduction

Brainndashcomputer interface (BCI) oriented research has aprinciple goal of aiding disabled people suffering from severemotor impairments (Hoffmann et al 2007 Palaniappan et al2009) The majority of research on BCI has been based onelectroencephalography (EEG) data and restricted to simpleexperimental tasks using a small set of commands In thesestudies (Cincotti et al 2003 Lee and Choi 2003 Obermaieret al 1999) information was extracted from a limited number ofEEG channels over scalp sites of the right and left hemisphere

Recently studies have employed invasive subduralelectrocorticogram (ECoG) recordings for BCI (Ganguly andCarmena 2010 Schalk 2010 Zhao et al 2010 Shenoy et al2007) ECoG is recorded for diagnosis in clinical populationseg for localization of epileptic foci The signal quality ofECoG recorded brain activity outperforms the EEG data withrespect to higher amplitudes and higher signal-to-noise ratio(SNR) higher spatial resolution and broader bandwidth (0ndash500 Hz) (Crone et al 1998 Schalk 2010) Thus ECoG-signalshave the potential for improving earlier results of featureextraction and signal classification (Schalk 2010)

While approaches have been made to correlate continuousmovement kinematics and brain activity exploiting regressiontechniques (eg Wiener filter Kalman filter and others Dethieret al 2011 Acharya et al 2010 Ball et al 2009 Kubanek et al2009 Liang and Bougrain 2009 Pistohl et al 2008) the supportvector machine (SVM) approach has been established as a goldstandard for deriving class labels in cases of a discrete set ofcontrol commands (Quandt et al 2012 Liu et al 2010 Zhaoet al 2010 Demirer et al 2009 Shenoy et al 2007) This is dueto high and reproducible classification performance as well asrobustness with a low number of training samples using SVMapproaches (Guyon et al 2002)

However depending on the individual problem distinctproperties of other machine learning methods may provideanother viable BCI approach Pascual-Marqui et al (1995)and Obermaier et al (1999) argued that brain states and statetransitions can explain components of observed human brainactivity Obermaier et al applied HMMs in offline classificationof non-invasive EEG recordings of brain activity to distinguishbetween two different imagined movements In support of theidea of distinct mental states they reported an improvementof BCI classification results in subjects who developed aconsistent imagination strategy resulting in more focused andreproducible brain activity patterns The authors concludedthat HMMs offer a promising means to classify mental activityin a BCI framework because of the inherent HMM flexibilityin structural degrees of freedom and simplicity Some studieshowever reveal unstable properties and lower HMM robustness

in high dimensional feature spaces The latter point is criticallyrelated to small training sets leading to degradation in theclassifierrsquos ability (Lotte et al 2007)

In this study we directly compare HMM and SVMclassification of ECoG data recorded during a finger tappingexperiment This was done after a careful optimization of bothclassifiersmdashfirst focusing on improvements of the featuresand secondly by introducing constraints to the HMMs reducingthe amount of required training data Additionally the problemof high dimensional feature spaces is addressed by featureselection It is known that feature extraction and selectionhave a crucial impact on classifier performance (Pechenizkiy2005 Guyon and Elisseeff 2003) However the exact effecton the decoding accuracy for given data characteristics andrepresentations of information in time and space remains tobe investigated Poorly constructed feature spaces cannot becompensated for by the classifier However an optimal featurespace does not make the choice of the classifier irrelevantA tendency toward unstable HMM behaviour as well as thedifferent ways HMM and SVM model feature space wasaddressed with the ECoG data set

Our hypothesis was that for ECoG classification of fingerrepresentations differences in performance between the twoapproaches would be due to the way features are extracted andselected and less influenced by the choice of the classifier Herewe demonstrate that constrained HMMs achieve decodingrates comparable to the SVM We first provide an overviewof how the data were acquired We then review the analysismethods involving feature extraction HMMs and SVMsThird results on feature space setup and feature selection areevaluated Finally we discuss the decoding performance ofthe HMM classifier and directly compare it to the one-vs-oneSVM

2 Material and methods

21 Patients experimental setup and data acquisition

ECoG data were recorded from four patients who volunteeredto participate (aged 18ndash35 years right handed male) Allpatients received subdural electrode implants for pre-surgicalplanning of epilepsy treatment at UC San Francisco CA USAThe electrode grid was solely placed based on clinical criteriaand covered cortical pre-motor motor somatosensory andtemporal areas Details on the exact grid placement for eachsubject can be obtained from supplementary figures SI1ndashSI4available from stacksioporgJNE10056020mmedia Thestudy was conducted in concordance with the local institutionalreview board approved protocol as well as the Declarationof Helsinki All patients gave their informed consent before

2

J Neural Eng 10 (2013) 056020 T Wissel et al

Table 1 The number of recorded and rejected trials per subject session and class (thumbindex fingermiddle fingerlittle finger)

Subject Session Trials Class breakdown Rejected trials (breakdown)

S1 1 357 6211310874 0 (0000)2 240 33907245 1 (0100)Total 587 95203180119 1 (0100)

S2 1 280 44979148 40 (513139)2 308 5110311143 0 (0000)3 271 38879254 77 (13232615)4 273 32888865 41 (817106)Total 1122 165365382210 158 (26534930)

S3 1 234 54707040 114 (16403622)2 249 47708349 104 (15313820)Total 483 10114015389 218 (31717442)

S4 1 359 5711311673 6 (2211)2 328 6310410655 33 (611115)Total 687 120217222128 39 (813126)

the recordings started The recordings did not interfere withthe treatment and entailed minimal additional risk for theparticipant At the time of recording all patients were off anti-epileptic medications They were withdrawn when they wereadmitted for ECoG monitoring

The recordings were obtained from a 64-channel grid of8 times 8 platinumndashiridium electrodes with 1 cm centre-to-centrespacing The diameter of the electrodes was 4 mm of which23 mm were exposed (except for subject 4 who had a 16 times 16electrode grid 4 mm centre-to-centre spacing) The ECoG wasrecorded with a hardware sampling frequency of 30517 Hzand was then down-sampled to 1017 Hz for storage and furtherprocessing During the recordings patients performed a serialreaction time task in which a number appeared on a screenindicating with which finger to press a key on the keyboard(1 indicated thumb 2 index 3 middle 5 little finger thering finger was not used) The next trial was automaticallyinitiated with a short randomized delay of (335 plusmn 36) ms afterthe previous key press (mean value for S1 session 1 othersin similar range) The patients were instructed to respond asrapidly and accurately as possible with all fingers remainingon a fixed key during the whole 6ndash8 min run For each trialthe requested as well as actually used finger was recordedThe rate of correct button presses was at or close to 100 forall subjects The classification trials were labeled according tothe finger that was actually moved Each patient participatedin 2ndash4 of these sessions (table 1) Subjects performed slightfinger movements requiring minimal force Topographic fingerrepresentations extend approximately 55 mm along the motorcortex (Penfield and Boldrey 1937) The timing of stimuluspresentations and button presses was recorded in auxiliaryanalogue channels synchronized with the brain data

Due to clinical constraints the time available for ECoGdata acquisition is restricted providing a challenging datasetfor any classifier The numbers of trials for classification arelisted in table 1 For pre-processing the time series was firsthighpass filtered using a cut-off frequency at 05 Hz as well asnotch filtered around the power line frequency (60 Hz) usingfrequency domain filtering The electric potentials on the gridwere re-referenced to common average reference (CAR)

Next the time series were visually inspected for theremaining intervals and artefacts caused by the measurement

hardware line noise loose contacts and sections with signalvariations not explained by normal physiology were excludedfrom the analysis Channels with epileptic activity wereremoved since epileptic spiking has massive high-frequencyspectral activity and can distort informative features in thisrange We also removed any epochs with spread of theepileptic activity to normal brain sites to avoid high-frequencyartefacts The procedure was chosen to ensure comparabilityacross subjects and sessions Trial rejection was carried outwithout any knowledge of the corresponding class labels andin advance of any investigations on classification accuraciesFinally the trials were aligned with respect to the detectedbutton press being the response for each movement request

22 Test framework

The methods described in the following assemble an overalltest framework as illustrated in figure 1 The classifier beingthe main focus here is embedded as the core part among othersupporting modules including pre-processing data groupingaccording to the testing scheme channel selection and featureextraction The framework includes two separate paths fortraining and testing data to ensure that no prior knowledgeabout the test data is used in the training process

23 Feature generation

Two different feature types have been extracted for thisstudy First time courses of low-frequency components wereobtained by spectral filtering in the Fourier domain These timecourses were limited to a specific interval of interest aroundthe button press containing the activation most predictive forthe four classes The chosen frequency range of the lowpassfilter covers a band of low-frequency components that containphase-locked event-related potentials (ERPs) and prominentlocal motor potentials (LMP) (Kubanek et al 2009) Theseare usually extracted by averaging out uncorrelated noise overmultiple trials (Makeig 1993) Here we focus on single-trialanalysis which may not reflect LMP activity (Picton et al1995) As a consequence we refer to this feature type as low-frequency time domain (LFTD) features Boundaries of timeinterval (length and location relative to the button press) and

3

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 1 Block diagram of the implemented test framework The decision for a particular class is provided by either SVM or HMM in bothcases they are exposed to an identical framework of pre-processing channel selection and feature extraction

lowpass filter cut-off were selected by grid search for eachsession and are stated in the results The data sampling wasreduced to the Nyquist frequency

Second event-related spectral perturbation (ERSP) wasmeasured using a sliding Hann-window approach (windowlength nw = 260 ms) For each window the square root ofthe power spectrum was computed by fast Fourier transform(FFT) The resulting coefficients were then averaged ina frequency band of interest Taking the square root toobtain the amplitude spectrum assigned a higher weight tohigh-frequency components compared to the original powerspectrum The resulting feature sequence can be used as ameasure for time-locked power fluctuation in the high gammaband neglecting explicit phase information (Graimann et al2004 Crone et al 1998 Makeig 1993)

The time interval relative to the button press and thewindow overlap and exact low and high cut-off frequency ofthe high gamma band most discriminative for the four classeswere selected by grid search The resulting parameters for eachsession are stated in the results section

A combined feature space including both high gamma(HG) as well as LFTD features was also evaluated The ratio pof how many LFTD feature sequences are used relative to thenumber of HG sequences was obtained by grid search Subsetselection beyond this ratio is described in the next subsection

24 Feature selection

241 General remarks on subset selection Featuresequences were computed for all 64 ECoG channels giving2 times 64 sequences in total From these sequences a subset ofsize O has been identified that contains the most relevantinformation about the finger labels A preference towardscertain channels is influenced by the grid placement as well asdistinct parts of the cortex differentially activated by the task

We tested several measures of feature selection suchas Fisherrsquos linear discriminant analysis (Fisher 1936) andBhattacharyya distance (Bhattacharyya 1943) and found thatthe DaviesndashBouldin (DB) index (Wissel and Palaniappan2011 Davies and Bouldin 1979) yielded the most robustand stable subset estimates for accurate class separation Inaddition to higher robustness the DB index also represents thefastest alternative computationally The DB index substantiallyoutperformed other methods for both SVM and HMMNevertheless it should be pointed out that observed DB subsetswere in good agreement with the ranking generated by the

normal vector w of a linear SVM Apart from so-called filtermethods such as the DB index an evaluation of these vectorentries after training the SVM would correspond to featureselection using a wrapper method (Guyon and Elisseeff 2003)Our study did not employ any wrapper methods to avoiddependences between feature selection and classifier This mayhave led to an additional bias towards SVM or HMM (Yu andLiu 2004) Based on the DB index we defined an algorithmthat automatically selects an O-dimensional subset of featuresequences as follows

242 DB index The index corresponds to a clusterseparation index measuring the overlap of two clusters in anarbitrarily high dimensional space where r isin R

T is one of Nsample vectors belonging to cluster Ci

μi = 1

N

sumrisinCi

r (1)

di = 1

N

sumrisinCi

r minus μi (2)

Ri j = di + di

μi minus μji j isin N+ i = j (3)

Using the cluster centroids μi isin RT and spread di matrix R =

Ri ji=1Nc j=1Nc contains the rate between the within-classspread and the between-class spread for all class combinationsThe smaller the index the less two clusters overlap in thatfeature space The index is computed for all features available

243 Feature selection procedure Covering differentaspects of class separation we developed a feature selectionalgorithm based on the DB index taking multiple classesinto account The algorithm aims at minimizing the mutualclass overlap for a subset of defined size O as follows Firsteach feature sequence is segmented into three parts of equallength The index matrix R measuring the cluster overlapbetween all classes i and j is then computed segment-wisefor each feature under consideration Subsequently all threematrices are averaged (geometric mean for a bias towardssmaller indices) for further computing The feature yieldingthe smallest mean index is selected since it provides the bestseparation on average from one class to at least one other classSecond for all class combinations the features correspondingto the minimal index per combination are added Each of

4

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 2 An HMM structure for one class consists of Q states emitting multi-dimensional observations ot at each time step The state aswell as the output sequence is governed by a set of probabilities including transition probabilities aij To assign a sequence Ot to a particularmodel the classifier decides for the highest probability P(Ot|model)

them optimally separates a certain finger combination Nextfeatures with the smallest geometric mean per class are addedif they are not part of the subset already Finally in case thepredefined number of elements in the subset O admits furtherfeatures vacancies are filled according to a sorted ranking ofindices averaged across class combinations per feature

For each trial this subset results in an O-dimensionalfeature vector varying over time t isin [0 T ] While thismultivariate sequence was fed directly into the HMM classifierthe OtimesT matrix Ot had to be reshaped into a 1times (T middotO) vectorfor the SVM The subset size O is obtained by exhaustivesearch on the training data This makes the difference betweenboth classifiers in data modelling obvious

While an HMM technically distinguishes between thetemporal development (columns in Ot) and the feature type(rows in Ot) the SVM concatenates all channels and theirtime courses within the same one-dimensional feature vectorThe potential flexibility of HMMs with respect to temporalphenomena arises substantially from this fact

25 Classification

251 Hidden Markov models Arising from a state machinestructure hidden Markov models (HMMs) comprise twostochastic processes First as illustrated in figure 2 a setof Q underlying states is used to describe a sequence Ot ofT O-dimensional feature vectors These states qt isin sii=1Q

are assumed to be hidden while each is associated with anO-dimensional multivariate Gaussian M-mixture distributionto generate the observed featuresmdashthe second stochasticprocess Both the state incident qt as well as the observedoutput variable ot at time t are subject to the Markov propertymdasha conditional dependence only on the last step in time t minus 1

Among other parameters the state sequence structure andits probable variations are described by prior probabilitiesP(q1 = si) and a transition matrix

A = ai ji j=1Q = P(qt+1 = s j|qt = si)i j=1Q

A more detailed account of the methodological background isgiven by Rabiner (1989)

For each of the four classes one HMM had to be definedwhich was trained using the iterative BaumndashWelch algorithmwith eight expectation maximization (EM) steps (Dempsteret al 1977) The classification on the test set was carried outfollowing a maximum likelihood approach as indicated by thecurly bracket in figure 2

In order to increase decoding performance the abovestandard HMM structure was constrained and adapted to theclassification problem This aims at optimizing the modelhypothesis to give a better fit and reduces the number of freeparameters to deal with limited training data

252 Model constraints First the transition matrix A wasconstrained to the Bakis model commonly used for wordmodelling in automatic speech processing (ASR) (Schukat-Talamazzini 1995) This model admits non-zero elementsonly on the upper triangular part of the transition matrixmdashspecifically on the main as well as first and second secondarydiagonal Thus this particular structure implies only lsquolooprsquolsquonextrsquo and lsquoskiprsquo transitions and suppresses state transitionson the remaining upper half Preliminary results using a fullupper triangular matrix instead revealed small probabilitiesie very unlikely state transitions for these suppressedtransitions throughout all finger models This suggests onlyminor contributions to the signal characteristics describedby that model A full ergodic transition matrix generated

5

J Neural Eng 10 (2013) 056020 T Wissel et al

poor decoding performances in general Recently a similarfinding was presented by Lederman and Tabrikian for EEGdata (Lederman and Tabrikian 2012)

Given five states containing one Gaussian the unleashedmodel would involve 4 times 1235 degrees of freedom while theBakis model uses 4 times 1219 for an exemplary selection of15 channels The significant increase of decoding performanceby dropping only a few parameters seems interesting withrespect to the nature of the underlying signal structure andsheds remarkable light on the importance of feature selectionfor HMMs An unleashed model using all 64 channels wouldhave 4 times 20835 free parameters

253 Model structure Experiments for all feature spaceshave been conducted in which the optimal number of states Qper HMM and the number of mixture distribution componentsM per state have been examined by grid search Sinceincreasing the degrees of freedom will degrade the qualityof model estimation high decoding rates have mainly beenobtained for a small number of states containing only a fewmixture components Based on this search a model structureof Q = 5 states and M = 1 mixture components has beenchosen as the best compromise Note that the actual optimummight vary slightly depending on the particular subject thefeature space and even on the session which is in line withother studies (Obermaier et al 1999 2001a 2001b)

254 Initialization The initial parameters of themultivariate probability densities have been set using a k-means clustering algorithm (Duda et al 2001) The algorithmtemporally clusters the vectors of the O times T feature matrix OtThe rows of this matrix were extended by a scaled temporalcounter adding a time stamp to each feature vector Thisincreases the probability of time-continuous state intervals inthe cluster output Then for each of the Q middot M clusters themean as well as the full covariance matrix have been calculatedexcluding the additional time stamp Finally to assign thesemixture distributions to specific states the clusters were sortedaccording to the mean of their corresponding time stampvalues This temporal order is suggested by the constrainedforward structure of the transition matrix The transition matrixis initialized with probability values on the diagonal thatare weighted to be c-times as likely as probabilities off thediagonal The constant c is being calculated as the quotientof the number of samples for a feature T and the number ofstates Q In case of multiple mixture components (M gt 1) themixture weight matrix was initialized using the normalizednumber of sample vectors assigned to each cluster

255 Implementation The basic implementation has beenderived from the open source functionalities on HMMsprovided by Kevin Murphy from the University of BritishColumbia (Murphy 2005) This framework its modificationsas well as the entire code developed for this study wereimplemented using MATLAB R2012b from MathWorks

256 Support vector machines As a gold-standardreference we used a one-vs-one SVM for multi-classclassification In our study training samples come in pairedassignments (xi yi)i=1N where xi is a 1 times (T middot O) featurevector derived from the ECoG data for one trial with itsclass assignment yi = ndash11 Equation (4) describes theclassification rule of the trained classifier for new samplesx A detailed derivation of this decision rule and the rationalebehind SVM-classification is given eg by Hastie et al (2009)

F(x) = sgn(sum

iαiyiK(xi x) + β0

)(4)

The αi isin [0 C] are weights for the training vectors and areestimated during the training process Training vectors withweights αi = 0 contribute to the solution of the classificationproblem and are called support vectors The offset of theseparating hyperplane from the origin is proportional to β0The constant C generally controls the trade-off betweenclassification error on the training set and smoothness ofthe decision boundary The numbers of training samplesper session (table 1) were in a similar range compared tothe resulting dimension of the SVM feature space T middot O(sim130minus400) Influenced by this fact the decoding accuracyof the SVM revealed almost no sensitivity with respect toregularization Since grid search yielded a stable plateau ofhigh performance for C gt 1 the regularization constant wasfixed to C = 1000 The linear kernel which has been usedthroughout the study is denoted by K(xi x) We further usedthe LIBSVM (wwwcsientuedutwsimcjlinlibsvm) packagefor Matlab developed by Chih-Chung Chang and Chih-Jen Lin(Chang and Lin 2011)

26 Considerations on testing

In order to evaluate the findings presented in the followingsections a five-fold cross-validation (CV) scheme was used(Lemm et al 2011) The assignment of a trial to a specific foldset was performed using uniform random permutations

Each CV procedure was repeated 30 times to averageout fluctuations in the resulting recognition rates arising fromrandom partitioning of the CV sets The number of repetitionswas chosen as a compromise to obtain stable estimates of thegeneralization error but also to avoid massive computationalload The training data was balanced between all classes totreat no class preferentially and to avoid the need for classprior probabilities Feature subset selection and estimation ofthe parameters of each classifier was performed only on datasets allocated for training in order to guarantee a fair estimateof the generalization error Due to prohibitive computationaleffort the feature spaces have been parameterized only oncefor each subject and session and are hence based on all samplesof each subject-specific data set In this context preliminaryexperiments indicated minimal differences to the decodingperformance (see table 2) when fixing the feature definitionsbased on the training sets only These experiments used nestedCV with an additional validation set for the feature spaceparameters

6

J Neural Eng 10 (2013) 056020 T Wissel et al

(a) (b)

(c) (d)

Figure 3 Optimal parameters of both feature spaces for each subject and session (a) Location and width for ROIs containing mostdiscriminative information for LFTD features (b) Optimal high cut-off frequency for the spectral lowpass filter to extract LFTD features(c) Location and width for ROIs containing most discriminative information for HG features (d) Low and high cut-off frequency for theoptimal HG frequency band

Table 2 HMM-decoding accuracies (in ) for parameteroptimization on full data and on training subsets Accuracies aregiven for subject 1mdashsession 1 Parameter optimization wasperformed on the full dataset (standard-CV routine) as well as thetraining subset (nested-CV routine)

Parameter optimization on LFTD HG LFTD + HG

Full dataset 807 plusmn 03 979 plusmn 01 984 plusmn 01Training subset 811 plusmn 10 978 plusmn 03 980 plusmn 02

3 Results

31 Feature and channel selection

In order to identify optimal parameters for feature extractiona grid search for each subject and session was performed Forthis optimization a preliminary feature selection was appliedaiming at a fixed subset size of O = 8 which was foundto be within the range of optimal subset sizes (see figure 6and table 3) Results for all sessions are illustrated by the barplots in figure 3 A sensitivity of the decoding rate has beenfound for temporal parameters such as location and width forthe time windows (ROIs) containing the most discriminativeinformation about LFTD or HG features (figures 3(a) and (c))

The highest impact on accuracy was found for the HGROI (mean optimal ranges S1 [minus1875 3375] ms S3 [minus400475] ms S4 [minus280 420] ms)

Further sensitivity is given for spectral parameters namelythe high cut-off frequency for the LFTD lowpass filter and thelow and high cut-off for the HG bandpass (figures 3(b) and(d)) For the former optimal frequencies ranged from 11ndash30 Hz and the optimal frequency bands for the latter all laywithin 60ndash300 Hz depending on the subject There was hardlyany impact on the decoding rate when changing the slidingwindow size and window overlap for the HG features Thesewere hence set to fixed values According to these parametersettings the mean numbers of samples T were 265 (subject 1)213 (subject 2) 31 (subject 3) and 22 (subject 4)

An example of the grid search procedure isshown in supplementary figures SI5ndashSI7 (available atstacksioporgJNE10056020mmedia) The plots showvarying decoding rates for the S1 session 1 within individualparameter subspaces The optimal parameters illustrated infigure 3 represent a single point in these spaces at peakingdecoding rate Based on these parameters figure 4 illustrates atypical result for a LFTD and HG feature sequence extractedfrom the shown raw ECoG signal (S1 session 2 channel 23middle finger) All feature sequences for this trialmdashconsistingof the entire set of channelsmdashare given in supplementaryfigure SI8

Using the resulting parameterization subsequent CV runssought to identify the optimal subset by tuning O based on thetraining set of the current fold Figure 5 illustrates the rateat which a particular channel was selected for each subject

7

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 4 The dashed grey time course shows a raw ECoG signal fora typical trial (S1 session 2 channel 23 middle finger) Theextracted feature sequences for LFTD (cut-off at 24 Hz) and HG(frequency band [85 Hz 270 Hz]) are overlaid as solid lines EachHG feature has been assigned to the time point at the centre of itscorresponding FFT window

Figure 5 Channel maps for all four subjects The channels arearranged according to the electrode grid and coloured with respectto their mean selection rate in repeated CV-runs Channels colouredlike the top of the colour bar tend to have a higher probability ofbeing selected

It evaluates the optimal subsets O across all sessions foldsand several CV repetitions Results are exemplarily shownfor high gamma features Full detail on subject-specific gridplacement and channel maps for LFTD and HG featurescan be found in supplementary figures SI1ndashSI4 (available atstacksioporgJNE10056020mmedia)

The channel maps reveal localized regions of highlyinformative channels (dark red) comprising only a small subsetwithin the full grid While these localizations are distinct andfocused for high gamma features informative channels spreadwider across the electrode grid for LFTD features This finding

is in line with an earlier study done by Kubanek et al (2009) andimplies that selected subsets for LFTD features tend to be lessstable across CV runs Combined subsets were selected fromboth feature spaces The selections indicated a preference fora higher proportion of LFTD features (table 3) This optimalproportion was determined by grid search

The optimal absolute subset size O was selected bysuccessively adding channels to the subset as defined by ourselection procedure and identifying the peak performanceFor the high gamma case figure 6 already shows a sharpincrease in decoding performance for a subset size belowfive Performance then increases less rapidly until it reachesa maximum usually in the range of 6ndash8 channels for the HGfeatures (see table 3 for detailed information on the results forthe other features) This behaviour again reflects the fact that alot of information is found in a small number of HG sequences

Increasing the subset size beyond this optimal size leadsto a decrease in performance for both SVM and HMM Thiscoincides with an increase of the number of free parameters inthe classifier model depending quadratically on the number ofchannels for HMMs (figure 6 bottom)

32 Decoding performances

Based on the configuration of the feature spaces and featureselection a combination space containing both LFTD andHG features yielded the best decoding accuracy across allsessions and subjects Figure 7 summarizes the highestHMM performances for each subject averaged across allsessions and compares them with the corresponding SVMaccuracies Except for subject 2 these accuracies were closeto or even exceeded 90 correct classifications Howeverfigure 8 shows that accuracies obtained only from the HGspace come very close to this combined case implying thatLFTD features contribute limited new information Usingtime domain features exclusively decreased performance inmost cases An exception is given by subject 2 whereLFTD features outperformed HG features This may beexplained by the wider spread of LFTD features across thecortex and the suboptimal grid placement (little coverageof motor cortex supplementary figure SI2 available atstacksioporgJNE10056020mmedia) The channel mapsshow informative channels at the very margin of the gridThis suggests that grid placement was not always optimal forcapturing the most informative electrodes

Differences in decoding accuracy compared to the SVMwere found to be small (table 4) and revealed no clearsuperiority of any classifier (figure 9) Excluding subject 2the mean performance difference (HMM-SVM) was ndash053for LFTD 01 for HG and ndash023 for combined featuresAccuracy trends for individual subjects are similar acrosssessions and vary less than across subjects (figure 9) Notethat irrespective of the suboptimal grid placement subject 2took part in four experimental sessions Here subject-specificvariables eg grid placement were important determinants ofdecoding success Thus we interpret our results at the subjectlevel and conclude that in three out of four subjects SVMand HMM decoding of finger movements led to comparableresults

8

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 6 Influence of channel selection on the HMM decoding accuracy using high gamma features only (S1 session 1) Changes indecoding accuracy when successively adding channels to the feature set applying either random or DB-index channel selection (top)Number of free HMM parameters for each number of channels (bottom)

Table 3 Number of selected channels averaged across sessions The averaged number of selected channels and the corresponding standardderivations are given for all subjects and all feature spaces For the combined space square brackets denote the corresponding breakdowninto its components The average is taken across all sessions and CV repetitions

Subject LFTD HG LFTD + HG

1 100 plusmn 00 60 plusmn 00 165 plusmn 05 ([100 plusmn 00]+[65 plusmn 07])2 65 plusmn 19 73 plusmn 21 128 plusmn 34 ([70 plusmn 45] + [58 plusmn 15])3 100 plusmn 00 80 plusmn 00 140 plusmn 10 ([60 plusmn 14] + [80 plusmn 00])4 100 plusmn 00 60 plusmn 14 160 plusmn 07 ([95 plusmn 07] + [65 plusmn 07])

A clear superiority for SVM classification was only foundfor subject 2 for whom the absolute decoding performanceswere significantly worse than those of the other three subjects

For the single spaces LFTD and HG table 4 shows themean performance loss for the unleashed HMM That refersto a case wherein the Bakis model constraints are relaxedto the full model For all except two sessions of subject 2(the full model had a higher performance of 02 for session3 and 07 for session 4) the Bakis model outperformedthe full model The results indicate a higher benefit for theLFTD features with an up to 88 performance gain forsubject 3 session 2 Higher accuracies were consistentlyachieved for feature subsets deviating from the optimalsize An example is illustrated for LFTD channel selection(subject 3 session 2) in supplementary figure SI9 (availableat stacksioporgJNE10056020mmedia) The plot comparesthe Bakis and full model similar to the procedure in figure 6While table 4 lists only the improvements for the optimalnumber of channels supplementary figure SI9 indicates that

the performance gain may be even higher when deviating fromthe optimal settings

Finally figure 10 illustrates the accuracy decompositionsinto true positive rates for each finger The results reveal similartrends across subjects and feature spaces While the thumband little finger provided the fewest training samples theirtrue positive rates tend to be the highest among all fingers Anexception is given for LFTD features for subject 2 Equivalentplots for the SVM are shown in supplementary figure SI10(available at stacksioporgJNE10056020mmedia)

For the classification of one feature sequence includingall T samples the HMM Matlab implementation took onaverage (3522 plusmn 0036) ms and the SVM implementationin C (00281 plusmn 00034) ms The mean computation time fora feature sequence was 02 ms plusmn 21μs (LFTD) 172 ms plusmn379 μs (HG) and 179 ms plusmn 200 μs (combined space) Theresults were obtained with an Intel Core i5-2500 K 34 GHz16 GB RAM Win 7 Pro 64-Bit Matlab 2012b for ten selectedchannels

9

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 7 Decoding accuracies (in ) for HMM (green) and SVM (beige) using LFTD + HG feature space (best case) Averageperformances across all sessions are shown for all subjects (S1ndashS4) as a mean result of a 30 times 5 CV procedure with their correspondingerror bars

Figure 8 Decoding accuracies (in ) for the HMMs for all feature spaces and sessions Bi Performances are averaged across 30 repetitionsof the five-fold CV procedure and displayed with their corresponding error bars

4 Discussion

An evaluation of feature space configurations revealed asensitivity of the decoding performance to the preciseparameterization of individual feature spaces Apart from thesegeneral aspects the grid search primarily allows for individualparameter deviations across subjects and sessions Importantlya sensitivity has been identified for the particular time intervalof interest used for the sliding window approach to extractHG features This indicates varying temporal dynamics acrosssubjects and may favour the application of HMMs in onlineapplications This sensitivity suggests that adapting to thesetemporal dynamics has an influence on the degree to whichinformation can be deduced from the data by both HMMand SVM However our investigations indicate that HMMsare less sensitive to changes of the ROI and give nearly

stable performances when changing the window overlapAs a model property the HMM is capable of modellinginformative sequences of different lengths and rates Thelatter characteristic originates from the explicit loop transitionwithin a particular model This means the HMM may coveruninformative samples of a temporal sequence with an idlestate before the informative part starts Underlying temporalprocesses that may persist longer or shorter are modelled bymore or less repeating basic model states Lacking any time-warp properties an SVM would expect a fixed sequence lengthand information rate meaning a fixed association between aparticular feature dimension oi and a specific point in time tiThis highlights the strict comparability of information withinone single feature dimension explicitly required by the SVMThis leads to the prediction of benefits for HMMs in onlineapplications where the time for the button press or other

10

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 9 Differences in decoding accuracy (in ) for the HMMs with respect to the SVM reference classifier (HMM-SVM) Results aredisplayed for all sessions Bi and feature spaces

Figure 10 Mean HMM decoding rates decomposed into true positive rates for each finger Results are shown for each feature space andaveraged across sessions and CV repetitions

Table 4 Final decoding accuracies (in ) listed for HMM (left) and SVM (right) Accuracies for the Bakis model are given for eachsubject-session and feature space by its mean and standard error across all 30 repetitions of the CV procedure Additionally for each subjectand feature space the mean across all sessions as well as the mean loss in performance for LFTD and HG when using an unconstrainedHMM is shown

HMM SVMSubjectndashsession LFTD HG LFTD + HG LFTD HG LFTD + HG

S1ndashB1 () 807 plusmn 03 979 plusmn 01 984 plusmn 01 858 plusmn 04 972 plusmn 02 983 plusmn 01S1ndashB2 () 788 plusmn 06 924 plusmn 04 953 plusmn 03 849 plusmn 05 934 plusmn 04 961 plusmn 03Mean () 798 plusmn 05 952 plusmn 03 969 plusmn 02 854 plusmn 05 953 plusmn 03 972 plusmn 02Unconstrained model () minus31 plusmn 05 minus09 plusmn 03 ndash ndash ndash ndashS2ndashB1 () 440 plusmn 10 359 plusmn 07 444 plusmn 06 514 plusmn 08 380 plusmn 08 557 plusmn 08S2ndashB2 () 502 plusmn 10 481 plusmn 07 524 plusmn 08 555 plusmn 07 480 plusmn 06 602 plusmn 08S2ndashB3 () 402 plusmn 07 363 plusmn 08 401 plusmn 08 496 plusmn 07 395 plusmn 09 515 plusmn 08S2ndashB4 () 309 plusmn 07 364 plusmn 09 384 plusmn 09 391 plusmn 07 380 plusmn 10 427 plusmn 09Mean () 412 plusmn 09 392 plusmn 08 438 plusmn 08 489 plusmn 07 409 plusmn 09 525 plusmn 09Unconstrained model () minus22 plusmn 09 minus04 plusmn 08 ndash ndash ndash ndashS3ndashB1 () 646 plusmn 07 851 plusmn 04 858 plusmn 04 612 plusmn 07 854 plusmn 05 888 plusmn 04S3ndashB2 () 801 plusmn 05 905 plusmn 03 937 plusmn 03 850 plusmn 04 914 plusmn 03 960 plusmn 03Mean () 724 plusmn 06 878 plusmn 04 898 plusmn 04 731 plusmn 06 884 plusmn 04 924 plusmn 03Unconstrained model () minus59 plusmn 07 minus25 plusmn 04 ndash ndash ndash ndashS4ndashB1 () 714 plusmn 05 835 plusmn 05 872 plusmn 04 685 plusmn 06 821 plusmn 04 867 plusmn 04S4ndashB2 () 697 plusmn 07 831 plusmn 03 879 plusmn 04 643 plusmn 06 823 plusmn 03 844 plusmn 04Mean () 706 plusmn 07 833 plusmn 03 876 plusmn 04 664 plusmn 06 822 plusmn 03 856 plusmn 04Unconstrained model () minus11 plusmn 06 minus31 plusmn 05 ndash ndash ndash ndash

11

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 2: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

OPEN ACCESSIOP PUBLISHING JOURNAL OF NEURAL ENGINEERING

J Neural Eng 10 (2013) 056020 (14pp) doi1010881741-2560105056020

Hidden Markov model and support vectormachine based decoding of fingermovements using electrocorticographyTobias Wissel1910 Tim Pfeiffer1 Robert Frysch1 Robert T Knight23Edward F Chang2 Hermann Hinrichs4567 Jochem W Rieger38

and Georg Rose1

1 Chair for Healthcare Telematics and Medical Engineering Otto-von-Guericke-University MagdeburgPostfach 4120 D-39016 Magdeburg Germany2 Department of Neurological Surgery University of California San Francisco 505 Parnassus AveM-779 San Francisco CA 94143-0112 USA3 Department of Psychology and the Helen Wills Neuroscience Institute University of CaliforniaBerkeley 132 Barker Hall Berkeley CA 94720-3190 USA4 Clinic of Neurology Otto-von-Guericke-University Magdeburg Leipziger Straszlige 44D-39120 Magdeburg Germany5 Leibniz-Institute for Neurobiology Brenneckestraszlige 6 D-39118 Magdeburg Germany6 German Center for Neurodegenerative Diseases (DZNE) Leipziger Straszlige 44 D-39120 MagdeburgGermany7 Center of Behavioural Brain Sciences (CBBS) Universitatsplatz 2 D-39106 Magdeburg Germany8 Applied Neurocognitive Psychology Faculty VI Carl-von-Ossietzky University D-26111 OldenburgGermany

E-mail wisselrobuni-luebeckde

Received 28 May 2013Accepted for publication 28 August 2013Published 18 September 2013Online at stacksioporgJNE10056020

AbstractObjective Support vector machines (SVM) have developed into a gold standard for accurateclassification in brainndashcomputer interfaces (BCI) The choice of the most appropriate classifierfor a particular application depends on several characteristics in addition to decoding accuracyHere we investigate the implementation of hidden Markov models (HMM) for online BCIsand discuss strategies to improve their performance Approach We compare the SVM servingas a reference and HMMs for classifying discrete finger movements obtained fromelectrocorticograms of four subjects performing a finger tapping experiment The classifierdecisions are based on a subset of low-frequency time domain and high gamma oscillationfeatures Main results We show that decoding optimization between the two approaches is dueto the way features are extracted and selected and less dependent on the classifier Anadditional gain in HMM performance of up to 6 was obtained by introducing modelconstraints Comparable accuracies of up to 90 were achieved with both SVM and HMMwith the high gamma cortical response providing the most important decoding information forboth techniques Significance We discuss technical HMM characteristics and adaptations in

9 Author to whom any correspondence should be addressed10 Present address Institute for Robotics and Cognitive Systems University of Luebeck Ratzeburger Allee 160 D-23538 Luebeck Germany

Content from this work may be used under the terms of the Creative Commons Attribution 30 licenceAny further distribution of this work must maintain attribution to the author(s) and the title of the work journal citation and DOI

1741-256013056020+14$3300 1 copy 2013 IOP Publishing Ltd Printed in the UK amp the USA

J Neural Eng 10 (2013) 056020 T Wissel et al

the context of the presented data as well as for general BCI applications Our findings suggestthat HMMs and their characteristics are promising for efficient online BCIs

S Online supplementary data available from stacksioporgJNE10056020mmedia

(Some figures may appear in colour only in the online journal)

1 Introduction

Brainndashcomputer interface (BCI) oriented research has aprinciple goal of aiding disabled people suffering from severemotor impairments (Hoffmann et al 2007 Palaniappan et al2009) The majority of research on BCI has been based onelectroencephalography (EEG) data and restricted to simpleexperimental tasks using a small set of commands In thesestudies (Cincotti et al 2003 Lee and Choi 2003 Obermaieret al 1999) information was extracted from a limited number ofEEG channels over scalp sites of the right and left hemisphere

Recently studies have employed invasive subduralelectrocorticogram (ECoG) recordings for BCI (Ganguly andCarmena 2010 Schalk 2010 Zhao et al 2010 Shenoy et al2007) ECoG is recorded for diagnosis in clinical populationseg for localization of epileptic foci The signal quality ofECoG recorded brain activity outperforms the EEG data withrespect to higher amplitudes and higher signal-to-noise ratio(SNR) higher spatial resolution and broader bandwidth (0ndash500 Hz) (Crone et al 1998 Schalk 2010) Thus ECoG-signalshave the potential for improving earlier results of featureextraction and signal classification (Schalk 2010)

While approaches have been made to correlate continuousmovement kinematics and brain activity exploiting regressiontechniques (eg Wiener filter Kalman filter and others Dethieret al 2011 Acharya et al 2010 Ball et al 2009 Kubanek et al2009 Liang and Bougrain 2009 Pistohl et al 2008) the supportvector machine (SVM) approach has been established as a goldstandard for deriving class labels in cases of a discrete set ofcontrol commands (Quandt et al 2012 Liu et al 2010 Zhaoet al 2010 Demirer et al 2009 Shenoy et al 2007) This is dueto high and reproducible classification performance as well asrobustness with a low number of training samples using SVMapproaches (Guyon et al 2002)

However depending on the individual problem distinctproperties of other machine learning methods may provideanother viable BCI approach Pascual-Marqui et al (1995)and Obermaier et al (1999) argued that brain states and statetransitions can explain components of observed human brainactivity Obermaier et al applied HMMs in offline classificationof non-invasive EEG recordings of brain activity to distinguishbetween two different imagined movements In support of theidea of distinct mental states they reported an improvementof BCI classification results in subjects who developed aconsistent imagination strategy resulting in more focused andreproducible brain activity patterns The authors concludedthat HMMs offer a promising means to classify mental activityin a BCI framework because of the inherent HMM flexibilityin structural degrees of freedom and simplicity Some studieshowever reveal unstable properties and lower HMM robustness

in high dimensional feature spaces The latter point is criticallyrelated to small training sets leading to degradation in theclassifierrsquos ability (Lotte et al 2007)

In this study we directly compare HMM and SVMclassification of ECoG data recorded during a finger tappingexperiment This was done after a careful optimization of bothclassifiersmdashfirst focusing on improvements of the featuresand secondly by introducing constraints to the HMMs reducingthe amount of required training data Additionally the problemof high dimensional feature spaces is addressed by featureselection It is known that feature extraction and selectionhave a crucial impact on classifier performance (Pechenizkiy2005 Guyon and Elisseeff 2003) However the exact effecton the decoding accuracy for given data characteristics andrepresentations of information in time and space remains tobe investigated Poorly constructed feature spaces cannot becompensated for by the classifier However an optimal featurespace does not make the choice of the classifier irrelevantA tendency toward unstable HMM behaviour as well as thedifferent ways HMM and SVM model feature space wasaddressed with the ECoG data set

Our hypothesis was that for ECoG classification of fingerrepresentations differences in performance between the twoapproaches would be due to the way features are extracted andselected and less influenced by the choice of the classifier Herewe demonstrate that constrained HMMs achieve decodingrates comparable to the SVM We first provide an overviewof how the data were acquired We then review the analysismethods involving feature extraction HMMs and SVMsThird results on feature space setup and feature selection areevaluated Finally we discuss the decoding performance ofthe HMM classifier and directly compare it to the one-vs-oneSVM

2 Material and methods

21 Patients experimental setup and data acquisition

ECoG data were recorded from four patients who volunteeredto participate (aged 18ndash35 years right handed male) Allpatients received subdural electrode implants for pre-surgicalplanning of epilepsy treatment at UC San Francisco CA USAThe electrode grid was solely placed based on clinical criteriaand covered cortical pre-motor motor somatosensory andtemporal areas Details on the exact grid placement for eachsubject can be obtained from supplementary figures SI1ndashSI4available from stacksioporgJNE10056020mmedia Thestudy was conducted in concordance with the local institutionalreview board approved protocol as well as the Declarationof Helsinki All patients gave their informed consent before

2

J Neural Eng 10 (2013) 056020 T Wissel et al

Table 1 The number of recorded and rejected trials per subject session and class (thumbindex fingermiddle fingerlittle finger)

Subject Session Trials Class breakdown Rejected trials (breakdown)

S1 1 357 6211310874 0 (0000)2 240 33907245 1 (0100)Total 587 95203180119 1 (0100)

S2 1 280 44979148 40 (513139)2 308 5110311143 0 (0000)3 271 38879254 77 (13232615)4 273 32888865 41 (817106)Total 1122 165365382210 158 (26534930)

S3 1 234 54707040 114 (16403622)2 249 47708349 104 (15313820)Total 483 10114015389 218 (31717442)

S4 1 359 5711311673 6 (2211)2 328 6310410655 33 (611115)Total 687 120217222128 39 (813126)

the recordings started The recordings did not interfere withthe treatment and entailed minimal additional risk for theparticipant At the time of recording all patients were off anti-epileptic medications They were withdrawn when they wereadmitted for ECoG monitoring

The recordings were obtained from a 64-channel grid of8 times 8 platinumndashiridium electrodes with 1 cm centre-to-centrespacing The diameter of the electrodes was 4 mm of which23 mm were exposed (except for subject 4 who had a 16 times 16electrode grid 4 mm centre-to-centre spacing) The ECoG wasrecorded with a hardware sampling frequency of 30517 Hzand was then down-sampled to 1017 Hz for storage and furtherprocessing During the recordings patients performed a serialreaction time task in which a number appeared on a screenindicating with which finger to press a key on the keyboard(1 indicated thumb 2 index 3 middle 5 little finger thering finger was not used) The next trial was automaticallyinitiated with a short randomized delay of (335 plusmn 36) ms afterthe previous key press (mean value for S1 session 1 othersin similar range) The patients were instructed to respond asrapidly and accurately as possible with all fingers remainingon a fixed key during the whole 6ndash8 min run For each trialthe requested as well as actually used finger was recordedThe rate of correct button presses was at or close to 100 forall subjects The classification trials were labeled according tothe finger that was actually moved Each patient participatedin 2ndash4 of these sessions (table 1) Subjects performed slightfinger movements requiring minimal force Topographic fingerrepresentations extend approximately 55 mm along the motorcortex (Penfield and Boldrey 1937) The timing of stimuluspresentations and button presses was recorded in auxiliaryanalogue channels synchronized with the brain data

Due to clinical constraints the time available for ECoGdata acquisition is restricted providing a challenging datasetfor any classifier The numbers of trials for classification arelisted in table 1 For pre-processing the time series was firsthighpass filtered using a cut-off frequency at 05 Hz as well asnotch filtered around the power line frequency (60 Hz) usingfrequency domain filtering The electric potentials on the gridwere re-referenced to common average reference (CAR)

Next the time series were visually inspected for theremaining intervals and artefacts caused by the measurement

hardware line noise loose contacts and sections with signalvariations not explained by normal physiology were excludedfrom the analysis Channels with epileptic activity wereremoved since epileptic spiking has massive high-frequencyspectral activity and can distort informative features in thisrange We also removed any epochs with spread of theepileptic activity to normal brain sites to avoid high-frequencyartefacts The procedure was chosen to ensure comparabilityacross subjects and sessions Trial rejection was carried outwithout any knowledge of the corresponding class labels andin advance of any investigations on classification accuraciesFinally the trials were aligned with respect to the detectedbutton press being the response for each movement request

22 Test framework

The methods described in the following assemble an overalltest framework as illustrated in figure 1 The classifier beingthe main focus here is embedded as the core part among othersupporting modules including pre-processing data groupingaccording to the testing scheme channel selection and featureextraction The framework includes two separate paths fortraining and testing data to ensure that no prior knowledgeabout the test data is used in the training process

23 Feature generation

Two different feature types have been extracted for thisstudy First time courses of low-frequency components wereobtained by spectral filtering in the Fourier domain These timecourses were limited to a specific interval of interest aroundthe button press containing the activation most predictive forthe four classes The chosen frequency range of the lowpassfilter covers a band of low-frequency components that containphase-locked event-related potentials (ERPs) and prominentlocal motor potentials (LMP) (Kubanek et al 2009) Theseare usually extracted by averaging out uncorrelated noise overmultiple trials (Makeig 1993) Here we focus on single-trialanalysis which may not reflect LMP activity (Picton et al1995) As a consequence we refer to this feature type as low-frequency time domain (LFTD) features Boundaries of timeinterval (length and location relative to the button press) and

3

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 1 Block diagram of the implemented test framework The decision for a particular class is provided by either SVM or HMM in bothcases they are exposed to an identical framework of pre-processing channel selection and feature extraction

lowpass filter cut-off were selected by grid search for eachsession and are stated in the results The data sampling wasreduced to the Nyquist frequency

Second event-related spectral perturbation (ERSP) wasmeasured using a sliding Hann-window approach (windowlength nw = 260 ms) For each window the square root ofthe power spectrum was computed by fast Fourier transform(FFT) The resulting coefficients were then averaged ina frequency band of interest Taking the square root toobtain the amplitude spectrum assigned a higher weight tohigh-frequency components compared to the original powerspectrum The resulting feature sequence can be used as ameasure for time-locked power fluctuation in the high gammaband neglecting explicit phase information (Graimann et al2004 Crone et al 1998 Makeig 1993)

The time interval relative to the button press and thewindow overlap and exact low and high cut-off frequency ofthe high gamma band most discriminative for the four classeswere selected by grid search The resulting parameters for eachsession are stated in the results section

A combined feature space including both high gamma(HG) as well as LFTD features was also evaluated The ratio pof how many LFTD feature sequences are used relative to thenumber of HG sequences was obtained by grid search Subsetselection beyond this ratio is described in the next subsection

24 Feature selection

241 General remarks on subset selection Featuresequences were computed for all 64 ECoG channels giving2 times 64 sequences in total From these sequences a subset ofsize O has been identified that contains the most relevantinformation about the finger labels A preference towardscertain channels is influenced by the grid placement as well asdistinct parts of the cortex differentially activated by the task

We tested several measures of feature selection suchas Fisherrsquos linear discriminant analysis (Fisher 1936) andBhattacharyya distance (Bhattacharyya 1943) and found thatthe DaviesndashBouldin (DB) index (Wissel and Palaniappan2011 Davies and Bouldin 1979) yielded the most robustand stable subset estimates for accurate class separation Inaddition to higher robustness the DB index also represents thefastest alternative computationally The DB index substantiallyoutperformed other methods for both SVM and HMMNevertheless it should be pointed out that observed DB subsetswere in good agreement with the ranking generated by the

normal vector w of a linear SVM Apart from so-called filtermethods such as the DB index an evaluation of these vectorentries after training the SVM would correspond to featureselection using a wrapper method (Guyon and Elisseeff 2003)Our study did not employ any wrapper methods to avoiddependences between feature selection and classifier This mayhave led to an additional bias towards SVM or HMM (Yu andLiu 2004) Based on the DB index we defined an algorithmthat automatically selects an O-dimensional subset of featuresequences as follows

242 DB index The index corresponds to a clusterseparation index measuring the overlap of two clusters in anarbitrarily high dimensional space where r isin R

T is one of Nsample vectors belonging to cluster Ci

μi = 1

N

sumrisinCi

r (1)

di = 1

N

sumrisinCi

r minus μi (2)

Ri j = di + di

μi minus μji j isin N+ i = j (3)

Using the cluster centroids μi isin RT and spread di matrix R =

Ri ji=1Nc j=1Nc contains the rate between the within-classspread and the between-class spread for all class combinationsThe smaller the index the less two clusters overlap in thatfeature space The index is computed for all features available

243 Feature selection procedure Covering differentaspects of class separation we developed a feature selectionalgorithm based on the DB index taking multiple classesinto account The algorithm aims at minimizing the mutualclass overlap for a subset of defined size O as follows Firsteach feature sequence is segmented into three parts of equallength The index matrix R measuring the cluster overlapbetween all classes i and j is then computed segment-wisefor each feature under consideration Subsequently all threematrices are averaged (geometric mean for a bias towardssmaller indices) for further computing The feature yieldingthe smallest mean index is selected since it provides the bestseparation on average from one class to at least one other classSecond for all class combinations the features correspondingto the minimal index per combination are added Each of

4

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 2 An HMM structure for one class consists of Q states emitting multi-dimensional observations ot at each time step The state aswell as the output sequence is governed by a set of probabilities including transition probabilities aij To assign a sequence Ot to a particularmodel the classifier decides for the highest probability P(Ot|model)

them optimally separates a certain finger combination Nextfeatures with the smallest geometric mean per class are addedif they are not part of the subset already Finally in case thepredefined number of elements in the subset O admits furtherfeatures vacancies are filled according to a sorted ranking ofindices averaged across class combinations per feature

For each trial this subset results in an O-dimensionalfeature vector varying over time t isin [0 T ] While thismultivariate sequence was fed directly into the HMM classifierthe OtimesT matrix Ot had to be reshaped into a 1times (T middotO) vectorfor the SVM The subset size O is obtained by exhaustivesearch on the training data This makes the difference betweenboth classifiers in data modelling obvious

While an HMM technically distinguishes between thetemporal development (columns in Ot) and the feature type(rows in Ot) the SVM concatenates all channels and theirtime courses within the same one-dimensional feature vectorThe potential flexibility of HMMs with respect to temporalphenomena arises substantially from this fact

25 Classification

251 Hidden Markov models Arising from a state machinestructure hidden Markov models (HMMs) comprise twostochastic processes First as illustrated in figure 2 a setof Q underlying states is used to describe a sequence Ot ofT O-dimensional feature vectors These states qt isin sii=1Q

are assumed to be hidden while each is associated with anO-dimensional multivariate Gaussian M-mixture distributionto generate the observed featuresmdashthe second stochasticprocess Both the state incident qt as well as the observedoutput variable ot at time t are subject to the Markov propertymdasha conditional dependence only on the last step in time t minus 1

Among other parameters the state sequence structure andits probable variations are described by prior probabilitiesP(q1 = si) and a transition matrix

A = ai ji j=1Q = P(qt+1 = s j|qt = si)i j=1Q

A more detailed account of the methodological background isgiven by Rabiner (1989)

For each of the four classes one HMM had to be definedwhich was trained using the iterative BaumndashWelch algorithmwith eight expectation maximization (EM) steps (Dempsteret al 1977) The classification on the test set was carried outfollowing a maximum likelihood approach as indicated by thecurly bracket in figure 2

In order to increase decoding performance the abovestandard HMM structure was constrained and adapted to theclassification problem This aims at optimizing the modelhypothesis to give a better fit and reduces the number of freeparameters to deal with limited training data

252 Model constraints First the transition matrix A wasconstrained to the Bakis model commonly used for wordmodelling in automatic speech processing (ASR) (Schukat-Talamazzini 1995) This model admits non-zero elementsonly on the upper triangular part of the transition matrixmdashspecifically on the main as well as first and second secondarydiagonal Thus this particular structure implies only lsquolooprsquolsquonextrsquo and lsquoskiprsquo transitions and suppresses state transitionson the remaining upper half Preliminary results using a fullupper triangular matrix instead revealed small probabilitiesie very unlikely state transitions for these suppressedtransitions throughout all finger models This suggests onlyminor contributions to the signal characteristics describedby that model A full ergodic transition matrix generated

5

J Neural Eng 10 (2013) 056020 T Wissel et al

poor decoding performances in general Recently a similarfinding was presented by Lederman and Tabrikian for EEGdata (Lederman and Tabrikian 2012)

Given five states containing one Gaussian the unleashedmodel would involve 4 times 1235 degrees of freedom while theBakis model uses 4 times 1219 for an exemplary selection of15 channels The significant increase of decoding performanceby dropping only a few parameters seems interesting withrespect to the nature of the underlying signal structure andsheds remarkable light on the importance of feature selectionfor HMMs An unleashed model using all 64 channels wouldhave 4 times 20835 free parameters

253 Model structure Experiments for all feature spaceshave been conducted in which the optimal number of states Qper HMM and the number of mixture distribution componentsM per state have been examined by grid search Sinceincreasing the degrees of freedom will degrade the qualityof model estimation high decoding rates have mainly beenobtained for a small number of states containing only a fewmixture components Based on this search a model structureof Q = 5 states and M = 1 mixture components has beenchosen as the best compromise Note that the actual optimummight vary slightly depending on the particular subject thefeature space and even on the session which is in line withother studies (Obermaier et al 1999 2001a 2001b)

254 Initialization The initial parameters of themultivariate probability densities have been set using a k-means clustering algorithm (Duda et al 2001) The algorithmtemporally clusters the vectors of the O times T feature matrix OtThe rows of this matrix were extended by a scaled temporalcounter adding a time stamp to each feature vector Thisincreases the probability of time-continuous state intervals inthe cluster output Then for each of the Q middot M clusters themean as well as the full covariance matrix have been calculatedexcluding the additional time stamp Finally to assign thesemixture distributions to specific states the clusters were sortedaccording to the mean of their corresponding time stampvalues This temporal order is suggested by the constrainedforward structure of the transition matrix The transition matrixis initialized with probability values on the diagonal thatare weighted to be c-times as likely as probabilities off thediagonal The constant c is being calculated as the quotientof the number of samples for a feature T and the number ofstates Q In case of multiple mixture components (M gt 1) themixture weight matrix was initialized using the normalizednumber of sample vectors assigned to each cluster

255 Implementation The basic implementation has beenderived from the open source functionalities on HMMsprovided by Kevin Murphy from the University of BritishColumbia (Murphy 2005) This framework its modificationsas well as the entire code developed for this study wereimplemented using MATLAB R2012b from MathWorks

256 Support vector machines As a gold-standardreference we used a one-vs-one SVM for multi-classclassification In our study training samples come in pairedassignments (xi yi)i=1N where xi is a 1 times (T middot O) featurevector derived from the ECoG data for one trial with itsclass assignment yi = ndash11 Equation (4) describes theclassification rule of the trained classifier for new samplesx A detailed derivation of this decision rule and the rationalebehind SVM-classification is given eg by Hastie et al (2009)

F(x) = sgn(sum

iαiyiK(xi x) + β0

)(4)

The αi isin [0 C] are weights for the training vectors and areestimated during the training process Training vectors withweights αi = 0 contribute to the solution of the classificationproblem and are called support vectors The offset of theseparating hyperplane from the origin is proportional to β0The constant C generally controls the trade-off betweenclassification error on the training set and smoothness ofthe decision boundary The numbers of training samplesper session (table 1) were in a similar range compared tothe resulting dimension of the SVM feature space T middot O(sim130minus400) Influenced by this fact the decoding accuracyof the SVM revealed almost no sensitivity with respect toregularization Since grid search yielded a stable plateau ofhigh performance for C gt 1 the regularization constant wasfixed to C = 1000 The linear kernel which has been usedthroughout the study is denoted by K(xi x) We further usedthe LIBSVM (wwwcsientuedutwsimcjlinlibsvm) packagefor Matlab developed by Chih-Chung Chang and Chih-Jen Lin(Chang and Lin 2011)

26 Considerations on testing

In order to evaluate the findings presented in the followingsections a five-fold cross-validation (CV) scheme was used(Lemm et al 2011) The assignment of a trial to a specific foldset was performed using uniform random permutations

Each CV procedure was repeated 30 times to averageout fluctuations in the resulting recognition rates arising fromrandom partitioning of the CV sets The number of repetitionswas chosen as a compromise to obtain stable estimates of thegeneralization error but also to avoid massive computationalload The training data was balanced between all classes totreat no class preferentially and to avoid the need for classprior probabilities Feature subset selection and estimation ofthe parameters of each classifier was performed only on datasets allocated for training in order to guarantee a fair estimateof the generalization error Due to prohibitive computationaleffort the feature spaces have been parameterized only oncefor each subject and session and are hence based on all samplesof each subject-specific data set In this context preliminaryexperiments indicated minimal differences to the decodingperformance (see table 2) when fixing the feature definitionsbased on the training sets only These experiments used nestedCV with an additional validation set for the feature spaceparameters

6

J Neural Eng 10 (2013) 056020 T Wissel et al

(a) (b)

(c) (d)

Figure 3 Optimal parameters of both feature spaces for each subject and session (a) Location and width for ROIs containing mostdiscriminative information for LFTD features (b) Optimal high cut-off frequency for the spectral lowpass filter to extract LFTD features(c) Location and width for ROIs containing most discriminative information for HG features (d) Low and high cut-off frequency for theoptimal HG frequency band

Table 2 HMM-decoding accuracies (in ) for parameteroptimization on full data and on training subsets Accuracies aregiven for subject 1mdashsession 1 Parameter optimization wasperformed on the full dataset (standard-CV routine) as well as thetraining subset (nested-CV routine)

Parameter optimization on LFTD HG LFTD + HG

Full dataset 807 plusmn 03 979 plusmn 01 984 plusmn 01Training subset 811 plusmn 10 978 plusmn 03 980 plusmn 02

3 Results

31 Feature and channel selection

In order to identify optimal parameters for feature extractiona grid search for each subject and session was performed Forthis optimization a preliminary feature selection was appliedaiming at a fixed subset size of O = 8 which was foundto be within the range of optimal subset sizes (see figure 6and table 3) Results for all sessions are illustrated by the barplots in figure 3 A sensitivity of the decoding rate has beenfound for temporal parameters such as location and width forthe time windows (ROIs) containing the most discriminativeinformation about LFTD or HG features (figures 3(a) and (c))

The highest impact on accuracy was found for the HGROI (mean optimal ranges S1 [minus1875 3375] ms S3 [minus400475] ms S4 [minus280 420] ms)

Further sensitivity is given for spectral parameters namelythe high cut-off frequency for the LFTD lowpass filter and thelow and high cut-off for the HG bandpass (figures 3(b) and(d)) For the former optimal frequencies ranged from 11ndash30 Hz and the optimal frequency bands for the latter all laywithin 60ndash300 Hz depending on the subject There was hardlyany impact on the decoding rate when changing the slidingwindow size and window overlap for the HG features Thesewere hence set to fixed values According to these parametersettings the mean numbers of samples T were 265 (subject 1)213 (subject 2) 31 (subject 3) and 22 (subject 4)

An example of the grid search procedure isshown in supplementary figures SI5ndashSI7 (available atstacksioporgJNE10056020mmedia) The plots showvarying decoding rates for the S1 session 1 within individualparameter subspaces The optimal parameters illustrated infigure 3 represent a single point in these spaces at peakingdecoding rate Based on these parameters figure 4 illustrates atypical result for a LFTD and HG feature sequence extractedfrom the shown raw ECoG signal (S1 session 2 channel 23middle finger) All feature sequences for this trialmdashconsistingof the entire set of channelsmdashare given in supplementaryfigure SI8

Using the resulting parameterization subsequent CV runssought to identify the optimal subset by tuning O based on thetraining set of the current fold Figure 5 illustrates the rateat which a particular channel was selected for each subject

7

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 4 The dashed grey time course shows a raw ECoG signal fora typical trial (S1 session 2 channel 23 middle finger) Theextracted feature sequences for LFTD (cut-off at 24 Hz) and HG(frequency band [85 Hz 270 Hz]) are overlaid as solid lines EachHG feature has been assigned to the time point at the centre of itscorresponding FFT window

Figure 5 Channel maps for all four subjects The channels arearranged according to the electrode grid and coloured with respectto their mean selection rate in repeated CV-runs Channels colouredlike the top of the colour bar tend to have a higher probability ofbeing selected

It evaluates the optimal subsets O across all sessions foldsand several CV repetitions Results are exemplarily shownfor high gamma features Full detail on subject-specific gridplacement and channel maps for LFTD and HG featurescan be found in supplementary figures SI1ndashSI4 (available atstacksioporgJNE10056020mmedia)

The channel maps reveal localized regions of highlyinformative channels (dark red) comprising only a small subsetwithin the full grid While these localizations are distinct andfocused for high gamma features informative channels spreadwider across the electrode grid for LFTD features This finding

is in line with an earlier study done by Kubanek et al (2009) andimplies that selected subsets for LFTD features tend to be lessstable across CV runs Combined subsets were selected fromboth feature spaces The selections indicated a preference fora higher proportion of LFTD features (table 3) This optimalproportion was determined by grid search

The optimal absolute subset size O was selected bysuccessively adding channels to the subset as defined by ourselection procedure and identifying the peak performanceFor the high gamma case figure 6 already shows a sharpincrease in decoding performance for a subset size belowfive Performance then increases less rapidly until it reachesa maximum usually in the range of 6ndash8 channels for the HGfeatures (see table 3 for detailed information on the results forthe other features) This behaviour again reflects the fact that alot of information is found in a small number of HG sequences

Increasing the subset size beyond this optimal size leadsto a decrease in performance for both SVM and HMM Thiscoincides with an increase of the number of free parameters inthe classifier model depending quadratically on the number ofchannels for HMMs (figure 6 bottom)

32 Decoding performances

Based on the configuration of the feature spaces and featureselection a combination space containing both LFTD andHG features yielded the best decoding accuracy across allsessions and subjects Figure 7 summarizes the highestHMM performances for each subject averaged across allsessions and compares them with the corresponding SVMaccuracies Except for subject 2 these accuracies were closeto or even exceeded 90 correct classifications Howeverfigure 8 shows that accuracies obtained only from the HGspace come very close to this combined case implying thatLFTD features contribute limited new information Usingtime domain features exclusively decreased performance inmost cases An exception is given by subject 2 whereLFTD features outperformed HG features This may beexplained by the wider spread of LFTD features across thecortex and the suboptimal grid placement (little coverageof motor cortex supplementary figure SI2 available atstacksioporgJNE10056020mmedia) The channel mapsshow informative channels at the very margin of the gridThis suggests that grid placement was not always optimal forcapturing the most informative electrodes

Differences in decoding accuracy compared to the SVMwere found to be small (table 4) and revealed no clearsuperiority of any classifier (figure 9) Excluding subject 2the mean performance difference (HMM-SVM) was ndash053for LFTD 01 for HG and ndash023 for combined featuresAccuracy trends for individual subjects are similar acrosssessions and vary less than across subjects (figure 9) Notethat irrespective of the suboptimal grid placement subject 2took part in four experimental sessions Here subject-specificvariables eg grid placement were important determinants ofdecoding success Thus we interpret our results at the subjectlevel and conclude that in three out of four subjects SVMand HMM decoding of finger movements led to comparableresults

8

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 6 Influence of channel selection on the HMM decoding accuracy using high gamma features only (S1 session 1) Changes indecoding accuracy when successively adding channels to the feature set applying either random or DB-index channel selection (top)Number of free HMM parameters for each number of channels (bottom)

Table 3 Number of selected channels averaged across sessions The averaged number of selected channels and the corresponding standardderivations are given for all subjects and all feature spaces For the combined space square brackets denote the corresponding breakdowninto its components The average is taken across all sessions and CV repetitions

Subject LFTD HG LFTD + HG

1 100 plusmn 00 60 plusmn 00 165 plusmn 05 ([100 plusmn 00]+[65 plusmn 07])2 65 plusmn 19 73 plusmn 21 128 plusmn 34 ([70 plusmn 45] + [58 plusmn 15])3 100 plusmn 00 80 plusmn 00 140 plusmn 10 ([60 plusmn 14] + [80 plusmn 00])4 100 plusmn 00 60 plusmn 14 160 plusmn 07 ([95 plusmn 07] + [65 plusmn 07])

A clear superiority for SVM classification was only foundfor subject 2 for whom the absolute decoding performanceswere significantly worse than those of the other three subjects

For the single spaces LFTD and HG table 4 shows themean performance loss for the unleashed HMM That refersto a case wherein the Bakis model constraints are relaxedto the full model For all except two sessions of subject 2(the full model had a higher performance of 02 for session3 and 07 for session 4) the Bakis model outperformedthe full model The results indicate a higher benefit for theLFTD features with an up to 88 performance gain forsubject 3 session 2 Higher accuracies were consistentlyachieved for feature subsets deviating from the optimalsize An example is illustrated for LFTD channel selection(subject 3 session 2) in supplementary figure SI9 (availableat stacksioporgJNE10056020mmedia) The plot comparesthe Bakis and full model similar to the procedure in figure 6While table 4 lists only the improvements for the optimalnumber of channels supplementary figure SI9 indicates that

the performance gain may be even higher when deviating fromthe optimal settings

Finally figure 10 illustrates the accuracy decompositionsinto true positive rates for each finger The results reveal similartrends across subjects and feature spaces While the thumband little finger provided the fewest training samples theirtrue positive rates tend to be the highest among all fingers Anexception is given for LFTD features for subject 2 Equivalentplots for the SVM are shown in supplementary figure SI10(available at stacksioporgJNE10056020mmedia)

For the classification of one feature sequence includingall T samples the HMM Matlab implementation took onaverage (3522 plusmn 0036) ms and the SVM implementationin C (00281 plusmn 00034) ms The mean computation time fora feature sequence was 02 ms plusmn 21μs (LFTD) 172 ms plusmn379 μs (HG) and 179 ms plusmn 200 μs (combined space) Theresults were obtained with an Intel Core i5-2500 K 34 GHz16 GB RAM Win 7 Pro 64-Bit Matlab 2012b for ten selectedchannels

9

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 7 Decoding accuracies (in ) for HMM (green) and SVM (beige) using LFTD + HG feature space (best case) Averageperformances across all sessions are shown for all subjects (S1ndashS4) as a mean result of a 30 times 5 CV procedure with their correspondingerror bars

Figure 8 Decoding accuracies (in ) for the HMMs for all feature spaces and sessions Bi Performances are averaged across 30 repetitionsof the five-fold CV procedure and displayed with their corresponding error bars

4 Discussion

An evaluation of feature space configurations revealed asensitivity of the decoding performance to the preciseparameterization of individual feature spaces Apart from thesegeneral aspects the grid search primarily allows for individualparameter deviations across subjects and sessions Importantlya sensitivity has been identified for the particular time intervalof interest used for the sliding window approach to extractHG features This indicates varying temporal dynamics acrosssubjects and may favour the application of HMMs in onlineapplications This sensitivity suggests that adapting to thesetemporal dynamics has an influence on the degree to whichinformation can be deduced from the data by both HMMand SVM However our investigations indicate that HMMsare less sensitive to changes of the ROI and give nearly

stable performances when changing the window overlapAs a model property the HMM is capable of modellinginformative sequences of different lengths and rates Thelatter characteristic originates from the explicit loop transitionwithin a particular model This means the HMM may coveruninformative samples of a temporal sequence with an idlestate before the informative part starts Underlying temporalprocesses that may persist longer or shorter are modelled bymore or less repeating basic model states Lacking any time-warp properties an SVM would expect a fixed sequence lengthand information rate meaning a fixed association between aparticular feature dimension oi and a specific point in time tiThis highlights the strict comparability of information withinone single feature dimension explicitly required by the SVMThis leads to the prediction of benefits for HMMs in onlineapplications where the time for the button press or other

10

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 9 Differences in decoding accuracy (in ) for the HMMs with respect to the SVM reference classifier (HMM-SVM) Results aredisplayed for all sessions Bi and feature spaces

Figure 10 Mean HMM decoding rates decomposed into true positive rates for each finger Results are shown for each feature space andaveraged across sessions and CV repetitions

Table 4 Final decoding accuracies (in ) listed for HMM (left) and SVM (right) Accuracies for the Bakis model are given for eachsubject-session and feature space by its mean and standard error across all 30 repetitions of the CV procedure Additionally for each subjectand feature space the mean across all sessions as well as the mean loss in performance for LFTD and HG when using an unconstrainedHMM is shown

HMM SVMSubjectndashsession LFTD HG LFTD + HG LFTD HG LFTD + HG

S1ndashB1 () 807 plusmn 03 979 plusmn 01 984 plusmn 01 858 plusmn 04 972 plusmn 02 983 plusmn 01S1ndashB2 () 788 plusmn 06 924 plusmn 04 953 plusmn 03 849 plusmn 05 934 plusmn 04 961 plusmn 03Mean () 798 plusmn 05 952 plusmn 03 969 plusmn 02 854 plusmn 05 953 plusmn 03 972 plusmn 02Unconstrained model () minus31 plusmn 05 minus09 plusmn 03 ndash ndash ndash ndashS2ndashB1 () 440 plusmn 10 359 plusmn 07 444 plusmn 06 514 plusmn 08 380 plusmn 08 557 plusmn 08S2ndashB2 () 502 plusmn 10 481 plusmn 07 524 plusmn 08 555 plusmn 07 480 plusmn 06 602 plusmn 08S2ndashB3 () 402 plusmn 07 363 plusmn 08 401 plusmn 08 496 plusmn 07 395 plusmn 09 515 plusmn 08S2ndashB4 () 309 plusmn 07 364 plusmn 09 384 plusmn 09 391 plusmn 07 380 plusmn 10 427 plusmn 09Mean () 412 plusmn 09 392 plusmn 08 438 plusmn 08 489 plusmn 07 409 plusmn 09 525 plusmn 09Unconstrained model () minus22 plusmn 09 minus04 plusmn 08 ndash ndash ndash ndashS3ndashB1 () 646 plusmn 07 851 plusmn 04 858 plusmn 04 612 plusmn 07 854 plusmn 05 888 plusmn 04S3ndashB2 () 801 plusmn 05 905 plusmn 03 937 plusmn 03 850 plusmn 04 914 plusmn 03 960 plusmn 03Mean () 724 plusmn 06 878 plusmn 04 898 plusmn 04 731 plusmn 06 884 plusmn 04 924 plusmn 03Unconstrained model () minus59 plusmn 07 minus25 plusmn 04 ndash ndash ndash ndashS4ndashB1 () 714 plusmn 05 835 plusmn 05 872 plusmn 04 685 plusmn 06 821 plusmn 04 867 plusmn 04S4ndashB2 () 697 plusmn 07 831 plusmn 03 879 plusmn 04 643 plusmn 06 823 plusmn 03 844 plusmn 04Mean () 706 plusmn 07 833 plusmn 03 876 plusmn 04 664 plusmn 06 822 plusmn 03 856 plusmn 04Unconstrained model () minus11 plusmn 06 minus31 plusmn 05 ndash ndash ndash ndash

11

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 3: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

the context of the presented data as well as for general BCI applications Our findings suggestthat HMMs and their characteristics are promising for efficient online BCIs

S Online supplementary data available from stacksioporgJNE10056020mmedia

(Some figures may appear in colour only in the online journal)

1 Introduction

Brainndashcomputer interface (BCI) oriented research has aprinciple goal of aiding disabled people suffering from severemotor impairments (Hoffmann et al 2007 Palaniappan et al2009) The majority of research on BCI has been based onelectroencephalography (EEG) data and restricted to simpleexperimental tasks using a small set of commands In thesestudies (Cincotti et al 2003 Lee and Choi 2003 Obermaieret al 1999) information was extracted from a limited number ofEEG channels over scalp sites of the right and left hemisphere

Recently studies have employed invasive subduralelectrocorticogram (ECoG) recordings for BCI (Ganguly andCarmena 2010 Schalk 2010 Zhao et al 2010 Shenoy et al2007) ECoG is recorded for diagnosis in clinical populationseg for localization of epileptic foci The signal quality ofECoG recorded brain activity outperforms the EEG data withrespect to higher amplitudes and higher signal-to-noise ratio(SNR) higher spatial resolution and broader bandwidth (0ndash500 Hz) (Crone et al 1998 Schalk 2010) Thus ECoG-signalshave the potential for improving earlier results of featureextraction and signal classification (Schalk 2010)

While approaches have been made to correlate continuousmovement kinematics and brain activity exploiting regressiontechniques (eg Wiener filter Kalman filter and others Dethieret al 2011 Acharya et al 2010 Ball et al 2009 Kubanek et al2009 Liang and Bougrain 2009 Pistohl et al 2008) the supportvector machine (SVM) approach has been established as a goldstandard for deriving class labels in cases of a discrete set ofcontrol commands (Quandt et al 2012 Liu et al 2010 Zhaoet al 2010 Demirer et al 2009 Shenoy et al 2007) This is dueto high and reproducible classification performance as well asrobustness with a low number of training samples using SVMapproaches (Guyon et al 2002)

However depending on the individual problem distinctproperties of other machine learning methods may provideanother viable BCI approach Pascual-Marqui et al (1995)and Obermaier et al (1999) argued that brain states and statetransitions can explain components of observed human brainactivity Obermaier et al applied HMMs in offline classificationof non-invasive EEG recordings of brain activity to distinguishbetween two different imagined movements In support of theidea of distinct mental states they reported an improvementof BCI classification results in subjects who developed aconsistent imagination strategy resulting in more focused andreproducible brain activity patterns The authors concludedthat HMMs offer a promising means to classify mental activityin a BCI framework because of the inherent HMM flexibilityin structural degrees of freedom and simplicity Some studieshowever reveal unstable properties and lower HMM robustness

in high dimensional feature spaces The latter point is criticallyrelated to small training sets leading to degradation in theclassifierrsquos ability (Lotte et al 2007)

In this study we directly compare HMM and SVMclassification of ECoG data recorded during a finger tappingexperiment This was done after a careful optimization of bothclassifiersmdashfirst focusing on improvements of the featuresand secondly by introducing constraints to the HMMs reducingthe amount of required training data Additionally the problemof high dimensional feature spaces is addressed by featureselection It is known that feature extraction and selectionhave a crucial impact on classifier performance (Pechenizkiy2005 Guyon and Elisseeff 2003) However the exact effecton the decoding accuracy for given data characteristics andrepresentations of information in time and space remains tobe investigated Poorly constructed feature spaces cannot becompensated for by the classifier However an optimal featurespace does not make the choice of the classifier irrelevantA tendency toward unstable HMM behaviour as well as thedifferent ways HMM and SVM model feature space wasaddressed with the ECoG data set

Our hypothesis was that for ECoG classification of fingerrepresentations differences in performance between the twoapproaches would be due to the way features are extracted andselected and less influenced by the choice of the classifier Herewe demonstrate that constrained HMMs achieve decodingrates comparable to the SVM We first provide an overviewof how the data were acquired We then review the analysismethods involving feature extraction HMMs and SVMsThird results on feature space setup and feature selection areevaluated Finally we discuss the decoding performance ofthe HMM classifier and directly compare it to the one-vs-oneSVM

2 Material and methods

21 Patients experimental setup and data acquisition

ECoG data were recorded from four patients who volunteeredto participate (aged 18ndash35 years right handed male) Allpatients received subdural electrode implants for pre-surgicalplanning of epilepsy treatment at UC San Francisco CA USAThe electrode grid was solely placed based on clinical criteriaand covered cortical pre-motor motor somatosensory andtemporal areas Details on the exact grid placement for eachsubject can be obtained from supplementary figures SI1ndashSI4available from stacksioporgJNE10056020mmedia Thestudy was conducted in concordance with the local institutionalreview board approved protocol as well as the Declarationof Helsinki All patients gave their informed consent before

2

J Neural Eng 10 (2013) 056020 T Wissel et al

Table 1 The number of recorded and rejected trials per subject session and class (thumbindex fingermiddle fingerlittle finger)

Subject Session Trials Class breakdown Rejected trials (breakdown)

S1 1 357 6211310874 0 (0000)2 240 33907245 1 (0100)Total 587 95203180119 1 (0100)

S2 1 280 44979148 40 (513139)2 308 5110311143 0 (0000)3 271 38879254 77 (13232615)4 273 32888865 41 (817106)Total 1122 165365382210 158 (26534930)

S3 1 234 54707040 114 (16403622)2 249 47708349 104 (15313820)Total 483 10114015389 218 (31717442)

S4 1 359 5711311673 6 (2211)2 328 6310410655 33 (611115)Total 687 120217222128 39 (813126)

the recordings started The recordings did not interfere withthe treatment and entailed minimal additional risk for theparticipant At the time of recording all patients were off anti-epileptic medications They were withdrawn when they wereadmitted for ECoG monitoring

The recordings were obtained from a 64-channel grid of8 times 8 platinumndashiridium electrodes with 1 cm centre-to-centrespacing The diameter of the electrodes was 4 mm of which23 mm were exposed (except for subject 4 who had a 16 times 16electrode grid 4 mm centre-to-centre spacing) The ECoG wasrecorded with a hardware sampling frequency of 30517 Hzand was then down-sampled to 1017 Hz for storage and furtherprocessing During the recordings patients performed a serialreaction time task in which a number appeared on a screenindicating with which finger to press a key on the keyboard(1 indicated thumb 2 index 3 middle 5 little finger thering finger was not used) The next trial was automaticallyinitiated with a short randomized delay of (335 plusmn 36) ms afterthe previous key press (mean value for S1 session 1 othersin similar range) The patients were instructed to respond asrapidly and accurately as possible with all fingers remainingon a fixed key during the whole 6ndash8 min run For each trialthe requested as well as actually used finger was recordedThe rate of correct button presses was at or close to 100 forall subjects The classification trials were labeled according tothe finger that was actually moved Each patient participatedin 2ndash4 of these sessions (table 1) Subjects performed slightfinger movements requiring minimal force Topographic fingerrepresentations extend approximately 55 mm along the motorcortex (Penfield and Boldrey 1937) The timing of stimuluspresentations and button presses was recorded in auxiliaryanalogue channels synchronized with the brain data

Due to clinical constraints the time available for ECoGdata acquisition is restricted providing a challenging datasetfor any classifier The numbers of trials for classification arelisted in table 1 For pre-processing the time series was firsthighpass filtered using a cut-off frequency at 05 Hz as well asnotch filtered around the power line frequency (60 Hz) usingfrequency domain filtering The electric potentials on the gridwere re-referenced to common average reference (CAR)

Next the time series were visually inspected for theremaining intervals and artefacts caused by the measurement

hardware line noise loose contacts and sections with signalvariations not explained by normal physiology were excludedfrom the analysis Channels with epileptic activity wereremoved since epileptic spiking has massive high-frequencyspectral activity and can distort informative features in thisrange We also removed any epochs with spread of theepileptic activity to normal brain sites to avoid high-frequencyartefacts The procedure was chosen to ensure comparabilityacross subjects and sessions Trial rejection was carried outwithout any knowledge of the corresponding class labels andin advance of any investigations on classification accuraciesFinally the trials were aligned with respect to the detectedbutton press being the response for each movement request

22 Test framework

The methods described in the following assemble an overalltest framework as illustrated in figure 1 The classifier beingthe main focus here is embedded as the core part among othersupporting modules including pre-processing data groupingaccording to the testing scheme channel selection and featureextraction The framework includes two separate paths fortraining and testing data to ensure that no prior knowledgeabout the test data is used in the training process

23 Feature generation

Two different feature types have been extracted for thisstudy First time courses of low-frequency components wereobtained by spectral filtering in the Fourier domain These timecourses were limited to a specific interval of interest aroundthe button press containing the activation most predictive forthe four classes The chosen frequency range of the lowpassfilter covers a band of low-frequency components that containphase-locked event-related potentials (ERPs) and prominentlocal motor potentials (LMP) (Kubanek et al 2009) Theseare usually extracted by averaging out uncorrelated noise overmultiple trials (Makeig 1993) Here we focus on single-trialanalysis which may not reflect LMP activity (Picton et al1995) As a consequence we refer to this feature type as low-frequency time domain (LFTD) features Boundaries of timeinterval (length and location relative to the button press) and

3

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 1 Block diagram of the implemented test framework The decision for a particular class is provided by either SVM or HMM in bothcases they are exposed to an identical framework of pre-processing channel selection and feature extraction

lowpass filter cut-off were selected by grid search for eachsession and are stated in the results The data sampling wasreduced to the Nyquist frequency

Second event-related spectral perturbation (ERSP) wasmeasured using a sliding Hann-window approach (windowlength nw = 260 ms) For each window the square root ofthe power spectrum was computed by fast Fourier transform(FFT) The resulting coefficients were then averaged ina frequency band of interest Taking the square root toobtain the amplitude spectrum assigned a higher weight tohigh-frequency components compared to the original powerspectrum The resulting feature sequence can be used as ameasure for time-locked power fluctuation in the high gammaband neglecting explicit phase information (Graimann et al2004 Crone et al 1998 Makeig 1993)

The time interval relative to the button press and thewindow overlap and exact low and high cut-off frequency ofthe high gamma band most discriminative for the four classeswere selected by grid search The resulting parameters for eachsession are stated in the results section

A combined feature space including both high gamma(HG) as well as LFTD features was also evaluated The ratio pof how many LFTD feature sequences are used relative to thenumber of HG sequences was obtained by grid search Subsetselection beyond this ratio is described in the next subsection

24 Feature selection

241 General remarks on subset selection Featuresequences were computed for all 64 ECoG channels giving2 times 64 sequences in total From these sequences a subset ofsize O has been identified that contains the most relevantinformation about the finger labels A preference towardscertain channels is influenced by the grid placement as well asdistinct parts of the cortex differentially activated by the task

We tested several measures of feature selection suchas Fisherrsquos linear discriminant analysis (Fisher 1936) andBhattacharyya distance (Bhattacharyya 1943) and found thatthe DaviesndashBouldin (DB) index (Wissel and Palaniappan2011 Davies and Bouldin 1979) yielded the most robustand stable subset estimates for accurate class separation Inaddition to higher robustness the DB index also represents thefastest alternative computationally The DB index substantiallyoutperformed other methods for both SVM and HMMNevertheless it should be pointed out that observed DB subsetswere in good agreement with the ranking generated by the

normal vector w of a linear SVM Apart from so-called filtermethods such as the DB index an evaluation of these vectorentries after training the SVM would correspond to featureselection using a wrapper method (Guyon and Elisseeff 2003)Our study did not employ any wrapper methods to avoiddependences between feature selection and classifier This mayhave led to an additional bias towards SVM or HMM (Yu andLiu 2004) Based on the DB index we defined an algorithmthat automatically selects an O-dimensional subset of featuresequences as follows

242 DB index The index corresponds to a clusterseparation index measuring the overlap of two clusters in anarbitrarily high dimensional space where r isin R

T is one of Nsample vectors belonging to cluster Ci

μi = 1

N

sumrisinCi

r (1)

di = 1

N

sumrisinCi

r minus μi (2)

Ri j = di + di

μi minus μji j isin N+ i = j (3)

Using the cluster centroids μi isin RT and spread di matrix R =

Ri ji=1Nc j=1Nc contains the rate between the within-classspread and the between-class spread for all class combinationsThe smaller the index the less two clusters overlap in thatfeature space The index is computed for all features available

243 Feature selection procedure Covering differentaspects of class separation we developed a feature selectionalgorithm based on the DB index taking multiple classesinto account The algorithm aims at minimizing the mutualclass overlap for a subset of defined size O as follows Firsteach feature sequence is segmented into three parts of equallength The index matrix R measuring the cluster overlapbetween all classes i and j is then computed segment-wisefor each feature under consideration Subsequently all threematrices are averaged (geometric mean for a bias towardssmaller indices) for further computing The feature yieldingthe smallest mean index is selected since it provides the bestseparation on average from one class to at least one other classSecond for all class combinations the features correspondingto the minimal index per combination are added Each of

4

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 2 An HMM structure for one class consists of Q states emitting multi-dimensional observations ot at each time step The state aswell as the output sequence is governed by a set of probabilities including transition probabilities aij To assign a sequence Ot to a particularmodel the classifier decides for the highest probability P(Ot|model)

them optimally separates a certain finger combination Nextfeatures with the smallest geometric mean per class are addedif they are not part of the subset already Finally in case thepredefined number of elements in the subset O admits furtherfeatures vacancies are filled according to a sorted ranking ofindices averaged across class combinations per feature

For each trial this subset results in an O-dimensionalfeature vector varying over time t isin [0 T ] While thismultivariate sequence was fed directly into the HMM classifierthe OtimesT matrix Ot had to be reshaped into a 1times (T middotO) vectorfor the SVM The subset size O is obtained by exhaustivesearch on the training data This makes the difference betweenboth classifiers in data modelling obvious

While an HMM technically distinguishes between thetemporal development (columns in Ot) and the feature type(rows in Ot) the SVM concatenates all channels and theirtime courses within the same one-dimensional feature vectorThe potential flexibility of HMMs with respect to temporalphenomena arises substantially from this fact

25 Classification

251 Hidden Markov models Arising from a state machinestructure hidden Markov models (HMMs) comprise twostochastic processes First as illustrated in figure 2 a setof Q underlying states is used to describe a sequence Ot ofT O-dimensional feature vectors These states qt isin sii=1Q

are assumed to be hidden while each is associated with anO-dimensional multivariate Gaussian M-mixture distributionto generate the observed featuresmdashthe second stochasticprocess Both the state incident qt as well as the observedoutput variable ot at time t are subject to the Markov propertymdasha conditional dependence only on the last step in time t minus 1

Among other parameters the state sequence structure andits probable variations are described by prior probabilitiesP(q1 = si) and a transition matrix

A = ai ji j=1Q = P(qt+1 = s j|qt = si)i j=1Q

A more detailed account of the methodological background isgiven by Rabiner (1989)

For each of the four classes one HMM had to be definedwhich was trained using the iterative BaumndashWelch algorithmwith eight expectation maximization (EM) steps (Dempsteret al 1977) The classification on the test set was carried outfollowing a maximum likelihood approach as indicated by thecurly bracket in figure 2

In order to increase decoding performance the abovestandard HMM structure was constrained and adapted to theclassification problem This aims at optimizing the modelhypothesis to give a better fit and reduces the number of freeparameters to deal with limited training data

252 Model constraints First the transition matrix A wasconstrained to the Bakis model commonly used for wordmodelling in automatic speech processing (ASR) (Schukat-Talamazzini 1995) This model admits non-zero elementsonly on the upper triangular part of the transition matrixmdashspecifically on the main as well as first and second secondarydiagonal Thus this particular structure implies only lsquolooprsquolsquonextrsquo and lsquoskiprsquo transitions and suppresses state transitionson the remaining upper half Preliminary results using a fullupper triangular matrix instead revealed small probabilitiesie very unlikely state transitions for these suppressedtransitions throughout all finger models This suggests onlyminor contributions to the signal characteristics describedby that model A full ergodic transition matrix generated

5

J Neural Eng 10 (2013) 056020 T Wissel et al

poor decoding performances in general Recently a similarfinding was presented by Lederman and Tabrikian for EEGdata (Lederman and Tabrikian 2012)

Given five states containing one Gaussian the unleashedmodel would involve 4 times 1235 degrees of freedom while theBakis model uses 4 times 1219 for an exemplary selection of15 channels The significant increase of decoding performanceby dropping only a few parameters seems interesting withrespect to the nature of the underlying signal structure andsheds remarkable light on the importance of feature selectionfor HMMs An unleashed model using all 64 channels wouldhave 4 times 20835 free parameters

253 Model structure Experiments for all feature spaceshave been conducted in which the optimal number of states Qper HMM and the number of mixture distribution componentsM per state have been examined by grid search Sinceincreasing the degrees of freedom will degrade the qualityof model estimation high decoding rates have mainly beenobtained for a small number of states containing only a fewmixture components Based on this search a model structureof Q = 5 states and M = 1 mixture components has beenchosen as the best compromise Note that the actual optimummight vary slightly depending on the particular subject thefeature space and even on the session which is in line withother studies (Obermaier et al 1999 2001a 2001b)

254 Initialization The initial parameters of themultivariate probability densities have been set using a k-means clustering algorithm (Duda et al 2001) The algorithmtemporally clusters the vectors of the O times T feature matrix OtThe rows of this matrix were extended by a scaled temporalcounter adding a time stamp to each feature vector Thisincreases the probability of time-continuous state intervals inthe cluster output Then for each of the Q middot M clusters themean as well as the full covariance matrix have been calculatedexcluding the additional time stamp Finally to assign thesemixture distributions to specific states the clusters were sortedaccording to the mean of their corresponding time stampvalues This temporal order is suggested by the constrainedforward structure of the transition matrix The transition matrixis initialized with probability values on the diagonal thatare weighted to be c-times as likely as probabilities off thediagonal The constant c is being calculated as the quotientof the number of samples for a feature T and the number ofstates Q In case of multiple mixture components (M gt 1) themixture weight matrix was initialized using the normalizednumber of sample vectors assigned to each cluster

255 Implementation The basic implementation has beenderived from the open source functionalities on HMMsprovided by Kevin Murphy from the University of BritishColumbia (Murphy 2005) This framework its modificationsas well as the entire code developed for this study wereimplemented using MATLAB R2012b from MathWorks

256 Support vector machines As a gold-standardreference we used a one-vs-one SVM for multi-classclassification In our study training samples come in pairedassignments (xi yi)i=1N where xi is a 1 times (T middot O) featurevector derived from the ECoG data for one trial with itsclass assignment yi = ndash11 Equation (4) describes theclassification rule of the trained classifier for new samplesx A detailed derivation of this decision rule and the rationalebehind SVM-classification is given eg by Hastie et al (2009)

F(x) = sgn(sum

iαiyiK(xi x) + β0

)(4)

The αi isin [0 C] are weights for the training vectors and areestimated during the training process Training vectors withweights αi = 0 contribute to the solution of the classificationproblem and are called support vectors The offset of theseparating hyperplane from the origin is proportional to β0The constant C generally controls the trade-off betweenclassification error on the training set and smoothness ofthe decision boundary The numbers of training samplesper session (table 1) were in a similar range compared tothe resulting dimension of the SVM feature space T middot O(sim130minus400) Influenced by this fact the decoding accuracyof the SVM revealed almost no sensitivity with respect toregularization Since grid search yielded a stable plateau ofhigh performance for C gt 1 the regularization constant wasfixed to C = 1000 The linear kernel which has been usedthroughout the study is denoted by K(xi x) We further usedthe LIBSVM (wwwcsientuedutwsimcjlinlibsvm) packagefor Matlab developed by Chih-Chung Chang and Chih-Jen Lin(Chang and Lin 2011)

26 Considerations on testing

In order to evaluate the findings presented in the followingsections a five-fold cross-validation (CV) scheme was used(Lemm et al 2011) The assignment of a trial to a specific foldset was performed using uniform random permutations

Each CV procedure was repeated 30 times to averageout fluctuations in the resulting recognition rates arising fromrandom partitioning of the CV sets The number of repetitionswas chosen as a compromise to obtain stable estimates of thegeneralization error but also to avoid massive computationalload The training data was balanced between all classes totreat no class preferentially and to avoid the need for classprior probabilities Feature subset selection and estimation ofthe parameters of each classifier was performed only on datasets allocated for training in order to guarantee a fair estimateof the generalization error Due to prohibitive computationaleffort the feature spaces have been parameterized only oncefor each subject and session and are hence based on all samplesof each subject-specific data set In this context preliminaryexperiments indicated minimal differences to the decodingperformance (see table 2) when fixing the feature definitionsbased on the training sets only These experiments used nestedCV with an additional validation set for the feature spaceparameters

6

J Neural Eng 10 (2013) 056020 T Wissel et al

(a) (b)

(c) (d)

Figure 3 Optimal parameters of both feature spaces for each subject and session (a) Location and width for ROIs containing mostdiscriminative information for LFTD features (b) Optimal high cut-off frequency for the spectral lowpass filter to extract LFTD features(c) Location and width for ROIs containing most discriminative information for HG features (d) Low and high cut-off frequency for theoptimal HG frequency band

Table 2 HMM-decoding accuracies (in ) for parameteroptimization on full data and on training subsets Accuracies aregiven for subject 1mdashsession 1 Parameter optimization wasperformed on the full dataset (standard-CV routine) as well as thetraining subset (nested-CV routine)

Parameter optimization on LFTD HG LFTD + HG

Full dataset 807 plusmn 03 979 plusmn 01 984 plusmn 01Training subset 811 plusmn 10 978 plusmn 03 980 plusmn 02

3 Results

31 Feature and channel selection

In order to identify optimal parameters for feature extractiona grid search for each subject and session was performed Forthis optimization a preliminary feature selection was appliedaiming at a fixed subset size of O = 8 which was foundto be within the range of optimal subset sizes (see figure 6and table 3) Results for all sessions are illustrated by the barplots in figure 3 A sensitivity of the decoding rate has beenfound for temporal parameters such as location and width forthe time windows (ROIs) containing the most discriminativeinformation about LFTD or HG features (figures 3(a) and (c))

The highest impact on accuracy was found for the HGROI (mean optimal ranges S1 [minus1875 3375] ms S3 [minus400475] ms S4 [minus280 420] ms)

Further sensitivity is given for spectral parameters namelythe high cut-off frequency for the LFTD lowpass filter and thelow and high cut-off for the HG bandpass (figures 3(b) and(d)) For the former optimal frequencies ranged from 11ndash30 Hz and the optimal frequency bands for the latter all laywithin 60ndash300 Hz depending on the subject There was hardlyany impact on the decoding rate when changing the slidingwindow size and window overlap for the HG features Thesewere hence set to fixed values According to these parametersettings the mean numbers of samples T were 265 (subject 1)213 (subject 2) 31 (subject 3) and 22 (subject 4)

An example of the grid search procedure isshown in supplementary figures SI5ndashSI7 (available atstacksioporgJNE10056020mmedia) The plots showvarying decoding rates for the S1 session 1 within individualparameter subspaces The optimal parameters illustrated infigure 3 represent a single point in these spaces at peakingdecoding rate Based on these parameters figure 4 illustrates atypical result for a LFTD and HG feature sequence extractedfrom the shown raw ECoG signal (S1 session 2 channel 23middle finger) All feature sequences for this trialmdashconsistingof the entire set of channelsmdashare given in supplementaryfigure SI8

Using the resulting parameterization subsequent CV runssought to identify the optimal subset by tuning O based on thetraining set of the current fold Figure 5 illustrates the rateat which a particular channel was selected for each subject

7

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 4 The dashed grey time course shows a raw ECoG signal fora typical trial (S1 session 2 channel 23 middle finger) Theextracted feature sequences for LFTD (cut-off at 24 Hz) and HG(frequency band [85 Hz 270 Hz]) are overlaid as solid lines EachHG feature has been assigned to the time point at the centre of itscorresponding FFT window

Figure 5 Channel maps for all four subjects The channels arearranged according to the electrode grid and coloured with respectto their mean selection rate in repeated CV-runs Channels colouredlike the top of the colour bar tend to have a higher probability ofbeing selected

It evaluates the optimal subsets O across all sessions foldsand several CV repetitions Results are exemplarily shownfor high gamma features Full detail on subject-specific gridplacement and channel maps for LFTD and HG featurescan be found in supplementary figures SI1ndashSI4 (available atstacksioporgJNE10056020mmedia)

The channel maps reveal localized regions of highlyinformative channels (dark red) comprising only a small subsetwithin the full grid While these localizations are distinct andfocused for high gamma features informative channels spreadwider across the electrode grid for LFTD features This finding

is in line with an earlier study done by Kubanek et al (2009) andimplies that selected subsets for LFTD features tend to be lessstable across CV runs Combined subsets were selected fromboth feature spaces The selections indicated a preference fora higher proportion of LFTD features (table 3) This optimalproportion was determined by grid search

The optimal absolute subset size O was selected bysuccessively adding channels to the subset as defined by ourselection procedure and identifying the peak performanceFor the high gamma case figure 6 already shows a sharpincrease in decoding performance for a subset size belowfive Performance then increases less rapidly until it reachesa maximum usually in the range of 6ndash8 channels for the HGfeatures (see table 3 for detailed information on the results forthe other features) This behaviour again reflects the fact that alot of information is found in a small number of HG sequences

Increasing the subset size beyond this optimal size leadsto a decrease in performance for both SVM and HMM Thiscoincides with an increase of the number of free parameters inthe classifier model depending quadratically on the number ofchannels for HMMs (figure 6 bottom)

32 Decoding performances

Based on the configuration of the feature spaces and featureselection a combination space containing both LFTD andHG features yielded the best decoding accuracy across allsessions and subjects Figure 7 summarizes the highestHMM performances for each subject averaged across allsessions and compares them with the corresponding SVMaccuracies Except for subject 2 these accuracies were closeto or even exceeded 90 correct classifications Howeverfigure 8 shows that accuracies obtained only from the HGspace come very close to this combined case implying thatLFTD features contribute limited new information Usingtime domain features exclusively decreased performance inmost cases An exception is given by subject 2 whereLFTD features outperformed HG features This may beexplained by the wider spread of LFTD features across thecortex and the suboptimal grid placement (little coverageof motor cortex supplementary figure SI2 available atstacksioporgJNE10056020mmedia) The channel mapsshow informative channels at the very margin of the gridThis suggests that grid placement was not always optimal forcapturing the most informative electrodes

Differences in decoding accuracy compared to the SVMwere found to be small (table 4) and revealed no clearsuperiority of any classifier (figure 9) Excluding subject 2the mean performance difference (HMM-SVM) was ndash053for LFTD 01 for HG and ndash023 for combined featuresAccuracy trends for individual subjects are similar acrosssessions and vary less than across subjects (figure 9) Notethat irrespective of the suboptimal grid placement subject 2took part in four experimental sessions Here subject-specificvariables eg grid placement were important determinants ofdecoding success Thus we interpret our results at the subjectlevel and conclude that in three out of four subjects SVMand HMM decoding of finger movements led to comparableresults

8

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 6 Influence of channel selection on the HMM decoding accuracy using high gamma features only (S1 session 1) Changes indecoding accuracy when successively adding channels to the feature set applying either random or DB-index channel selection (top)Number of free HMM parameters for each number of channels (bottom)

Table 3 Number of selected channels averaged across sessions The averaged number of selected channels and the corresponding standardderivations are given for all subjects and all feature spaces For the combined space square brackets denote the corresponding breakdowninto its components The average is taken across all sessions and CV repetitions

Subject LFTD HG LFTD + HG

1 100 plusmn 00 60 plusmn 00 165 plusmn 05 ([100 plusmn 00]+[65 plusmn 07])2 65 plusmn 19 73 plusmn 21 128 plusmn 34 ([70 plusmn 45] + [58 plusmn 15])3 100 plusmn 00 80 plusmn 00 140 plusmn 10 ([60 plusmn 14] + [80 plusmn 00])4 100 plusmn 00 60 plusmn 14 160 plusmn 07 ([95 plusmn 07] + [65 plusmn 07])

A clear superiority for SVM classification was only foundfor subject 2 for whom the absolute decoding performanceswere significantly worse than those of the other three subjects

For the single spaces LFTD and HG table 4 shows themean performance loss for the unleashed HMM That refersto a case wherein the Bakis model constraints are relaxedto the full model For all except two sessions of subject 2(the full model had a higher performance of 02 for session3 and 07 for session 4) the Bakis model outperformedthe full model The results indicate a higher benefit for theLFTD features with an up to 88 performance gain forsubject 3 session 2 Higher accuracies were consistentlyachieved for feature subsets deviating from the optimalsize An example is illustrated for LFTD channel selection(subject 3 session 2) in supplementary figure SI9 (availableat stacksioporgJNE10056020mmedia) The plot comparesthe Bakis and full model similar to the procedure in figure 6While table 4 lists only the improvements for the optimalnumber of channels supplementary figure SI9 indicates that

the performance gain may be even higher when deviating fromthe optimal settings

Finally figure 10 illustrates the accuracy decompositionsinto true positive rates for each finger The results reveal similartrends across subjects and feature spaces While the thumband little finger provided the fewest training samples theirtrue positive rates tend to be the highest among all fingers Anexception is given for LFTD features for subject 2 Equivalentplots for the SVM are shown in supplementary figure SI10(available at stacksioporgJNE10056020mmedia)

For the classification of one feature sequence includingall T samples the HMM Matlab implementation took onaverage (3522 plusmn 0036) ms and the SVM implementationin C (00281 plusmn 00034) ms The mean computation time fora feature sequence was 02 ms plusmn 21μs (LFTD) 172 ms plusmn379 μs (HG) and 179 ms plusmn 200 μs (combined space) Theresults were obtained with an Intel Core i5-2500 K 34 GHz16 GB RAM Win 7 Pro 64-Bit Matlab 2012b for ten selectedchannels

9

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 7 Decoding accuracies (in ) for HMM (green) and SVM (beige) using LFTD + HG feature space (best case) Averageperformances across all sessions are shown for all subjects (S1ndashS4) as a mean result of a 30 times 5 CV procedure with their correspondingerror bars

Figure 8 Decoding accuracies (in ) for the HMMs for all feature spaces and sessions Bi Performances are averaged across 30 repetitionsof the five-fold CV procedure and displayed with their corresponding error bars

4 Discussion

An evaluation of feature space configurations revealed asensitivity of the decoding performance to the preciseparameterization of individual feature spaces Apart from thesegeneral aspects the grid search primarily allows for individualparameter deviations across subjects and sessions Importantlya sensitivity has been identified for the particular time intervalof interest used for the sliding window approach to extractHG features This indicates varying temporal dynamics acrosssubjects and may favour the application of HMMs in onlineapplications This sensitivity suggests that adapting to thesetemporal dynamics has an influence on the degree to whichinformation can be deduced from the data by both HMMand SVM However our investigations indicate that HMMsare less sensitive to changes of the ROI and give nearly

stable performances when changing the window overlapAs a model property the HMM is capable of modellinginformative sequences of different lengths and rates Thelatter characteristic originates from the explicit loop transitionwithin a particular model This means the HMM may coveruninformative samples of a temporal sequence with an idlestate before the informative part starts Underlying temporalprocesses that may persist longer or shorter are modelled bymore or less repeating basic model states Lacking any time-warp properties an SVM would expect a fixed sequence lengthand information rate meaning a fixed association between aparticular feature dimension oi and a specific point in time tiThis highlights the strict comparability of information withinone single feature dimension explicitly required by the SVMThis leads to the prediction of benefits for HMMs in onlineapplications where the time for the button press or other

10

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 9 Differences in decoding accuracy (in ) for the HMMs with respect to the SVM reference classifier (HMM-SVM) Results aredisplayed for all sessions Bi and feature spaces

Figure 10 Mean HMM decoding rates decomposed into true positive rates for each finger Results are shown for each feature space andaveraged across sessions and CV repetitions

Table 4 Final decoding accuracies (in ) listed for HMM (left) and SVM (right) Accuracies for the Bakis model are given for eachsubject-session and feature space by its mean and standard error across all 30 repetitions of the CV procedure Additionally for each subjectand feature space the mean across all sessions as well as the mean loss in performance for LFTD and HG when using an unconstrainedHMM is shown

HMM SVMSubjectndashsession LFTD HG LFTD + HG LFTD HG LFTD + HG

S1ndashB1 () 807 plusmn 03 979 plusmn 01 984 plusmn 01 858 plusmn 04 972 plusmn 02 983 plusmn 01S1ndashB2 () 788 plusmn 06 924 plusmn 04 953 plusmn 03 849 plusmn 05 934 plusmn 04 961 plusmn 03Mean () 798 plusmn 05 952 plusmn 03 969 plusmn 02 854 plusmn 05 953 plusmn 03 972 plusmn 02Unconstrained model () minus31 plusmn 05 minus09 plusmn 03 ndash ndash ndash ndashS2ndashB1 () 440 plusmn 10 359 plusmn 07 444 plusmn 06 514 plusmn 08 380 plusmn 08 557 plusmn 08S2ndashB2 () 502 plusmn 10 481 plusmn 07 524 plusmn 08 555 plusmn 07 480 plusmn 06 602 plusmn 08S2ndashB3 () 402 plusmn 07 363 plusmn 08 401 plusmn 08 496 plusmn 07 395 plusmn 09 515 plusmn 08S2ndashB4 () 309 plusmn 07 364 plusmn 09 384 plusmn 09 391 plusmn 07 380 plusmn 10 427 plusmn 09Mean () 412 plusmn 09 392 plusmn 08 438 plusmn 08 489 plusmn 07 409 plusmn 09 525 plusmn 09Unconstrained model () minus22 plusmn 09 minus04 plusmn 08 ndash ndash ndash ndashS3ndashB1 () 646 plusmn 07 851 plusmn 04 858 plusmn 04 612 plusmn 07 854 plusmn 05 888 plusmn 04S3ndashB2 () 801 plusmn 05 905 plusmn 03 937 plusmn 03 850 plusmn 04 914 plusmn 03 960 plusmn 03Mean () 724 plusmn 06 878 plusmn 04 898 plusmn 04 731 plusmn 06 884 plusmn 04 924 plusmn 03Unconstrained model () minus59 plusmn 07 minus25 plusmn 04 ndash ndash ndash ndashS4ndashB1 () 714 plusmn 05 835 plusmn 05 872 plusmn 04 685 plusmn 06 821 plusmn 04 867 plusmn 04S4ndashB2 () 697 plusmn 07 831 plusmn 03 879 plusmn 04 643 plusmn 06 823 plusmn 03 844 plusmn 04Mean () 706 plusmn 07 833 plusmn 03 876 plusmn 04 664 plusmn 06 822 plusmn 03 856 plusmn 04Unconstrained model () minus11 plusmn 06 minus31 plusmn 05 ndash ndash ndash ndash

11

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 4: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

Table 1 The number of recorded and rejected trials per subject session and class (thumbindex fingermiddle fingerlittle finger)

Subject Session Trials Class breakdown Rejected trials (breakdown)

S1 1 357 6211310874 0 (0000)2 240 33907245 1 (0100)Total 587 95203180119 1 (0100)

S2 1 280 44979148 40 (513139)2 308 5110311143 0 (0000)3 271 38879254 77 (13232615)4 273 32888865 41 (817106)Total 1122 165365382210 158 (26534930)

S3 1 234 54707040 114 (16403622)2 249 47708349 104 (15313820)Total 483 10114015389 218 (31717442)

S4 1 359 5711311673 6 (2211)2 328 6310410655 33 (611115)Total 687 120217222128 39 (813126)

the recordings started The recordings did not interfere withthe treatment and entailed minimal additional risk for theparticipant At the time of recording all patients were off anti-epileptic medications They were withdrawn when they wereadmitted for ECoG monitoring

The recordings were obtained from a 64-channel grid of8 times 8 platinumndashiridium electrodes with 1 cm centre-to-centrespacing The diameter of the electrodes was 4 mm of which23 mm were exposed (except for subject 4 who had a 16 times 16electrode grid 4 mm centre-to-centre spacing) The ECoG wasrecorded with a hardware sampling frequency of 30517 Hzand was then down-sampled to 1017 Hz for storage and furtherprocessing During the recordings patients performed a serialreaction time task in which a number appeared on a screenindicating with which finger to press a key on the keyboard(1 indicated thumb 2 index 3 middle 5 little finger thering finger was not used) The next trial was automaticallyinitiated with a short randomized delay of (335 plusmn 36) ms afterthe previous key press (mean value for S1 session 1 othersin similar range) The patients were instructed to respond asrapidly and accurately as possible with all fingers remainingon a fixed key during the whole 6ndash8 min run For each trialthe requested as well as actually used finger was recordedThe rate of correct button presses was at or close to 100 forall subjects The classification trials were labeled according tothe finger that was actually moved Each patient participatedin 2ndash4 of these sessions (table 1) Subjects performed slightfinger movements requiring minimal force Topographic fingerrepresentations extend approximately 55 mm along the motorcortex (Penfield and Boldrey 1937) The timing of stimuluspresentations and button presses was recorded in auxiliaryanalogue channels synchronized with the brain data

Due to clinical constraints the time available for ECoGdata acquisition is restricted providing a challenging datasetfor any classifier The numbers of trials for classification arelisted in table 1 For pre-processing the time series was firsthighpass filtered using a cut-off frequency at 05 Hz as well asnotch filtered around the power line frequency (60 Hz) usingfrequency domain filtering The electric potentials on the gridwere re-referenced to common average reference (CAR)

Next the time series were visually inspected for theremaining intervals and artefacts caused by the measurement

hardware line noise loose contacts and sections with signalvariations not explained by normal physiology were excludedfrom the analysis Channels with epileptic activity wereremoved since epileptic spiking has massive high-frequencyspectral activity and can distort informative features in thisrange We also removed any epochs with spread of theepileptic activity to normal brain sites to avoid high-frequencyartefacts The procedure was chosen to ensure comparabilityacross subjects and sessions Trial rejection was carried outwithout any knowledge of the corresponding class labels andin advance of any investigations on classification accuraciesFinally the trials were aligned with respect to the detectedbutton press being the response for each movement request

22 Test framework

The methods described in the following assemble an overalltest framework as illustrated in figure 1 The classifier beingthe main focus here is embedded as the core part among othersupporting modules including pre-processing data groupingaccording to the testing scheme channel selection and featureextraction The framework includes two separate paths fortraining and testing data to ensure that no prior knowledgeabout the test data is used in the training process

23 Feature generation

Two different feature types have been extracted for thisstudy First time courses of low-frequency components wereobtained by spectral filtering in the Fourier domain These timecourses were limited to a specific interval of interest aroundthe button press containing the activation most predictive forthe four classes The chosen frequency range of the lowpassfilter covers a band of low-frequency components that containphase-locked event-related potentials (ERPs) and prominentlocal motor potentials (LMP) (Kubanek et al 2009) Theseare usually extracted by averaging out uncorrelated noise overmultiple trials (Makeig 1993) Here we focus on single-trialanalysis which may not reflect LMP activity (Picton et al1995) As a consequence we refer to this feature type as low-frequency time domain (LFTD) features Boundaries of timeinterval (length and location relative to the button press) and

3

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 1 Block diagram of the implemented test framework The decision for a particular class is provided by either SVM or HMM in bothcases they are exposed to an identical framework of pre-processing channel selection and feature extraction

lowpass filter cut-off were selected by grid search for eachsession and are stated in the results The data sampling wasreduced to the Nyquist frequency

Second event-related spectral perturbation (ERSP) wasmeasured using a sliding Hann-window approach (windowlength nw = 260 ms) For each window the square root ofthe power spectrum was computed by fast Fourier transform(FFT) The resulting coefficients were then averaged ina frequency band of interest Taking the square root toobtain the amplitude spectrum assigned a higher weight tohigh-frequency components compared to the original powerspectrum The resulting feature sequence can be used as ameasure for time-locked power fluctuation in the high gammaband neglecting explicit phase information (Graimann et al2004 Crone et al 1998 Makeig 1993)

The time interval relative to the button press and thewindow overlap and exact low and high cut-off frequency ofthe high gamma band most discriminative for the four classeswere selected by grid search The resulting parameters for eachsession are stated in the results section

A combined feature space including both high gamma(HG) as well as LFTD features was also evaluated The ratio pof how many LFTD feature sequences are used relative to thenumber of HG sequences was obtained by grid search Subsetselection beyond this ratio is described in the next subsection

24 Feature selection

241 General remarks on subset selection Featuresequences were computed for all 64 ECoG channels giving2 times 64 sequences in total From these sequences a subset ofsize O has been identified that contains the most relevantinformation about the finger labels A preference towardscertain channels is influenced by the grid placement as well asdistinct parts of the cortex differentially activated by the task

We tested several measures of feature selection suchas Fisherrsquos linear discriminant analysis (Fisher 1936) andBhattacharyya distance (Bhattacharyya 1943) and found thatthe DaviesndashBouldin (DB) index (Wissel and Palaniappan2011 Davies and Bouldin 1979) yielded the most robustand stable subset estimates for accurate class separation Inaddition to higher robustness the DB index also represents thefastest alternative computationally The DB index substantiallyoutperformed other methods for both SVM and HMMNevertheless it should be pointed out that observed DB subsetswere in good agreement with the ranking generated by the

normal vector w of a linear SVM Apart from so-called filtermethods such as the DB index an evaluation of these vectorentries after training the SVM would correspond to featureselection using a wrapper method (Guyon and Elisseeff 2003)Our study did not employ any wrapper methods to avoiddependences between feature selection and classifier This mayhave led to an additional bias towards SVM or HMM (Yu andLiu 2004) Based on the DB index we defined an algorithmthat automatically selects an O-dimensional subset of featuresequences as follows

242 DB index The index corresponds to a clusterseparation index measuring the overlap of two clusters in anarbitrarily high dimensional space where r isin R

T is one of Nsample vectors belonging to cluster Ci

μi = 1

N

sumrisinCi

r (1)

di = 1

N

sumrisinCi

r minus μi (2)

Ri j = di + di

μi minus μji j isin N+ i = j (3)

Using the cluster centroids μi isin RT and spread di matrix R =

Ri ji=1Nc j=1Nc contains the rate between the within-classspread and the between-class spread for all class combinationsThe smaller the index the less two clusters overlap in thatfeature space The index is computed for all features available

243 Feature selection procedure Covering differentaspects of class separation we developed a feature selectionalgorithm based on the DB index taking multiple classesinto account The algorithm aims at minimizing the mutualclass overlap for a subset of defined size O as follows Firsteach feature sequence is segmented into three parts of equallength The index matrix R measuring the cluster overlapbetween all classes i and j is then computed segment-wisefor each feature under consideration Subsequently all threematrices are averaged (geometric mean for a bias towardssmaller indices) for further computing The feature yieldingthe smallest mean index is selected since it provides the bestseparation on average from one class to at least one other classSecond for all class combinations the features correspondingto the minimal index per combination are added Each of

4

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 2 An HMM structure for one class consists of Q states emitting multi-dimensional observations ot at each time step The state aswell as the output sequence is governed by a set of probabilities including transition probabilities aij To assign a sequence Ot to a particularmodel the classifier decides for the highest probability P(Ot|model)

them optimally separates a certain finger combination Nextfeatures with the smallest geometric mean per class are addedif they are not part of the subset already Finally in case thepredefined number of elements in the subset O admits furtherfeatures vacancies are filled according to a sorted ranking ofindices averaged across class combinations per feature

For each trial this subset results in an O-dimensionalfeature vector varying over time t isin [0 T ] While thismultivariate sequence was fed directly into the HMM classifierthe OtimesT matrix Ot had to be reshaped into a 1times (T middotO) vectorfor the SVM The subset size O is obtained by exhaustivesearch on the training data This makes the difference betweenboth classifiers in data modelling obvious

While an HMM technically distinguishes between thetemporal development (columns in Ot) and the feature type(rows in Ot) the SVM concatenates all channels and theirtime courses within the same one-dimensional feature vectorThe potential flexibility of HMMs with respect to temporalphenomena arises substantially from this fact

25 Classification

251 Hidden Markov models Arising from a state machinestructure hidden Markov models (HMMs) comprise twostochastic processes First as illustrated in figure 2 a setof Q underlying states is used to describe a sequence Ot ofT O-dimensional feature vectors These states qt isin sii=1Q

are assumed to be hidden while each is associated with anO-dimensional multivariate Gaussian M-mixture distributionto generate the observed featuresmdashthe second stochasticprocess Both the state incident qt as well as the observedoutput variable ot at time t are subject to the Markov propertymdasha conditional dependence only on the last step in time t minus 1

Among other parameters the state sequence structure andits probable variations are described by prior probabilitiesP(q1 = si) and a transition matrix

A = ai ji j=1Q = P(qt+1 = s j|qt = si)i j=1Q

A more detailed account of the methodological background isgiven by Rabiner (1989)

For each of the four classes one HMM had to be definedwhich was trained using the iterative BaumndashWelch algorithmwith eight expectation maximization (EM) steps (Dempsteret al 1977) The classification on the test set was carried outfollowing a maximum likelihood approach as indicated by thecurly bracket in figure 2

In order to increase decoding performance the abovestandard HMM structure was constrained and adapted to theclassification problem This aims at optimizing the modelhypothesis to give a better fit and reduces the number of freeparameters to deal with limited training data

252 Model constraints First the transition matrix A wasconstrained to the Bakis model commonly used for wordmodelling in automatic speech processing (ASR) (Schukat-Talamazzini 1995) This model admits non-zero elementsonly on the upper triangular part of the transition matrixmdashspecifically on the main as well as first and second secondarydiagonal Thus this particular structure implies only lsquolooprsquolsquonextrsquo and lsquoskiprsquo transitions and suppresses state transitionson the remaining upper half Preliminary results using a fullupper triangular matrix instead revealed small probabilitiesie very unlikely state transitions for these suppressedtransitions throughout all finger models This suggests onlyminor contributions to the signal characteristics describedby that model A full ergodic transition matrix generated

5

J Neural Eng 10 (2013) 056020 T Wissel et al

poor decoding performances in general Recently a similarfinding was presented by Lederman and Tabrikian for EEGdata (Lederman and Tabrikian 2012)

Given five states containing one Gaussian the unleashedmodel would involve 4 times 1235 degrees of freedom while theBakis model uses 4 times 1219 for an exemplary selection of15 channels The significant increase of decoding performanceby dropping only a few parameters seems interesting withrespect to the nature of the underlying signal structure andsheds remarkable light on the importance of feature selectionfor HMMs An unleashed model using all 64 channels wouldhave 4 times 20835 free parameters

253 Model structure Experiments for all feature spaceshave been conducted in which the optimal number of states Qper HMM and the number of mixture distribution componentsM per state have been examined by grid search Sinceincreasing the degrees of freedom will degrade the qualityof model estimation high decoding rates have mainly beenobtained for a small number of states containing only a fewmixture components Based on this search a model structureof Q = 5 states and M = 1 mixture components has beenchosen as the best compromise Note that the actual optimummight vary slightly depending on the particular subject thefeature space and even on the session which is in line withother studies (Obermaier et al 1999 2001a 2001b)

254 Initialization The initial parameters of themultivariate probability densities have been set using a k-means clustering algorithm (Duda et al 2001) The algorithmtemporally clusters the vectors of the O times T feature matrix OtThe rows of this matrix were extended by a scaled temporalcounter adding a time stamp to each feature vector Thisincreases the probability of time-continuous state intervals inthe cluster output Then for each of the Q middot M clusters themean as well as the full covariance matrix have been calculatedexcluding the additional time stamp Finally to assign thesemixture distributions to specific states the clusters were sortedaccording to the mean of their corresponding time stampvalues This temporal order is suggested by the constrainedforward structure of the transition matrix The transition matrixis initialized with probability values on the diagonal thatare weighted to be c-times as likely as probabilities off thediagonal The constant c is being calculated as the quotientof the number of samples for a feature T and the number ofstates Q In case of multiple mixture components (M gt 1) themixture weight matrix was initialized using the normalizednumber of sample vectors assigned to each cluster

255 Implementation The basic implementation has beenderived from the open source functionalities on HMMsprovided by Kevin Murphy from the University of BritishColumbia (Murphy 2005) This framework its modificationsas well as the entire code developed for this study wereimplemented using MATLAB R2012b from MathWorks

256 Support vector machines As a gold-standardreference we used a one-vs-one SVM for multi-classclassification In our study training samples come in pairedassignments (xi yi)i=1N where xi is a 1 times (T middot O) featurevector derived from the ECoG data for one trial with itsclass assignment yi = ndash11 Equation (4) describes theclassification rule of the trained classifier for new samplesx A detailed derivation of this decision rule and the rationalebehind SVM-classification is given eg by Hastie et al (2009)

F(x) = sgn(sum

iαiyiK(xi x) + β0

)(4)

The αi isin [0 C] are weights for the training vectors and areestimated during the training process Training vectors withweights αi = 0 contribute to the solution of the classificationproblem and are called support vectors The offset of theseparating hyperplane from the origin is proportional to β0The constant C generally controls the trade-off betweenclassification error on the training set and smoothness ofthe decision boundary The numbers of training samplesper session (table 1) were in a similar range compared tothe resulting dimension of the SVM feature space T middot O(sim130minus400) Influenced by this fact the decoding accuracyof the SVM revealed almost no sensitivity with respect toregularization Since grid search yielded a stable plateau ofhigh performance for C gt 1 the regularization constant wasfixed to C = 1000 The linear kernel which has been usedthroughout the study is denoted by K(xi x) We further usedthe LIBSVM (wwwcsientuedutwsimcjlinlibsvm) packagefor Matlab developed by Chih-Chung Chang and Chih-Jen Lin(Chang and Lin 2011)

26 Considerations on testing

In order to evaluate the findings presented in the followingsections a five-fold cross-validation (CV) scheme was used(Lemm et al 2011) The assignment of a trial to a specific foldset was performed using uniform random permutations

Each CV procedure was repeated 30 times to averageout fluctuations in the resulting recognition rates arising fromrandom partitioning of the CV sets The number of repetitionswas chosen as a compromise to obtain stable estimates of thegeneralization error but also to avoid massive computationalload The training data was balanced between all classes totreat no class preferentially and to avoid the need for classprior probabilities Feature subset selection and estimation ofthe parameters of each classifier was performed only on datasets allocated for training in order to guarantee a fair estimateof the generalization error Due to prohibitive computationaleffort the feature spaces have been parameterized only oncefor each subject and session and are hence based on all samplesof each subject-specific data set In this context preliminaryexperiments indicated minimal differences to the decodingperformance (see table 2) when fixing the feature definitionsbased on the training sets only These experiments used nestedCV with an additional validation set for the feature spaceparameters

6

J Neural Eng 10 (2013) 056020 T Wissel et al

(a) (b)

(c) (d)

Figure 3 Optimal parameters of both feature spaces for each subject and session (a) Location and width for ROIs containing mostdiscriminative information for LFTD features (b) Optimal high cut-off frequency for the spectral lowpass filter to extract LFTD features(c) Location and width for ROIs containing most discriminative information for HG features (d) Low and high cut-off frequency for theoptimal HG frequency band

Table 2 HMM-decoding accuracies (in ) for parameteroptimization on full data and on training subsets Accuracies aregiven for subject 1mdashsession 1 Parameter optimization wasperformed on the full dataset (standard-CV routine) as well as thetraining subset (nested-CV routine)

Parameter optimization on LFTD HG LFTD + HG

Full dataset 807 plusmn 03 979 plusmn 01 984 plusmn 01Training subset 811 plusmn 10 978 plusmn 03 980 plusmn 02

3 Results

31 Feature and channel selection

In order to identify optimal parameters for feature extractiona grid search for each subject and session was performed Forthis optimization a preliminary feature selection was appliedaiming at a fixed subset size of O = 8 which was foundto be within the range of optimal subset sizes (see figure 6and table 3) Results for all sessions are illustrated by the barplots in figure 3 A sensitivity of the decoding rate has beenfound for temporal parameters such as location and width forthe time windows (ROIs) containing the most discriminativeinformation about LFTD or HG features (figures 3(a) and (c))

The highest impact on accuracy was found for the HGROI (mean optimal ranges S1 [minus1875 3375] ms S3 [minus400475] ms S4 [minus280 420] ms)

Further sensitivity is given for spectral parameters namelythe high cut-off frequency for the LFTD lowpass filter and thelow and high cut-off for the HG bandpass (figures 3(b) and(d)) For the former optimal frequencies ranged from 11ndash30 Hz and the optimal frequency bands for the latter all laywithin 60ndash300 Hz depending on the subject There was hardlyany impact on the decoding rate when changing the slidingwindow size and window overlap for the HG features Thesewere hence set to fixed values According to these parametersettings the mean numbers of samples T were 265 (subject 1)213 (subject 2) 31 (subject 3) and 22 (subject 4)

An example of the grid search procedure isshown in supplementary figures SI5ndashSI7 (available atstacksioporgJNE10056020mmedia) The plots showvarying decoding rates for the S1 session 1 within individualparameter subspaces The optimal parameters illustrated infigure 3 represent a single point in these spaces at peakingdecoding rate Based on these parameters figure 4 illustrates atypical result for a LFTD and HG feature sequence extractedfrom the shown raw ECoG signal (S1 session 2 channel 23middle finger) All feature sequences for this trialmdashconsistingof the entire set of channelsmdashare given in supplementaryfigure SI8

Using the resulting parameterization subsequent CV runssought to identify the optimal subset by tuning O based on thetraining set of the current fold Figure 5 illustrates the rateat which a particular channel was selected for each subject

7

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 4 The dashed grey time course shows a raw ECoG signal fora typical trial (S1 session 2 channel 23 middle finger) Theextracted feature sequences for LFTD (cut-off at 24 Hz) and HG(frequency band [85 Hz 270 Hz]) are overlaid as solid lines EachHG feature has been assigned to the time point at the centre of itscorresponding FFT window

Figure 5 Channel maps for all four subjects The channels arearranged according to the electrode grid and coloured with respectto their mean selection rate in repeated CV-runs Channels colouredlike the top of the colour bar tend to have a higher probability ofbeing selected

It evaluates the optimal subsets O across all sessions foldsand several CV repetitions Results are exemplarily shownfor high gamma features Full detail on subject-specific gridplacement and channel maps for LFTD and HG featurescan be found in supplementary figures SI1ndashSI4 (available atstacksioporgJNE10056020mmedia)

The channel maps reveal localized regions of highlyinformative channels (dark red) comprising only a small subsetwithin the full grid While these localizations are distinct andfocused for high gamma features informative channels spreadwider across the electrode grid for LFTD features This finding

is in line with an earlier study done by Kubanek et al (2009) andimplies that selected subsets for LFTD features tend to be lessstable across CV runs Combined subsets were selected fromboth feature spaces The selections indicated a preference fora higher proportion of LFTD features (table 3) This optimalproportion was determined by grid search

The optimal absolute subset size O was selected bysuccessively adding channels to the subset as defined by ourselection procedure and identifying the peak performanceFor the high gamma case figure 6 already shows a sharpincrease in decoding performance for a subset size belowfive Performance then increases less rapidly until it reachesa maximum usually in the range of 6ndash8 channels for the HGfeatures (see table 3 for detailed information on the results forthe other features) This behaviour again reflects the fact that alot of information is found in a small number of HG sequences

Increasing the subset size beyond this optimal size leadsto a decrease in performance for both SVM and HMM Thiscoincides with an increase of the number of free parameters inthe classifier model depending quadratically on the number ofchannels for HMMs (figure 6 bottom)

32 Decoding performances

Based on the configuration of the feature spaces and featureselection a combination space containing both LFTD andHG features yielded the best decoding accuracy across allsessions and subjects Figure 7 summarizes the highestHMM performances for each subject averaged across allsessions and compares them with the corresponding SVMaccuracies Except for subject 2 these accuracies were closeto or even exceeded 90 correct classifications Howeverfigure 8 shows that accuracies obtained only from the HGspace come very close to this combined case implying thatLFTD features contribute limited new information Usingtime domain features exclusively decreased performance inmost cases An exception is given by subject 2 whereLFTD features outperformed HG features This may beexplained by the wider spread of LFTD features across thecortex and the suboptimal grid placement (little coverageof motor cortex supplementary figure SI2 available atstacksioporgJNE10056020mmedia) The channel mapsshow informative channels at the very margin of the gridThis suggests that grid placement was not always optimal forcapturing the most informative electrodes

Differences in decoding accuracy compared to the SVMwere found to be small (table 4) and revealed no clearsuperiority of any classifier (figure 9) Excluding subject 2the mean performance difference (HMM-SVM) was ndash053for LFTD 01 for HG and ndash023 for combined featuresAccuracy trends for individual subjects are similar acrosssessions and vary less than across subjects (figure 9) Notethat irrespective of the suboptimal grid placement subject 2took part in four experimental sessions Here subject-specificvariables eg grid placement were important determinants ofdecoding success Thus we interpret our results at the subjectlevel and conclude that in three out of four subjects SVMand HMM decoding of finger movements led to comparableresults

8

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 6 Influence of channel selection on the HMM decoding accuracy using high gamma features only (S1 session 1) Changes indecoding accuracy when successively adding channels to the feature set applying either random or DB-index channel selection (top)Number of free HMM parameters for each number of channels (bottom)

Table 3 Number of selected channels averaged across sessions The averaged number of selected channels and the corresponding standardderivations are given for all subjects and all feature spaces For the combined space square brackets denote the corresponding breakdowninto its components The average is taken across all sessions and CV repetitions

Subject LFTD HG LFTD + HG

1 100 plusmn 00 60 plusmn 00 165 plusmn 05 ([100 plusmn 00]+[65 plusmn 07])2 65 plusmn 19 73 plusmn 21 128 plusmn 34 ([70 plusmn 45] + [58 plusmn 15])3 100 plusmn 00 80 plusmn 00 140 plusmn 10 ([60 plusmn 14] + [80 plusmn 00])4 100 plusmn 00 60 plusmn 14 160 plusmn 07 ([95 plusmn 07] + [65 plusmn 07])

A clear superiority for SVM classification was only foundfor subject 2 for whom the absolute decoding performanceswere significantly worse than those of the other three subjects

For the single spaces LFTD and HG table 4 shows themean performance loss for the unleashed HMM That refersto a case wherein the Bakis model constraints are relaxedto the full model For all except two sessions of subject 2(the full model had a higher performance of 02 for session3 and 07 for session 4) the Bakis model outperformedthe full model The results indicate a higher benefit for theLFTD features with an up to 88 performance gain forsubject 3 session 2 Higher accuracies were consistentlyachieved for feature subsets deviating from the optimalsize An example is illustrated for LFTD channel selection(subject 3 session 2) in supplementary figure SI9 (availableat stacksioporgJNE10056020mmedia) The plot comparesthe Bakis and full model similar to the procedure in figure 6While table 4 lists only the improvements for the optimalnumber of channels supplementary figure SI9 indicates that

the performance gain may be even higher when deviating fromthe optimal settings

Finally figure 10 illustrates the accuracy decompositionsinto true positive rates for each finger The results reveal similartrends across subjects and feature spaces While the thumband little finger provided the fewest training samples theirtrue positive rates tend to be the highest among all fingers Anexception is given for LFTD features for subject 2 Equivalentplots for the SVM are shown in supplementary figure SI10(available at stacksioporgJNE10056020mmedia)

For the classification of one feature sequence includingall T samples the HMM Matlab implementation took onaverage (3522 plusmn 0036) ms and the SVM implementationin C (00281 plusmn 00034) ms The mean computation time fora feature sequence was 02 ms plusmn 21μs (LFTD) 172 ms plusmn379 μs (HG) and 179 ms plusmn 200 μs (combined space) Theresults were obtained with an Intel Core i5-2500 K 34 GHz16 GB RAM Win 7 Pro 64-Bit Matlab 2012b for ten selectedchannels

9

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 7 Decoding accuracies (in ) for HMM (green) and SVM (beige) using LFTD + HG feature space (best case) Averageperformances across all sessions are shown for all subjects (S1ndashS4) as a mean result of a 30 times 5 CV procedure with their correspondingerror bars

Figure 8 Decoding accuracies (in ) for the HMMs for all feature spaces and sessions Bi Performances are averaged across 30 repetitionsof the five-fold CV procedure and displayed with their corresponding error bars

4 Discussion

An evaluation of feature space configurations revealed asensitivity of the decoding performance to the preciseparameterization of individual feature spaces Apart from thesegeneral aspects the grid search primarily allows for individualparameter deviations across subjects and sessions Importantlya sensitivity has been identified for the particular time intervalof interest used for the sliding window approach to extractHG features This indicates varying temporal dynamics acrosssubjects and may favour the application of HMMs in onlineapplications This sensitivity suggests that adapting to thesetemporal dynamics has an influence on the degree to whichinformation can be deduced from the data by both HMMand SVM However our investigations indicate that HMMsare less sensitive to changes of the ROI and give nearly

stable performances when changing the window overlapAs a model property the HMM is capable of modellinginformative sequences of different lengths and rates Thelatter characteristic originates from the explicit loop transitionwithin a particular model This means the HMM may coveruninformative samples of a temporal sequence with an idlestate before the informative part starts Underlying temporalprocesses that may persist longer or shorter are modelled bymore or less repeating basic model states Lacking any time-warp properties an SVM would expect a fixed sequence lengthand information rate meaning a fixed association between aparticular feature dimension oi and a specific point in time tiThis highlights the strict comparability of information withinone single feature dimension explicitly required by the SVMThis leads to the prediction of benefits for HMMs in onlineapplications where the time for the button press or other

10

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 9 Differences in decoding accuracy (in ) for the HMMs with respect to the SVM reference classifier (HMM-SVM) Results aredisplayed for all sessions Bi and feature spaces

Figure 10 Mean HMM decoding rates decomposed into true positive rates for each finger Results are shown for each feature space andaveraged across sessions and CV repetitions

Table 4 Final decoding accuracies (in ) listed for HMM (left) and SVM (right) Accuracies for the Bakis model are given for eachsubject-session and feature space by its mean and standard error across all 30 repetitions of the CV procedure Additionally for each subjectand feature space the mean across all sessions as well as the mean loss in performance for LFTD and HG when using an unconstrainedHMM is shown

HMM SVMSubjectndashsession LFTD HG LFTD + HG LFTD HG LFTD + HG

S1ndashB1 () 807 plusmn 03 979 plusmn 01 984 plusmn 01 858 plusmn 04 972 plusmn 02 983 plusmn 01S1ndashB2 () 788 plusmn 06 924 plusmn 04 953 plusmn 03 849 plusmn 05 934 plusmn 04 961 plusmn 03Mean () 798 plusmn 05 952 plusmn 03 969 plusmn 02 854 plusmn 05 953 plusmn 03 972 plusmn 02Unconstrained model () minus31 plusmn 05 minus09 plusmn 03 ndash ndash ndash ndashS2ndashB1 () 440 plusmn 10 359 plusmn 07 444 plusmn 06 514 plusmn 08 380 plusmn 08 557 plusmn 08S2ndashB2 () 502 plusmn 10 481 plusmn 07 524 plusmn 08 555 plusmn 07 480 plusmn 06 602 plusmn 08S2ndashB3 () 402 plusmn 07 363 plusmn 08 401 plusmn 08 496 plusmn 07 395 plusmn 09 515 plusmn 08S2ndashB4 () 309 plusmn 07 364 plusmn 09 384 plusmn 09 391 plusmn 07 380 plusmn 10 427 plusmn 09Mean () 412 plusmn 09 392 plusmn 08 438 plusmn 08 489 plusmn 07 409 plusmn 09 525 plusmn 09Unconstrained model () minus22 plusmn 09 minus04 plusmn 08 ndash ndash ndash ndashS3ndashB1 () 646 plusmn 07 851 plusmn 04 858 plusmn 04 612 plusmn 07 854 plusmn 05 888 plusmn 04S3ndashB2 () 801 plusmn 05 905 plusmn 03 937 plusmn 03 850 plusmn 04 914 plusmn 03 960 plusmn 03Mean () 724 plusmn 06 878 plusmn 04 898 plusmn 04 731 plusmn 06 884 plusmn 04 924 plusmn 03Unconstrained model () minus59 plusmn 07 minus25 plusmn 04 ndash ndash ndash ndashS4ndashB1 () 714 plusmn 05 835 plusmn 05 872 plusmn 04 685 plusmn 06 821 plusmn 04 867 plusmn 04S4ndashB2 () 697 plusmn 07 831 plusmn 03 879 plusmn 04 643 plusmn 06 823 plusmn 03 844 plusmn 04Mean () 706 plusmn 07 833 plusmn 03 876 plusmn 04 664 plusmn 06 822 plusmn 03 856 plusmn 04Unconstrained model () minus11 plusmn 06 minus31 plusmn 05 ndash ndash ndash ndash

11

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 5: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 1 Block diagram of the implemented test framework The decision for a particular class is provided by either SVM or HMM in bothcases they are exposed to an identical framework of pre-processing channel selection and feature extraction

lowpass filter cut-off were selected by grid search for eachsession and are stated in the results The data sampling wasreduced to the Nyquist frequency

Second event-related spectral perturbation (ERSP) wasmeasured using a sliding Hann-window approach (windowlength nw = 260 ms) For each window the square root ofthe power spectrum was computed by fast Fourier transform(FFT) The resulting coefficients were then averaged ina frequency band of interest Taking the square root toobtain the amplitude spectrum assigned a higher weight tohigh-frequency components compared to the original powerspectrum The resulting feature sequence can be used as ameasure for time-locked power fluctuation in the high gammaband neglecting explicit phase information (Graimann et al2004 Crone et al 1998 Makeig 1993)

The time interval relative to the button press and thewindow overlap and exact low and high cut-off frequency ofthe high gamma band most discriminative for the four classeswere selected by grid search The resulting parameters for eachsession are stated in the results section

A combined feature space including both high gamma(HG) as well as LFTD features was also evaluated The ratio pof how many LFTD feature sequences are used relative to thenumber of HG sequences was obtained by grid search Subsetselection beyond this ratio is described in the next subsection

24 Feature selection

241 General remarks on subset selection Featuresequences were computed for all 64 ECoG channels giving2 times 64 sequences in total From these sequences a subset ofsize O has been identified that contains the most relevantinformation about the finger labels A preference towardscertain channels is influenced by the grid placement as well asdistinct parts of the cortex differentially activated by the task

We tested several measures of feature selection suchas Fisherrsquos linear discriminant analysis (Fisher 1936) andBhattacharyya distance (Bhattacharyya 1943) and found thatthe DaviesndashBouldin (DB) index (Wissel and Palaniappan2011 Davies and Bouldin 1979) yielded the most robustand stable subset estimates for accurate class separation Inaddition to higher robustness the DB index also represents thefastest alternative computationally The DB index substantiallyoutperformed other methods for both SVM and HMMNevertheless it should be pointed out that observed DB subsetswere in good agreement with the ranking generated by the

normal vector w of a linear SVM Apart from so-called filtermethods such as the DB index an evaluation of these vectorentries after training the SVM would correspond to featureselection using a wrapper method (Guyon and Elisseeff 2003)Our study did not employ any wrapper methods to avoiddependences between feature selection and classifier This mayhave led to an additional bias towards SVM or HMM (Yu andLiu 2004) Based on the DB index we defined an algorithmthat automatically selects an O-dimensional subset of featuresequences as follows

242 DB index The index corresponds to a clusterseparation index measuring the overlap of two clusters in anarbitrarily high dimensional space where r isin R

T is one of Nsample vectors belonging to cluster Ci

μi = 1

N

sumrisinCi

r (1)

di = 1

N

sumrisinCi

r minus μi (2)

Ri j = di + di

μi minus μji j isin N+ i = j (3)

Using the cluster centroids μi isin RT and spread di matrix R =

Ri ji=1Nc j=1Nc contains the rate between the within-classspread and the between-class spread for all class combinationsThe smaller the index the less two clusters overlap in thatfeature space The index is computed for all features available

243 Feature selection procedure Covering differentaspects of class separation we developed a feature selectionalgorithm based on the DB index taking multiple classesinto account The algorithm aims at minimizing the mutualclass overlap for a subset of defined size O as follows Firsteach feature sequence is segmented into three parts of equallength The index matrix R measuring the cluster overlapbetween all classes i and j is then computed segment-wisefor each feature under consideration Subsequently all threematrices are averaged (geometric mean for a bias towardssmaller indices) for further computing The feature yieldingthe smallest mean index is selected since it provides the bestseparation on average from one class to at least one other classSecond for all class combinations the features correspondingto the minimal index per combination are added Each of

4

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 2 An HMM structure for one class consists of Q states emitting multi-dimensional observations ot at each time step The state aswell as the output sequence is governed by a set of probabilities including transition probabilities aij To assign a sequence Ot to a particularmodel the classifier decides for the highest probability P(Ot|model)

them optimally separates a certain finger combination Nextfeatures with the smallest geometric mean per class are addedif they are not part of the subset already Finally in case thepredefined number of elements in the subset O admits furtherfeatures vacancies are filled according to a sorted ranking ofindices averaged across class combinations per feature

For each trial this subset results in an O-dimensionalfeature vector varying over time t isin [0 T ] While thismultivariate sequence was fed directly into the HMM classifierthe OtimesT matrix Ot had to be reshaped into a 1times (T middotO) vectorfor the SVM The subset size O is obtained by exhaustivesearch on the training data This makes the difference betweenboth classifiers in data modelling obvious

While an HMM technically distinguishes between thetemporal development (columns in Ot) and the feature type(rows in Ot) the SVM concatenates all channels and theirtime courses within the same one-dimensional feature vectorThe potential flexibility of HMMs with respect to temporalphenomena arises substantially from this fact

25 Classification

251 Hidden Markov models Arising from a state machinestructure hidden Markov models (HMMs) comprise twostochastic processes First as illustrated in figure 2 a setof Q underlying states is used to describe a sequence Ot ofT O-dimensional feature vectors These states qt isin sii=1Q

are assumed to be hidden while each is associated with anO-dimensional multivariate Gaussian M-mixture distributionto generate the observed featuresmdashthe second stochasticprocess Both the state incident qt as well as the observedoutput variable ot at time t are subject to the Markov propertymdasha conditional dependence only on the last step in time t minus 1

Among other parameters the state sequence structure andits probable variations are described by prior probabilitiesP(q1 = si) and a transition matrix

A = ai ji j=1Q = P(qt+1 = s j|qt = si)i j=1Q

A more detailed account of the methodological background isgiven by Rabiner (1989)

For each of the four classes one HMM had to be definedwhich was trained using the iterative BaumndashWelch algorithmwith eight expectation maximization (EM) steps (Dempsteret al 1977) The classification on the test set was carried outfollowing a maximum likelihood approach as indicated by thecurly bracket in figure 2

In order to increase decoding performance the abovestandard HMM structure was constrained and adapted to theclassification problem This aims at optimizing the modelhypothesis to give a better fit and reduces the number of freeparameters to deal with limited training data

252 Model constraints First the transition matrix A wasconstrained to the Bakis model commonly used for wordmodelling in automatic speech processing (ASR) (Schukat-Talamazzini 1995) This model admits non-zero elementsonly on the upper triangular part of the transition matrixmdashspecifically on the main as well as first and second secondarydiagonal Thus this particular structure implies only lsquolooprsquolsquonextrsquo and lsquoskiprsquo transitions and suppresses state transitionson the remaining upper half Preliminary results using a fullupper triangular matrix instead revealed small probabilitiesie very unlikely state transitions for these suppressedtransitions throughout all finger models This suggests onlyminor contributions to the signal characteristics describedby that model A full ergodic transition matrix generated

5

J Neural Eng 10 (2013) 056020 T Wissel et al

poor decoding performances in general Recently a similarfinding was presented by Lederman and Tabrikian for EEGdata (Lederman and Tabrikian 2012)

Given five states containing one Gaussian the unleashedmodel would involve 4 times 1235 degrees of freedom while theBakis model uses 4 times 1219 for an exemplary selection of15 channels The significant increase of decoding performanceby dropping only a few parameters seems interesting withrespect to the nature of the underlying signal structure andsheds remarkable light on the importance of feature selectionfor HMMs An unleashed model using all 64 channels wouldhave 4 times 20835 free parameters

253 Model structure Experiments for all feature spaceshave been conducted in which the optimal number of states Qper HMM and the number of mixture distribution componentsM per state have been examined by grid search Sinceincreasing the degrees of freedom will degrade the qualityof model estimation high decoding rates have mainly beenobtained for a small number of states containing only a fewmixture components Based on this search a model structureof Q = 5 states and M = 1 mixture components has beenchosen as the best compromise Note that the actual optimummight vary slightly depending on the particular subject thefeature space and even on the session which is in line withother studies (Obermaier et al 1999 2001a 2001b)

254 Initialization The initial parameters of themultivariate probability densities have been set using a k-means clustering algorithm (Duda et al 2001) The algorithmtemporally clusters the vectors of the O times T feature matrix OtThe rows of this matrix were extended by a scaled temporalcounter adding a time stamp to each feature vector Thisincreases the probability of time-continuous state intervals inthe cluster output Then for each of the Q middot M clusters themean as well as the full covariance matrix have been calculatedexcluding the additional time stamp Finally to assign thesemixture distributions to specific states the clusters were sortedaccording to the mean of their corresponding time stampvalues This temporal order is suggested by the constrainedforward structure of the transition matrix The transition matrixis initialized with probability values on the diagonal thatare weighted to be c-times as likely as probabilities off thediagonal The constant c is being calculated as the quotientof the number of samples for a feature T and the number ofstates Q In case of multiple mixture components (M gt 1) themixture weight matrix was initialized using the normalizednumber of sample vectors assigned to each cluster

255 Implementation The basic implementation has beenderived from the open source functionalities on HMMsprovided by Kevin Murphy from the University of BritishColumbia (Murphy 2005) This framework its modificationsas well as the entire code developed for this study wereimplemented using MATLAB R2012b from MathWorks

256 Support vector machines As a gold-standardreference we used a one-vs-one SVM for multi-classclassification In our study training samples come in pairedassignments (xi yi)i=1N where xi is a 1 times (T middot O) featurevector derived from the ECoG data for one trial with itsclass assignment yi = ndash11 Equation (4) describes theclassification rule of the trained classifier for new samplesx A detailed derivation of this decision rule and the rationalebehind SVM-classification is given eg by Hastie et al (2009)

F(x) = sgn(sum

iαiyiK(xi x) + β0

)(4)

The αi isin [0 C] are weights for the training vectors and areestimated during the training process Training vectors withweights αi = 0 contribute to the solution of the classificationproblem and are called support vectors The offset of theseparating hyperplane from the origin is proportional to β0The constant C generally controls the trade-off betweenclassification error on the training set and smoothness ofthe decision boundary The numbers of training samplesper session (table 1) were in a similar range compared tothe resulting dimension of the SVM feature space T middot O(sim130minus400) Influenced by this fact the decoding accuracyof the SVM revealed almost no sensitivity with respect toregularization Since grid search yielded a stable plateau ofhigh performance for C gt 1 the regularization constant wasfixed to C = 1000 The linear kernel which has been usedthroughout the study is denoted by K(xi x) We further usedthe LIBSVM (wwwcsientuedutwsimcjlinlibsvm) packagefor Matlab developed by Chih-Chung Chang and Chih-Jen Lin(Chang and Lin 2011)

26 Considerations on testing

In order to evaluate the findings presented in the followingsections a five-fold cross-validation (CV) scheme was used(Lemm et al 2011) The assignment of a trial to a specific foldset was performed using uniform random permutations

Each CV procedure was repeated 30 times to averageout fluctuations in the resulting recognition rates arising fromrandom partitioning of the CV sets The number of repetitionswas chosen as a compromise to obtain stable estimates of thegeneralization error but also to avoid massive computationalload The training data was balanced between all classes totreat no class preferentially and to avoid the need for classprior probabilities Feature subset selection and estimation ofthe parameters of each classifier was performed only on datasets allocated for training in order to guarantee a fair estimateof the generalization error Due to prohibitive computationaleffort the feature spaces have been parameterized only oncefor each subject and session and are hence based on all samplesof each subject-specific data set In this context preliminaryexperiments indicated minimal differences to the decodingperformance (see table 2) when fixing the feature definitionsbased on the training sets only These experiments used nestedCV with an additional validation set for the feature spaceparameters

6

J Neural Eng 10 (2013) 056020 T Wissel et al

(a) (b)

(c) (d)

Figure 3 Optimal parameters of both feature spaces for each subject and session (a) Location and width for ROIs containing mostdiscriminative information for LFTD features (b) Optimal high cut-off frequency for the spectral lowpass filter to extract LFTD features(c) Location and width for ROIs containing most discriminative information for HG features (d) Low and high cut-off frequency for theoptimal HG frequency band

Table 2 HMM-decoding accuracies (in ) for parameteroptimization on full data and on training subsets Accuracies aregiven for subject 1mdashsession 1 Parameter optimization wasperformed on the full dataset (standard-CV routine) as well as thetraining subset (nested-CV routine)

Parameter optimization on LFTD HG LFTD + HG

Full dataset 807 plusmn 03 979 plusmn 01 984 plusmn 01Training subset 811 plusmn 10 978 plusmn 03 980 plusmn 02

3 Results

31 Feature and channel selection

In order to identify optimal parameters for feature extractiona grid search for each subject and session was performed Forthis optimization a preliminary feature selection was appliedaiming at a fixed subset size of O = 8 which was foundto be within the range of optimal subset sizes (see figure 6and table 3) Results for all sessions are illustrated by the barplots in figure 3 A sensitivity of the decoding rate has beenfound for temporal parameters such as location and width forthe time windows (ROIs) containing the most discriminativeinformation about LFTD or HG features (figures 3(a) and (c))

The highest impact on accuracy was found for the HGROI (mean optimal ranges S1 [minus1875 3375] ms S3 [minus400475] ms S4 [minus280 420] ms)

Further sensitivity is given for spectral parameters namelythe high cut-off frequency for the LFTD lowpass filter and thelow and high cut-off for the HG bandpass (figures 3(b) and(d)) For the former optimal frequencies ranged from 11ndash30 Hz and the optimal frequency bands for the latter all laywithin 60ndash300 Hz depending on the subject There was hardlyany impact on the decoding rate when changing the slidingwindow size and window overlap for the HG features Thesewere hence set to fixed values According to these parametersettings the mean numbers of samples T were 265 (subject 1)213 (subject 2) 31 (subject 3) and 22 (subject 4)

An example of the grid search procedure isshown in supplementary figures SI5ndashSI7 (available atstacksioporgJNE10056020mmedia) The plots showvarying decoding rates for the S1 session 1 within individualparameter subspaces The optimal parameters illustrated infigure 3 represent a single point in these spaces at peakingdecoding rate Based on these parameters figure 4 illustrates atypical result for a LFTD and HG feature sequence extractedfrom the shown raw ECoG signal (S1 session 2 channel 23middle finger) All feature sequences for this trialmdashconsistingof the entire set of channelsmdashare given in supplementaryfigure SI8

Using the resulting parameterization subsequent CV runssought to identify the optimal subset by tuning O based on thetraining set of the current fold Figure 5 illustrates the rateat which a particular channel was selected for each subject

7

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 4 The dashed grey time course shows a raw ECoG signal fora typical trial (S1 session 2 channel 23 middle finger) Theextracted feature sequences for LFTD (cut-off at 24 Hz) and HG(frequency band [85 Hz 270 Hz]) are overlaid as solid lines EachHG feature has been assigned to the time point at the centre of itscorresponding FFT window

Figure 5 Channel maps for all four subjects The channels arearranged according to the electrode grid and coloured with respectto their mean selection rate in repeated CV-runs Channels colouredlike the top of the colour bar tend to have a higher probability ofbeing selected

It evaluates the optimal subsets O across all sessions foldsand several CV repetitions Results are exemplarily shownfor high gamma features Full detail on subject-specific gridplacement and channel maps for LFTD and HG featurescan be found in supplementary figures SI1ndashSI4 (available atstacksioporgJNE10056020mmedia)

The channel maps reveal localized regions of highlyinformative channels (dark red) comprising only a small subsetwithin the full grid While these localizations are distinct andfocused for high gamma features informative channels spreadwider across the electrode grid for LFTD features This finding

is in line with an earlier study done by Kubanek et al (2009) andimplies that selected subsets for LFTD features tend to be lessstable across CV runs Combined subsets were selected fromboth feature spaces The selections indicated a preference fora higher proportion of LFTD features (table 3) This optimalproportion was determined by grid search

The optimal absolute subset size O was selected bysuccessively adding channels to the subset as defined by ourselection procedure and identifying the peak performanceFor the high gamma case figure 6 already shows a sharpincrease in decoding performance for a subset size belowfive Performance then increases less rapidly until it reachesa maximum usually in the range of 6ndash8 channels for the HGfeatures (see table 3 for detailed information on the results forthe other features) This behaviour again reflects the fact that alot of information is found in a small number of HG sequences

Increasing the subset size beyond this optimal size leadsto a decrease in performance for both SVM and HMM Thiscoincides with an increase of the number of free parameters inthe classifier model depending quadratically on the number ofchannels for HMMs (figure 6 bottom)

32 Decoding performances

Based on the configuration of the feature spaces and featureselection a combination space containing both LFTD andHG features yielded the best decoding accuracy across allsessions and subjects Figure 7 summarizes the highestHMM performances for each subject averaged across allsessions and compares them with the corresponding SVMaccuracies Except for subject 2 these accuracies were closeto or even exceeded 90 correct classifications Howeverfigure 8 shows that accuracies obtained only from the HGspace come very close to this combined case implying thatLFTD features contribute limited new information Usingtime domain features exclusively decreased performance inmost cases An exception is given by subject 2 whereLFTD features outperformed HG features This may beexplained by the wider spread of LFTD features across thecortex and the suboptimal grid placement (little coverageof motor cortex supplementary figure SI2 available atstacksioporgJNE10056020mmedia) The channel mapsshow informative channels at the very margin of the gridThis suggests that grid placement was not always optimal forcapturing the most informative electrodes

Differences in decoding accuracy compared to the SVMwere found to be small (table 4) and revealed no clearsuperiority of any classifier (figure 9) Excluding subject 2the mean performance difference (HMM-SVM) was ndash053for LFTD 01 for HG and ndash023 for combined featuresAccuracy trends for individual subjects are similar acrosssessions and vary less than across subjects (figure 9) Notethat irrespective of the suboptimal grid placement subject 2took part in four experimental sessions Here subject-specificvariables eg grid placement were important determinants ofdecoding success Thus we interpret our results at the subjectlevel and conclude that in three out of four subjects SVMand HMM decoding of finger movements led to comparableresults

8

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 6 Influence of channel selection on the HMM decoding accuracy using high gamma features only (S1 session 1) Changes indecoding accuracy when successively adding channels to the feature set applying either random or DB-index channel selection (top)Number of free HMM parameters for each number of channels (bottom)

Table 3 Number of selected channels averaged across sessions The averaged number of selected channels and the corresponding standardderivations are given for all subjects and all feature spaces For the combined space square brackets denote the corresponding breakdowninto its components The average is taken across all sessions and CV repetitions

Subject LFTD HG LFTD + HG

1 100 plusmn 00 60 plusmn 00 165 plusmn 05 ([100 plusmn 00]+[65 plusmn 07])2 65 plusmn 19 73 plusmn 21 128 plusmn 34 ([70 plusmn 45] + [58 plusmn 15])3 100 plusmn 00 80 plusmn 00 140 plusmn 10 ([60 plusmn 14] + [80 plusmn 00])4 100 plusmn 00 60 plusmn 14 160 plusmn 07 ([95 plusmn 07] + [65 plusmn 07])

A clear superiority for SVM classification was only foundfor subject 2 for whom the absolute decoding performanceswere significantly worse than those of the other three subjects

For the single spaces LFTD and HG table 4 shows themean performance loss for the unleashed HMM That refersto a case wherein the Bakis model constraints are relaxedto the full model For all except two sessions of subject 2(the full model had a higher performance of 02 for session3 and 07 for session 4) the Bakis model outperformedthe full model The results indicate a higher benefit for theLFTD features with an up to 88 performance gain forsubject 3 session 2 Higher accuracies were consistentlyachieved for feature subsets deviating from the optimalsize An example is illustrated for LFTD channel selection(subject 3 session 2) in supplementary figure SI9 (availableat stacksioporgJNE10056020mmedia) The plot comparesthe Bakis and full model similar to the procedure in figure 6While table 4 lists only the improvements for the optimalnumber of channels supplementary figure SI9 indicates that

the performance gain may be even higher when deviating fromthe optimal settings

Finally figure 10 illustrates the accuracy decompositionsinto true positive rates for each finger The results reveal similartrends across subjects and feature spaces While the thumband little finger provided the fewest training samples theirtrue positive rates tend to be the highest among all fingers Anexception is given for LFTD features for subject 2 Equivalentplots for the SVM are shown in supplementary figure SI10(available at stacksioporgJNE10056020mmedia)

For the classification of one feature sequence includingall T samples the HMM Matlab implementation took onaverage (3522 plusmn 0036) ms and the SVM implementationin C (00281 plusmn 00034) ms The mean computation time fora feature sequence was 02 ms plusmn 21μs (LFTD) 172 ms plusmn379 μs (HG) and 179 ms plusmn 200 μs (combined space) Theresults were obtained with an Intel Core i5-2500 K 34 GHz16 GB RAM Win 7 Pro 64-Bit Matlab 2012b for ten selectedchannels

9

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 7 Decoding accuracies (in ) for HMM (green) and SVM (beige) using LFTD + HG feature space (best case) Averageperformances across all sessions are shown for all subjects (S1ndashS4) as a mean result of a 30 times 5 CV procedure with their correspondingerror bars

Figure 8 Decoding accuracies (in ) for the HMMs for all feature spaces and sessions Bi Performances are averaged across 30 repetitionsof the five-fold CV procedure and displayed with their corresponding error bars

4 Discussion

An evaluation of feature space configurations revealed asensitivity of the decoding performance to the preciseparameterization of individual feature spaces Apart from thesegeneral aspects the grid search primarily allows for individualparameter deviations across subjects and sessions Importantlya sensitivity has been identified for the particular time intervalof interest used for the sliding window approach to extractHG features This indicates varying temporal dynamics acrosssubjects and may favour the application of HMMs in onlineapplications This sensitivity suggests that adapting to thesetemporal dynamics has an influence on the degree to whichinformation can be deduced from the data by both HMMand SVM However our investigations indicate that HMMsare less sensitive to changes of the ROI and give nearly

stable performances when changing the window overlapAs a model property the HMM is capable of modellinginformative sequences of different lengths and rates Thelatter characteristic originates from the explicit loop transitionwithin a particular model This means the HMM may coveruninformative samples of a temporal sequence with an idlestate before the informative part starts Underlying temporalprocesses that may persist longer or shorter are modelled bymore or less repeating basic model states Lacking any time-warp properties an SVM would expect a fixed sequence lengthand information rate meaning a fixed association between aparticular feature dimension oi and a specific point in time tiThis highlights the strict comparability of information withinone single feature dimension explicitly required by the SVMThis leads to the prediction of benefits for HMMs in onlineapplications where the time for the button press or other

10

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 9 Differences in decoding accuracy (in ) for the HMMs with respect to the SVM reference classifier (HMM-SVM) Results aredisplayed for all sessions Bi and feature spaces

Figure 10 Mean HMM decoding rates decomposed into true positive rates for each finger Results are shown for each feature space andaveraged across sessions and CV repetitions

Table 4 Final decoding accuracies (in ) listed for HMM (left) and SVM (right) Accuracies for the Bakis model are given for eachsubject-session and feature space by its mean and standard error across all 30 repetitions of the CV procedure Additionally for each subjectand feature space the mean across all sessions as well as the mean loss in performance for LFTD and HG when using an unconstrainedHMM is shown

HMM SVMSubjectndashsession LFTD HG LFTD + HG LFTD HG LFTD + HG

S1ndashB1 () 807 plusmn 03 979 plusmn 01 984 plusmn 01 858 plusmn 04 972 plusmn 02 983 plusmn 01S1ndashB2 () 788 plusmn 06 924 plusmn 04 953 plusmn 03 849 plusmn 05 934 plusmn 04 961 plusmn 03Mean () 798 plusmn 05 952 plusmn 03 969 plusmn 02 854 plusmn 05 953 plusmn 03 972 plusmn 02Unconstrained model () minus31 plusmn 05 minus09 plusmn 03 ndash ndash ndash ndashS2ndashB1 () 440 plusmn 10 359 plusmn 07 444 plusmn 06 514 plusmn 08 380 plusmn 08 557 plusmn 08S2ndashB2 () 502 plusmn 10 481 plusmn 07 524 plusmn 08 555 plusmn 07 480 plusmn 06 602 plusmn 08S2ndashB3 () 402 plusmn 07 363 plusmn 08 401 plusmn 08 496 plusmn 07 395 plusmn 09 515 plusmn 08S2ndashB4 () 309 plusmn 07 364 plusmn 09 384 plusmn 09 391 plusmn 07 380 plusmn 10 427 plusmn 09Mean () 412 plusmn 09 392 plusmn 08 438 plusmn 08 489 plusmn 07 409 plusmn 09 525 plusmn 09Unconstrained model () minus22 plusmn 09 minus04 plusmn 08 ndash ndash ndash ndashS3ndashB1 () 646 plusmn 07 851 plusmn 04 858 plusmn 04 612 plusmn 07 854 plusmn 05 888 plusmn 04S3ndashB2 () 801 plusmn 05 905 plusmn 03 937 plusmn 03 850 plusmn 04 914 plusmn 03 960 plusmn 03Mean () 724 plusmn 06 878 plusmn 04 898 plusmn 04 731 plusmn 06 884 plusmn 04 924 plusmn 03Unconstrained model () minus59 plusmn 07 minus25 plusmn 04 ndash ndash ndash ndashS4ndashB1 () 714 plusmn 05 835 plusmn 05 872 plusmn 04 685 plusmn 06 821 plusmn 04 867 plusmn 04S4ndashB2 () 697 plusmn 07 831 plusmn 03 879 plusmn 04 643 plusmn 06 823 plusmn 03 844 plusmn 04Mean () 706 plusmn 07 833 plusmn 03 876 plusmn 04 664 plusmn 06 822 plusmn 03 856 plusmn 04Unconstrained model () minus11 plusmn 06 minus31 plusmn 05 ndash ndash ndash ndash

11

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 6: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 2 An HMM structure for one class consists of Q states emitting multi-dimensional observations ot at each time step The state aswell as the output sequence is governed by a set of probabilities including transition probabilities aij To assign a sequence Ot to a particularmodel the classifier decides for the highest probability P(Ot|model)

them optimally separates a certain finger combination Nextfeatures with the smallest geometric mean per class are addedif they are not part of the subset already Finally in case thepredefined number of elements in the subset O admits furtherfeatures vacancies are filled according to a sorted ranking ofindices averaged across class combinations per feature

For each trial this subset results in an O-dimensionalfeature vector varying over time t isin [0 T ] While thismultivariate sequence was fed directly into the HMM classifierthe OtimesT matrix Ot had to be reshaped into a 1times (T middotO) vectorfor the SVM The subset size O is obtained by exhaustivesearch on the training data This makes the difference betweenboth classifiers in data modelling obvious

While an HMM technically distinguishes between thetemporal development (columns in Ot) and the feature type(rows in Ot) the SVM concatenates all channels and theirtime courses within the same one-dimensional feature vectorThe potential flexibility of HMMs with respect to temporalphenomena arises substantially from this fact

25 Classification

251 Hidden Markov models Arising from a state machinestructure hidden Markov models (HMMs) comprise twostochastic processes First as illustrated in figure 2 a setof Q underlying states is used to describe a sequence Ot ofT O-dimensional feature vectors These states qt isin sii=1Q

are assumed to be hidden while each is associated with anO-dimensional multivariate Gaussian M-mixture distributionto generate the observed featuresmdashthe second stochasticprocess Both the state incident qt as well as the observedoutput variable ot at time t are subject to the Markov propertymdasha conditional dependence only on the last step in time t minus 1

Among other parameters the state sequence structure andits probable variations are described by prior probabilitiesP(q1 = si) and a transition matrix

A = ai ji j=1Q = P(qt+1 = s j|qt = si)i j=1Q

A more detailed account of the methodological background isgiven by Rabiner (1989)

For each of the four classes one HMM had to be definedwhich was trained using the iterative BaumndashWelch algorithmwith eight expectation maximization (EM) steps (Dempsteret al 1977) The classification on the test set was carried outfollowing a maximum likelihood approach as indicated by thecurly bracket in figure 2

In order to increase decoding performance the abovestandard HMM structure was constrained and adapted to theclassification problem This aims at optimizing the modelhypothesis to give a better fit and reduces the number of freeparameters to deal with limited training data

252 Model constraints First the transition matrix A wasconstrained to the Bakis model commonly used for wordmodelling in automatic speech processing (ASR) (Schukat-Talamazzini 1995) This model admits non-zero elementsonly on the upper triangular part of the transition matrixmdashspecifically on the main as well as first and second secondarydiagonal Thus this particular structure implies only lsquolooprsquolsquonextrsquo and lsquoskiprsquo transitions and suppresses state transitionson the remaining upper half Preliminary results using a fullupper triangular matrix instead revealed small probabilitiesie very unlikely state transitions for these suppressedtransitions throughout all finger models This suggests onlyminor contributions to the signal characteristics describedby that model A full ergodic transition matrix generated

5

J Neural Eng 10 (2013) 056020 T Wissel et al

poor decoding performances in general Recently a similarfinding was presented by Lederman and Tabrikian for EEGdata (Lederman and Tabrikian 2012)

Given five states containing one Gaussian the unleashedmodel would involve 4 times 1235 degrees of freedom while theBakis model uses 4 times 1219 for an exemplary selection of15 channels The significant increase of decoding performanceby dropping only a few parameters seems interesting withrespect to the nature of the underlying signal structure andsheds remarkable light on the importance of feature selectionfor HMMs An unleashed model using all 64 channels wouldhave 4 times 20835 free parameters

253 Model structure Experiments for all feature spaceshave been conducted in which the optimal number of states Qper HMM and the number of mixture distribution componentsM per state have been examined by grid search Sinceincreasing the degrees of freedom will degrade the qualityof model estimation high decoding rates have mainly beenobtained for a small number of states containing only a fewmixture components Based on this search a model structureof Q = 5 states and M = 1 mixture components has beenchosen as the best compromise Note that the actual optimummight vary slightly depending on the particular subject thefeature space and even on the session which is in line withother studies (Obermaier et al 1999 2001a 2001b)

254 Initialization The initial parameters of themultivariate probability densities have been set using a k-means clustering algorithm (Duda et al 2001) The algorithmtemporally clusters the vectors of the O times T feature matrix OtThe rows of this matrix were extended by a scaled temporalcounter adding a time stamp to each feature vector Thisincreases the probability of time-continuous state intervals inthe cluster output Then for each of the Q middot M clusters themean as well as the full covariance matrix have been calculatedexcluding the additional time stamp Finally to assign thesemixture distributions to specific states the clusters were sortedaccording to the mean of their corresponding time stampvalues This temporal order is suggested by the constrainedforward structure of the transition matrix The transition matrixis initialized with probability values on the diagonal thatare weighted to be c-times as likely as probabilities off thediagonal The constant c is being calculated as the quotientof the number of samples for a feature T and the number ofstates Q In case of multiple mixture components (M gt 1) themixture weight matrix was initialized using the normalizednumber of sample vectors assigned to each cluster

255 Implementation The basic implementation has beenderived from the open source functionalities on HMMsprovided by Kevin Murphy from the University of BritishColumbia (Murphy 2005) This framework its modificationsas well as the entire code developed for this study wereimplemented using MATLAB R2012b from MathWorks

256 Support vector machines As a gold-standardreference we used a one-vs-one SVM for multi-classclassification In our study training samples come in pairedassignments (xi yi)i=1N where xi is a 1 times (T middot O) featurevector derived from the ECoG data for one trial with itsclass assignment yi = ndash11 Equation (4) describes theclassification rule of the trained classifier for new samplesx A detailed derivation of this decision rule and the rationalebehind SVM-classification is given eg by Hastie et al (2009)

F(x) = sgn(sum

iαiyiK(xi x) + β0

)(4)

The αi isin [0 C] are weights for the training vectors and areestimated during the training process Training vectors withweights αi = 0 contribute to the solution of the classificationproblem and are called support vectors The offset of theseparating hyperplane from the origin is proportional to β0The constant C generally controls the trade-off betweenclassification error on the training set and smoothness ofthe decision boundary The numbers of training samplesper session (table 1) were in a similar range compared tothe resulting dimension of the SVM feature space T middot O(sim130minus400) Influenced by this fact the decoding accuracyof the SVM revealed almost no sensitivity with respect toregularization Since grid search yielded a stable plateau ofhigh performance for C gt 1 the regularization constant wasfixed to C = 1000 The linear kernel which has been usedthroughout the study is denoted by K(xi x) We further usedthe LIBSVM (wwwcsientuedutwsimcjlinlibsvm) packagefor Matlab developed by Chih-Chung Chang and Chih-Jen Lin(Chang and Lin 2011)

26 Considerations on testing

In order to evaluate the findings presented in the followingsections a five-fold cross-validation (CV) scheme was used(Lemm et al 2011) The assignment of a trial to a specific foldset was performed using uniform random permutations

Each CV procedure was repeated 30 times to averageout fluctuations in the resulting recognition rates arising fromrandom partitioning of the CV sets The number of repetitionswas chosen as a compromise to obtain stable estimates of thegeneralization error but also to avoid massive computationalload The training data was balanced between all classes totreat no class preferentially and to avoid the need for classprior probabilities Feature subset selection and estimation ofthe parameters of each classifier was performed only on datasets allocated for training in order to guarantee a fair estimateof the generalization error Due to prohibitive computationaleffort the feature spaces have been parameterized only oncefor each subject and session and are hence based on all samplesof each subject-specific data set In this context preliminaryexperiments indicated minimal differences to the decodingperformance (see table 2) when fixing the feature definitionsbased on the training sets only These experiments used nestedCV with an additional validation set for the feature spaceparameters

6

J Neural Eng 10 (2013) 056020 T Wissel et al

(a) (b)

(c) (d)

Figure 3 Optimal parameters of both feature spaces for each subject and session (a) Location and width for ROIs containing mostdiscriminative information for LFTD features (b) Optimal high cut-off frequency for the spectral lowpass filter to extract LFTD features(c) Location and width for ROIs containing most discriminative information for HG features (d) Low and high cut-off frequency for theoptimal HG frequency band

Table 2 HMM-decoding accuracies (in ) for parameteroptimization on full data and on training subsets Accuracies aregiven for subject 1mdashsession 1 Parameter optimization wasperformed on the full dataset (standard-CV routine) as well as thetraining subset (nested-CV routine)

Parameter optimization on LFTD HG LFTD + HG

Full dataset 807 plusmn 03 979 plusmn 01 984 plusmn 01Training subset 811 plusmn 10 978 plusmn 03 980 plusmn 02

3 Results

31 Feature and channel selection

In order to identify optimal parameters for feature extractiona grid search for each subject and session was performed Forthis optimization a preliminary feature selection was appliedaiming at a fixed subset size of O = 8 which was foundto be within the range of optimal subset sizes (see figure 6and table 3) Results for all sessions are illustrated by the barplots in figure 3 A sensitivity of the decoding rate has beenfound for temporal parameters such as location and width forthe time windows (ROIs) containing the most discriminativeinformation about LFTD or HG features (figures 3(a) and (c))

The highest impact on accuracy was found for the HGROI (mean optimal ranges S1 [minus1875 3375] ms S3 [minus400475] ms S4 [minus280 420] ms)

Further sensitivity is given for spectral parameters namelythe high cut-off frequency for the LFTD lowpass filter and thelow and high cut-off for the HG bandpass (figures 3(b) and(d)) For the former optimal frequencies ranged from 11ndash30 Hz and the optimal frequency bands for the latter all laywithin 60ndash300 Hz depending on the subject There was hardlyany impact on the decoding rate when changing the slidingwindow size and window overlap for the HG features Thesewere hence set to fixed values According to these parametersettings the mean numbers of samples T were 265 (subject 1)213 (subject 2) 31 (subject 3) and 22 (subject 4)

An example of the grid search procedure isshown in supplementary figures SI5ndashSI7 (available atstacksioporgJNE10056020mmedia) The plots showvarying decoding rates for the S1 session 1 within individualparameter subspaces The optimal parameters illustrated infigure 3 represent a single point in these spaces at peakingdecoding rate Based on these parameters figure 4 illustrates atypical result for a LFTD and HG feature sequence extractedfrom the shown raw ECoG signal (S1 session 2 channel 23middle finger) All feature sequences for this trialmdashconsistingof the entire set of channelsmdashare given in supplementaryfigure SI8

Using the resulting parameterization subsequent CV runssought to identify the optimal subset by tuning O based on thetraining set of the current fold Figure 5 illustrates the rateat which a particular channel was selected for each subject

7

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 4 The dashed grey time course shows a raw ECoG signal fora typical trial (S1 session 2 channel 23 middle finger) Theextracted feature sequences for LFTD (cut-off at 24 Hz) and HG(frequency band [85 Hz 270 Hz]) are overlaid as solid lines EachHG feature has been assigned to the time point at the centre of itscorresponding FFT window

Figure 5 Channel maps for all four subjects The channels arearranged according to the electrode grid and coloured with respectto their mean selection rate in repeated CV-runs Channels colouredlike the top of the colour bar tend to have a higher probability ofbeing selected

It evaluates the optimal subsets O across all sessions foldsand several CV repetitions Results are exemplarily shownfor high gamma features Full detail on subject-specific gridplacement and channel maps for LFTD and HG featurescan be found in supplementary figures SI1ndashSI4 (available atstacksioporgJNE10056020mmedia)

The channel maps reveal localized regions of highlyinformative channels (dark red) comprising only a small subsetwithin the full grid While these localizations are distinct andfocused for high gamma features informative channels spreadwider across the electrode grid for LFTD features This finding

is in line with an earlier study done by Kubanek et al (2009) andimplies that selected subsets for LFTD features tend to be lessstable across CV runs Combined subsets were selected fromboth feature spaces The selections indicated a preference fora higher proportion of LFTD features (table 3) This optimalproportion was determined by grid search

The optimal absolute subset size O was selected bysuccessively adding channels to the subset as defined by ourselection procedure and identifying the peak performanceFor the high gamma case figure 6 already shows a sharpincrease in decoding performance for a subset size belowfive Performance then increases less rapidly until it reachesa maximum usually in the range of 6ndash8 channels for the HGfeatures (see table 3 for detailed information on the results forthe other features) This behaviour again reflects the fact that alot of information is found in a small number of HG sequences

Increasing the subset size beyond this optimal size leadsto a decrease in performance for both SVM and HMM Thiscoincides with an increase of the number of free parameters inthe classifier model depending quadratically on the number ofchannels for HMMs (figure 6 bottom)

32 Decoding performances

Based on the configuration of the feature spaces and featureselection a combination space containing both LFTD andHG features yielded the best decoding accuracy across allsessions and subjects Figure 7 summarizes the highestHMM performances for each subject averaged across allsessions and compares them with the corresponding SVMaccuracies Except for subject 2 these accuracies were closeto or even exceeded 90 correct classifications Howeverfigure 8 shows that accuracies obtained only from the HGspace come very close to this combined case implying thatLFTD features contribute limited new information Usingtime domain features exclusively decreased performance inmost cases An exception is given by subject 2 whereLFTD features outperformed HG features This may beexplained by the wider spread of LFTD features across thecortex and the suboptimal grid placement (little coverageof motor cortex supplementary figure SI2 available atstacksioporgJNE10056020mmedia) The channel mapsshow informative channels at the very margin of the gridThis suggests that grid placement was not always optimal forcapturing the most informative electrodes

Differences in decoding accuracy compared to the SVMwere found to be small (table 4) and revealed no clearsuperiority of any classifier (figure 9) Excluding subject 2the mean performance difference (HMM-SVM) was ndash053for LFTD 01 for HG and ndash023 for combined featuresAccuracy trends for individual subjects are similar acrosssessions and vary less than across subjects (figure 9) Notethat irrespective of the suboptimal grid placement subject 2took part in four experimental sessions Here subject-specificvariables eg grid placement were important determinants ofdecoding success Thus we interpret our results at the subjectlevel and conclude that in three out of four subjects SVMand HMM decoding of finger movements led to comparableresults

8

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 6 Influence of channel selection on the HMM decoding accuracy using high gamma features only (S1 session 1) Changes indecoding accuracy when successively adding channels to the feature set applying either random or DB-index channel selection (top)Number of free HMM parameters for each number of channels (bottom)

Table 3 Number of selected channels averaged across sessions The averaged number of selected channels and the corresponding standardderivations are given for all subjects and all feature spaces For the combined space square brackets denote the corresponding breakdowninto its components The average is taken across all sessions and CV repetitions

Subject LFTD HG LFTD + HG

1 100 plusmn 00 60 plusmn 00 165 plusmn 05 ([100 plusmn 00]+[65 plusmn 07])2 65 plusmn 19 73 plusmn 21 128 plusmn 34 ([70 plusmn 45] + [58 plusmn 15])3 100 plusmn 00 80 plusmn 00 140 plusmn 10 ([60 plusmn 14] + [80 plusmn 00])4 100 plusmn 00 60 plusmn 14 160 plusmn 07 ([95 plusmn 07] + [65 plusmn 07])

A clear superiority for SVM classification was only foundfor subject 2 for whom the absolute decoding performanceswere significantly worse than those of the other three subjects

For the single spaces LFTD and HG table 4 shows themean performance loss for the unleashed HMM That refersto a case wherein the Bakis model constraints are relaxedto the full model For all except two sessions of subject 2(the full model had a higher performance of 02 for session3 and 07 for session 4) the Bakis model outperformedthe full model The results indicate a higher benefit for theLFTD features with an up to 88 performance gain forsubject 3 session 2 Higher accuracies were consistentlyachieved for feature subsets deviating from the optimalsize An example is illustrated for LFTD channel selection(subject 3 session 2) in supplementary figure SI9 (availableat stacksioporgJNE10056020mmedia) The plot comparesthe Bakis and full model similar to the procedure in figure 6While table 4 lists only the improvements for the optimalnumber of channels supplementary figure SI9 indicates that

the performance gain may be even higher when deviating fromthe optimal settings

Finally figure 10 illustrates the accuracy decompositionsinto true positive rates for each finger The results reveal similartrends across subjects and feature spaces While the thumband little finger provided the fewest training samples theirtrue positive rates tend to be the highest among all fingers Anexception is given for LFTD features for subject 2 Equivalentplots for the SVM are shown in supplementary figure SI10(available at stacksioporgJNE10056020mmedia)

For the classification of one feature sequence includingall T samples the HMM Matlab implementation took onaverage (3522 plusmn 0036) ms and the SVM implementationin C (00281 plusmn 00034) ms The mean computation time fora feature sequence was 02 ms plusmn 21μs (LFTD) 172 ms plusmn379 μs (HG) and 179 ms plusmn 200 μs (combined space) Theresults were obtained with an Intel Core i5-2500 K 34 GHz16 GB RAM Win 7 Pro 64-Bit Matlab 2012b for ten selectedchannels

9

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 7 Decoding accuracies (in ) for HMM (green) and SVM (beige) using LFTD + HG feature space (best case) Averageperformances across all sessions are shown for all subjects (S1ndashS4) as a mean result of a 30 times 5 CV procedure with their correspondingerror bars

Figure 8 Decoding accuracies (in ) for the HMMs for all feature spaces and sessions Bi Performances are averaged across 30 repetitionsof the five-fold CV procedure and displayed with their corresponding error bars

4 Discussion

An evaluation of feature space configurations revealed asensitivity of the decoding performance to the preciseparameterization of individual feature spaces Apart from thesegeneral aspects the grid search primarily allows for individualparameter deviations across subjects and sessions Importantlya sensitivity has been identified for the particular time intervalof interest used for the sliding window approach to extractHG features This indicates varying temporal dynamics acrosssubjects and may favour the application of HMMs in onlineapplications This sensitivity suggests that adapting to thesetemporal dynamics has an influence on the degree to whichinformation can be deduced from the data by both HMMand SVM However our investigations indicate that HMMsare less sensitive to changes of the ROI and give nearly

stable performances when changing the window overlapAs a model property the HMM is capable of modellinginformative sequences of different lengths and rates Thelatter characteristic originates from the explicit loop transitionwithin a particular model This means the HMM may coveruninformative samples of a temporal sequence with an idlestate before the informative part starts Underlying temporalprocesses that may persist longer or shorter are modelled bymore or less repeating basic model states Lacking any time-warp properties an SVM would expect a fixed sequence lengthand information rate meaning a fixed association between aparticular feature dimension oi and a specific point in time tiThis highlights the strict comparability of information withinone single feature dimension explicitly required by the SVMThis leads to the prediction of benefits for HMMs in onlineapplications where the time for the button press or other

10

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 9 Differences in decoding accuracy (in ) for the HMMs with respect to the SVM reference classifier (HMM-SVM) Results aredisplayed for all sessions Bi and feature spaces

Figure 10 Mean HMM decoding rates decomposed into true positive rates for each finger Results are shown for each feature space andaveraged across sessions and CV repetitions

Table 4 Final decoding accuracies (in ) listed for HMM (left) and SVM (right) Accuracies for the Bakis model are given for eachsubject-session and feature space by its mean and standard error across all 30 repetitions of the CV procedure Additionally for each subjectand feature space the mean across all sessions as well as the mean loss in performance for LFTD and HG when using an unconstrainedHMM is shown

HMM SVMSubjectndashsession LFTD HG LFTD + HG LFTD HG LFTD + HG

S1ndashB1 () 807 plusmn 03 979 plusmn 01 984 plusmn 01 858 plusmn 04 972 plusmn 02 983 plusmn 01S1ndashB2 () 788 plusmn 06 924 plusmn 04 953 plusmn 03 849 plusmn 05 934 plusmn 04 961 plusmn 03Mean () 798 plusmn 05 952 plusmn 03 969 plusmn 02 854 plusmn 05 953 plusmn 03 972 plusmn 02Unconstrained model () minus31 plusmn 05 minus09 plusmn 03 ndash ndash ndash ndashS2ndashB1 () 440 plusmn 10 359 plusmn 07 444 plusmn 06 514 plusmn 08 380 plusmn 08 557 plusmn 08S2ndashB2 () 502 plusmn 10 481 plusmn 07 524 plusmn 08 555 plusmn 07 480 plusmn 06 602 plusmn 08S2ndashB3 () 402 plusmn 07 363 plusmn 08 401 plusmn 08 496 plusmn 07 395 plusmn 09 515 plusmn 08S2ndashB4 () 309 plusmn 07 364 plusmn 09 384 plusmn 09 391 plusmn 07 380 plusmn 10 427 plusmn 09Mean () 412 plusmn 09 392 plusmn 08 438 plusmn 08 489 plusmn 07 409 plusmn 09 525 plusmn 09Unconstrained model () minus22 plusmn 09 minus04 plusmn 08 ndash ndash ndash ndashS3ndashB1 () 646 plusmn 07 851 plusmn 04 858 plusmn 04 612 plusmn 07 854 plusmn 05 888 plusmn 04S3ndashB2 () 801 plusmn 05 905 plusmn 03 937 plusmn 03 850 plusmn 04 914 plusmn 03 960 plusmn 03Mean () 724 plusmn 06 878 plusmn 04 898 plusmn 04 731 plusmn 06 884 plusmn 04 924 plusmn 03Unconstrained model () minus59 plusmn 07 minus25 plusmn 04 ndash ndash ndash ndashS4ndashB1 () 714 plusmn 05 835 plusmn 05 872 plusmn 04 685 plusmn 06 821 plusmn 04 867 plusmn 04S4ndashB2 () 697 plusmn 07 831 plusmn 03 879 plusmn 04 643 plusmn 06 823 plusmn 03 844 plusmn 04Mean () 706 plusmn 07 833 plusmn 03 876 plusmn 04 664 plusmn 06 822 plusmn 03 856 plusmn 04Unconstrained model () minus11 plusmn 06 minus31 plusmn 05 ndash ndash ndash ndash

11

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 7: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

poor decoding performances in general Recently a similarfinding was presented by Lederman and Tabrikian for EEGdata (Lederman and Tabrikian 2012)

Given five states containing one Gaussian the unleashedmodel would involve 4 times 1235 degrees of freedom while theBakis model uses 4 times 1219 for an exemplary selection of15 channels The significant increase of decoding performanceby dropping only a few parameters seems interesting withrespect to the nature of the underlying signal structure andsheds remarkable light on the importance of feature selectionfor HMMs An unleashed model using all 64 channels wouldhave 4 times 20835 free parameters

253 Model structure Experiments for all feature spaceshave been conducted in which the optimal number of states Qper HMM and the number of mixture distribution componentsM per state have been examined by grid search Sinceincreasing the degrees of freedom will degrade the qualityof model estimation high decoding rates have mainly beenobtained for a small number of states containing only a fewmixture components Based on this search a model structureof Q = 5 states and M = 1 mixture components has beenchosen as the best compromise Note that the actual optimummight vary slightly depending on the particular subject thefeature space and even on the session which is in line withother studies (Obermaier et al 1999 2001a 2001b)

254 Initialization The initial parameters of themultivariate probability densities have been set using a k-means clustering algorithm (Duda et al 2001) The algorithmtemporally clusters the vectors of the O times T feature matrix OtThe rows of this matrix were extended by a scaled temporalcounter adding a time stamp to each feature vector Thisincreases the probability of time-continuous state intervals inthe cluster output Then for each of the Q middot M clusters themean as well as the full covariance matrix have been calculatedexcluding the additional time stamp Finally to assign thesemixture distributions to specific states the clusters were sortedaccording to the mean of their corresponding time stampvalues This temporal order is suggested by the constrainedforward structure of the transition matrix The transition matrixis initialized with probability values on the diagonal thatare weighted to be c-times as likely as probabilities off thediagonal The constant c is being calculated as the quotientof the number of samples for a feature T and the number ofstates Q In case of multiple mixture components (M gt 1) themixture weight matrix was initialized using the normalizednumber of sample vectors assigned to each cluster

255 Implementation The basic implementation has beenderived from the open source functionalities on HMMsprovided by Kevin Murphy from the University of BritishColumbia (Murphy 2005) This framework its modificationsas well as the entire code developed for this study wereimplemented using MATLAB R2012b from MathWorks

256 Support vector machines As a gold-standardreference we used a one-vs-one SVM for multi-classclassification In our study training samples come in pairedassignments (xi yi)i=1N where xi is a 1 times (T middot O) featurevector derived from the ECoG data for one trial with itsclass assignment yi = ndash11 Equation (4) describes theclassification rule of the trained classifier for new samplesx A detailed derivation of this decision rule and the rationalebehind SVM-classification is given eg by Hastie et al (2009)

F(x) = sgn(sum

iαiyiK(xi x) + β0

)(4)

The αi isin [0 C] are weights for the training vectors and areestimated during the training process Training vectors withweights αi = 0 contribute to the solution of the classificationproblem and are called support vectors The offset of theseparating hyperplane from the origin is proportional to β0The constant C generally controls the trade-off betweenclassification error on the training set and smoothness ofthe decision boundary The numbers of training samplesper session (table 1) were in a similar range compared tothe resulting dimension of the SVM feature space T middot O(sim130minus400) Influenced by this fact the decoding accuracyof the SVM revealed almost no sensitivity with respect toregularization Since grid search yielded a stable plateau ofhigh performance for C gt 1 the regularization constant wasfixed to C = 1000 The linear kernel which has been usedthroughout the study is denoted by K(xi x) We further usedthe LIBSVM (wwwcsientuedutwsimcjlinlibsvm) packagefor Matlab developed by Chih-Chung Chang and Chih-Jen Lin(Chang and Lin 2011)

26 Considerations on testing

In order to evaluate the findings presented in the followingsections a five-fold cross-validation (CV) scheme was used(Lemm et al 2011) The assignment of a trial to a specific foldset was performed using uniform random permutations

Each CV procedure was repeated 30 times to averageout fluctuations in the resulting recognition rates arising fromrandom partitioning of the CV sets The number of repetitionswas chosen as a compromise to obtain stable estimates of thegeneralization error but also to avoid massive computationalload The training data was balanced between all classes totreat no class preferentially and to avoid the need for classprior probabilities Feature subset selection and estimation ofthe parameters of each classifier was performed only on datasets allocated for training in order to guarantee a fair estimateof the generalization error Due to prohibitive computationaleffort the feature spaces have been parameterized only oncefor each subject and session and are hence based on all samplesof each subject-specific data set In this context preliminaryexperiments indicated minimal differences to the decodingperformance (see table 2) when fixing the feature definitionsbased on the training sets only These experiments used nestedCV with an additional validation set for the feature spaceparameters

6

J Neural Eng 10 (2013) 056020 T Wissel et al

(a) (b)

(c) (d)

Figure 3 Optimal parameters of both feature spaces for each subject and session (a) Location and width for ROIs containing mostdiscriminative information for LFTD features (b) Optimal high cut-off frequency for the spectral lowpass filter to extract LFTD features(c) Location and width for ROIs containing most discriminative information for HG features (d) Low and high cut-off frequency for theoptimal HG frequency band

Table 2 HMM-decoding accuracies (in ) for parameteroptimization on full data and on training subsets Accuracies aregiven for subject 1mdashsession 1 Parameter optimization wasperformed on the full dataset (standard-CV routine) as well as thetraining subset (nested-CV routine)

Parameter optimization on LFTD HG LFTD + HG

Full dataset 807 plusmn 03 979 plusmn 01 984 plusmn 01Training subset 811 plusmn 10 978 plusmn 03 980 plusmn 02

3 Results

31 Feature and channel selection

In order to identify optimal parameters for feature extractiona grid search for each subject and session was performed Forthis optimization a preliminary feature selection was appliedaiming at a fixed subset size of O = 8 which was foundto be within the range of optimal subset sizes (see figure 6and table 3) Results for all sessions are illustrated by the barplots in figure 3 A sensitivity of the decoding rate has beenfound for temporal parameters such as location and width forthe time windows (ROIs) containing the most discriminativeinformation about LFTD or HG features (figures 3(a) and (c))

The highest impact on accuracy was found for the HGROI (mean optimal ranges S1 [minus1875 3375] ms S3 [minus400475] ms S4 [minus280 420] ms)

Further sensitivity is given for spectral parameters namelythe high cut-off frequency for the LFTD lowpass filter and thelow and high cut-off for the HG bandpass (figures 3(b) and(d)) For the former optimal frequencies ranged from 11ndash30 Hz and the optimal frequency bands for the latter all laywithin 60ndash300 Hz depending on the subject There was hardlyany impact on the decoding rate when changing the slidingwindow size and window overlap for the HG features Thesewere hence set to fixed values According to these parametersettings the mean numbers of samples T were 265 (subject 1)213 (subject 2) 31 (subject 3) and 22 (subject 4)

An example of the grid search procedure isshown in supplementary figures SI5ndashSI7 (available atstacksioporgJNE10056020mmedia) The plots showvarying decoding rates for the S1 session 1 within individualparameter subspaces The optimal parameters illustrated infigure 3 represent a single point in these spaces at peakingdecoding rate Based on these parameters figure 4 illustrates atypical result for a LFTD and HG feature sequence extractedfrom the shown raw ECoG signal (S1 session 2 channel 23middle finger) All feature sequences for this trialmdashconsistingof the entire set of channelsmdashare given in supplementaryfigure SI8

Using the resulting parameterization subsequent CV runssought to identify the optimal subset by tuning O based on thetraining set of the current fold Figure 5 illustrates the rateat which a particular channel was selected for each subject

7

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 4 The dashed grey time course shows a raw ECoG signal fora typical trial (S1 session 2 channel 23 middle finger) Theextracted feature sequences for LFTD (cut-off at 24 Hz) and HG(frequency band [85 Hz 270 Hz]) are overlaid as solid lines EachHG feature has been assigned to the time point at the centre of itscorresponding FFT window

Figure 5 Channel maps for all four subjects The channels arearranged according to the electrode grid and coloured with respectto their mean selection rate in repeated CV-runs Channels colouredlike the top of the colour bar tend to have a higher probability ofbeing selected

It evaluates the optimal subsets O across all sessions foldsand several CV repetitions Results are exemplarily shownfor high gamma features Full detail on subject-specific gridplacement and channel maps for LFTD and HG featurescan be found in supplementary figures SI1ndashSI4 (available atstacksioporgJNE10056020mmedia)

The channel maps reveal localized regions of highlyinformative channels (dark red) comprising only a small subsetwithin the full grid While these localizations are distinct andfocused for high gamma features informative channels spreadwider across the electrode grid for LFTD features This finding

is in line with an earlier study done by Kubanek et al (2009) andimplies that selected subsets for LFTD features tend to be lessstable across CV runs Combined subsets were selected fromboth feature spaces The selections indicated a preference fora higher proportion of LFTD features (table 3) This optimalproportion was determined by grid search

The optimal absolute subset size O was selected bysuccessively adding channels to the subset as defined by ourselection procedure and identifying the peak performanceFor the high gamma case figure 6 already shows a sharpincrease in decoding performance for a subset size belowfive Performance then increases less rapidly until it reachesa maximum usually in the range of 6ndash8 channels for the HGfeatures (see table 3 for detailed information on the results forthe other features) This behaviour again reflects the fact that alot of information is found in a small number of HG sequences

Increasing the subset size beyond this optimal size leadsto a decrease in performance for both SVM and HMM Thiscoincides with an increase of the number of free parameters inthe classifier model depending quadratically on the number ofchannels for HMMs (figure 6 bottom)

32 Decoding performances

Based on the configuration of the feature spaces and featureselection a combination space containing both LFTD andHG features yielded the best decoding accuracy across allsessions and subjects Figure 7 summarizes the highestHMM performances for each subject averaged across allsessions and compares them with the corresponding SVMaccuracies Except for subject 2 these accuracies were closeto or even exceeded 90 correct classifications Howeverfigure 8 shows that accuracies obtained only from the HGspace come very close to this combined case implying thatLFTD features contribute limited new information Usingtime domain features exclusively decreased performance inmost cases An exception is given by subject 2 whereLFTD features outperformed HG features This may beexplained by the wider spread of LFTD features across thecortex and the suboptimal grid placement (little coverageof motor cortex supplementary figure SI2 available atstacksioporgJNE10056020mmedia) The channel mapsshow informative channels at the very margin of the gridThis suggests that grid placement was not always optimal forcapturing the most informative electrodes

Differences in decoding accuracy compared to the SVMwere found to be small (table 4) and revealed no clearsuperiority of any classifier (figure 9) Excluding subject 2the mean performance difference (HMM-SVM) was ndash053for LFTD 01 for HG and ndash023 for combined featuresAccuracy trends for individual subjects are similar acrosssessions and vary less than across subjects (figure 9) Notethat irrespective of the suboptimal grid placement subject 2took part in four experimental sessions Here subject-specificvariables eg grid placement were important determinants ofdecoding success Thus we interpret our results at the subjectlevel and conclude that in three out of four subjects SVMand HMM decoding of finger movements led to comparableresults

8

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 6 Influence of channel selection on the HMM decoding accuracy using high gamma features only (S1 session 1) Changes indecoding accuracy when successively adding channels to the feature set applying either random or DB-index channel selection (top)Number of free HMM parameters for each number of channels (bottom)

Table 3 Number of selected channels averaged across sessions The averaged number of selected channels and the corresponding standardderivations are given for all subjects and all feature spaces For the combined space square brackets denote the corresponding breakdowninto its components The average is taken across all sessions and CV repetitions

Subject LFTD HG LFTD + HG

1 100 plusmn 00 60 plusmn 00 165 plusmn 05 ([100 plusmn 00]+[65 plusmn 07])2 65 plusmn 19 73 plusmn 21 128 plusmn 34 ([70 plusmn 45] + [58 plusmn 15])3 100 plusmn 00 80 plusmn 00 140 plusmn 10 ([60 plusmn 14] + [80 plusmn 00])4 100 plusmn 00 60 plusmn 14 160 plusmn 07 ([95 plusmn 07] + [65 plusmn 07])

A clear superiority for SVM classification was only foundfor subject 2 for whom the absolute decoding performanceswere significantly worse than those of the other three subjects

For the single spaces LFTD and HG table 4 shows themean performance loss for the unleashed HMM That refersto a case wherein the Bakis model constraints are relaxedto the full model For all except two sessions of subject 2(the full model had a higher performance of 02 for session3 and 07 for session 4) the Bakis model outperformedthe full model The results indicate a higher benefit for theLFTD features with an up to 88 performance gain forsubject 3 session 2 Higher accuracies were consistentlyachieved for feature subsets deviating from the optimalsize An example is illustrated for LFTD channel selection(subject 3 session 2) in supplementary figure SI9 (availableat stacksioporgJNE10056020mmedia) The plot comparesthe Bakis and full model similar to the procedure in figure 6While table 4 lists only the improvements for the optimalnumber of channels supplementary figure SI9 indicates that

the performance gain may be even higher when deviating fromthe optimal settings

Finally figure 10 illustrates the accuracy decompositionsinto true positive rates for each finger The results reveal similartrends across subjects and feature spaces While the thumband little finger provided the fewest training samples theirtrue positive rates tend to be the highest among all fingers Anexception is given for LFTD features for subject 2 Equivalentplots for the SVM are shown in supplementary figure SI10(available at stacksioporgJNE10056020mmedia)

For the classification of one feature sequence includingall T samples the HMM Matlab implementation took onaverage (3522 plusmn 0036) ms and the SVM implementationin C (00281 plusmn 00034) ms The mean computation time fora feature sequence was 02 ms plusmn 21μs (LFTD) 172 ms plusmn379 μs (HG) and 179 ms plusmn 200 μs (combined space) Theresults were obtained with an Intel Core i5-2500 K 34 GHz16 GB RAM Win 7 Pro 64-Bit Matlab 2012b for ten selectedchannels

9

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 7 Decoding accuracies (in ) for HMM (green) and SVM (beige) using LFTD + HG feature space (best case) Averageperformances across all sessions are shown for all subjects (S1ndashS4) as a mean result of a 30 times 5 CV procedure with their correspondingerror bars

Figure 8 Decoding accuracies (in ) for the HMMs for all feature spaces and sessions Bi Performances are averaged across 30 repetitionsof the five-fold CV procedure and displayed with their corresponding error bars

4 Discussion

An evaluation of feature space configurations revealed asensitivity of the decoding performance to the preciseparameterization of individual feature spaces Apart from thesegeneral aspects the grid search primarily allows for individualparameter deviations across subjects and sessions Importantlya sensitivity has been identified for the particular time intervalof interest used for the sliding window approach to extractHG features This indicates varying temporal dynamics acrosssubjects and may favour the application of HMMs in onlineapplications This sensitivity suggests that adapting to thesetemporal dynamics has an influence on the degree to whichinformation can be deduced from the data by both HMMand SVM However our investigations indicate that HMMsare less sensitive to changes of the ROI and give nearly

stable performances when changing the window overlapAs a model property the HMM is capable of modellinginformative sequences of different lengths and rates Thelatter characteristic originates from the explicit loop transitionwithin a particular model This means the HMM may coveruninformative samples of a temporal sequence with an idlestate before the informative part starts Underlying temporalprocesses that may persist longer or shorter are modelled bymore or less repeating basic model states Lacking any time-warp properties an SVM would expect a fixed sequence lengthand information rate meaning a fixed association between aparticular feature dimension oi and a specific point in time tiThis highlights the strict comparability of information withinone single feature dimension explicitly required by the SVMThis leads to the prediction of benefits for HMMs in onlineapplications where the time for the button press or other

10

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 9 Differences in decoding accuracy (in ) for the HMMs with respect to the SVM reference classifier (HMM-SVM) Results aredisplayed for all sessions Bi and feature spaces

Figure 10 Mean HMM decoding rates decomposed into true positive rates for each finger Results are shown for each feature space andaveraged across sessions and CV repetitions

Table 4 Final decoding accuracies (in ) listed for HMM (left) and SVM (right) Accuracies for the Bakis model are given for eachsubject-session and feature space by its mean and standard error across all 30 repetitions of the CV procedure Additionally for each subjectand feature space the mean across all sessions as well as the mean loss in performance for LFTD and HG when using an unconstrainedHMM is shown

HMM SVMSubjectndashsession LFTD HG LFTD + HG LFTD HG LFTD + HG

S1ndashB1 () 807 plusmn 03 979 plusmn 01 984 plusmn 01 858 plusmn 04 972 plusmn 02 983 plusmn 01S1ndashB2 () 788 plusmn 06 924 plusmn 04 953 plusmn 03 849 plusmn 05 934 plusmn 04 961 plusmn 03Mean () 798 plusmn 05 952 plusmn 03 969 plusmn 02 854 plusmn 05 953 plusmn 03 972 plusmn 02Unconstrained model () minus31 plusmn 05 minus09 plusmn 03 ndash ndash ndash ndashS2ndashB1 () 440 plusmn 10 359 plusmn 07 444 plusmn 06 514 plusmn 08 380 plusmn 08 557 plusmn 08S2ndashB2 () 502 plusmn 10 481 plusmn 07 524 plusmn 08 555 plusmn 07 480 plusmn 06 602 plusmn 08S2ndashB3 () 402 plusmn 07 363 plusmn 08 401 plusmn 08 496 plusmn 07 395 plusmn 09 515 plusmn 08S2ndashB4 () 309 plusmn 07 364 plusmn 09 384 plusmn 09 391 plusmn 07 380 plusmn 10 427 plusmn 09Mean () 412 plusmn 09 392 plusmn 08 438 plusmn 08 489 plusmn 07 409 plusmn 09 525 plusmn 09Unconstrained model () minus22 plusmn 09 minus04 plusmn 08 ndash ndash ndash ndashS3ndashB1 () 646 plusmn 07 851 plusmn 04 858 plusmn 04 612 plusmn 07 854 plusmn 05 888 plusmn 04S3ndashB2 () 801 plusmn 05 905 plusmn 03 937 plusmn 03 850 plusmn 04 914 plusmn 03 960 plusmn 03Mean () 724 plusmn 06 878 plusmn 04 898 plusmn 04 731 plusmn 06 884 plusmn 04 924 plusmn 03Unconstrained model () minus59 plusmn 07 minus25 plusmn 04 ndash ndash ndash ndashS4ndashB1 () 714 plusmn 05 835 plusmn 05 872 plusmn 04 685 plusmn 06 821 plusmn 04 867 plusmn 04S4ndashB2 () 697 plusmn 07 831 plusmn 03 879 plusmn 04 643 plusmn 06 823 plusmn 03 844 plusmn 04Mean () 706 plusmn 07 833 plusmn 03 876 plusmn 04 664 plusmn 06 822 plusmn 03 856 plusmn 04Unconstrained model () minus11 plusmn 06 minus31 plusmn 05 ndash ndash ndash ndash

11

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 8: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

(a) (b)

(c) (d)

Figure 3 Optimal parameters of both feature spaces for each subject and session (a) Location and width for ROIs containing mostdiscriminative information for LFTD features (b) Optimal high cut-off frequency for the spectral lowpass filter to extract LFTD features(c) Location and width for ROIs containing most discriminative information for HG features (d) Low and high cut-off frequency for theoptimal HG frequency band

Table 2 HMM-decoding accuracies (in ) for parameteroptimization on full data and on training subsets Accuracies aregiven for subject 1mdashsession 1 Parameter optimization wasperformed on the full dataset (standard-CV routine) as well as thetraining subset (nested-CV routine)

Parameter optimization on LFTD HG LFTD + HG

Full dataset 807 plusmn 03 979 plusmn 01 984 plusmn 01Training subset 811 plusmn 10 978 plusmn 03 980 plusmn 02

3 Results

31 Feature and channel selection

In order to identify optimal parameters for feature extractiona grid search for each subject and session was performed Forthis optimization a preliminary feature selection was appliedaiming at a fixed subset size of O = 8 which was foundto be within the range of optimal subset sizes (see figure 6and table 3) Results for all sessions are illustrated by the barplots in figure 3 A sensitivity of the decoding rate has beenfound for temporal parameters such as location and width forthe time windows (ROIs) containing the most discriminativeinformation about LFTD or HG features (figures 3(a) and (c))

The highest impact on accuracy was found for the HGROI (mean optimal ranges S1 [minus1875 3375] ms S3 [minus400475] ms S4 [minus280 420] ms)

Further sensitivity is given for spectral parameters namelythe high cut-off frequency for the LFTD lowpass filter and thelow and high cut-off for the HG bandpass (figures 3(b) and(d)) For the former optimal frequencies ranged from 11ndash30 Hz and the optimal frequency bands for the latter all laywithin 60ndash300 Hz depending on the subject There was hardlyany impact on the decoding rate when changing the slidingwindow size and window overlap for the HG features Thesewere hence set to fixed values According to these parametersettings the mean numbers of samples T were 265 (subject 1)213 (subject 2) 31 (subject 3) and 22 (subject 4)

An example of the grid search procedure isshown in supplementary figures SI5ndashSI7 (available atstacksioporgJNE10056020mmedia) The plots showvarying decoding rates for the S1 session 1 within individualparameter subspaces The optimal parameters illustrated infigure 3 represent a single point in these spaces at peakingdecoding rate Based on these parameters figure 4 illustrates atypical result for a LFTD and HG feature sequence extractedfrom the shown raw ECoG signal (S1 session 2 channel 23middle finger) All feature sequences for this trialmdashconsistingof the entire set of channelsmdashare given in supplementaryfigure SI8

Using the resulting parameterization subsequent CV runssought to identify the optimal subset by tuning O based on thetraining set of the current fold Figure 5 illustrates the rateat which a particular channel was selected for each subject

7

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 4 The dashed grey time course shows a raw ECoG signal fora typical trial (S1 session 2 channel 23 middle finger) Theextracted feature sequences for LFTD (cut-off at 24 Hz) and HG(frequency band [85 Hz 270 Hz]) are overlaid as solid lines EachHG feature has been assigned to the time point at the centre of itscorresponding FFT window

Figure 5 Channel maps for all four subjects The channels arearranged according to the electrode grid and coloured with respectto their mean selection rate in repeated CV-runs Channels colouredlike the top of the colour bar tend to have a higher probability ofbeing selected

It evaluates the optimal subsets O across all sessions foldsand several CV repetitions Results are exemplarily shownfor high gamma features Full detail on subject-specific gridplacement and channel maps for LFTD and HG featurescan be found in supplementary figures SI1ndashSI4 (available atstacksioporgJNE10056020mmedia)

The channel maps reveal localized regions of highlyinformative channels (dark red) comprising only a small subsetwithin the full grid While these localizations are distinct andfocused for high gamma features informative channels spreadwider across the electrode grid for LFTD features This finding

is in line with an earlier study done by Kubanek et al (2009) andimplies that selected subsets for LFTD features tend to be lessstable across CV runs Combined subsets were selected fromboth feature spaces The selections indicated a preference fora higher proportion of LFTD features (table 3) This optimalproportion was determined by grid search

The optimal absolute subset size O was selected bysuccessively adding channels to the subset as defined by ourselection procedure and identifying the peak performanceFor the high gamma case figure 6 already shows a sharpincrease in decoding performance for a subset size belowfive Performance then increases less rapidly until it reachesa maximum usually in the range of 6ndash8 channels for the HGfeatures (see table 3 for detailed information on the results forthe other features) This behaviour again reflects the fact that alot of information is found in a small number of HG sequences

Increasing the subset size beyond this optimal size leadsto a decrease in performance for both SVM and HMM Thiscoincides with an increase of the number of free parameters inthe classifier model depending quadratically on the number ofchannels for HMMs (figure 6 bottom)

32 Decoding performances

Based on the configuration of the feature spaces and featureselection a combination space containing both LFTD andHG features yielded the best decoding accuracy across allsessions and subjects Figure 7 summarizes the highestHMM performances for each subject averaged across allsessions and compares them with the corresponding SVMaccuracies Except for subject 2 these accuracies were closeto or even exceeded 90 correct classifications Howeverfigure 8 shows that accuracies obtained only from the HGspace come very close to this combined case implying thatLFTD features contribute limited new information Usingtime domain features exclusively decreased performance inmost cases An exception is given by subject 2 whereLFTD features outperformed HG features This may beexplained by the wider spread of LFTD features across thecortex and the suboptimal grid placement (little coverageof motor cortex supplementary figure SI2 available atstacksioporgJNE10056020mmedia) The channel mapsshow informative channels at the very margin of the gridThis suggests that grid placement was not always optimal forcapturing the most informative electrodes

Differences in decoding accuracy compared to the SVMwere found to be small (table 4) and revealed no clearsuperiority of any classifier (figure 9) Excluding subject 2the mean performance difference (HMM-SVM) was ndash053for LFTD 01 for HG and ndash023 for combined featuresAccuracy trends for individual subjects are similar acrosssessions and vary less than across subjects (figure 9) Notethat irrespective of the suboptimal grid placement subject 2took part in four experimental sessions Here subject-specificvariables eg grid placement were important determinants ofdecoding success Thus we interpret our results at the subjectlevel and conclude that in three out of four subjects SVMand HMM decoding of finger movements led to comparableresults

8

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 6 Influence of channel selection on the HMM decoding accuracy using high gamma features only (S1 session 1) Changes indecoding accuracy when successively adding channels to the feature set applying either random or DB-index channel selection (top)Number of free HMM parameters for each number of channels (bottom)

Table 3 Number of selected channels averaged across sessions The averaged number of selected channels and the corresponding standardderivations are given for all subjects and all feature spaces For the combined space square brackets denote the corresponding breakdowninto its components The average is taken across all sessions and CV repetitions

Subject LFTD HG LFTD + HG

1 100 plusmn 00 60 plusmn 00 165 plusmn 05 ([100 plusmn 00]+[65 plusmn 07])2 65 plusmn 19 73 plusmn 21 128 plusmn 34 ([70 plusmn 45] + [58 plusmn 15])3 100 plusmn 00 80 plusmn 00 140 plusmn 10 ([60 plusmn 14] + [80 plusmn 00])4 100 plusmn 00 60 plusmn 14 160 plusmn 07 ([95 plusmn 07] + [65 plusmn 07])

A clear superiority for SVM classification was only foundfor subject 2 for whom the absolute decoding performanceswere significantly worse than those of the other three subjects

For the single spaces LFTD and HG table 4 shows themean performance loss for the unleashed HMM That refersto a case wherein the Bakis model constraints are relaxedto the full model For all except two sessions of subject 2(the full model had a higher performance of 02 for session3 and 07 for session 4) the Bakis model outperformedthe full model The results indicate a higher benefit for theLFTD features with an up to 88 performance gain forsubject 3 session 2 Higher accuracies were consistentlyachieved for feature subsets deviating from the optimalsize An example is illustrated for LFTD channel selection(subject 3 session 2) in supplementary figure SI9 (availableat stacksioporgJNE10056020mmedia) The plot comparesthe Bakis and full model similar to the procedure in figure 6While table 4 lists only the improvements for the optimalnumber of channels supplementary figure SI9 indicates that

the performance gain may be even higher when deviating fromthe optimal settings

Finally figure 10 illustrates the accuracy decompositionsinto true positive rates for each finger The results reveal similartrends across subjects and feature spaces While the thumband little finger provided the fewest training samples theirtrue positive rates tend to be the highest among all fingers Anexception is given for LFTD features for subject 2 Equivalentplots for the SVM are shown in supplementary figure SI10(available at stacksioporgJNE10056020mmedia)

For the classification of one feature sequence includingall T samples the HMM Matlab implementation took onaverage (3522 plusmn 0036) ms and the SVM implementationin C (00281 plusmn 00034) ms The mean computation time fora feature sequence was 02 ms plusmn 21μs (LFTD) 172 ms plusmn379 μs (HG) and 179 ms plusmn 200 μs (combined space) Theresults were obtained with an Intel Core i5-2500 K 34 GHz16 GB RAM Win 7 Pro 64-Bit Matlab 2012b for ten selectedchannels

9

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 7 Decoding accuracies (in ) for HMM (green) and SVM (beige) using LFTD + HG feature space (best case) Averageperformances across all sessions are shown for all subjects (S1ndashS4) as a mean result of a 30 times 5 CV procedure with their correspondingerror bars

Figure 8 Decoding accuracies (in ) for the HMMs for all feature spaces and sessions Bi Performances are averaged across 30 repetitionsof the five-fold CV procedure and displayed with their corresponding error bars

4 Discussion

An evaluation of feature space configurations revealed asensitivity of the decoding performance to the preciseparameterization of individual feature spaces Apart from thesegeneral aspects the grid search primarily allows for individualparameter deviations across subjects and sessions Importantlya sensitivity has been identified for the particular time intervalof interest used for the sliding window approach to extractHG features This indicates varying temporal dynamics acrosssubjects and may favour the application of HMMs in onlineapplications This sensitivity suggests that adapting to thesetemporal dynamics has an influence on the degree to whichinformation can be deduced from the data by both HMMand SVM However our investigations indicate that HMMsare less sensitive to changes of the ROI and give nearly

stable performances when changing the window overlapAs a model property the HMM is capable of modellinginformative sequences of different lengths and rates Thelatter characteristic originates from the explicit loop transitionwithin a particular model This means the HMM may coveruninformative samples of a temporal sequence with an idlestate before the informative part starts Underlying temporalprocesses that may persist longer or shorter are modelled bymore or less repeating basic model states Lacking any time-warp properties an SVM would expect a fixed sequence lengthand information rate meaning a fixed association between aparticular feature dimension oi and a specific point in time tiThis highlights the strict comparability of information withinone single feature dimension explicitly required by the SVMThis leads to the prediction of benefits for HMMs in onlineapplications where the time for the button press or other

10

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 9 Differences in decoding accuracy (in ) for the HMMs with respect to the SVM reference classifier (HMM-SVM) Results aredisplayed for all sessions Bi and feature spaces

Figure 10 Mean HMM decoding rates decomposed into true positive rates for each finger Results are shown for each feature space andaveraged across sessions and CV repetitions

Table 4 Final decoding accuracies (in ) listed for HMM (left) and SVM (right) Accuracies for the Bakis model are given for eachsubject-session and feature space by its mean and standard error across all 30 repetitions of the CV procedure Additionally for each subjectand feature space the mean across all sessions as well as the mean loss in performance for LFTD and HG when using an unconstrainedHMM is shown

HMM SVMSubjectndashsession LFTD HG LFTD + HG LFTD HG LFTD + HG

S1ndashB1 () 807 plusmn 03 979 plusmn 01 984 plusmn 01 858 plusmn 04 972 plusmn 02 983 plusmn 01S1ndashB2 () 788 plusmn 06 924 plusmn 04 953 plusmn 03 849 plusmn 05 934 plusmn 04 961 plusmn 03Mean () 798 plusmn 05 952 plusmn 03 969 plusmn 02 854 plusmn 05 953 plusmn 03 972 plusmn 02Unconstrained model () minus31 plusmn 05 minus09 plusmn 03 ndash ndash ndash ndashS2ndashB1 () 440 plusmn 10 359 plusmn 07 444 plusmn 06 514 plusmn 08 380 plusmn 08 557 plusmn 08S2ndashB2 () 502 plusmn 10 481 plusmn 07 524 plusmn 08 555 plusmn 07 480 plusmn 06 602 plusmn 08S2ndashB3 () 402 plusmn 07 363 plusmn 08 401 plusmn 08 496 plusmn 07 395 plusmn 09 515 plusmn 08S2ndashB4 () 309 plusmn 07 364 plusmn 09 384 plusmn 09 391 plusmn 07 380 plusmn 10 427 plusmn 09Mean () 412 plusmn 09 392 plusmn 08 438 plusmn 08 489 plusmn 07 409 plusmn 09 525 plusmn 09Unconstrained model () minus22 plusmn 09 minus04 plusmn 08 ndash ndash ndash ndashS3ndashB1 () 646 plusmn 07 851 plusmn 04 858 plusmn 04 612 plusmn 07 854 plusmn 05 888 plusmn 04S3ndashB2 () 801 plusmn 05 905 plusmn 03 937 plusmn 03 850 plusmn 04 914 plusmn 03 960 plusmn 03Mean () 724 plusmn 06 878 plusmn 04 898 plusmn 04 731 plusmn 06 884 plusmn 04 924 plusmn 03Unconstrained model () minus59 plusmn 07 minus25 plusmn 04 ndash ndash ndash ndashS4ndashB1 () 714 plusmn 05 835 plusmn 05 872 plusmn 04 685 plusmn 06 821 plusmn 04 867 plusmn 04S4ndashB2 () 697 plusmn 07 831 plusmn 03 879 plusmn 04 643 plusmn 06 823 plusmn 03 844 plusmn 04Mean () 706 plusmn 07 833 plusmn 03 876 plusmn 04 664 plusmn 06 822 plusmn 03 856 plusmn 04Unconstrained model () minus11 plusmn 06 minus31 plusmn 05 ndash ndash ndash ndash

11

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 9: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 4 The dashed grey time course shows a raw ECoG signal fora typical trial (S1 session 2 channel 23 middle finger) Theextracted feature sequences for LFTD (cut-off at 24 Hz) and HG(frequency band [85 Hz 270 Hz]) are overlaid as solid lines EachHG feature has been assigned to the time point at the centre of itscorresponding FFT window

Figure 5 Channel maps for all four subjects The channels arearranged according to the electrode grid and coloured with respectto their mean selection rate in repeated CV-runs Channels colouredlike the top of the colour bar tend to have a higher probability ofbeing selected

It evaluates the optimal subsets O across all sessions foldsand several CV repetitions Results are exemplarily shownfor high gamma features Full detail on subject-specific gridplacement and channel maps for LFTD and HG featurescan be found in supplementary figures SI1ndashSI4 (available atstacksioporgJNE10056020mmedia)

The channel maps reveal localized regions of highlyinformative channels (dark red) comprising only a small subsetwithin the full grid While these localizations are distinct andfocused for high gamma features informative channels spreadwider across the electrode grid for LFTD features This finding

is in line with an earlier study done by Kubanek et al (2009) andimplies that selected subsets for LFTD features tend to be lessstable across CV runs Combined subsets were selected fromboth feature spaces The selections indicated a preference fora higher proportion of LFTD features (table 3) This optimalproportion was determined by grid search

The optimal absolute subset size O was selected bysuccessively adding channels to the subset as defined by ourselection procedure and identifying the peak performanceFor the high gamma case figure 6 already shows a sharpincrease in decoding performance for a subset size belowfive Performance then increases less rapidly until it reachesa maximum usually in the range of 6ndash8 channels for the HGfeatures (see table 3 for detailed information on the results forthe other features) This behaviour again reflects the fact that alot of information is found in a small number of HG sequences

Increasing the subset size beyond this optimal size leadsto a decrease in performance for both SVM and HMM Thiscoincides with an increase of the number of free parameters inthe classifier model depending quadratically on the number ofchannels for HMMs (figure 6 bottom)

32 Decoding performances

Based on the configuration of the feature spaces and featureselection a combination space containing both LFTD andHG features yielded the best decoding accuracy across allsessions and subjects Figure 7 summarizes the highestHMM performances for each subject averaged across allsessions and compares them with the corresponding SVMaccuracies Except for subject 2 these accuracies were closeto or even exceeded 90 correct classifications Howeverfigure 8 shows that accuracies obtained only from the HGspace come very close to this combined case implying thatLFTD features contribute limited new information Usingtime domain features exclusively decreased performance inmost cases An exception is given by subject 2 whereLFTD features outperformed HG features This may beexplained by the wider spread of LFTD features across thecortex and the suboptimal grid placement (little coverageof motor cortex supplementary figure SI2 available atstacksioporgJNE10056020mmedia) The channel mapsshow informative channels at the very margin of the gridThis suggests that grid placement was not always optimal forcapturing the most informative electrodes

Differences in decoding accuracy compared to the SVMwere found to be small (table 4) and revealed no clearsuperiority of any classifier (figure 9) Excluding subject 2the mean performance difference (HMM-SVM) was ndash053for LFTD 01 for HG and ndash023 for combined featuresAccuracy trends for individual subjects are similar acrosssessions and vary less than across subjects (figure 9) Notethat irrespective of the suboptimal grid placement subject 2took part in four experimental sessions Here subject-specificvariables eg grid placement were important determinants ofdecoding success Thus we interpret our results at the subjectlevel and conclude that in three out of four subjects SVMand HMM decoding of finger movements led to comparableresults

8

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 6 Influence of channel selection on the HMM decoding accuracy using high gamma features only (S1 session 1) Changes indecoding accuracy when successively adding channels to the feature set applying either random or DB-index channel selection (top)Number of free HMM parameters for each number of channels (bottom)

Table 3 Number of selected channels averaged across sessions The averaged number of selected channels and the corresponding standardderivations are given for all subjects and all feature spaces For the combined space square brackets denote the corresponding breakdowninto its components The average is taken across all sessions and CV repetitions

Subject LFTD HG LFTD + HG

1 100 plusmn 00 60 plusmn 00 165 plusmn 05 ([100 plusmn 00]+[65 plusmn 07])2 65 plusmn 19 73 plusmn 21 128 plusmn 34 ([70 plusmn 45] + [58 plusmn 15])3 100 plusmn 00 80 plusmn 00 140 plusmn 10 ([60 plusmn 14] + [80 plusmn 00])4 100 plusmn 00 60 plusmn 14 160 plusmn 07 ([95 plusmn 07] + [65 plusmn 07])

A clear superiority for SVM classification was only foundfor subject 2 for whom the absolute decoding performanceswere significantly worse than those of the other three subjects

For the single spaces LFTD and HG table 4 shows themean performance loss for the unleashed HMM That refersto a case wherein the Bakis model constraints are relaxedto the full model For all except two sessions of subject 2(the full model had a higher performance of 02 for session3 and 07 for session 4) the Bakis model outperformedthe full model The results indicate a higher benefit for theLFTD features with an up to 88 performance gain forsubject 3 session 2 Higher accuracies were consistentlyachieved for feature subsets deviating from the optimalsize An example is illustrated for LFTD channel selection(subject 3 session 2) in supplementary figure SI9 (availableat stacksioporgJNE10056020mmedia) The plot comparesthe Bakis and full model similar to the procedure in figure 6While table 4 lists only the improvements for the optimalnumber of channels supplementary figure SI9 indicates that

the performance gain may be even higher when deviating fromthe optimal settings

Finally figure 10 illustrates the accuracy decompositionsinto true positive rates for each finger The results reveal similartrends across subjects and feature spaces While the thumband little finger provided the fewest training samples theirtrue positive rates tend to be the highest among all fingers Anexception is given for LFTD features for subject 2 Equivalentplots for the SVM are shown in supplementary figure SI10(available at stacksioporgJNE10056020mmedia)

For the classification of one feature sequence includingall T samples the HMM Matlab implementation took onaverage (3522 plusmn 0036) ms and the SVM implementationin C (00281 plusmn 00034) ms The mean computation time fora feature sequence was 02 ms plusmn 21μs (LFTD) 172 ms plusmn379 μs (HG) and 179 ms plusmn 200 μs (combined space) Theresults were obtained with an Intel Core i5-2500 K 34 GHz16 GB RAM Win 7 Pro 64-Bit Matlab 2012b for ten selectedchannels

9

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 7 Decoding accuracies (in ) for HMM (green) and SVM (beige) using LFTD + HG feature space (best case) Averageperformances across all sessions are shown for all subjects (S1ndashS4) as a mean result of a 30 times 5 CV procedure with their correspondingerror bars

Figure 8 Decoding accuracies (in ) for the HMMs for all feature spaces and sessions Bi Performances are averaged across 30 repetitionsof the five-fold CV procedure and displayed with their corresponding error bars

4 Discussion

An evaluation of feature space configurations revealed asensitivity of the decoding performance to the preciseparameterization of individual feature spaces Apart from thesegeneral aspects the grid search primarily allows for individualparameter deviations across subjects and sessions Importantlya sensitivity has been identified for the particular time intervalof interest used for the sliding window approach to extractHG features This indicates varying temporal dynamics acrosssubjects and may favour the application of HMMs in onlineapplications This sensitivity suggests that adapting to thesetemporal dynamics has an influence on the degree to whichinformation can be deduced from the data by both HMMand SVM However our investigations indicate that HMMsare less sensitive to changes of the ROI and give nearly

stable performances when changing the window overlapAs a model property the HMM is capable of modellinginformative sequences of different lengths and rates Thelatter characteristic originates from the explicit loop transitionwithin a particular model This means the HMM may coveruninformative samples of a temporal sequence with an idlestate before the informative part starts Underlying temporalprocesses that may persist longer or shorter are modelled bymore or less repeating basic model states Lacking any time-warp properties an SVM would expect a fixed sequence lengthand information rate meaning a fixed association between aparticular feature dimension oi and a specific point in time tiThis highlights the strict comparability of information withinone single feature dimension explicitly required by the SVMThis leads to the prediction of benefits for HMMs in onlineapplications where the time for the button press or other

10

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 9 Differences in decoding accuracy (in ) for the HMMs with respect to the SVM reference classifier (HMM-SVM) Results aredisplayed for all sessions Bi and feature spaces

Figure 10 Mean HMM decoding rates decomposed into true positive rates for each finger Results are shown for each feature space andaveraged across sessions and CV repetitions

Table 4 Final decoding accuracies (in ) listed for HMM (left) and SVM (right) Accuracies for the Bakis model are given for eachsubject-session and feature space by its mean and standard error across all 30 repetitions of the CV procedure Additionally for each subjectand feature space the mean across all sessions as well as the mean loss in performance for LFTD and HG when using an unconstrainedHMM is shown

HMM SVMSubjectndashsession LFTD HG LFTD + HG LFTD HG LFTD + HG

S1ndashB1 () 807 plusmn 03 979 plusmn 01 984 plusmn 01 858 plusmn 04 972 plusmn 02 983 plusmn 01S1ndashB2 () 788 plusmn 06 924 plusmn 04 953 plusmn 03 849 plusmn 05 934 plusmn 04 961 plusmn 03Mean () 798 plusmn 05 952 plusmn 03 969 plusmn 02 854 plusmn 05 953 plusmn 03 972 plusmn 02Unconstrained model () minus31 plusmn 05 minus09 plusmn 03 ndash ndash ndash ndashS2ndashB1 () 440 plusmn 10 359 plusmn 07 444 plusmn 06 514 plusmn 08 380 plusmn 08 557 plusmn 08S2ndashB2 () 502 plusmn 10 481 plusmn 07 524 plusmn 08 555 plusmn 07 480 plusmn 06 602 plusmn 08S2ndashB3 () 402 plusmn 07 363 plusmn 08 401 plusmn 08 496 plusmn 07 395 plusmn 09 515 plusmn 08S2ndashB4 () 309 plusmn 07 364 plusmn 09 384 plusmn 09 391 plusmn 07 380 plusmn 10 427 plusmn 09Mean () 412 plusmn 09 392 plusmn 08 438 plusmn 08 489 plusmn 07 409 plusmn 09 525 plusmn 09Unconstrained model () minus22 plusmn 09 minus04 plusmn 08 ndash ndash ndash ndashS3ndashB1 () 646 plusmn 07 851 plusmn 04 858 plusmn 04 612 plusmn 07 854 plusmn 05 888 plusmn 04S3ndashB2 () 801 plusmn 05 905 plusmn 03 937 plusmn 03 850 plusmn 04 914 plusmn 03 960 plusmn 03Mean () 724 plusmn 06 878 plusmn 04 898 plusmn 04 731 plusmn 06 884 plusmn 04 924 plusmn 03Unconstrained model () minus59 plusmn 07 minus25 plusmn 04 ndash ndash ndash ndashS4ndashB1 () 714 plusmn 05 835 plusmn 05 872 plusmn 04 685 plusmn 06 821 plusmn 04 867 plusmn 04S4ndashB2 () 697 plusmn 07 831 plusmn 03 879 plusmn 04 643 plusmn 06 823 plusmn 03 844 plusmn 04Mean () 706 plusmn 07 833 plusmn 03 876 plusmn 04 664 plusmn 06 822 plusmn 03 856 plusmn 04Unconstrained model () minus11 plusmn 06 minus31 plusmn 05 ndash ndash ndash ndash

11

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 10: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 6 Influence of channel selection on the HMM decoding accuracy using high gamma features only (S1 session 1) Changes indecoding accuracy when successively adding channels to the feature set applying either random or DB-index channel selection (top)Number of free HMM parameters for each number of channels (bottom)

Table 3 Number of selected channels averaged across sessions The averaged number of selected channels and the corresponding standardderivations are given for all subjects and all feature spaces For the combined space square brackets denote the corresponding breakdowninto its components The average is taken across all sessions and CV repetitions

Subject LFTD HG LFTD + HG

1 100 plusmn 00 60 plusmn 00 165 plusmn 05 ([100 plusmn 00]+[65 plusmn 07])2 65 plusmn 19 73 plusmn 21 128 plusmn 34 ([70 plusmn 45] + [58 plusmn 15])3 100 plusmn 00 80 plusmn 00 140 plusmn 10 ([60 plusmn 14] + [80 plusmn 00])4 100 plusmn 00 60 plusmn 14 160 plusmn 07 ([95 plusmn 07] + [65 plusmn 07])

A clear superiority for SVM classification was only foundfor subject 2 for whom the absolute decoding performanceswere significantly worse than those of the other three subjects

For the single spaces LFTD and HG table 4 shows themean performance loss for the unleashed HMM That refersto a case wherein the Bakis model constraints are relaxedto the full model For all except two sessions of subject 2(the full model had a higher performance of 02 for session3 and 07 for session 4) the Bakis model outperformedthe full model The results indicate a higher benefit for theLFTD features with an up to 88 performance gain forsubject 3 session 2 Higher accuracies were consistentlyachieved for feature subsets deviating from the optimalsize An example is illustrated for LFTD channel selection(subject 3 session 2) in supplementary figure SI9 (availableat stacksioporgJNE10056020mmedia) The plot comparesthe Bakis and full model similar to the procedure in figure 6While table 4 lists only the improvements for the optimalnumber of channels supplementary figure SI9 indicates that

the performance gain may be even higher when deviating fromthe optimal settings

Finally figure 10 illustrates the accuracy decompositionsinto true positive rates for each finger The results reveal similartrends across subjects and feature spaces While the thumband little finger provided the fewest training samples theirtrue positive rates tend to be the highest among all fingers Anexception is given for LFTD features for subject 2 Equivalentplots for the SVM are shown in supplementary figure SI10(available at stacksioporgJNE10056020mmedia)

For the classification of one feature sequence includingall T samples the HMM Matlab implementation took onaverage (3522 plusmn 0036) ms and the SVM implementationin C (00281 plusmn 00034) ms The mean computation time fora feature sequence was 02 ms plusmn 21μs (LFTD) 172 ms plusmn379 μs (HG) and 179 ms plusmn 200 μs (combined space) Theresults were obtained with an Intel Core i5-2500 K 34 GHz16 GB RAM Win 7 Pro 64-Bit Matlab 2012b for ten selectedchannels

9

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 7 Decoding accuracies (in ) for HMM (green) and SVM (beige) using LFTD + HG feature space (best case) Averageperformances across all sessions are shown for all subjects (S1ndashS4) as a mean result of a 30 times 5 CV procedure with their correspondingerror bars

Figure 8 Decoding accuracies (in ) for the HMMs for all feature spaces and sessions Bi Performances are averaged across 30 repetitionsof the five-fold CV procedure and displayed with their corresponding error bars

4 Discussion

An evaluation of feature space configurations revealed asensitivity of the decoding performance to the preciseparameterization of individual feature spaces Apart from thesegeneral aspects the grid search primarily allows for individualparameter deviations across subjects and sessions Importantlya sensitivity has been identified for the particular time intervalof interest used for the sliding window approach to extractHG features This indicates varying temporal dynamics acrosssubjects and may favour the application of HMMs in onlineapplications This sensitivity suggests that adapting to thesetemporal dynamics has an influence on the degree to whichinformation can be deduced from the data by both HMMand SVM However our investigations indicate that HMMsare less sensitive to changes of the ROI and give nearly

stable performances when changing the window overlapAs a model property the HMM is capable of modellinginformative sequences of different lengths and rates Thelatter characteristic originates from the explicit loop transitionwithin a particular model This means the HMM may coveruninformative samples of a temporal sequence with an idlestate before the informative part starts Underlying temporalprocesses that may persist longer or shorter are modelled bymore or less repeating basic model states Lacking any time-warp properties an SVM would expect a fixed sequence lengthand information rate meaning a fixed association between aparticular feature dimension oi and a specific point in time tiThis highlights the strict comparability of information withinone single feature dimension explicitly required by the SVMThis leads to the prediction of benefits for HMMs in onlineapplications where the time for the button press or other

10

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 9 Differences in decoding accuracy (in ) for the HMMs with respect to the SVM reference classifier (HMM-SVM) Results aredisplayed for all sessions Bi and feature spaces

Figure 10 Mean HMM decoding rates decomposed into true positive rates for each finger Results are shown for each feature space andaveraged across sessions and CV repetitions

Table 4 Final decoding accuracies (in ) listed for HMM (left) and SVM (right) Accuracies for the Bakis model are given for eachsubject-session and feature space by its mean and standard error across all 30 repetitions of the CV procedure Additionally for each subjectand feature space the mean across all sessions as well as the mean loss in performance for LFTD and HG when using an unconstrainedHMM is shown

HMM SVMSubjectndashsession LFTD HG LFTD + HG LFTD HG LFTD + HG

S1ndashB1 () 807 plusmn 03 979 plusmn 01 984 plusmn 01 858 plusmn 04 972 plusmn 02 983 plusmn 01S1ndashB2 () 788 plusmn 06 924 plusmn 04 953 plusmn 03 849 plusmn 05 934 plusmn 04 961 plusmn 03Mean () 798 plusmn 05 952 plusmn 03 969 plusmn 02 854 plusmn 05 953 plusmn 03 972 plusmn 02Unconstrained model () minus31 plusmn 05 minus09 plusmn 03 ndash ndash ndash ndashS2ndashB1 () 440 plusmn 10 359 plusmn 07 444 plusmn 06 514 plusmn 08 380 plusmn 08 557 plusmn 08S2ndashB2 () 502 plusmn 10 481 plusmn 07 524 plusmn 08 555 plusmn 07 480 plusmn 06 602 plusmn 08S2ndashB3 () 402 plusmn 07 363 plusmn 08 401 plusmn 08 496 plusmn 07 395 plusmn 09 515 plusmn 08S2ndashB4 () 309 plusmn 07 364 plusmn 09 384 plusmn 09 391 plusmn 07 380 plusmn 10 427 plusmn 09Mean () 412 plusmn 09 392 plusmn 08 438 plusmn 08 489 plusmn 07 409 plusmn 09 525 plusmn 09Unconstrained model () minus22 plusmn 09 minus04 plusmn 08 ndash ndash ndash ndashS3ndashB1 () 646 plusmn 07 851 plusmn 04 858 plusmn 04 612 plusmn 07 854 plusmn 05 888 plusmn 04S3ndashB2 () 801 plusmn 05 905 plusmn 03 937 plusmn 03 850 plusmn 04 914 plusmn 03 960 plusmn 03Mean () 724 plusmn 06 878 plusmn 04 898 plusmn 04 731 plusmn 06 884 plusmn 04 924 plusmn 03Unconstrained model () minus59 plusmn 07 minus25 plusmn 04 ndash ndash ndash ndashS4ndashB1 () 714 plusmn 05 835 plusmn 05 872 plusmn 04 685 plusmn 06 821 plusmn 04 867 plusmn 04S4ndashB2 () 697 plusmn 07 831 plusmn 03 879 plusmn 04 643 plusmn 06 823 plusmn 03 844 plusmn 04Mean () 706 plusmn 07 833 plusmn 03 876 plusmn 04 664 plusmn 06 822 plusmn 03 856 plusmn 04Unconstrained model () minus11 plusmn 06 minus31 plusmn 05 ndash ndash ndash ndash

11

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 11: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 7 Decoding accuracies (in ) for HMM (green) and SVM (beige) using LFTD + HG feature space (best case) Averageperformances across all sessions are shown for all subjects (S1ndashS4) as a mean result of a 30 times 5 CV procedure with their correspondingerror bars

Figure 8 Decoding accuracies (in ) for the HMMs for all feature spaces and sessions Bi Performances are averaged across 30 repetitionsof the five-fold CV procedure and displayed with their corresponding error bars

4 Discussion

An evaluation of feature space configurations revealed asensitivity of the decoding performance to the preciseparameterization of individual feature spaces Apart from thesegeneral aspects the grid search primarily allows for individualparameter deviations across subjects and sessions Importantlya sensitivity has been identified for the particular time intervalof interest used for the sliding window approach to extractHG features This indicates varying temporal dynamics acrosssubjects and may favour the application of HMMs in onlineapplications This sensitivity suggests that adapting to thesetemporal dynamics has an influence on the degree to whichinformation can be deduced from the data by both HMMand SVM However our investigations indicate that HMMsare less sensitive to changes of the ROI and give nearly

stable performances when changing the window overlapAs a model property the HMM is capable of modellinginformative sequences of different lengths and rates Thelatter characteristic originates from the explicit loop transitionwithin a particular model This means the HMM may coveruninformative samples of a temporal sequence with an idlestate before the informative part starts Underlying temporalprocesses that may persist longer or shorter are modelled bymore or less repeating basic model states Lacking any time-warp properties an SVM would expect a fixed sequence lengthand information rate meaning a fixed association between aparticular feature dimension oi and a specific point in time tiThis highlights the strict comparability of information withinone single feature dimension explicitly required by the SVMThis leads to the prediction of benefits for HMMs in onlineapplications where the time for the button press or other

10

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 9 Differences in decoding accuracy (in ) for the HMMs with respect to the SVM reference classifier (HMM-SVM) Results aredisplayed for all sessions Bi and feature spaces

Figure 10 Mean HMM decoding rates decomposed into true positive rates for each finger Results are shown for each feature space andaveraged across sessions and CV repetitions

Table 4 Final decoding accuracies (in ) listed for HMM (left) and SVM (right) Accuracies for the Bakis model are given for eachsubject-session and feature space by its mean and standard error across all 30 repetitions of the CV procedure Additionally for each subjectand feature space the mean across all sessions as well as the mean loss in performance for LFTD and HG when using an unconstrainedHMM is shown

HMM SVMSubjectndashsession LFTD HG LFTD + HG LFTD HG LFTD + HG

S1ndashB1 () 807 plusmn 03 979 plusmn 01 984 plusmn 01 858 plusmn 04 972 plusmn 02 983 plusmn 01S1ndashB2 () 788 plusmn 06 924 plusmn 04 953 plusmn 03 849 plusmn 05 934 plusmn 04 961 plusmn 03Mean () 798 plusmn 05 952 plusmn 03 969 plusmn 02 854 plusmn 05 953 plusmn 03 972 plusmn 02Unconstrained model () minus31 plusmn 05 minus09 plusmn 03 ndash ndash ndash ndashS2ndashB1 () 440 plusmn 10 359 plusmn 07 444 plusmn 06 514 plusmn 08 380 plusmn 08 557 plusmn 08S2ndashB2 () 502 plusmn 10 481 plusmn 07 524 plusmn 08 555 plusmn 07 480 plusmn 06 602 plusmn 08S2ndashB3 () 402 plusmn 07 363 plusmn 08 401 plusmn 08 496 plusmn 07 395 plusmn 09 515 plusmn 08S2ndashB4 () 309 plusmn 07 364 plusmn 09 384 plusmn 09 391 plusmn 07 380 plusmn 10 427 plusmn 09Mean () 412 plusmn 09 392 plusmn 08 438 plusmn 08 489 plusmn 07 409 plusmn 09 525 plusmn 09Unconstrained model () minus22 plusmn 09 minus04 plusmn 08 ndash ndash ndash ndashS3ndashB1 () 646 plusmn 07 851 plusmn 04 858 plusmn 04 612 plusmn 07 854 plusmn 05 888 plusmn 04S3ndashB2 () 801 plusmn 05 905 plusmn 03 937 plusmn 03 850 plusmn 04 914 plusmn 03 960 plusmn 03Mean () 724 plusmn 06 878 plusmn 04 898 plusmn 04 731 plusmn 06 884 plusmn 04 924 plusmn 03Unconstrained model () minus59 plusmn 07 minus25 plusmn 04 ndash ndash ndash ndashS4ndashB1 () 714 plusmn 05 835 plusmn 05 872 plusmn 04 685 plusmn 06 821 plusmn 04 867 plusmn 04S4ndashB2 () 697 plusmn 07 831 plusmn 03 879 plusmn 04 643 plusmn 06 823 plusmn 03 844 plusmn 04Mean () 706 plusmn 07 833 plusmn 03 876 plusmn 04 664 plusmn 06 822 plusmn 03 856 plusmn 04Unconstrained model () minus11 plusmn 06 minus31 plusmn 05 ndash ndash ndash ndash

11

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 12: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

Figure 9 Differences in decoding accuracy (in ) for the HMMs with respect to the SVM reference classifier (HMM-SVM) Results aredisplayed for all sessions Bi and feature spaces

Figure 10 Mean HMM decoding rates decomposed into true positive rates for each finger Results are shown for each feature space andaveraged across sessions and CV repetitions

Table 4 Final decoding accuracies (in ) listed for HMM (left) and SVM (right) Accuracies for the Bakis model are given for eachsubject-session and feature space by its mean and standard error across all 30 repetitions of the CV procedure Additionally for each subjectand feature space the mean across all sessions as well as the mean loss in performance for LFTD and HG when using an unconstrainedHMM is shown

HMM SVMSubjectndashsession LFTD HG LFTD + HG LFTD HG LFTD + HG

S1ndashB1 () 807 plusmn 03 979 plusmn 01 984 plusmn 01 858 plusmn 04 972 plusmn 02 983 plusmn 01S1ndashB2 () 788 plusmn 06 924 plusmn 04 953 plusmn 03 849 plusmn 05 934 plusmn 04 961 plusmn 03Mean () 798 plusmn 05 952 plusmn 03 969 plusmn 02 854 plusmn 05 953 plusmn 03 972 plusmn 02Unconstrained model () minus31 plusmn 05 minus09 plusmn 03 ndash ndash ndash ndashS2ndashB1 () 440 plusmn 10 359 plusmn 07 444 plusmn 06 514 plusmn 08 380 plusmn 08 557 plusmn 08S2ndashB2 () 502 plusmn 10 481 plusmn 07 524 plusmn 08 555 plusmn 07 480 plusmn 06 602 plusmn 08S2ndashB3 () 402 plusmn 07 363 plusmn 08 401 plusmn 08 496 plusmn 07 395 plusmn 09 515 plusmn 08S2ndashB4 () 309 plusmn 07 364 plusmn 09 384 plusmn 09 391 plusmn 07 380 plusmn 10 427 plusmn 09Mean () 412 plusmn 09 392 plusmn 08 438 plusmn 08 489 plusmn 07 409 plusmn 09 525 plusmn 09Unconstrained model () minus22 plusmn 09 minus04 plusmn 08 ndash ndash ndash ndashS3ndashB1 () 646 plusmn 07 851 plusmn 04 858 plusmn 04 612 plusmn 07 854 plusmn 05 888 plusmn 04S3ndashB2 () 801 plusmn 05 905 plusmn 03 937 plusmn 03 850 plusmn 04 914 plusmn 03 960 plusmn 03Mean () 724 plusmn 06 878 plusmn 04 898 plusmn 04 731 plusmn 06 884 plusmn 04 924 plusmn 03Unconstrained model () minus59 plusmn 07 minus25 plusmn 04 ndash ndash ndash ndashS4ndashB1 () 714 plusmn 05 835 plusmn 05 872 plusmn 04 685 plusmn 06 821 plusmn 04 867 plusmn 04S4ndashB2 () 697 plusmn 07 831 plusmn 03 879 plusmn 04 643 plusmn 06 823 plusmn 03 844 plusmn 04Mean () 706 plusmn 07 833 plusmn 03 876 plusmn 04 664 plusmn 06 822 plusmn 03 856 plusmn 04Unconstrained model () minus11 plusmn 06 minus31 plusmn 05 ndash ndash ndash ndash

11

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 13: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

events may not be known beforehand The classifier may actmore robustly on activity that is not perfectly time lockedImaging paradigms for motor imagery or similar experimentsmay provide scenarios in which a subject carries out an actionfor an unspecified time interval Given that varying temporalextent is likely the need for training the SVM for the entirerange of possible ROI locations and widths is not desirable

The SVM characteristics also imply a need to await thewhole sequence to make a decision for a class In contrastthe Viterbi algorithm allows HMMs to state at each incomingfeature vector how likely an assignment to a particular class isIn online applications this striking advantage provides moreprompt classification and may reduce computational overheadif only the baseline is employed There is no need to awaitfurther samples for a decision Note that this emphasizes theimplicit adaptation of HMMs to real-time applications whilethe SVM needs to classify the whole new subsequence againafter each incoming sample HMMs just update their currentprediction according to the incoming information They donot start from scratch The implementation used here did notmake use of this online feature since only the classificationof the entire trial was evaluated Another reason for the SVMbeing faster than the HMM is given by the non-optimizedimplementation in Matlab code of the latter The same appliesto the feature extraction Existing high performance librariesfor these standard signal processing techniques or evena hardware field-programmable gate array implementationmay significantly speed up the implementation The averageclassification time for the HMMs implies a maximal possiblesignal rate of about 285 Hz when classifying after eachsample Thus in terms of classification even with the currentimplementation real-time is perfectly feasible for most casesMore sophisticated implementations are used in the field ofspeech processing where classifications have to be performedon the basis of raw data with very high sampling rates of up to40 kHz

Second tuning the features to an informative subsetyielded less dependence on the particular classifier forthe presented offline case Preliminary analysis usingcompromised parameter sets for the feature spaces as wellas feature subset size led instead to a higher performancegap between both classifiers Adapting to each subjectindividually a strict superiority was not observed for theHMMs versus the SVM in subjects 1 and 3 For subject4 the HMM actually outperformed the SVM in all casesInterestingly this coincides with the grid resolution whichwas twice as high for this subject supporting the importanceof the degree and spatial precision of motor cortex coverageFor subject 2 decoding performances were significantlyworse in comparison with the other subjects This isprobably due to the grid placement not being optimal forthe classification (supplementary figure SI2 available atstacksioporgJNE10056020mmedia) Figure 5 shows nodistinct informative region and channels selected at a highrate are located at the outer boundary of the grid rather thanover motor cortices This hypothesis is further supported by thedecoding accuracies which are higher for LFTD than for HGfeatures in subject 2 Since informative regions were spread

more widely across the cortex for subject 2 (Kubanek et al2009) regions exhibiting critical high gamma motor regionactivity may have been located outside the actual grid

While the results suggest that high feature quality is themain factor for high performances in this ECoG data settwo key points have to be emphasized for the applicationof HMMs with respect to the present finger classificationproblem First due to a high number of free model parametersfeature selection is essential for achieving high decodingaccuracy Note that the SVM exhibits fewer free parametersto be estimated during training These parameters mainlycorrespond to the weights α The number of these directlydepends on the number of training samples The functionalrelationship between features and labels even depends on thenon-zero weights of the support vectors only Thus SVMexhibits higher robustness to variations of the subset sizeSecond imposing constraints on the HMM model structurewas found to be important for achieving high decodingaccuracies As stated in the methods section the introductionof the Bakis model does not substantially influence the numberof free parameters The subset size has a much higher impact(figure 6) on the overall number Nevertheless a performancegain of up to 88 per session or 59 mean per subjectacross all sessions was achieved for the LFTD features Thissuggests that improvements may not be due to a mere reductionin the absolute number of parameters but also due to bettercompliance of the model with the underlying temporal datastructure ie the evolution of feature characteristics over timeFor the LFTD space information is encoded in the signalphase and hence the temporal structure The fact that theBakis model constrains this temporal structure hence entailsinteresting implications Higher overall accuracies for the HGspace along with only minor benefits from the Bakis modelindicate that the clear advantages of model constraints mainlyarise if the relevant information does not appear as prominentin the data This is further supported by a higher increase inHMM performance when not using the optimal subset sizefor LFTD features (see supplementary figure SI9 availableat stacksioporgJNE10056020mmedia) This suggests thatthe Bakis approach tends to be more robust with respect todeviations from the optimal parameter setting

Finally it should be mentioned that the current testprocedure does not fully allow for non-stationary behaviourThat means that the random trial selection does not taketemporal evolution of signal properties into account Thesemight change and blur the feature space representation overtime (Lemm et al 2011) In this context and depending on theextent of non-stationarity the presented results may slightlyoverestimate the real decoding rates This applies for bothSVM and HMM Non-stationarity should be addressed inan online setup by updating the HMM on-the-fly Trainingdata should only be used from past recordings and the signalcharacteristics of future test set data remain unknown Whileonline model updates are generally unproblematic for theHMM training procedure investigations on this side are stillneeded

12

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 14: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

5 Conclusions

For a defined offline case we have shown that high decodingrates can be achieved with both the HMM and SVM classifiersin most subjects This suggests that the general choice ofclassifier is of less importance once HMM robustness isincreased by introducing model constraints The performancegap between both machine learning methods is sensitive tofeature extraction and particularly electrode subset selectionraising interesting implications for future research This studyrestricted HMMs to a rigid time window with a known eventmdasha scenario that suits SVM requirements This condition maybe relaxed and classification improved by use of the HMMtime-warp property (Sun et al 1994) These allow for the samespatio-temporal pattern of brain activity occurring at differenttemporal rates

By exploiting the actual potential of such models weexpect them to be individually adapted to specific BCIapplications The idea of adaptive online learning may beimplemented for non-stationary signal properties

Future research may encompass extended and additionalmodel constraints for special cases as successfully used inASR such as fixed transition probabilities or the sharingof mixture distributions among different states Finallyincorporation of other sources of prior knowledge and contextawareness may further optimize decoding accuracy

Acknowledgments

The authors acknowledge the work of Fanny Quandt whosubstantially contributed to the acquisition of the ECoGdata Supported by NINDS grant NS21135 Land-Sachsen-Anhalt grant MK48-2009003 and ECHORD 231143 Thiswork is partly funded by the German Ministry of Educationand Research (BMBF) within the ForschungscampusSTIMULATE (grant 03FO16102A)

References

Acharya M Fifer M Benz H Crone N and Thakor N 2010Electrocorticographic amplitude predicts finger positionsduring slow grasping motions of the hand J Neural Eng7 13

Ball T Schulze-Bonhage A Aertsen A and Mehring C 2009Differential representation of arm movement direction inrelation to cortical anatomy and function J Neural Eng6 016

Bhattacharyya A 1943 On a measure of divergence between twostatistical populations defined by their probability distributionsBull Calcutta Math Soc 35 99ndash1909

Chang C-C and Lin C-J 2011 LIBSVM a library for support vectormachines ACM Tran Intell Syst Technol 2 1ndash27

Cincotti F et al 2003 Comparison of different feature classifiers forbrain computer interfaces Proc 1st Int IEEE EMBS Conf onNeural Engineering (Capri Island Italy) pp 645ndash7

Crone N Miglioretti D Gordon B and Lesser R 1998 Functionalmapping of human sensorimotor cortex withelectrocorticographic spectral analysis II event-relatedsynchronization in the gamma bands Brain 121 2301ndash15

Davies D and Bouldin D 1979 A cluster separation measure IEEETrans Pattern Anal Mach Intell PAMI-1 224ndash7

Demirer R Ozerdem M and Bayrak C 2009 Classification ofimaginary movements in ECoG with a hybrid approach basedon multi-dimensional Hilbert-SVM solution J NeurosciMethods 178 214ndash8

Dempster A Laird N and Rubin D B 1977 Maximum likelihoodfrom incomplete data via the EM algorithm J R Stat Soc B39 1ndash38

Dethier J Gilja V Nuyujukian P Elassaad S A Shenoy K Vand Boahen K 2011 Spiking neural network decoder forbrain-machine interfaces EMBSrsquo11 Proc IEEE 5th Int Confon Neural Engineering in Medicine and Biological Society(Cancun Mexico) pp 396ndash9

Duda R O Hart P E and Stork D G 2001 Pattern Classification 2ndedn (New York Interscience)

Fisher A R 1936 The use of multiple measurements in taxonomicproblems Ann Eugen 7 179ndash88

Ganguly K and Carmena J M 2010 Neural correlates of skillacquisition with a cortical brain-machine interface J MotorBehav 42 355ndash60

Graimann B Huggins J Levine S and Pfurtscheller G 2004 Towarda direct brain interface based on human subdural recordings andwavelet-packet analysis IEEE Trans Biomed Eng 51 954ndash62

Guyon I and Elisseeff A 2003 An introduction to variable andfeature selection J Mach Learn Res 3 1157ndash82

Guyon I Weston J Barnhill S and Vapnik V 2002 Gene selectionfor cancer classification using support vector machinesJ Mach Learn 46 389ndash422

Hastie T Tibshirani R and Friedman J 2009 The Elements ofStatistical Learning 2nd edn (New York Springer)

Hoffmann U Vesin J and Diserens K 2007 An efficient P300-basedbrain-computer interface for disabled subjects J NeurosciMethods 167 115ndash25

Kubanek J Miller K Ojemann J Wolpaw J and Schalk G 2009Decoding flexion of individual fingers usingelectrocorticographic signals in humans J Neural Eng 6 14

Lederman D and Tabrikian J 2012 Classification of multichannelEEG patterns using parallel hidden Markov models Med BiolEng Comput 50 319ndash28

Lee H and Choi S 2003 PCA+HMM+SVM for EEG patternclassification 7th Proc Int Symp on Signal Processing and ItsApplications vol 1 pp 541ndash4

Lemm S Blankertz B Dickhaus T and Mueller K-R 2011Introduction to machine learning for brain imagingNeuroImage 56 387ndash99

Liang N and Bougrain L 2009 Decoding finger flexion usingamplitude modulation from band-specific ECoG ESANN EurSymp on Artificial Neural Networks pp 467ndash72

Liu C Zhao H B Li C S and Wang H 2010 Classification of ECoGmotor imagery tasks based on CSP and SVM BMEI 3rd IntConf on Biomedical Engineering and Informatics vol 2pp 804ndash7

Lotte F Congedo M Lecuyer A Lamarche F and Arnaldi B 2007 Areview of classification algorithms for EEG-basedbrain-computer interfaces J Neural Eng 4 R1ndashR13

Makeig S 1993 Auditory event-related dynamics of the EEGspectrum and effects of exposure to tones ElectroencephalogrClin Neurophysiol 86 283ndash93

Murphy K 2005 Hidden Markov model (HMM) toolbox for matlabwwwcsubccasimmurphykSoftwareHMMhmmhtml

Obermaier B Guger C Neuper C and Pfurtscheller G 2001a HiddenMarkov models for online classification of single trial EEGdata Pattern Recognit Lett 22 1299ndash309

Obermaier B Guger C and Pfurtscheller G 1999 Hidden Markovmodels used for the offline classification of EEG data BiomedTechBiomed Eng 44 158ndash62

Obermaier B Neuper C Guger C and Pfurtscheller G 2001bInformation transfer rate in a five-classes brain-computerinterface IEEE Trans Neural Syst Rehabil Eng 9 283ndash8

Palaniappan R Chanan R and Syan S 2009 Current practices inelectroencephalogram based brain-computer interfaces

13

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References
Page 15: Hidden Markov model and support vector machine based ...knightlab.berkeley.edu/statics/publications/2014/10/02/...et al 2011, Acharya et al 2010,Ballet al 2009, Kubanek et al 2009,LiangandBougrain2009,Pistohletal2008),thesupport

J Neural Eng 10 (2013) 056020 T Wissel et al

Encyclopedia of Information Science and Technology II(Hershey PA IGI Global) pp 888ndash901

Pasqual-Marqui R D Michel C M and Lehmann D 1995Segmentation of brain electrical activity into microstatesmodel estimation and validation IEEE Trans Biomed Eng42 658ndash65

Pechenizkiy M 2005 The impact of feature extraction on theperformance of a classifier kNN Naive Bayes and C45 Proc18th Canadian Society Conf on Advances in ArtificialIntelligence (Berlin Springer) pp 268ndash79

Penfield W and Boldrey E 1937 Somatic motor and sensoryrepresentation in the cerebral cortex of man as studied byelectrical stimulation Brain 60 389ndash443

Picton T W Lins O G and Scherg M 1995 The recording andanalysis of event-related potentials Handbook ofNeuropsychology ed F Boller and J Grafman (AmsterdamElsevier) vol 10 pp 3ndash73

Pistohl T Ball T Schulze-Bonhage A Aertsen A and Mehring C2008 Prediction of arm movement trajectories fromECoG-recordings in humans J Neurosci Methods 167 105ndash14

Quandt F Reichert C Hinrichs H Heinze H Knight R and Rieger J2012 Single trial discrimination of individual finger

movements on one hand a combined MEG and EEG studyNeuroImage 59 3316ndash24

Rabiner L 1989 A tutorial on hidden Markov models and selectedapplications in speech recognition Proc IEEE 77 257ndash86

Schalk G 2010 Can electrocorticography (ECoG) support robust andpowerful brain-computer interfaces Front Neuroeng 3 1

Schukat-Talamazzini E G 1995 Automatische SpracherkennungmdashGrundlagen statistische Modelle und effiziente AlgorithmenKunstliche Intelligenz (Braunschweig Vieweg)

Shenoy P Miller K Ojemann J and Rao R 2007 Finger movementclassification for an electrocorticographic BCI CNErsquo07 3rdInt IEEEEMBS Conf on Neural Engineering pp 192ndash5

Sun D X Deng L and Wu C F J 1994 State-dependent timewarping in the trended hidden Markov model Signal Proc39 263ndash75

Wissel T and Palaniappan R 2011 Considerations on strategies toimprove EOG signal analysis Int J Artif Life Res 2 6ndash21

Yu L and Liu H 2004 Efficient feature selection via analysis ofrelevance and redundancy J Mach Learn Res 5 1205ndash24

Zhao H B Liu C Wang H and Li C S 2010 Classifying ECoGsignals using probabilistic neural network ICIErsquo10 WASE IntConf on Information Engineering vol 1 pp 77ndash80

14

  • 1 Introduction
  • 2 Material and methods
    • 21 Patients experimental setup and data acquisition
    • 22 Test framework
    • 23 Feature generation
    • 24 Feature selection
    • 25 Classification
    • 26 Considerations on testing
      • 3 Results
        • 31 Feature and channel selection
        • 32 Decoding performances
          • 4 Discussion
          • 5 Conclusions
          • Acknowledgments
          • References