classification of clinical data using sequential patterns...

14
Classification of Clinical Data using Sequential Patterns: A case study in Amyotrophic Lateral Sclerosis. Andr´ e V. Carreiro Susana Pinto Mamede de Carvalho Sara C. Madeira Cl´ audia Antunes Abstract Until recently, knowledge discovery would be restricted to a static analysis, disregarding any temporal or sequential re- lations within the data. In the last decade, temporal data mining developed to be a hot topic of research, looking for those temporal dependencies, unveiling new insights in var- ious areas of interest, including bioinformatics. Sequential pattern mining tries to achieve such goals, by finding fre- quent patterns within a population and returning them to the user. However, its application as a basis for a direct clas- sification problem with clinical data was never studied, to our knowledge. Hence, this work uses discovered sequential patterns as features for standard classifiers, using a clinical dataset obtained from Amyotrophic Lateral Sclerosis (ALS) patients. The preliminary results are very promising, achiev- ing a prediction accuracy over 83% with a very reduced set of features, both from original data and sequential patterns. Future work includes advancing from a classification prob- lem to prognosis prediction. 1 Introduction Knowledge discovery has been applied for many decades, with the goal of extracting useful hidden infor- mation from data. Nonetheless, most of the approaches in this area would, until more recently, be limited to a static analysis, disregarding possible (and sometimes clear) temporal dependencies within the data. In the last decades, however, significant advances have been made with regard to this issue, thus resulting in several techniques for temporal mining [1, 2, 3, 4, 5, 6]. There are many applications of temporal mining in real-life problems [7], ranging from market analysis [1], telecommunications [5] and multimedia [8], evolution of financial data [6], genetics [5], to healthcare, including treatment eectiveness [2, 3], prognosis prediction and diagnosis support [3, 4]. These last domains gain particular importance, given that a faster response to disease progression, or even prior to its onset, will certainly be of great importance for such patients, their families and healthcare providers. Instituto Superior T´ ecnico (IST), Technical University of Lisbon, Portugal Knowledge Discovery and Bioinformatics (KDBIO) Group, INESC-ID, Portugal Neuromuscular Unit, Institute of Molecular Medicine and Faculty of Medicine, University of Lisbon, Portugal In this work we focus on sequential pattern mining (SPM), and its application in medical problems. We use a clinical dataset from Amyotrophic Lateral Sclerosis (ALS) patients with the same goal of Ama- ral [9], where standard machine learning methods are used to predict the need for non-invasive ventilation (NIV), withouth sequential patterns (SP) as features. This paper is organized as follows: first, the related work is presented, including some techniques for SPM. Following, we describe the used dataset and reveal the main results from the application of SPM for classifica- tion, discussing the main conclusions and future work. 2 Background 2.1 Sequential Pattern Mining In this context, a sequence is an ordered list of sets of elements called items; and the sum of the number of items in each set, corresponds to the sequence length. The sets of items are named itemsets, also called transactions due to its application in market data, or evaluations of a batch of exams in a time window (corresponding to a time point), in the clinical context. Given the possibility of allowing a δ gap in sequence matching, a sequence a =<a 1 a 2 ...a n > is a δ-distance subsequence of another sequence b =<b 1 b 2 ...b m >, if there are integers i 1 <i 2 < ... < i n such that a 1 bi 1 ,a 2 bi 2 , ..., a n bi n and i k - i k-1 δ, 8k 2 {1, ...n}, being said that a b [12]. Note that a 1-distance subsequence corresponds to a contiguous subsequence. Given a database of sequences, D, and minimum support threshold, σ, a sequence is considered frequent if it is a subsequence of (or supported by) at least σ sequences in D. However, this threshold is usually used as a proportion of the total number of sequences in D. Finally, a sequential pattern is a frequent maximal sequence, being the goal of SPM to find all the SPs [10, 11, 12]. 2.1.1 Apriori-Based Methods Generalized Se- quential Patterns (GSP) [10] was one of the first ap- proaches for mining sequential patterns; it is based on 61 Downloaded from knowledgecenter.siam.org

Upload: others

Post on 15-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

Classification of Clinical Data using Sequential Patterns: A case study

in Amyotrophic Lateral Sclerosis.

Andre V. Carreiro

⇤†Susana Pinto

‡Mamede de Carvalho

‡Sara C. Madeira

⇤†

Claudia Antunes

AbstractUntil recently, knowledge discovery would be restricted to astatic analysis, disregarding any temporal or sequential re-lations within the data. In the last decade, temporal datamining developed to be a hot topic of research, looking forthose temporal dependencies, unveiling new insights in var-ious areas of interest, including bioinformatics. Sequentialpattern mining tries to achieve such goals, by finding fre-quent patterns within a population and returning them tothe user. However, its application as a basis for a direct clas-sification problem with clinical data was never studied, toour knowledge. Hence, this work uses discovered sequentialpatterns as features for standard classifiers, using a clinicaldataset obtained from Amyotrophic Lateral Sclerosis (ALS)patients. The preliminary results are very promising, achiev-ing a prediction accuracy over 83% with a very reduced setof features, both from original data and sequential patterns.Future work includes advancing from a classification prob-lem to prognosis prediction.

1 Introduction

Knowledge discovery has been applied for manydecades, with the goal of extracting useful hidden infor-mation from data. Nonetheless, most of the approachesin this area would, until more recently, be limited toa static analysis, disregarding possible (and sometimesclear) temporal dependencies within the data. In thelast decades, however, significant advances have beenmade with regard to this issue, thus resulting in severaltechniques for temporal mining [1, 2, 3, 4, 5, 6].

There are many applications of temporal mining inreal-life problems [7], ranging from market analysis [1],telecommunications [5] and multimedia [8], evolution offinancial data [6], genetics [5], to healthcare, includingtreatment e↵ectiveness [2, 3], prognosis prediction anddiagnosis support [3, 4]. These last domains gainparticular importance, given that a faster response todisease progression, or even prior to its onset, willcertainly be of great importance for such patients, theirfamilies and healthcare providers.

⇤Instituto Superior Tecnico (IST), Technical University of

Lisbon, Portugal†Knowledge Discovery and Bioinformatics (KDBIO) Group,

INESC-ID, Portugal‡Neuromuscular Unit, Institute of Molecular Medicine and

Faculty of Medicine, University of Lisbon, Portugal

In this work we focus on sequential pattern mining(SPM), and its application in medical problems.

We use a clinical dataset from Amyotrophic LateralSclerosis (ALS) patients with the same goal of Ama-ral [9], where standard machine learning methods areused to predict the need for non-invasive ventilation(NIV), withouth sequential patterns (SP) as features.

This paper is organized as follows: first, the relatedwork is presented, including some techniques for SPM.Following, we describe the used dataset and reveal themain results from the application of SPM for classifica-tion, discussing the main conclusions and future work.

2 Background

2.1 Sequential Pattern Mining In this context, asequence is an ordered list of sets of elements calleditems; and the sum of the number of items in eachset, corresponds to the sequence length. The sets ofitems are named itemsets, also called transactionsdue to its application in market data, or evaluations ofa batch of exams in a time window (corresponding to atime point), in the clinical context. Given the possibilityof allowing a � gap in sequence matching, a sequencea =< a1a2...an > is a �-distance subsequence ofanother sequence b =< b1b2...bm >, if there are integersi1 < i2 < ... < in such that a1 ✓ bi1, a2 ✓ bi2, ..., an ✓bin and ik � ik�1 �, 8k 2 {1, ...n}, being said thata ✓ b [12]. Note that a 1-distance subsequencecorresponds to a contiguous subsequence.

Given a database of sequences, D, and minimumsupport threshold, �, a sequence is considered frequentif it is a subsequence of (or supported by) at least� sequences in D. However, this threshold is usuallyused as a proportion of the total number of sequencesin D. Finally, a sequential pattern is a frequentmaximal sequence, being the goal of SPM to find allthe SPs [10, 11, 12].

2.1.1 Apriori-Based Methods Generalized Se-quential Patterns (GSP) [10] was one of the first ap-proaches for mining sequential patterns; it is based on

61Downloaded from knowledgecenter.siam.org

Page 2: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

the Apriori method for mining association rules [11],and the idea is to generate candidate sequences andtest them. The main di↵erence to Apriori, is on thecandidate generation method: it joins two patterns toform a new candidate whenever the maximal prefix ofone is equal to the maximal su�x of the other. Ad-ditionally, GSP introduced a set of features respectingthe ordered nature of data, including gap constraints.These constraints are useful for limiting the number ofpatterns, by only considering �-distance subsequences,with a fixed value for �.

2.1.2 Pattern-Growth Methods These methodswere developed more recently with the goal of avoidingthe candidate generation step completely, while focusingthe search on a reduced set of the initial database [12].

One of the algorithms is the PrefixSpan [13], basedon a recursive construction of the SPs. This is possi-ble by means of projected databases: an ↵-projecteddatabase is a set of subsequences, in the database, thatare su�xes of sequences with prefix ↵, which can be sim-ply an itemset, or a subsequence. To account for thegap constraints, Generalized PrefixSPan (GenPrefixS-pan) was created [12]. This generalization is based onredefining the projected database construction. Insteadof restricting the search to the first occurrence of theelement, every element’s occurrence is considered. Forperformance, and gap constraints consideration, Gen-PrefixSpan was chosen for this work.

2.2 Pruning Sequential Patterns The number offound SPs can achieve a value over the millions forlower minimum support thresholds. This quantity isimpractical to analyse, either from a simple miningand interpretation, or for classification purposes, thusleading to the need of reducing the set of found SPs.

The first reduction, considering the following prob-lem of classification, is to remove the trivial SPs, whichpresent only a single transaction with a single item (sin-gle evaluation of a single feature), losing any sequentialor feature clustering properties.

A possible method of further pruning is based on aminimum improvement criterion [14]. The idea, sharedwith other methods, is to prune the SPs which are notsignificantly more informative than other returned SPs.More specifically, for a pattern p of length n, if thefollowing criterion is met for any supersequence q of p,then p is excluded.

8q, p : p ✓ q ^ support(p) support(q) +minImp

) exclude p

Note that support(p) represents the number of se-quences (or patients), in the database, sharing the sub-

sequence p. It is interesting to note that for minImp =0, the algorithm returns exactly the set of closed SPs,whereas a value of minImp = Infinity returns exactlythe set of maximal SPs (theorems and proofs in [14]).At this point, it is interesting to discuss these two typesof SPs. Maximal SPs [1], are frequent patterns thatare not a subsequence of any other SP. Their majorlimitation is that many highly supported (and possiblycrucial) patterns are excluded from the result set. An-other approach is to mine the closed SPs, which are nota subsequence of any other pattern that presents ex-actly the same support. One of the main advantages ofclosed patterns is that all frequent patterns (and respec-tive supports) can be generated from closed patterns,thus forming a condensed representation of the resultset. Nonetheless, the impact of mining only closed pat-terns can lead to an unnoticeable reduction, especiallyfor sparse datasets [14].

2.3 Related Work of SPM on clinical and bio-logical data In 1970, Lasker [15] studied sequences ofsymptoms for a set of patients, and was able to iden-tify some of the diseases based on SP recognition. Inthe beginning of the last decade, Ramirez et al. [16],applied temporal pattern discovery on a HIV dataset.The main goal was to determine if people with the samechronic illness presented a similar experience along theevaluations. Although the authors use a Decision Treeclassifier to determine the health status of a patient, thegoal is still to mine significant patterns, not using thepatterns for classifying the health status. In a similarcontext, Lin et al. [17] presented a new SPM techniqueaiming at the discovery of time dependency patterns ofclinical pathways for brain stroke patients. With infor-mation on such patterns, the health care procedures fornew patients are, in turn, more e↵ective and e�cient.

Concaro et al. [18], in 2007, applied SPM in order tounveil association patterns of diagnoses shared by hospi-tals of the USA. The identified SPs can provide informa-tion about the most frequent healthcare episodes in thecountry, and even expose temporal precedence betweendiseases, suggesting a possible causal relationship.

More recently, Choi et al. [19] studied the possibilityof predicting a patient’s revisit to a public health center,and applied SPM to anticipate diseases of patients whodo revisit such centers. Some of the found SPs amongthe diseases of revisiting patients can provide a betterand earlier understanding of foreseeable diseases, as wellas adequate precautionary measures.

Tseng and Lee [20] introduced a method calledClassifying-By-Sequence (CBS) where probabilistic in-duction is used, to extract inherent SPs which then canbe e�ciently used for classification, although this was

62Downloaded from knowledgecenter.siam.org

Page 3: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

tested only on synthetic time series data.Nonetheless, to our knowledge, classification of

clinical data using SPM was never studied. It is clearthat it is crucial to improve the e�ciency of healthcareproviders, as well as developing and improving thetechniques to unveil more and more specific temporalrelations (patients, symptoms, etc), in order to achievea faster diagnosis, but also a more confident prognosis.

3 Methods

In this section, the used dataset (and preprocessing) isdescribed. Then, we discuss the classification strategies.

3.1 Amyotrophic Lateral Sclerosis DatasetThis dataset consists of clinical and laboratory data,and following the work of [9] with expert validation,it comprises 29 di↵erent features (excluding the patientidentication code) of 506 patients, resulting in 2694 eval-uations (or transactions) in total.

The distribution of the number of transactionsper patient is shown in Figure 1 (left). The ALSdataset presents an average value of 5,3 transactions persequence (or temporal evaluations per patient), rangingfrom 1 to 22 time points in length, thus confirmingthat this dataset is appropriate for SPM, since itstime series are relatively long and consistent acrosspatients. Figure 1 (right) shows the distribution of themean number of items for each patient’s evaluation (ortransaction), presenting an approximate normal curve,where, in average, 18 di↵erent features were assessed.

3.2 Data Preprocessing One of the best knownapplications of SPM is in market basket data, wherean example of a sequence can be, for a list of productsbought by a customer:

(Book1;Book2)(Book3;Reading Glasses)

which means that the customer is associated to twotransactions: two books and, later, a pair of readingglasses with a third book. In order to apply suchmethods to our clinical datasets, they have to gothrough some transformations so the representation canresemble a registry of transactions. Hence, the first stepwas to discretize the data, according to some expertguidelines, already studied in [9]. This is crucial inSPM, since it can only be applied on categorized data.Then, all features were binarized, so that the end resultconsists of sequences such as

(Att1 = V al11;Att2 = V al21)(Att1 = V al12)

where Valij is the jth value for the ith attribute.

3.3 Classification The main goal of this work is toperform classification on clinical datasets, using SPM toachieve the predictions. However, first we must definewhat is the feature which will be used as class. In thiscase, there is a feature associated to the need of NIV bythe patient. In this context, the class is the informationif a patient, at some point during his evaluations, re-quires (or not) NIV. The class distribution is as follows:228 patients never needed NIV throughout their evalua-tion (45%), and 278 patients need NIV in the final eval-uation (55%, considered the baseline for performance).It is a simple task of classification, leaving further prob-lems such as prognosis for future work. The idea is tostart using SPs as features for standard classifiers, com-paring the performance with the case where the originalfeatures are used, and then checking if a combination ofthe two types of data returns some improvement. Thestandard classifiers used (initially with default parame-ters) are from the machine learning package Weka, avail-able in http://www.cs.waikato.ac.nz. These includeDecision Tree (DT) J48, k-Nearest Neighbor (KNN),Support Vector Machines (SVM) with SMO implemen-tation, and Radial Basis Function (RBF) Network. A5x10-fold cross validation scheme is used, with randomseeds {1,11,21,31,41}. In what concerns the perfor-mance evaluation, we retrieve several metrics, such asthe prediction accuracy, confusion matrix, kappa statis-tics, F-measure, precision and recall.

3.3.1 Using Original features after temporalaggregation Since the original dataset consists of dif-ferent rows corresponding to the di↵erent evaluationsfor each patient, there was the need to transform therepresentation such that each patient’s information wasincluded in a single row, for comparison purposes (anduse in the enrichment phase). The permanent features,as demographics data, were maintained. The variablefeatures were dealt with two di↵erent approaches Thefirst one consisted in analysing the nominal variation be-tween initial and final evaluations, assigning a symbolaccordingly: {U - Up, D - Down and N - No change}.The second approach was based on calculating the vari-ation between initial and final records, and consideringthe number of evaluations to get a (normalized) slopeper time point. The resulting matrix is given as inputto standard classifiers. An example of such transforma-tions can be found in Figure 2.

3.3.2 Using Sequential Patterns This approachconsists of using the found SPs to build a binary matrix,M , with P rows and N columns (respectively, thenumber of patients and of SPs). Mij = 1 if and onlyif patient i supports the pattern j. Matrix M is then

63Downloaded from knowledgecenter.siam.org

Page 4: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

1 2 3 4 5 6 7 8 9 10111213141516171819202122

#Patients 6963705554454418191413 9 8 8 3 4 2 2 3 1 1 1

01020304050607080

Nu

mb

er

of

Pat

ien

ts

Distribution of Transactions per Patient

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

#Patients 1 2 2 4 10 21 41 53 75 66 71 42 37 28 23 15 4 7 0 3 1

0

20

40

60

80

Nu

mb

er

of

Pat

ien

ts

Distribution of Mean Number of Items per Patient

Figure 1: Left - Distribution of the number of transactions (temporal evaluations) per patient for the ALSdataset. Right - Distribution of the mean number of items per patient for the ALS dataset. The mean numbercorresponds to the average value of the size (number of items) of all the transactions.

Gender BMI Age at onset ALS-FRS R

2 20.83 67 30 11

2 20.83 67 30 12 2 20.83 67 8 12

Gender BMI Age at onset ALS-FRS R

2 20.83 67 D U

Gender BMI Age at onset ALS-FRS R

2 20.83 67 -7.333 0.333

Original Data

Nominal Variation

Slope

Figure 2: Example of transformation of the originaldata: nominal variation and numerical slope. Gender =2 is Female; BMI stands for Body Mass Index (at firstvisit); ALS-FRS is the Functional Rating Scale and Ris a parameter related to this scale.

used as input to the mentioned standard classifiers, andthe patient is classified as requiring NIV or not. It isimportant to stress that any SP containing the NIVattribute is removed from the analysis.

3.3.3 Using Enriched Data Taking into the con-sideration the improvement obtained in [21], we decidedto assess if there would be any signs of improvement inthe classification results when the original features wereenriched with the SPs. Thus, both the previous matri-ces are merged, and then given as input to the standardclassifiers, as in the previous situations.

4 Results and Discussion

In this section the results of applying SPM are presentedand discussed, accounting for the data characteristics.

4.1 Sequential Patterns As explained in the previ-ous section, the main parameters of the SPM algorithmare the minimum support, and allowed gap. Several ex-periments were performed, in order to assess the impactof changing these parameters in the resulting sequences,either in number and/or length. Figure 3 shows the to-tal number of discovered frequent sequences, and theirdistribution according to their length (number of timepoints) for variable support (a) and gap (b).

Note that, for an easier visualization, the verticalaxis in Figure 3(a) is in a logarithmic scale. This meansthat a minimum support of 0.2 (or 20%), returns a muchhigher number of SPs, reaching a total over 3 million.On the other hand, a value of 0.6 returns a total of97 SPs. In the context of sequence length, we have toconsider once again the use of a log scale. Hence, itmay seem that SPs of a single time point are in muchhigher numbers, when that is not the case. In fact, fora support of 0.2 (see Figure 3(a)), the most commonlength is 4, whereas it is 3 for 0.3, 2 for 0.4, and finally1 for 0.5 and 0.6.

The analysis presented in Figure 3(b) does notrequire a vertical log scale, and thus one can easilycompare and interpret the total number of SPs and theirdistribution according to their length. As expected,with higher values of allowed gap, one obtains a highernumber of SPs, although this total tends to stabilizetowards the end. However, the distribution accordingto the sequence length is somewhat preserved, withthe greatest variations ocurring for lengths 2 and 3.

64Downloaded from knowledgecenter.siam.org

Page 5: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

0,2 0,3 0,4 0,5 0,6

7 5420 0 0 0 0

6 114666 20 0 0 0

5 934220 3281 11 0 0

4 1086182 14594 260 3 0

3 655448 16005 926 91 5

2 220347 13118 1306 187 28

1 25371 3925 814 250 64

110

1001000

10000100000

100000010000000

Nu

mb

er

of

Seq

ue

nce

s

Variable Support (fixed Gap = 0)

0 1 3 5 7

6 20 24 28 28 28

5 3281 3865 3885 3893 3893

4 14594 17719 18715 18893 19088

3 16005 20729 25720 27138 27654

2 13118 15849 18614 19963 20527

1 3925 3925 3925 3925 3925

01000020000300004000050000600007000080000

Nu

mb

er

of

Seq

ue

nce

s

Variable Gap (fixed Support = 0.3)

a) Minimum Support b) Allowed Gap

Figure 3: Total number of frequent sequences and distribution according to their length: a) with variable minimumsupport, and a fixed gap = 0 (Log scale); b) with variable allowed gap, and a fixed minimum support of 0.3.

Finally, the choice was to proceed with a fixed gap andvariable support threshold, given that this revealed tobe a greater source of variation in the found SPs.

4.2 Pruning the Sequential Patterns As afore-mentioned, the number of obtained SPs can be ex-tremely high (see Figure 3), and thus, to be able toperform classification, these data have to reduced. Thefirst observation is that the number of SPs obtained witha minimum support of 0.2 are far beyond the accept-able, so we restrict our further analyses to the interval{0.3,0.4,0.5,0.6}. However, even for the other values ofminimum support, we still end up with a great numberof SPs, from which a significant number may be lesssignificant in terms of classification purposes. Hence,pruning is applied, as discussed before. The thresholdfor the minimum improvement was tested only for twopossible values: 0 and Infinity (in fact, the largest possi-ble integer), corresponding, respectively, to finding theclosed and the maximal SPs [14].

The results of pruning the SPs can be found in Fig-ure 4. As expected from the minimum improvementcriteria, the number of closed SPs is higher than themaximal ones. However, the reduction is not so sig-nificant, and it remains somewhat consistent across thedi↵erent values of minimum support, exception madeto the value of 0.6, with much less original SPs (under100). Consequently, there are still thousands of SPs fora minimum support of 0.3 and 0.4, which would be un-bearable for further classification applications. This ledto the decision of introducing a parameter to greatlyminimize the number of used SPs: use only the N

most supported SPs. This value was varied in the set{30,100,200,500,1000,2000,5000}, although for furtheranalyses, according to the interestingness of the results,we restricted the variation to the set {30,200,1000}.

0,3 0,4 0,5 0,6Original 50943 3317 531 97Max Seq Patt 37370 2272 373 60Closed Patt 50875 3277 498 79

110

1001000

10000100000

Nu

mb

er o

f Se

qu

enti

al

Pat

tern

s

Pruning Results

Figure 4: Results of pruning with minimum improve-ment threshold of 0 (Closed patterns) and Infinity(Maximal patterns) vs minimum support (Log scale).

Still from Figure 4, one can see that, for example,a maximum of 200 or 1000 most supported SPs returnsthe same number of pruned ones for a minimum supportof 0.6 (79 closed and 60 maximal SPs). Thus, when itis said that 200 most supported SPs were used, this is,in fact, the allowed maximum.

4.3 Classification One of the primary goals of thiswork was to evaluate the performance of di↵erent stan-dard classifiers, when SPs were used as features, thusdirectly introducing the temporal information. We notethat, to our knowledge, this kind of application hasnot been performed yet when considering clinical data.

65Downloaded from knowledgecenter.siam.org

Page 6: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

First, the results of the classification problem using theoriginal features after temporal data aggregation arepresented. Then, we compare these results to the onesobtained when the found SPs are used as features. Fi-nally, we assess the classifiers’ behavior when we enrichthe feature space with both original features and thediscovered SPs.

4.3.1 Temporal Data Aggregation Figure 5shows the prediction accuracy values of the classifica-tion problem using the original features, after an ap-propriate temporal aggregation, so that each patient ischaracterized by only a single row reflecting the wholeset of temporal evaluations.

DT KNN SVM RBF

Nominal Variation 77,67 65,02 73,12 71,34

Slope 71,74 52,17 68,18 67,19

50,0055,0060,0065,0070,0075,0080,0085,00

Pre

dic

tio

n A

ccu

racy

(%

)

Temporal Data Aggregation

Figure 5: Prediction accuracy obtained when using theoriginal features, after temporal data aggregation withnominal variation (Up, Down, No change) or slope val-ues (change per evaluation). Weka classifiers DecisionTree J48 (DT), k-Nearest Neighbor (KNN), SupportVector Machine (SVM) with SMO implementation andRadial Basis Function (RBF) Network were used, under5x10 cross validation. Standard deviation bars are alsoshown.

It is clear, from Figure 5, that when the aggregationis performed with nominal variation, using an alphabet{U,D,N} to account for changes from the initial evalua-tion to the last, the classification performance is signif-icantly better than when the (numerical) slope valuesare used, even analyzing other metrics such as the con-fusion matrices, kappa statistics, precision and recall,not shown here due to space restrictions (and given theapproximately balanced data).

4.3.2 Sequential Patterns Figure 6 shows the pre-diction accuracy values obtained when using the discov-ered SPs as features for di↵erent standard classifiers.

Figure 6 shows the prediction accuracy values corre-sponding to a minimum support of 0.5 which returnedthe best overall results. Never It is worth discussingsome particular behavior that is not clear from Figure 6,

DT KNN SVM RBF30 MS 78,02 72,47 77,26 75,26200 MS 78,64 69,50 80,61 75,081000 MS 78,23 68,52 80,21 72,54

65,00

70,00

75,00

80,00

85,00

Pre

dic

tio

n A

ccu

racy

(%

)

Classification using only Sequential Patterns

Figure 6: Prediction accuracy obtained when usingonly a maximum of 30, 200 and 1000 most supportedsequential patterns. Weka classifiers Decision TreeJ48 (DT), k-Nearest Neighbor (KNN), Support VectorMachine (SVM) with SMO implementation and RadialBasis Function (RBF) Network were used, under 5x10cross validation. Standard deviation bars are alsoshown.

since these particular results are the ones correspondingto a minimum support of 0.5, which returned the bestoverall results.

Nevertheless, it is interesting to analyse these re-sults having the ones from Figure 4) in mind. For allthe possible values, the results change considerably onlywhen the maximal SPs are used, because the reductionto closed patterns is much less significant. However,there are changes when the maximum allowed is higherthan the total number of returned SPs. This happens,for instance, when a maximum of 200 most supportedSPs is allowed. Since the total number for a minimumsupport of 0.6 is under 100 (see Figure 4), in this situ-ation there is a significant change in the performance.Similarly, a maximum of 1000 most supported SPs ishigher than the total number of patterns for a minimumsupport of 0.5 and 0.6. As aforementioned, given thatthe results for the values 0.3 and 0.4 are considerablyworse, the prediction accuracy values shown in Figure 6correspond to a minimum support of 0.5, although allof its patterns are used, not reaching the maximum of1000. Finally, in what concerns to the minimum im-provement as the pruning criteria, there is not a clearadvantage of using either closed or maximal SPs. Infact, for a maximum of 30 most supported SPs, the bestresult is associated to the maximal patterns, whereas forvalues of 200 and 1000 most supported SPs, the best re-sults are obtained using closed patterns.

With expert validation, the most supported SPs canbe studied to verify if any of them can bring new insightson disease progression.

66Downloaded from knowledgecenter.siam.org

Page 7: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

DT KNN SVM RBF30 MS 74,11 69,37 77,27 73,72200 MS 77,08 69,57 78,06 73,721000 MS 76,88 68,97 78,46 72,92

65,0070,0075,0080,0085,00

Pre

dic

tio

n A

ccu

racy

(%

)

Slope + Sequential Patterns

DT KNN SVM RBF30 MS 78,06 72,92 78,46 74,90200 MS 78,66 71,34 79,64 76,091000 MS 77,47 70,55 80,24 74,31

65,0070,0075,0080,0085,00

Pre

dic

tio

n A

ccu

racy

(%

) Nominal Var. + Sequential Patterns

Figure 7: Prediction accuracy obtained when using the original features, after temporal data aggregation, enrichedwith the information from sequential patterns (30, 200 and 1000 most selected (MS)). Weka classifiers DecisionTree J48 (DT), k-Nearest Neighbor (KNN), Support Vector Machine (SVM) with SMO implementation and RadialBasis Function (RBF) Network were used, under 5x10 cross validation. Standard deviation bars are shown.

4.3.3 Enrichment As stated in a previous section,enrichment with temporal information in the form ofSPs has been shown to improve results [21]. Figure 7shows the prediction accuracy values obtained with bothtypes of data: (left) nominal variation or (right) slope,and a maximum of 30, 200 and 1000 SPs.

The first observation is that, again, the nominalvariation returns higher prediction accuracy values thanwhen the numerical slope is used, although in thiscase, the di↵erences are less significant. Assessing apossible improvement in the classification performance,it is possible to conclude that the results in Figure 7 arehigher than the ones shown in Figure 5, relative to usingonly the original features. However, the di↵erences fromthe former results to the ones in Figure 6, using onlythe SPs, are unsignificant. This could mean that theclassifiers are discarding, almost completely, the originalfeatures, making use solely of the SPs.

To further analyse this question, feature selection(FS) was applied in order to verify if, in fact, theoriginal features would be discarded as being irrelevantto the classification, and, on the other hand, if there isa reduced set of SPs, still useful for the classificationproblem. Figure 8 shows the corresponding results,when using the BestFirst algorithm in Weka, althoughother FS algorithms were also tested.

The results obtained with FS are considerably bet-ter than before for most of the classifiers, especiallyKNN. As it was somewhat anticipated, the selected fea-tures include only one to three of the original features.Nevertheless, these results, even slightly higher thanwhen using only SPs, are very interesting, taking intoaccount that the whole set of SPs is reduced to aboutonly seven. The selected features are shown in Table 1(note that the kept features were similar for nominalvariation and slope). It is interesting that the attribute

(MND = 2), meaning that the patient has no familyhistory of Motor Neuron Disease (MND), is present inmost kept SPs, while the original feature MND is dis-carded.

Table 1: Features selected by Weka’s BestFirst algo-rithm. * Only for slope aggregation ** Not presentfor the 30 most supported SPs. MND stands for fam-ily history of Motor Neuron Disease (1 - yes, 2 - no).MIP stands for Maximal Inspiratory Pressure (%). Ris the respiratory section of the ALS-FRS-R (RevisedFunctional Rating Scale). Sp02mean is the mean valueof oxygen saturation, Dips/h being another respiratoryparameter.

Original Sequential Patterns

(MND=2;R<11)(MND=2)(MND=2)(MND=2)(MND=2)

R (MND=2;MIP<60)(MND=2)Weight* (R<11;ALS-FRS-R<36)**

(SpO2min>80;Dips/h<4)**(MND=2)(MND=2)**

4.3.4 Sensitivity to Parameter OptimizationSince all these results are relative to using the parame-ters by default in Weka classifiers, we proceeded to studyhow sensitive the classifiers based on SPs are to parame-ter variation. Thus, a simple grid search was performedfor each of the used classifiers and classification ap-proach (original features, SPs and enriched data). Thebest prediction accuracy values can be found in Fig-ure 9, where it can bee seen that the results are close tothe ones obtained with default parameters. Note thatthe complexity factor c of the SVM classifier seems to be

67Downloaded from knowledgecenter.siam.org

Page 8: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

DT KNN SVM RBF30 MS 78,46 80,83 78,85 79,45200 MS 81,82 81,62 78,66 79,051000 MS 81,82 81,62 78,66 79,05

65,0070,0075,0080,0085,00

Pre

dic

tio

n A

ccu

racy

(%

) Nominal Var. + Seq. Patterns (FS)

DT KNN SVM RBF30 MS 78,85 67,19 78,26 77,47200 MS 78,66 75,10 77,27 77,671000 MS 78,66 75,10 77,27 77,67

65,0070,0075,0080,0085,00

Pre

dic

tio

n A

ccu

racy

(%

)

Slope + Seq. Patterns (FS)

Figure 8: Prediction accuracy obtained when using the original features, after temporal data aggregation, enrichedwith the information from sequential patterns (30, 200 and 1000 most selected (MS)). Feature Selection (FS) isperformed (BestFirst in Weka). Weka classifiers Decision Tree J48 (DT), k-Nearest Neighbor (KNN), SupportVector Machine (SVM) with SMO implementation and Radial Basis Function (RBF) Network were used, under5x10 cross validation. Standard deviation bars are shown.

the most variable parameter. For example, we obtain,for the enriched data with FS, a prediction accuracy of83.36% with SVM (c = 100 ; poly degree = 3), wherethe default parameters (c = 1 ; poly degree = 1) re-turned a prediction accuracy under 79% .

DT SVM KNN RBF

Enrich_FS 82,21 83,36 81,42 78,85

Enrich 79,09 81,54 75,85 76,25

SP_FS 78,93 77,55 79,21 78,30

SP 79,80 80,71 73,95 75,49

Nom_FS 75,38 76,48 74,19 74,74

Nom 77,51 73,04 70,59 72,37

65,00

70,00

75,00

80,00

85,00

Pred

ictio

n Ac

cura

cy (%

)

Results after Parameter Optimization

Figure 9: Prediction accuracy obtained with simple pa-rameter optimization, with original features after nom-inal variation aggregation (Nom), sequential patterns(SPs) and enriched (Enrich) data, with and withoutfeature selection (FS). Weka classifiers Decision TreeJ48 (DT), k-Nearest Neighbor (KNN), Support VectorMachine (SVM) with SMO implementation and RadialBasis Function (RBF) Network were used, under 5x10cross validation. Standard deviation bars are shown.

5 Conclusions and Future Work

SPM has been used with success in several di↵erent do-mains, including bioinformatics. However, to our knowl-

edge, it was never successfully applied for classificationof clinical time series data.

The data statistics can be a good indicator of wetherthe dataset is, or not, a suitable candidate for classifica-tion based on SPM. As seen for the ALS dataset, thereare a significant number of longer transactions (morethan two time points), and a reasonable mean numberof items per transaction. Another crucial aspect for theapplication of SPM is the possibility of an appropriatediscretization, which, in this case, was based on expertknowledge.

Then, several statistics were shown for the obtainedSPs, using the original set, and also closed and maximalpatterns to assess their influence. In fact, due tobeing a somewhat sparse dataset, we conclude that thepruning of non-closed SPs is almost unnoticeable. Inwhat concerns to the classification method, the binaryinformation of a patient containing (or not) a givenSP is very simple. Nonetheless, the obtained resultswere very interesting, especially in comparison to theresults of [9], in which comparable results of predictionaccuracy and other metrics were obtained, althoughwith extensive parameter search and optimization.

As expected, the consideration of the sequential na-ture of the data improved the results when faced to asimple variation of the original features between initialand final evaluations. Nevertheless, when the combina-tion of both types of data was initially used, the resultswere similar to the ones using only the SPs. To assessthe influence of each type of data, FS was performed,resulting in a (marginally) better performance, with asignificant reduction of features (only one or two of theoriginal features, and three to seven SPs were kept).This is one of the most interesting results, since directinterpretation of these features is possible, which might

68Downloaded from knowledgecenter.siam.org

Page 9: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

bring new insights on the mechanisms behind ALS pro-gression. Moreover, with more optimized parameters,we can see that enriched data with FS returned a pre-diction accuracy over 83%, where this value was 80.71%for using only SPs (with FS was 77.55%). Nonethe-less, as this is a preliminary work, the classificationapproach might be improved, for example, by usinga similar method to Classify-By-Sequence [20], wherethe classifiable sequences are extracted from each classgroup, respectively, rather than from the whole dataset.Nonetheless, the J48 Decision Tree performed well, evenwhen compared to the SVM classifier. This is also infavour of interpretability, since the tree can be retrievedand possibly used for decision making.

Finally, it is our future goal to move from simpleclassification to prognosis prediction, anticipating thatwe could reply to clinical questions as: ”What is theprobability that a particular patient will require NIVwithin the following 9 months?” or ”What is the safetime frame for the next appointment without risk ofrespiratory failure?”.

Acknowledgments. This work was supported by na-tional funds through Fundacao para a Ciencia e aTecnologia (FCT), under project contract PTDC/EIA-EIA/111239/2009 to Neuroclinomics and doctoral grantSFRH/BD/82042/2011 to AVC.

References

[1] R. Agrawal and R. Srikant, Mining sequential patterns.,Proc. Int Conf. Data Engineering (ICDE), (1995),pp. 3–14.

[2] J. Caraca-Valente and I. Chavarrıas, Discovering Sim-ilar Patterns in Time Series., KDD, (2000), pp. 497–505.

[3] A. V. Carreiro, O. Anunciacao, J. A. Carrico and S. C.Madeira, Prognostic Prediction through Biclustering-Based Classification of Clinical Gene Expression TimeSeries., Journal of Integrative Bioinformatics, 8:3(2011), pp. 175–191.

[4] A. V. Carreiro, Artur J. Ferreira, M. A. T. Figueiredoand Sara. C. Madeira, Towards a Classication Ap-proach using Meta-Biclustering: Impact of Discretiza-tion in the Analysis of Expression Time Series., Jour-nal of Integrative Bioinformatics, 9:3 (2012), pp. 207–222.

[5] G. Das, H. Mannila and P. Smyth, Rule Discovery fromTime Series., KDD, (1998), pp. 16–22.

[6] M. Gavrilov, D. Anguelov, P. Indyk and R. Motwani,Mining the Stock Market: Which Measure is Best?,KDD, (2000), pp. 487–496.

[7] C. M. Antunes and A. L. Oliveira, Temporal DataMining: an Overview., KDD Workshop on TemporalData Mining, (2001).

[8] V. C. Tseng, P. C. Tseng and K. W. C. Lin, MiningTemporal and Spatial Object Relations in MultimediaContents., Int Conf. on Wireless Networks, Communi-cations and Mobile Computing, 2 (2005), pp. 1371–76.

[9] P. M. T. Amaral, S. Pinto, M. de Carvalho, P. Tomasand S. C. Madeira, Predicting the need for non-invasiveventilation in patients with Amyotrophic Lateral Scle-rosis, ACM SIGKDD Workshop on Health Informatics(HI-KDD), (2012).

[10] R. Srikant and R. Agrawal, Mining Sequential Pat-terns: Generalizations and Performance Improve-ments., Proc. Int’l Conf. Extending Database Technol-ogy, (1996), pp. 3–17.

[11] R. Agrawal and R. Srikant, Fast Algorithms for MiningAssociation Rules., Proc. 20th Int’l Conf. on VeryLarge Data Bases (VLDB), (1994), pp. 487–499.

[12] C. M. Antunes and A. L. Oliveira, Generalization ofPattern-Growth Methods for Sequential Pattern Miningwith Gap Constraints., Machine Learning and DataMining in Pattern Recognition, 2734 (2003), pp. 239–251.

[13] J. Pei and J. Han and et al., PrefixSpan: Min-ing Sequential Patterns E�ciently by Prefix-ProjectedPattern Growth., Proc. Int’l Conf. Data Engineering(ICDE 01), (2001).

[14] S. Prinke and M. Wojciechowski and M. Za-krzewicz, Pruning discovered Sequential Patterns us-ing Minimum Improvement Threshold., ADBIS Work-shop on Data Mining and Knowledge Discovery(ADMKD’2005), (2005).

[15] G. E. Lasker, Application of Sequential Pattern-Recognition Technique to Medical Diagnostics., IntJournal of Bio-Medical Computing, 1:3 (1970),pp. 173–186.

[16] J. C. G. Ramirez et al, Temporal Pattern Discov-ery in Course-of-Disease Data., IEEE Engineering inMedicine and Biology, 19:4 (2000), pp. 63–71.

[17] F. Lin, S. Chou, S. Pan and Y. Chen, Mining timedependency patterns in clinical pathways., Int Journalof Medical Informatics, 62 (2001), pp. 11–25.

[18] S. Concaro, L. Sacchi and R. Bellazzi, Temporaldata mining methods for the analysis of the AHRQarchives., Proc Am Med Inform Assoc Annual Sym-posium, (2007).

[19] K. Choi, S. Chung, H. Rhee and Y. Suh, Classificationand Sequential Pattern Analysis for Improving Man-agerial E�ciency and Providing Better Medical Servicein Public Healthcare Centers., Healthc Inform Res., 16(2010), pp. 67–76.

[20] V. S. M. Tseng and C-H Lee, CBS: A new classificationmethod by using sequential patterns., Proc. of the 2005SIAM International Data Mining Conference, (2005),pp. 596–600.

[21] J. Barracosa and C. M. Antunes, Anticipating Teach-ers’ Performance., Proc.International Workshop onKnowledge Discovery on Educational Data in the ACMInternational Conference on Knowledge Discovery andData Mining (KDinED@KDD), (2011).

69Downloaded from knowledgecenter.siam.org

Page 10: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

Classification and Diagnosis of Myopathy from EMG Signals*

Brian D. Bue†

Erzsébet Merényi‡†

James M. Killian§

Abstract We present a methodology to predict the presence of myopathy (muscle disease) from intramuscular elec-tromyography (EMG) signals. Myopathy is a form of neuromuscular disease affecting skeletal muscle fibers resulting in muscle weakness. Many types of myopathy are serious and debilitating conditions that are difficult to diagnose and treat. Early detection of such diseases can potentially reduce both patient suf-fering and medical costs. Intramuscular electromyog-raphy is a standard clinical method used to diagnose neuromuscular disorders such as myopathies. By evaluating the shape and frequency of electrical action potentials produced by muscular fibers and captured in EMG measurements, a physician can often detect both the presence and the severity of such disorders. How-ever, EMG measurements can vary significantly across different subjects, different muscles, and according to session-specific characteristics such as muscle fatigue and degree of contraction. By considering normalized, fixed-duration (0.5-2 sec) samples of regions known to be diagnostic in EMG signals measured at full muscle contraction, we can automatically detect the presence of myopathies across different subjects and muscles with ~90% accuracy. We argue that our methodology is more generally applicable than existing methods that depend upon accurate segmentation of individual mo-tor unit action potential (MUAP) waveforms. We pre-sent a rigorous evaluation of our technique across dif-ferent subjects and several different muscles.

Keywords EMG, myopathy, classification, diagnosis, FFT, fre-quency domain

1 Automated Diagnosis of Myopathy from EMG Signals

Myopathy (muscle disease) is a form of neuromuscular disorder that results in muscle weakness due to dys-functioning skeletal muscle fibers [2]. A wide variety of both acquired and hereditary myopathies have been identified, many of which are serious and often debili-tating conditions that are difficult to accurately diag-nose and treat [3]. Early detection of these diseases by clinical examination and laboratory tests can greatly

reduce patient suffering and medical costs. Moreover, data gathered during such examinations may lead to an improved understanding of the nature and treatment of such diseases, and allow development of automated systems that assist diagnosis.

In clinical practice, intramuscular EMG is a stan-dard method used to assess neurophysiologic charac-teristics of skeletal muscles to diagnose neuromuscular diseases. EMG records electrical action potentials gen-erated by groups of muscle fibers controlled by the same motor nerve, called a motor unit. These motor units are the basic functional units of the muscle that can be voluntarily activated. The shape of individual motor unit action potential waveforms reflect the status and structure of a given motor unit. EMG measure-ments from patients with myopathy differ from healthy subjects in that their recruited MUAPs usually have shorter duration, lower amplitude, and increased poly-phasicity. Figure 1 illustrates the difference between 0.5 second samples of EMG traces from the deltoid muscle of a healthy subject vs. the deltoid of a subject with myopathy. These, and many additional subtleties characterize differences between healthy and abnormal subjects, depending on the nature and severity of pa-thology and are extensively discussed in the literature.

In recent years, a number of techniques have been proposed to classify EMG signals for medical diagno-sis. Several authors (e.g., [8, 9, 4, 5, 7, 10]) propose segmenting the EMG data in the temporal domain into individual MUAP waveforms, which are then labeled and classified based upon (features derived from) the segmented waveforms. However, such techniques are limited in that they assume that individual MUAPs can be extracted from data in a consistent and reliable manner. Extracting individual MUAPs may be difficult or impossible since MUAPs at high muscle contraction are often in superposition, while pathologies of interest may not be observable at low muscle contraction. Most previous works analyze data obtained with low (less

Figure 1: 0.5 second samples of EMG traces from the deltoid muscle

from a healthy subject (top) vs. a subject with myopathy (bottom).

* This work was partially supported by the Wheeler fund from the Baylor College of Medicine. † Department of Electrical and Computer Engineering, Rice University, Houston, TX. ([email protected]) ‡ Department of Statistics, Rice University, Houston, TX. ([email protected]) § Department of Neurology, Baylor College of Medicine, Houston, TX. ([email protected])

70Downloaded from knowledgecenter.siam.org

Page 11: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

than full) muscle contraction. Moreover, the ratio of recruited MUAPs is another indicator of presence or absence of myopathy, and should also be taken into account, which is often not the case with previous work.

Given the issues with time-domain MUAP seg-mentation, classifying EMG data in the frequency do-main may be a more robust approach. Some recent work has shown good results in classifying neuromus-cular disease from EMG data in the frequency domain. For instance, 6 demonstrated 85% overall accuracy in classifying EMG signals of 59 subjects in the fre-quency domain into Normal, Myopathy and Neuropa-thy classes. In this work, we also analyze EMG data in the frequency domain. Our work, however, is distinct from previous research in the following:

a) We consider EMG data measured at full muscle contraction, which improves the objective evaluation of per-subject and per-muscle characteristics;

b) We classify diagnostic regions of the full EMG signal in the frequency domain, rather than pre-segmented and manually-labeled individual MUAP waveforms; and

c) In addition to evaluating classification perform-ance on data across different subjects, we also evaluate the characteristics of different muscles for diagnostic purposes.

We provide a rigorous evaluation of generalization capabilities via cross-validation, in contrast to a num-ber of existing works [e.g., 8, 9, 5, 6].

2 EMG Data Description

Our EMG data was collected at the EMG Labora-tory in the Department of Neurology of the Baylor College of Medicine in Houston, TX, by (or under the direction of) Dr. James Killian, M.D. The data we con-sider consists of 15 EMG sessions from 8 different subjects measured in one or more different muscles. Three of the subjects are female and the remaining subjects are male. The mean age of the subjects is 56.63 (std. dev=16.4) years. The currently available data are from the biceps brachii, triceps brachii, deltoid and vastus lateralis (VL), selected for their diagnostic utility by the physician. We use the term trace to de-note a record of a “full” EMG session for a single sub-ject on a single muscle. Each trace is collected using the following methodology: A monopolar needle elec-trode is inserted into a designated skeletal muscle in a proximal arm or leg. The signal is processed through the differential preamplifier to a Cadwell Sierra EMG machine amplifier (Cadwell Laboratories, WA, USA) which transfers the signal to a computer display and loudspeaker for clinical evaluation. The subject then exerts maximum contraction of the muscle under study as the electrode is moved by several millimeters until an adequate interferential muscle pattern of firing mo-tor units is noted on the screen. A 60 sec sample is then

recorded. The process is repeated on 4 to 6 separate muscles and the captured traces from each muscle are stored for subsequent signal analysis.

In a post-labeling session the physician designates each trace as a member of one of the following five classes based upon the observed severity of the pathol-ogy in the EMG signal: Healthy/Normal (Nor), Bor-derline Myopathy (Myo1), Mild Myopathy (Myo2), Moderate Myopathy (Myo3), Severe Myopathy (Myo4). The basis of the clinical diagnostic gradings of abnormal myopathic motor units (individual motor unit with durations of activity (?) under 6ms) is related to the estimated percentage of myopathic units relative to the total number of firing motor units. Borderline: 0-10% abnormal units, mild: 10-25%, moderate: 25-50%, severe: above 50%. This is a subjective grading based on visual and auditory analysis by co-author JK of the different muscle samples. While our work, in general, does include discrimination of all these classes, here we focus on the methodology of classifi-cation from full signals (as opposed to MUAPs) in frequency domain and demonstrate the effectiveness on two classes in favor of keeping the focus on the methodology in this short paper. We present classifica-tion results for the above five classes using the meth-odology described here, in a subsequent paper.

Portions of the traces are not diagnostic and/or saturated due to insertional activity or instrument tun-ing effects. To eliminate the non-diagnostic portions of each trace, the physician manually defines the diagnos-tic regions in each trace, which are temporally-contiguous segments of varying length. While auto-mated identification and separation of non-diagnostic regions is important, it is outside the scope of the pre-sent work.

3 Methodology

3.1 Data Preprocessing

We split the diagnostic regions of each trace into fixed slices of ns seconds in duration. We subsequently refer to each of these slices as a sample. Each sample is a m-dimensional vector capturing a temporally-contiguous portion of each diagnostic region. We nor-malize each sample by its L2 norm which maps the amplitudes of the samples to a common range. This allows us to reconcile, to some degree, amplitude dif-ferences between measurements on different muscles and different subjects at varying contraction levels while retaining other differences of the waveforms. We then map each normalized sample into the frequency domain using the Fast Fourier Transform in MATLAB. We discard the symmetric portion of the frequency-domain samples, resulting in sample vectors of dimensionality m/2. Table 1 gives a summary of the samples we consider with sample duration ns = 0.5 sec.

71Downloaded from knowledgecenter.siam.org

Page 12: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

3.2 Classification

In this study, we consider the problem of classify-ing the frequency-domain samples as Normal or My-opathic. To achieve this, we group all of the samples labeled Myo1-Myo4 into a single superclass Myo*. However, several classes are poorly represented in terms of the number of samples – particularly the Normal and the borderline myopathy (Myo1) classes, which represent only 14.94% and 9.2% of the total samples, respectively. To mitigate this issue, we first balance the sampling distributions of the five (Normal, Myo1-Myo4) classes by augmenting the training data with Nresampj = Nmax ! Nj samples, sampled with replacement, from the training samples of each class j, where Nmax is the number of samples of the class with the maximum number of samples, and Nj is the number of samples in class j. This balancing step ensures that samples of varying severity are equally represented, but leads to a sampling bias between the Normal vs. Myo* superclass. Consequently, we perform an addi-tional balancing step by adding Nnormal = Nall ! NMyo* samples from the normal class to the training set, as before, sampling with replacement, where Nall is the total number of samples, and NMyo* is the num-ber of samples in the Myo* class. After balancing, we have a total of 524 samples for the Normal and Myo* classes, with the Myo* class consisting of 131 samples of each of the Myo1-Myo4 classes, respectively.

EMG signals may vary between different subjects or on different muscles. Consequently, it is crucial to evaluate classification accuracy when data from differ-ent subjects and/or muscles is used as training and test data. To achieve this, after balancing the samples as described above, we perform ten cross-validation splits, where in each split we use data from half of the subjects for test data, and divide the remaining samples into training (3/8th of the total samples) and validation (1/8th of the total samples) sets. We ensure by random stratified sampling that the training, test and validation sets each contain instances from each of the Normal and Myo* classes and from each muscle group. The classifier we use is a linear Support Vector Machine (SVM). We select the SVM regularization parameter C from the set {0.01, 0.1, 1, 10, 100, 1000} that yields the highest accuracy on the validation set. We report the mean and standard deviation of classification accu-racies produced on the test data in each split.

4 Classification Results and Evaluation

4.1 Classification Accuracy vs. Sample Duration We first evaluate the classification accuracy with

respect to the sample duration ns. We consider ns val-ues in the set {0.05, 0.1, 0.2, 0.5, 1, 2}. Table 2 gives the number of balanced samples and the dimensional-ity m of each sample for each value of ns, and the cor-responding mean and standard deviation of classifica-tion accuracies across the ten cross-validation splits. We observe that classification accuracy increases with increasing sample duration. The standard deviation also typically decreases, with the exception of ns=2, where the high dimensionality and small quantity of samples produce slightly less stable results. However, this generally suggests that longer sample durations are desirable, despite the high dimensionality of the result-ing feature space. Additionally, our results indicate that it is possible to predict the presence or absence of my-opathies from relatively short portions of a full EMG trace.

4.2 Per-class, Per-muscle and Per-subject Eval-uation

We now evaluate the performance of our method-ology on the individual classes, muscles and subjects we consider in this work. For this evaluation we fix the sample duration ns to 0.5, as this duration consists of a reasonable number of samples (1024) to evaluate, at fairly high dimensionality (16000 dimensions/sample) and yields very good classification accuracies (90.4% average).

With respect to the Normal vs. Myo* classes, we observe considerably higher classification accuracy on the Myo* class (mean=0.959, stddev=0.023) than on the normal class (mean=0.822, stddev=0.070). This is due to the fact that our data includes significantly fewer subjects with normal conditions. When we con-

Table 1: Summary of EMG data for each muscle with sample duration ns = 0.5. The total number of seconds of data for each class is provided.

Values in parenthesis give the number of unique subjects for each muscle with respect to each class.

ns # samp m/2 Accuracy (std.dev.)

0.05

10528 1600 0.760 (0.058) 0.1 5256 3200 0.815 (0.059) 0.2 2616 8000 0.878 (0.042) 0.5 1048 16000 0.904 (0.033)

1 512 32000 0.966 (0.028) 2 256 64000 0.971 (0.041)

Table 2: Number of balanced samples and sample dimensionality (N/2) with respect to sample duration ns, and corresponding mean

classification accuracy and standard deviation accuracies.

72Downloaded from knowledgecenter.siam.org

Page 13: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

sider individual muscles (Table 3), we observe that the samples from the biceps and deltoid muscles tend to be misclassified more often than the triceps and VL mus-cles. A possible reason for this is that the biceps and deltoid muscles appear similar to one another in terms of EMG signals, but appear different from the triceps and VL muscles. This is also suggested by the results in Bischoff et al. [1], but further investigation on addi-tional data is necessary to confirm this hypothesis in our case.

Table 4 gives the classification accuracies for the individual subjects and their respective traces. Most notable are the results for subject S10, whose biceps and deltoid traces are classified with 28.5 and 16.1% less than their respective mean muscle accuracies (as shown in Table 3). Subject S10 represents a case where some muscles exhibit no observable pathology, while other muscles show signs of myopathy. While it is difficult to state conclusively without data from ad-ditional patients with similarly mixed pathologies, ac-cording to the physician, this case may be a result of a borderline myopathy, and the training labels may need

revision once sufficient evidence is available.

5 Discussion and Future Work

In this work, we evaluated a novel methodology for classifying EMG signals in the frequency domain. By considering, as training samples, Fourier trans-forms of normalized, fixed-length segments of diag-nostic regions of the full signals (as opposed to ex-tracted MUAPs) measured at full contraction, we demonstrated high average generalization performance by a linear SVM classifier across individual subjects and different muscles. The average classification accu-racy on test data increases from 80% to 97% with the duration of the samples (0.1 to 2 sec, respectively) while the reliability, determined from ten cross-validation folds, simultaneously increases (standard deviation decreases). Our analysis also suggests that detecting the presence of myopathy can be accom-

plished with very short duration samples of a full EMG trace.

The long-term, primary goal of our work is to de-velop a system that captures the physician’s capability to diagnose a variety of neuromuscular disorders from EMG data, as well as to distinguish among the severity degrees of diseases such as the classes of myopathies listed in Section 2. While our classification accuracies are fairly high, this is of course a two-class case. Clas-sifying the samples according to their severities is a more challenging task, and will require more elaborate and sophisticated experiments.

We also aim to classify EMG signals of patients with neurogenic disorders using our methodology. Because our methodology yields comparable results to previous analyses considering EMG data from my-opathic and neurogenic diseases (e.g., 6), and based upon our preliminary experiments with 5 classes, we anticipate our method will generalize well to such sce-narios.

While the results presented here are encouraging, much additional analysis and development is needed in order to achieve the above goals and to make our sys-tem useful for clinicians. This includes systematically designed experiments with increasing amounts and complexity of data (increased variety of subjects, mus-cles, diseases), testing increasingly sophisticated clas-sification techniques to better align with real-life cir-cumstances such as highly imbalanced sample sets, and intelligent identification of feature subsets neces-sary for producing high-quality (high-accuracy and high-fidelity) classifications. For fully automated proc-essing, developing techniques to segment an EMG signal into diagnostic and non-diagnostic regions, or to incorporate learning constraints to identify various non-disease-related conditions are also necessary.

Acknowledgements The authors thank Penny Gregg at the EMG Laboratory of the Department of Neurol-ogy, Baylor College of Medicine, for her assistance with data collection, and Rice University graduate stu-dents Kai Du and Du Nguyen for data preprocessing and software modification efforts in the early stages of this work.

References

[1] C. Bischoff, E. Stälberg, B. Falck, and K.E. Eeg-Olofsson. Reference values of motor unit action potentials obtained with multi-MUAP analysis. Muscle & Nerve, vol. 17, no. 8, pp. 842–851, Aug. 1994.

[2] A. S. Blum and S. B. Rutkove. The clinical neuro-physiology primer, vol. 388. Humana Press, 2007.

[3] F. Buchthal, An introduction to electromyography. Copenhagen: Gyldendal, 1957.

Bicep Deltoid Tricep VL 0.907 (0.087) 0.852 (0.072) 1.000 (0.000) 1.000 (0.000)

Table 3: Per-muscle accuracies from all subjects for ns=0.5

Subject Average Trace Class Trace Accuracy S02 0.936 (0.050) Biceps Myo* 0.936 (0.050) S03 0.958 (0.037) Deltoid Myo* 0.937 (0.055)

Triceps Myo* 1.000 (0.000) S04 1.000 (0.000) VL Myo* 1.000 (0.000) S07 0.986 (0.022) Biceps Myo* 0.972 (0.043)

Deltoid Myo* 0.984 (0.025) VL Myo* 1.000 (0.000)

S08 0.888 (0.007) Deltoid Nor 0.888 (0.007) S09 0.975 (0.035) Biceps Myo* 0.951 (0.068)

Deltoid Myo* 1.000 (0.000) Triceps Myo* 1.000 (0.000)

S10 0.789 (0.128) Biceps Myo* 0.622 (0.171) Deltoid Nor 0.691 (0.056) VL Myo* 1.000 (0.000)

S15 0.852 (0.028) Deltoid Nor 0.852 (0.028) Table 4: Per-subject/trace classification accuracies for ns=0.5.

73Downloaded from knowledgecenter.siam.org

Page 14: Classification of Clinical Data using Sequential Patterns ...web.tecnico.ulisboa.pt/claudia.antunes/artigos/carreiro2013dmmh_s… · Sequential pattern mining tries to achieve such

[4] C. I. Christodoulou and C. S. Pattichis. Unsuper-vised pattern recognition for the classification of EMG signals. IEEE Transactions on Biomedical Engineering, vol. 46, no. 2, pp. 169–178, Feb. 1999.

[5] N. F. Güler and S. Koçer, Classification of EMG Signals Using PCA and FFT, J Med Syst, vol. 29, no. 3, pp. 241–250, Jun. 2005.

[6] N. F. Güler and S. Koçer, Use of Support Vector Machines and Neural Network in Diagnosis of Neuromuscular Disorders, J. Med Syst, vol. 29, no. 3, pp. 271–284, Jun. 2005.

[7] R. Merletti and D. Farina. Analysis of intramuscu-lar electromyogram signals. Philosophical Trans-actions of the Royal Society A: Mathematical,

Physical and Engineering Sciences, vol. 367 no. 1887, pp. 357–368, Jan. 2009.

[8] C. S. Pattichis, C. N. Schizas, and L. T. Middle-ton, Neural network models in EMG diagnosis. Biomedical Engineering, IEEE Transactions on, vol. 42, no. 5. May 1995.

[9] C. S. Pattichis and C. N. Schizas, “Genetics-based machine learning for the assessment of certain neuromuscular disorders.,” Neural Networks, IEEE Transactions on, vol. 7, no. 2, pp. 427–439, Jan. 1996.

[10] M. B. I. Reaz, M. S. Hussain, and F. Mohd-Yasin. Techniques of EMG signal analysis: detection, processing, classification and applications. Bio-logical Procedures Online, vol. 8, no. 1, pp. 11–35, Dec. 2006.

74Downloaded from knowledgecenter.siam.org