fuzzy clustering of time-variant and invariant …...1 fuzzy clustering of time-variant and...

10
1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical University of Lisbon, Instituto Superior Técnico, Dept. of Mechanical Engineering, CIS/IDMEC LAETA, Av. Rovisco Pais, 1049-001 Lisbon, Portugal 2014. ARTICLE INFO ABSTRACT Keywords: This dissertation proposes a novel clustering method based on fuzzy c-means, which is capable of handling information from time variant and invariant features. The new method, Mixed Clustering, shows the advantages of successfully aggregating both data components to identify systems in a wide range of application domains, such as Medical, Management or Energy Systems. The flexible formulation of the proposed methodology can adapt to data sets with multivariate time series and different similarity measures based on distance. In fact, in addition to the euclidean distance, the distance based on the popular Dynamic Time Warping method is used for time series similarity search, being capable of overcoming the temporal misalignment between them, commonly found on these applications. The contribution of the Mixed Clustering approach is demonstrated for forecasting and classification problems, the first being achieved through its application to a meteorological system for temperature and humidity forecasting based on geographical location. The method’s performance as a binary classifier is demonstrated with a Medical application, where the goal is to predict the outcome of a patient diagnosed with septic shock through the analysis of physiological variables measured during a sampling period and patient’s demography, which is constant during his stay in an Intensive Care Unit. The machine learning process is tested under unsupervised and supervised alternatives. The application of the method showed that when the temporal information of the patient is poorer, the demographic information can improve the classifier’s performance. Data Mining Machine Learning Clustering Time Series Analysis Mixed Data Septic Shock 1. Introduction 1.1. Knowledge Data Discovery The present developments in data warehouse enable storing of increasingly bigger sets of data, leading to a growth in the amount of information available regarding any given system as well as the analytical possibilities they provide. The Knowledge Data Discovery (KDD) process focuses on methodologies for extracting useful knowledge from the available information, data bases, (Fayyad, Piatetsky-Shapiro, & Smyth, 1996). Firstly the data relevant (target data) for the system under identification from the available data base, after which the target data is pre-processed, cleaning the information, handling missing values and adapting it to the requirements of the analysis. Figure 1-1 KDD Process The data is then Transformed, consolidated into structures appropriate for the data mining method then applied, in this case the Mixed Clustering, which identifies patterns in the data.

Upload: others

Post on 31-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Fuzzy Clustering of Time-variant and invariant …...1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical

1

Fuzzy Clustering of Time-variant and invariant Features:

Application to Sepsis Outcome Prediction

Marta C. Ferreira*

* Technical University of Lisbon, Instituto Superior Técnico, Dept. of

Mechanical Engineering, CIS/IDMEC – LAETA, Av. Rovisco Pais, 1049-001 Lisbon, Portugal 2014.

ARTICLE INFO

ABSTRACT

Keywords:

This dissertation proposes a novel clustering method based on fuzzy c-means, which is capable of

handling information from time variant and invariant features. The new method, Mixed Clustering,

shows the advantages of successfully aggregating both data components to identify systems in a wide

range of application domains, such as Medical, Management or Energy Systems.

The flexible formulation of the proposed methodology can adapt to data sets with multivariate time

series and different similarity measures based on distance. In fact, in addition to the euclidean

distance, the distance based on the popular Dynamic Time Warping method is used for time series

similarity search, being capable of overcoming the temporal misalignment between them, commonly

found on these applications.

The contribution of the Mixed Clustering approach is demonstrated for forecasting and

classification problems, the first being achieved through its application to a meteorological system for

temperature and humidity forecasting based on geographical location. The method’s performance as a

binary classifier is demonstrated with a Medical application, where the goal is to predict the outcome

of a patient diagnosed with septic shock through the analysis of physiological variables measured

during a sampling period and patient’s demography, which is constant during his stay in an Intensive

Care Unit. The machine learning process is tested under unsupervised and supervised alternatives.

The application of the method showed that when the temporal information of the patient is poorer, the

demographic information can improve the classifier’s performance.

Data Mining

Machine Learning

Clustering

Time Series Analysis

Mixed Data

Septic Shock

1. Introduction

1.1. Knowledge Data Discovery

The present developments in data warehouse enable

storing of increasingly bigger sets of data, leading to a

growth in the amount of information available regarding

any given system as well as the analytical possibilities

they provide.

The Knowledge Data Discovery (KDD) process

focuses on methodologies for extracting useful

knowledge from the available information, data bases,

(Fayyad, Piatetsky-Shapiro, & Smyth, 1996).

Firstly the data relevant (target data) for the system

under identification from the available data base, after

which the target data is pre-processed, cleaning the

information, handling missing values and adapting it to

the requirements of the analysis.

Figure 1-1 KDD Process

The data is then Transformed, consolidated into

structures appropriate for the data mining method then

applied, in this case the Mixed Clustering, which

identifies patterns in the data.

Page 2: Fuzzy Clustering of Time-variant and invariant …...1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical

2

The results obtained from the mined patterns is then

interpreted in the original systems field, finally

obtaining the useful knowledge desired.

The focus of this dissertation is proposal of a new,

efficient, data mining method based on clustering, for

databases combining time variant and invariant features,

valid for forecasting and classification problems, and

applicable to a diverse range of application domains,

from medical problems, climacteric analysis, power

management to economic studies, designated as Mixed

Clustering.

1.2. Time Series Data Mining

This innovative data mining method searches for

patterns and similarities in both data components, time

variant and invariant, combining the extracted

information to better characterize the data objects.

The process of mining time series, particularly, the

clustering of time series attracts the interest of

researchers. The complexity of this type of data requires

careful examination of the proposed algorithms, (Rani

& Sikka, 2012). While the time invariant features are

easily compared by a common and simple distance

function, the Euclidean Distance, the time variant

features, represented by time series, require a more

complex analysis, (Rani & Sikka, 2012).

Thus, a more modern measure is implemented for

similarity search of time series, the Dynamic Time

Warping.

Figure 1-2 Euclidean and DTW matching of Time Series

This similarity measure is capable to overcome

temporally misaligned time series, identifying similar

tendencies and patters, even if unfazed in the time of

occurrence.

This measure has been successfully applied in areas

such as handwriting and online signature matching,

time series database search, computer vision,

surveillance and signal processing, (Gaudin &

Nicoloyannis, 2006).

1.3. Outline

This work is structured as follows: in section 2, the

mixed clustering concept is described and the

methodology presented. In section 3, the use of the

method’s outputs to solve a forecasting problem is

presented and applied to a Meteorological System,

followed by a demonstration and discussion of the

results. The method’s contribution to a classification

problem is demonstrated in section 4, and applied to a

Medical System, followed by a demonstration and

discussion of the results achieved. Finally, in section 5

the results of the different applications are revised and

compared to previous works on the subject, concluding

with a set of suggestions to further develop the study

described as future work.

2. Clustering

2.1. Concept

Clustering is a data mining technique that aims to

group similar data objects, based on patterns identified,

while distinguishing objects with distinct behaviours,

divide the data into clusters, so that intra-group

differences are smaller than those inter-groups. This

concept is useful in a wide range of applications from

image analysis, wireless sensor network's based

applications or population segmentation to

bioinformatics, (Liao, 2005).

Often, the information that describes a system is not

all represented in the same type of data, there are

categorical, numerical and text features, constant and

time-varying features. In such cases, a clustering

Page 3: Fuzzy Clustering of Time-variant and invariant …...1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical

3

method capable of conciliating distinct data types

becomes necessary.

In (Izakian, Witold, & Jamal, 2013), a clustering

method to handle spatiotemporal systems is proposed.

These systems are characterized not only by temporal

features but also by the spatial location at which they

were measured. Geography, climatology and

epidemiology systems are examples of applications

relying on spatiotemporal data for their identification.

The methodology proposed in (Izakian, Witold, &

Jamal, 2013) expands the Fuzzy C-Means (FCM)

Clustering technique, (Bezdek, Ehrlich, & Full, 1984) to

handle spatiotemporal data by adding a pondering

element 𝜆, that factors the importance to be given to the

temporal component. This element majorly beneficiates

the algorithm’s flexibility, allowing it to search for the

best combination between temporal and spatial

contributions

The aim of this dissertation is to expand this notion of

spatiotemporal data to any dataset containing different

types of data, constant and time-varying, that may

require specific treatment, by generalizing the

spatiotemporal clustering methodology to data bases

with mixed clustering and multivariate time series.

We will show that there are benefits in successfully

converging both data components to model systems in a

wide range of application domains, such as Medical

Care, Finances, Management and Energetic Systems.

2.2. Mixed Clustering Methodology

When working with a database with time variant and

invariant features, the input data is considered as a

concatenation of both data components:

𝑥𝑖 = [𝑥𝑖𝑠|𝑥𝑖

𝑡], 𝑖 = 1, . . , 𝑛 ( 2.1 )

The invariant component, represented by numeric

values, is structured as follows

𝑥𝑖𝑠 = [𝑥𝑖,1

𝑠 , … , 𝑥𝑖,𝑟𝑠 ] ( 2.2 )

Where r is the number of invariant features.

The time variant data component, represented by

multivariate time series, is structured as a tri-

dimensional matrix:

𝑥𝑖,𝑗,𝑘𝑡 =

( 2.3 )

In this format, each value is defined by 3 coordinates:

𝑖 = 1, … , 𝑛, indicating the sample number,

j = 1, … , 𝑞, the sampling point

and 𝑘 = 1, … , 𝑓, the feature

The clustering method defines a set of prototypes, or

centers for each of the c clusters, comprised of a variant

and an invariant component:

The invariant component’s prototypes are determined

by:

𝑣𝑙𝑠 =

∑ 𝑢𝑙,𝑖𝑚𝑥𝑖

𝑠𝑛𝑖=1

∑ 𝑢𝑙,𝑖𝑚𝑛

𝑖=1

( 2.4 )

The time-variant prototypes require an expansion to

deal with the dimensionality increase of the data. A 3

dimensional structure was defined, with dimensions

[𝑐 × 𝑞 × 𝑓]:

𝒗𝒍,𝒌𝒕 =

∑ 𝒖𝒍,𝒊𝒎𝒙𝒊,𝒌

𝒕𝒏𝒊=𝟏

∑ 𝒖𝒍,𝒊𝒎𝒏

𝒊=𝟏 ( 2.5 )

Where the fuzziness parameter, m, makes the process

more fuzzy or crisp. The membership degree

The value 𝑢𝑙,𝑖 is an element of the partition matrix, U,

that defines the degree at which each sample belongs to

each cluster. Being a fuzzy clustering method, the

membership of a sample k to a cluster i is a value in the

interval 𝑢𝑙,𝑖 ∈ [0,1],∑ 𝑢𝑖,𝑘 = 1𝑐𝑙=1 and0 < ∑ 𝑢𝑙,𝑖 < 𝑛𝑛

𝑖=1 .

The similarity between a sample and a cluster is

then measured by the sample’s augmented distance

to the cluster’s center, given by:

𝒅𝝀𝟐(𝒗𝒍, 𝒙𝒊) = ‖𝒗𝒍

𝒔 − 𝒙𝒊𝒔‖𝟐 + 𝝀 ∑ 𝜹(𝒗𝒍,𝒌

𝒕 , 𝒙𝒊,𝒌𝒕 )

𝒇

𝒌=𝟏

( 2.6 )

Where 𝜹(𝒗𝒍,𝒌𝒕 , 𝒙𝒊,𝒌

𝒕 ) is the distance between the 𝑘𝑡ℎ

feature of prototype 𝑖 and sample 𝑗, calculated by the

Page 4: Fuzzy Clustering of Time-variant and invariant …...1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical

4

distance function used and 𝜆 is a parameter that defines

the influence given to the time variant features. The

optimal value of this parameter is determined by

sequential runs of the clustering process, for different

values, choosing the one that generates the best

performance.

By adding the distances of all features for each

sample, the matrix of distances maintains its dimension

[𝑐 × 𝑛], resulting in a meaningful partition matrix

defined, as for a univariate time-series system, by:

𝒖𝒍,𝒊 =𝟏

∑ (𝒅𝝀(𝒗𝒍,𝒙𝒊)

𝒅𝝀(𝒗𝒐,𝒙𝒊))

𝟐/(𝒎−𝟏)𝒄𝒐=𝟏

( 2.7 )

Since the objective function 𝐽 only has direct

dependency on the distances and membership degrees,

it can be defined as for a univariate time-series system:

𝐽𝑱 = ∑ ∑ 𝒖𝒍,𝒊𝒎𝒅𝝀

𝟐(𝒗𝒍, 𝒙𝒊

𝒏𝒊=𝟏

𝒄𝒍=𝟏 ) ( 2.8 )

The Clustering process continues until convergence

of the distance function or the maximum number of

iterations is achieved.

3. Forecasting Problem – Meteorological System

3.1. Modelling

The Alberta Agriculture and Rural Development

organization provides current and historical weather

data from approximately 340 meteorological stations

located across the Californian province, mapped on

Figure 3-1. The meteorological variables available

include temperature, humidity, precipitation and solar

radiation, and are of great interest for users such as

Epidemiologists seeking to better understand, for

instance, the relationships between measures of

environmental health and those of animal health. This

platform, available at (ARD) is also valuable for

environmental or agriculture analysis.

Figure 3-1 Map of the province of Alberta, Canada. Area

were the meteorological stations are located

The Alberta province covers areas with different

geographical and meteorological profiles that

characterize these locations, including mountains,

valleys, lakes and arid areas.

For these experiments, the average daily temperatures

and the daily average humidity registries where

considered, taken from 1/1/2009 to 12/31/2009, forming

the time variant input features. The time invariant

features used consisted of the latitude and longitude

coordinates of the location of the station they were

measured at.

All stations in which all features were available and

had no missing values were considered, resulting in 168

samples.

The time series were represented by the Discrete

Fourier Transform (DFT).

DFT Fuzziness parameter: 𝑚 = 2

Number of samples: 𝑛 = 249

Number of time invariant features: 𝑟 = 2

Number of time variant features: 𝑓 = 2

Time variant feature’s length: 𝑞 = 365

3.2. Experimental Setup

The application of the Mixed clustering methodology

proposed to the Meteorological System was performed

under two distinct criterions. The first, Reconstruction

Page 5: Fuzzy Clustering of Time-variant and invariant …...1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical

5

Criterion (RC), evaluates the cluster validity, while the

Prediction Criterion (PC) evaluates the method’s

forecasting ability.

Reconstruction Criterion

The RC assesses the quality of the clusters

constructed by attempting to recreate the original data.

Defining �̂� as the reconstructed data, its variant and

invariant components are respectively defined as

�̂�𝑖𝑠 =

∑ 𝑢𝑙,𝑖𝑚𝑣𝑙

𝑠𝑐𝑙=1

∑ 𝑢𝑙,𝑖𝑚𝑐

𝑙=1 ( 3.1 )

�̂�𝑖𝑡 =

∑ 𝑢𝑙,𝑖𝑚𝑣𝑙,𝑘

𝑡𝑐𝑙=1

∑ 𝑢𝑙,𝑖𝑚𝑐

𝑙=1 𝑘 ∈ [1, 𝑓] ( 3.2 )

The Average Reconstruction Error (ARE) is

calculated as:

𝐴𝑅𝐸(𝜆) =1

𝑛× (

1

𝑟× (∑ ∑

(𝑥𝑖,𝑗𝑠 − �̂�𝑖,𝑗

𝑠 )2

𝜎𝑗2

𝑟

𝑗=1

𝑛

𝑖=1) +

1

𝑓 × 𝑞

× (∑ ∑ ∑(𝑥𝑖,𝑗

𝑡 − �̂�𝑖,𝑗𝑡 )

2

𝜎𝑗2

𝑞

𝑗=1

𝑓

𝑘=1

𝑛

𝑖=1))

( 3.3 )

Where 𝜎𝑗2 is the variance of the j

th feature.

Prediction Criterion

The aim of the PC is to predict the temporal

component of the data by using the available spatial

component of the data, minimizing the resulting error

by adjusting the temporal influence parameter 𝜆.

A partition matrix is estimated from the invariant data

and prototypes:

�̃�𝑙,𝑖 =1

∑ (‖𝑣𝑙

𝑠−𝑥𝑖𝑠‖

‖𝑣𝑜𝑠 −𝑥𝑖

𝑠‖)

2(𝑚−1)⁄

𝑐𝑜=1

( 3.4 )

The average Prediction Error (APE) is then calculated

as:

𝐴𝑃𝐸(𝜆) =1

𝑛×𝑓×𝑞× (∑ ∑ ∑

(𝑥𝑖,𝑗𝑡 −𝑥𝑖,𝑗

𝑡 )2

𝜎𝑗2

𝑞𝑗=1

𝑓𝑘=1

𝑛𝑖=1 ) ( 3.5 )

The stopping criteria for the clustering algorithm in this

experiment were the following:

Minimal variation of the objective function:

|∆𝐽| < 𝜀 = 10−5

Maximum number of iterations: 𝑚𝑎𝑥𝑖𝑡 = 100

3.3. Results and Discussion

Reconstruction Criterion

The RC was applied to each of time variant feature,

humidity or temperature, individually and to the

combination of both in a multivariate approach, each

using a number of clusters between 2 and 5, using the

Euclidean Distance and the DTW for similarity search.

It was observed that the multivariate alternative was

not capable to improve the quality of the data clusters

created, according to this criteria, and that the best

results were obtained for the temperature features, with

5 clusters and using the Euclidean Distance. Figure 3-2

shows a plot of the analysed stations according to their

geographical location, coloured according to the cluster

they have the highest membership degree to, under the

best RC conditions. Four stations in different regions

are highlighted.

Figure 3-2 Geographical Distribution under best RC

conditions, c=5

It is clear that the method was capable of recognizing

and distinguishing areas with the most different

climacteric profiles.

Prediction Criterion

The PC was also applied under the same experimental

conditions as the RC, multivariate and univariate time

series, Euclidean distance and DTW were used as

similarity measures for a number of clusters between 2

and 10.

Page 6: Fuzzy Clustering of Time-variant and invariant …...1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical

6

The best result was also obtained using the

multivariate approach, with the Euclidean distance and

8 clusters.

These conditions were used to forecast the

temperature and humidity. The total samples were

separated into training and testing sets:

𝑥𝑡𝑟𝑎𝑖𝑛 = [𝑥𝑡𝑟𝑎𝑖𝑛𝑠 |𝑥𝑡𝑟𝑎𝑖𝑛

𝑡 ] ( 3.1 )

And

𝑥𝑡𝑒𝑠𝑡 = [𝑥𝑡𝑒𝑠𝑡𝑠 |𝑥𝑡𝑒𝑠𝑡

𝑡 ] ( 3.2 )

The procedure followed is described in Figure 3-3.

Figure 3-3 Workflow representing process for temporal

forecasting of test set

In this experiment, around 70% of the samples were

used as train set, 𝑛𝑇𝑟𝑎𝑖𝑛 = 117, while the rest was

used as test set. The forecasting results of humidity and

temperature of one exemplary test sample, under the

best conditions, are shown in Figure 3-4 and Figure 3-5,

respectively.

Figure 3-4 Humidity Predicting under best PC conditions

Figure 3-5 Temperature Predicting under best PC conditions

In the forecasting problem, the DTW did not show

improvement on the Euclidean distance, as similarity

measures. The multivariate approach achieved the best

forecasts of temperature and humidity during 2009, at

the selected stations.

4. Classification Problem – Medical System

An analogy was made from the spatiotemporal

concept, where the geographical location becomes, in

medical applications, a patient’s demography: age,

weight, height, sex, among other possibilities. In this

equivalence, the temporal component is regarded as all

time-varying features that characterize the system, such

as heart beats, blood pressure, body temperature and

such, measured through a period of time and

represented as time-series.

Page 7: Fuzzy Clustering of Time-variant and invariant …...1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical

7

4.1. Modelling

Septic shock is a medical emergency that can occur as

a reaction of the immune system to, for example, an

operation. It is estimated to affect about 12% of patients

in an Intensive Care Unit (ICU) and has a high death

rate, which is referred to depend on the patient’s age

and overall health.

The database used, MEDAN, comprises several

physiological features of patients diagnosed with

abdominal septic shock, uniformly sampled during the

whole period while the patient was at the ICU, (Paetz,

2003). This database was pre-processed by (Marques,

Moutinho, Vieira, & Sousa, 2011), who analysed the

most determinant features for outcome prediction,

creating a sub dataset of patients with measurements of

12 of the available features.

This data suffered further processing, from which

resulted a data set with 100 samples each comprised of:

2 time invariant features: patient’s age and

weight, represented by a numeric value;

12 time variant features representing

physiological variables by time series with a

sampling time of 24 hours, over the last 10 days of

the patient’s stay in an Intensive Care Unit;

1 outcome represented by a binary where 0

represents the patient’s survival and 1 the patient’s

death.

4.2. Experimental Setup

The concept of classification based on clustering

assumes that similar objects will share outcomes, and

uses this knowledge to predict an object’s classification.

The classification approach proposed in this work is

based on this concept and defines an object as

belonging to a cluster if its membership degree is higher

than a certain threshold. It then assumes that objects

grouped together must share the same outcome. Thus,

this concept is only valid for binary classifiers using

two clusters, c=2.

To evaluate the method’s ability to predict an object’s

outcome, a 5 fold Cross Validation was performed.

At each fold, the train set is clustered to determine the

optimal 𝜆∗ and the resulting clustering output 𝑣∗. The

membership degree of each test set sample are then

determined, depending on their distance to each cluster

prototype, and the predicted outcome determined

according to the highest membership degree.

The experiments described in this section share the

following experimental conditions:

Clustering Conditions:

o Minimal variation of the objective function:

|∆J| < ε = 10−8

o Maximum number of iterations: maxit = 500

o Fuzziness parameter: m = 2

Classification Conditions:

o 5 Fold CrossValidation

o Class Distribution: 44%/56%

The Mixed Clustering methodology was applied

under two learning approaches: unsupervised and

supervised. The first partitions the data without

knowledge of its outcome, while the second used

labelled samples for training, following the steps:

i Unsupervised Clustering of Train set to

determine 𝜆∗;

ii Supervised Clustering of Train set using 𝜆∗ to

obtain prototypes 𝑣∗;

iii Unsupervised Classification of Test set using 𝑣∗.

The criteria implemented to evaluate the quality of

the outcome prediction is frequently used with health

care problems, (Lavrač, 1999):

Accuracy: measures the number of correct

classifications out of samples classified;

Sensitivity: accounts for the number of correct

positive classifications, out of all positive samples;

Specificity: accounts for the number of correct

negative classifications, out of all negative samples;

4.3. Results and Discussion

The experiments performed with the Mixed

Clustering include the use of data representations in

time (raw data) and frequency domain (DFT), of the

Page 8: Fuzzy Clustering of Time-variant and invariant …...1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical

8

Euclidean Distance and the DTW as similarity

measures. In addition to the mixed clustering, an

alternative clustering was tested, using only the time

variant features, to assess the actual benefit of

combining both information components, designated as

Temporal Clustering.

A Forward Feature Selection method was used to

assess the quality of each time variant feature, under all

combinations of conditions described.

It was observed that the superiority of a similarity

measure or time series representation method depended

on the feature.

The benefit of the mixed clustering over the temporal

clustering was also not global for every feature. It was

verified that when the time variant features, by

themselves, were rich enough, the addition of the

patient’s demography mislead the algorithm, leading to

weaker results. However, when the temporal feature

was weaker, it benefited from the mixed clustering

approach.

The best overall Unsupervised Mixed Clustering

result was obtained using the Euclidean Distance with

the DFT using one time variant feature, no. 6,

representative of the Central Venous Pressure.

Figure 4-1 shows the differences between the

temporal and mixed alternatives, under unsupervised

learning, for the best feature and an example of a

weaker temporal feature that benefited from the mixed

clustering approach, feature 8: Ph.

Figure 4-1 Unsupervised Mixed and Temporal Clustering

Accuracy for features 6 and 8

It is observable that while the addition of the patient’s

demography did not increase the performance of feature

6, the weaker feature 8 needed the increase of

information that came with it.

In Figure 4-2, the equivalent results are shown, for

the Supervised learning alternative.

Figure 4-2 Supervised Mixed and Temporal Clustering

Accuracy for features 6 and 8

The best result under Supervised clustering was also

achieved for feature 6, using the DTW and DFT. It is

also shown that, for these features, the supervised

clustering alternative managed to improve the results of

the unsupervised alternative. This effect was not

verified for all features however, overall the supervised

learning increase the performance of the features that

were also the best under the unsupervised alternative,

suggesting that the features most related to the outcome

beneficiate from its inclusion in the learning process.

5. Conclusions and Future Work

A new expanded clustering algorithm was formulated

to mine databases represented by both time variant and

invariant features, combining the information extracted

to further characterize a given system. The results of the

data mining and pattern recognition process were

applied to machine learning purposes, where distinct

methodologies were proposed to solve Forecasting and

Classification problems, the first with a Meteorological

System, while the last with a Medical application,

demonstrating its wide applicability.

Page 9: Fuzzy Clustering of Time-variant and invariant …...1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical

9

Different measures were implemented for similarity

search between time series, the commonly used

Euclidean Distance and the increasingly popular

Dynamic Time Warping. The benefit of the joint

clustering of different types of data was also

demonstrated, by comparing it to the clustering of

individual data types.

Table 5-1 shows the best result obtained from

previous work on the same database.

It should be noticed that these results are not directly

comparable since the studies performed different

processing on the input data and the methods used are

different. The authors of (Cismondi, et al., 2012) used

multi-criteria Feature Selection with Fuzzy Models

(FM) and Neural Networks (NN) to predict the patient’s

outcome.

While the FM constructed produced the best ACC,

the Mixed Clustering produced comparable results

using 4 times less features, 2 of each were numerical

values, significantly easier to measure and process.

Table 5-1 Best Mixed Clustering and best previous

work result

Reference Method No.

features

ACC

(%)

Sens.

(%)

Spec.

(%)

(Cismondi,

et al.,

2012)

NN

Max

Sens. 12 72.78 84.27 65.53

Max.

Spec. 12 75.74 60.00 85.48

Parallel 12 79.25 85.21 69.64

FM

Max

Sens. 12 74.45 84.18 68.81

Max.

Spec. 12 79.17 63.53 88.83

Parallel 12 81.72 85.15 71.21

Mixed

Clustering

Unsupervised

3* 76.00 88.83 66.09

Supervised 3* 77.00 82.33 73.24

* The mixed Clustering used two constant features, patient’s age

and weight, combined with one time variant feature.

In addition, the Mixed Clustering method has the

highest sensitivity, or true positive rate, crucial since the

positive class represents a deceased patient.

As future work, it would be interesting to expand the

clustering possibilities to any number of partitions and

to databases with any number of classes.

Since the DTW method is able to compare time series

of different length, the expansion of the method to form

prototypes of variable length would expand the

applicability of the mixed clustering method to

databases with time series of different length.

Also, a reformulation of the method should include

the possibility to use different similarity measures for

each feature, as well as the influence given to each

through the implementation of different temporal

influence parameters 𝜆𝑖, where 𝑖 = 1,2, … , 𝑓.

Even though one of the great advantages of the data

mining and soft computing techniques analysis is their

ability to read any problem specific to a given field as a

generalized system, the final step in the KDD approach

would be the interpretation of the results, bringing the

problem back to its field and enabling practical

conclusions. Thus, the medical system application

demonstrated would benefit from further analysis over

the best features that resulted from the feature selection

algorithms, possibly bringing awareness of the

importance of a feature to the medical community. In

this context, a feature sensibility study could also be

performed on the time variant and invariant features,

pre-assessing the quality of the knowledge they contain.

The causes of septic are not yet fully comprehended,

however some risk factors have been studied (Fink,

Abraham, Vincent, & Kochanek, 2005), and could be

insert in the Mixed Clustering method as time invariant

features.

Finally, the validation of the mixed clustering

methodology requires its application to problems from

different domains and fields, such as Financial, Power

Consumption or Surveillance Applications. The use of

benchmark databases can demonstrate its value against

Page 10: Fuzzy Clustering of Time-variant and invariant …...1 Fuzzy Clustering of Time-variant and invariant Features: Application to Sepsis Outcome Prediction Marta C. Ferreira* * Technical

10

different techniques. However, due to the specific

characteristics of the mixed clustering’s inputs, there is

a shortage of available databases, (Keogh & Kasetty,

2003).

References

ARD. (n.d.). Current and Historical Alberta Weather

Station Data Viewer. Retrieved May 2014,

from http://agriculture.alberta.ca/acis/alberta-

weather-data-viewer.jsp

Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM:

The fuzzy c-means clustering algorithm.

Computers & Geosciences, 10, 191-203.

Cismondi, F., Horn, A. L., Fialho, A. S., Vieira, S. M.,

Reti, S. R., Sousa, J. M., et al. (2012). Multi-

stage Modeling Using Fuzzy Multi-criteria

Feature Selection to Improve Survival

Prediction of ICU Septic Shock Patients.

Expert Systems with Applications, 39, 12332--

12339.

Devjver, P. A., & Kittler, J. (1982). Pattern

Recognition: A Statistical Approach. Prentice-

Hall.

Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996).

From data mining to knowledge discovery in

databases. Al Magazine, 17, 37-54.

Fink, M., Abraham, E., Vincent, J., & Kochanek, P. M.

(2005). Septic Shock. In Textbook of Critical

Care (5th ed.). Saunders Elsevier.

Gaudin, R., & Nicoloyannis, N. (2006). An Adaptable

Time Warping Distance for Time Series

Learning . 5th International Conference on

Machine Learning and Applications (ICMLA

06). Orlando, USA.

Han, J., & Kamber, M. (2006). Data Mining: Concepts

and Techniques (2 ed.). Morgan Kaufmann

Publishers.

Izakian, H., Witold, P., & Jamal, I. (2013, October).

Clustering Spatiotemporal Data: An

Augmented Fuzzy C-Means. IEEE

TRANSACTIONS ON FUZZY SYSTEMS, 21.

Keogh, E., & Kasetty, S. (2003, October). On the Need

for Time Series Data Mining Benchmarks: A

Survey and Empirical Demonstration. Data

Mining and Knowledge Discovery, 7, pp. 349-

371.

Lavrač, N. (1999). Artificial Intelligence in Medicine:

Machine Learning for Data Mining in

Medicine (Vol. 1620).

Liao, T. W. (2005, November). Clustering of time series

data - a survey. Pattern Recognition, 1857-

1874.

Marques, F. J., Moutinho, A., Vieira, S. M., & Sousa, J.

M. (2011). Preprocessing of Clinical Databases

to improve classification accuracy of patient

diagnosis. World Congress, (pp. 14121-

14126).

Paetz, J. (2003). Knowledge-based approach to septic

shock patient data using a neural network with

trapezoidal activation functions. Artificial

Intelligence in Medicine, 28, 207-230.

Rani, S., & Sikka, G. (2012). Recent Techniques of

Clustering of Time Series Data: A Survey.

International Journal of Computer

Applications, 52(15).