mining sequence patterns from data collected by brain damage rehabilitationa1108558/thesis.pdf ·...

Mining Sequence Patterns from Data Collectedby Brain Damage Rehabilitation

byJulian Kuhnel (a1108558)

in Partial Fulfillment of the Requirements for the Degree ofBachelor of Science in Computer Science

University of ViennaScientific Computing

November 24th, 2014

Thesis Supervisor:Univ.-Prof. i.R. Dipl.-Ing. Dr. Peter Brezany

Abstract.Since the number of people affected of dementia disease gets quite hugein the near future, IT support for the involved people is strongly needed.It is known and proven with several studies, that various forms of men-tal stimulation that might include specific games, books, melodies, etc.where highly therapeutic to this group of patients. Individuals partici-pating in the studies performed significantly better on measures of cog-nition; increases in alertness and in awareness in the test subjects werereported. Also physical exercises, lifelong learning and social interactionhelps to slow down the course of the disease. Knowing this, a possibleway to provide IT support is given.IT support can be offered in various ways, e.g. patients can be supportedby “Smart Houses” or GPS devices and their families and carers throughmonitoring systems. Those devices produce a large set of data, which canbe analyzed and used to gain more knowledge about the illness and thebehavior of the patients. Also data collected from sensors monitoringpatients outdoor movement, relevant events registered indoor, history ofmedication intake, etc. can be used in the analysis process. Thus domainexperts need to be provided with analysis tools, which make extensiveuse of data mining algorithms and can handle extremly heterogeneousdata.The goal of this thesis is to design a generic analysis tool using a se-quence pattern discovery algorithm and a set of visualization methodsto support the domain experts, which can lead to an optimization ofappropriate stimulation methods. A special chapter is dedicated to ex-plaining principles of the selected data mining method “GSP” (Gener-alized Sequence Pattern Algorithm), which is used in the implementedprototype. The datamining platform Weka is used for loading the nec-essary data, the editing and selection of it and for applying the actualdatamining algorithm. A comprehensive investigation of various knownvisualization methods was accomplished and a composition of the bestsolutions, which will support the domain experts in an efficient way ispresented.A working prototype was implemented and tested on data generated bythe eScrapBook, which is a novel concept for producing specific virtualbooks developed within the European SPES project in order to slowdownthe course of the patient’s dementia disease via mental brain stimulation.It is an online platform enabling easy creation and use of multimediascrapbooks. The books may contain text, images, audio and video clipsand are viewable either online or exportable as an archive to be usedindependently of the authoring application. The generated data by theeScrapBook are about the interactions of the user with the device itself.Next, an example scenario is given, in which a domain expert goesthrough the complete workflow of the prototype.Finally, we propose future research tasks that can extend the results weachieved.

i

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1 Research Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Brain Damages and Rehabilitation . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Sequence Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Software System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1 Waikato Environment for Knowledge Analysis (WEKA) . . . . . . . . 93.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Input Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Testing and Evaluation of the Prototype Workflow . . . . . . . . . . . . . . . . . 155 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.1 Results Achieved and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

A Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

List of Figures

1 Smart House [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Two examples of sequence mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Pseudo code of the “GSP” algorithm. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Abstraction of the sequential pattern mining workflow. [3] . . . . . . . . . . . 105 Use Case Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Component Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Overview Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 ER Diagram SPES project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Tree visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1510 Visualization with graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1611 Scatter plot example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1712 Timeline example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1713 Compare Timelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1814 First layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1915 Second layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2016 Third layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2117 Connection to the database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2218 Editing and selection of the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2319 Second possibility to connect to the database and select the data. . . . . 23

ii

20 Textual output of the data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2421 Settings for the algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2422 Textual output of the “GSP” algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 2523 Attribute Table - Part 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2824 Attribute Table - Part 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

iii

1 Introduction

In this section, the main goals of this thesis, its motivation and the explicit struc-ture are described. It is related to IT devices and technologies, which supportthe needs of people suffering from brain damage diseases and describes in detailthe strategy how two reach the goals of this thesis.

1.1 Motivation

Thanks to the modern medicine, the human life expectancy continues to im-prove, but with side effects. The fact, that people get older and older goes handin hand with the sad truth, that age related diseases like dementia occur muchmore frequently. By 2050, the number of people affected by Alzheimer’s is predi-cated to reach 106.4 million and several professions/actors, like patients, medicaldoctors, care providers and psychologists, may be involved in the medical careand rehabilitation of Brain Disorder [4]. Thus advanced IT-based support likedata science, big data, cloud technologies and visualization is strongly needed.The possibilities to support the domain experts, the patients and their carers ina technological way are very various and depend on the needs of their specificroles. For example, the patients which are suffering from dementia need help athome. This help could be provided by “Smart Houses”. As we can read in [1]“Smart Houses” offer the possibility for older people or people with disabilities tolive an independent respectively more independent life. It adds multiple devicesfor different needings as listed in [1]:

1. Devices for automation and control of the home environment (e.g. automatickitchen equipment, light and door controllers, temperature controllers, homesecurity devices, etc.);

2. Assistive devices (e.g. electro-mechanical devices for movement assistance,devices for physical rehabilitation and fitness, etc.);

3. Devices for health monitoring (e.g. monitoring of vital parameters, posturemonitoring, behavior monitoring, advanced chemical analysis, etc.);

4. Devices for information exchange (e.g. systems for information access andtelecommunication, systems for telemonitoring, teleinspection and remotecontrol, etc.);

5. Leisure devices (e.g. virtual reality systems, emotional interactive entertain-ment robots, etc.)

An illustration of the interaction of this various devices in a “Smart House”is shown in Figure 1.

A “Smart House” covers a large set of needs for dementia patients, but itis very expensive and only well-offered people are able to afford such a big price.But there is a way to use only affordable parts which are also used by “SmartHouses” to support the patients and all the other included persons in a way thatis accessible, i.e. affordable, for everyone. A “GPS” device, for example, is not

1

Fig. 1. Smart House [1]

costly and is applicable in the field of dementia disease in different forms. Youcould use it as a key finder, when you extend the device a little bit, it couldrecognize when the patient falls to the ground and there is even the possibilityto follow and find the patient when he is in a confused phase and went off forthat reason.

Also telemedicine is a possibility to support involved people. As you canread in [5], telemedicine is defined as the use of telecommunication technologiesto provide medical information and services. Although this definition includesmedical uses of the telephone, personal computers, facsimile and dinstance edu-cation, telemedicine is increasingly being used as shorthand for remote electronicclinical consultation. For that idea and reason the international “SPES”(SupportPatients Through E-Service Solutions) [6] and the “OLDES” (Older People’s E-Services At Home) projects [7] were founded, wherein the “SPES” project wasbased on the findings and the implementations of the “OLDES” project [6].

2

This thesis is based on the research and development results achieved in the“SPES” project.

1.2 Thesis Objectives

This thesis provides the basis for fast advance of making predictions about thecourse of the disease and finding sequential patterns in the behavior of patientgroups in order to help domain experts to slow down or -in the best case- re-verse the progression of the disease. The thesis is closely relationshiped withthe “SPES” project and thus works in the field of telemedicine. As you canread in [6], in the course of the “SPES” project, a tele-health and entertainmentplatform was implemented, whose focus lies on the following diseases:

1. Respiratory problems2. Dementia3. Handicapped people4. Social exclusion

One part of this platform is the eScrapBook (see Section 2.1), a software systemdeveloped for dementia patients. The main goal of this thesis is to provide atool for analyzing and visualizing the log data generated from this softwarewith a sequence pattern discovery algorithm to enable domain experts to gaina deeper insight into the behavior of patients who come down with dementia.Furthermore, to ensure an easy adaption regarding the need of the increasingdemand on IT-based methods in the medical sector, the tool should be easilyextendable. In addition, the expandability of an implementation is state-of-the-art, since new technologies occur in increasingly short intervals and, as you canread in [8], expandability provides the possibility to ensure that software canbe used for a longer period of time. This will be realized by making invidualparts (e.g. visualization, algorithm, used datasource and database, etc.) of theimplementation easily replaceable.

Summarizing, the main goals of this thesis are:

1. Conceptional elaboration of the prototype and the overall platform.2. Elaboration of a generic software architecture.3. Implementation of a working prototype.4. Evaluation and testing of the prototype.

Thus advanced IT-based support for the field of rehabilitation of Brain Dis-order is a key challenge further approached by this work.

1.3 Thesis Organization

In the next section, the background information is provided. This covers relatedprojects as for example the European project “SPES” as well as an explanationof a sequence pattern discovery algorithm and a brief overview on dementiadisease, the different courses of getting it and possibilities for their rehabilitation.

3

It provides information about datamining in general and finally there is a shortoverview of the algorithm which is used in this thesis to fullfill a part of the goalsin this work.

In Section 3, the software design, developed in the context of this paper in-cluding architecture, the input data model and various output forms (e.g. textualand pictorial) is presented. The tool named “Weka”, which provides a compre-hensive collection of machine learning algorithms and data preprocessing tools ispresented and an abstraction of the sequential pattern mining workflow is given.Also UML class, usecase as well as component diagrams are used for a better un-derstanding of the system architecture developed within this work. Beyond thatalso the possibilities to visualize sequential patterns found in current literatureare addressed and discussed. Finally, the approach used in this thesis is pre-sented. It consists of different layers which are including the best visualizationsfor sequencial patterns pointed out before.

Section 4 is divided into two parts: The first part covers a description ofthe testing and the evaluation of the prototype workflow is given, where theprototype is explained in a summarizing way. It is described how Weka workstogether with the tool developed in this thesis and defines exactly where theborder of this two components is. In the second part the achieved results fromthe tests get analyzed and are discussed.

In the last section you’ll find the conclusion of this thesis. This part is con-sidering the challenges which occur during this work and possible solutions forthem are given. A summarization of the content and purpose of the thesis ispresented and in the last part opportunities for future work are given.

2 Background Information

In this section information, about the surrounding knowledge areas, an introduc-tion of the related work and the description of the relationships to other scienceareas are provided. Furthermore, essential knowledge about the dementia diseaseand the algorithm which is relevant for this thesis are explained.

2.1 Research Context

This work is directly related to the European project SPES. In the context ofthe SPES project, a software was developed that provides mental stimulationfor dementia patients. The name of this software is “eScapbook” which is avirtual book that can be filled with text, videos, pictures and audios individually.Whenever a patient is interacting with the book, log data is created. Each patienthas his own book and thus he produces his own log files. In this work, the log fileswill get analyzed for its patterns to help the domain experts in their research.Furthermore, it would be possible to analyze data from additional data sources(e.g. positional tracking data). Also related to this thesis are the scientific datamanagement project Advanced Breath Gas Analysis (ABA) and LIGHT. Formore detailed information see [9].

4

2.2 Brain Damages and Rehabilitation

Dementia disease is a syndrome that causes a significant loss in congnitive abilitybeyond what would normally be expected from aging. As you can read in [10],modern medicine makes it possible that life expectancy increases every yearby three months. So in 2050, the number of people affected by Alzheimer’s ispredicted to reach 106.4 million [4]. There are different reasons for dementiadisease. As [11] describes, the most common causes are lack of oxygen of thebrain, head injury, brain haemorrhage, damages caused by excessive alcohol ordrug consumption and infection. The disease is splitted into four phases. In thebeginning first a decrease of the short-term memory followed by concentrationdisorders and disorientation is recognized. In phase four the patients suffer fromhallucinations and are not longer able to do life skills like washing or dressingthemselfs. The patients also suffer from depression and mood swing.

This work tries to support domain experts in order to slow down the dementiaprocess, to reactivate cognition and thinking skills as well as to increase the abil-ity to communicate by providing an analysis and visualization tool for sequentialpatterns of patient groups. Additionally, the above mentioned aspects lead toan increase of the mental well being and makes it easier to work against depres-sions. There are different ways to attend the disease. The following methods areimportant to rehabilitate patients suffering from dementia. Physical exercises aswell as lifelong learning have a positive effect on the progress of the illness. Alsosocial interaction helps to slow down the course of the disease. A more aggres-sive variant is to activate or touch the brain directly with electricity, magnets orimplants. Another method which is also the focus of this paper is to stimulatethe brain in a mental way.

2.3 Data Mining

Data mining means that knowledge is extracted or “mined” from large amountsof data. [12] points out that the process of knowledge discovery is a process whichincludes the following steps:

1. Data cleaning (remove noise and inconsistent data)2. Data integration (where multiple data sources may be combined)3. Data selection (where data relevant to the analysis task are retrieved from

the database)4. Data transformation (where data are transformed or consolidated into forms

appropriate for mining by performing summary of aggregation operations,for instance)

5. Data mining (an essential process where intelligent methods are applied inorder to extract data patterns)

6. Pattern evaluation (to identify the truly interesting patterns representingknowledge based on some interestingness measures)

7. Knowledge presentation (where visualization and knowledge representationtechniques are used to present the mined knowledge to the user)

5

Weka provides the functionality for all of these steps, the different datamining al-gorithms and possibilities for data transformation and presentation are describedin detail in Section 3.1.

2.4 Sequence Pattern Mining

Sequence pattern mining is similar to association rules mining. The differencebetween these two methods is, that association rules mining does not care aboutthe order of the events but sequence pattern mining does. As you can readin [13], data mining or knowledge discovery in databases is getting more andmore interesting as more data are produced every day. This can be understoodas an efficient way to discover patterns from large databases or data warehouses.One specific way to disclose such knowledge is to discover sequential patterns,out of event-based data, with a sequence pattern algorithm. There are differentsequence algorithms in use. The Needleman-Wunsch algorithm is used to com-pare two very big sequences or one small with one big sequence and the Smith-Waterman algorithm, which is used to compare one specific sequence with alarge set of other sequences. A detailed description of these algorithms and alsoa good solution for their parallelizability with OpenCL and Cuda is given in [14].Since the dataset which is used in this thesis consists of multiple sequences whereeach sequence has to be compared with each other and further these algorithmscompare two sequences regarding their similarity, they are no good existing so-lutions for this work. As you can read in Section 1.2, the goal of this thesis isto determine subsequences out of the dataset and not to compare the sequenceson their similarity. One algorithm to archive that is the Generalized SequencePattern (“GSP”) algorithm (see Section 2.4) and a parallelized version of it withmore modifications is described in detail in [15]. For this work the “GSP” algo-rithm is used, because this algorithm has been proven successful, is consistentand provides the functionality for what this thesis is looking for. There is a setof other sequential mining pattern algorithms given in [12]. Anyway, due to thesoftware architecture which allows a great flexibility, one can simply exchangethe underlying used algorithm and switch to e.g. a parallized version.

“Event-based data are encountered daily in many disciplines and are used forvarious purposes. They are collections of ordered sequences of events where eachevent has a start time and a duration” [16]. Examples of such data include med-ical records, internet surfing records, transaction records, industrial process orsystem control records, and activity diary data. [15] provides a good enumerationof possible application areas for sequential patterns:

1. Discovering sequential relationships between different telecommunicationswitches and alarms triggering on them.

2. Analyzing data from scientific experiments conducted over a period of time.3. Discovering relationships between stock market events (e.g. fluctuations happen-

ing on market indices, individual stock prices).4. Analyzing medical records of patients for temporal patterns between diag-

nosis, treatment, symptoms, and examination results, etc.

6

5. Discovering patterns among different socio-economic events.

The problem is to find out all sequential patterns with a user specified mini-mum support, where a support of a sequential pattern is the percentage of datasequences that contain the pattern. For example in a database, where the logfiles from the “eScapbook” of dementia patients are stored, each data-sequencemay be what one patient looked up in his book for one day combined with back-ground information. A sequential pattern out of this data is given in [3] in apseudo notation:

“Patients who primarily observe photos of their family and start totake the medicament M code are likely to extend their mental hori-zont within a month significantly. [minimum support = 10%]”

The data sequence corresponding to a patient who has looked up other mediain the book in between the above described events still contains the sequentialpattern, but if some patient looked up the same media in a different order theydon’t have a sequence pattern in common (see Figure 2 for an example). So, “allthe items in an element of a sequence must be present in a single transaction forthe data sequence to support the pattern.” [13]

The Algorithm The “Generalized Sequence Pattern Algorithm” (“GSP”) de-veloped by the employees from IBM finds sequence patterns as follows. In prin-ciple the algorithm passes over the data in multiple iterations. The first iterationdetermines the support of each item, the number of data-sequences that includethe item. After that first iteration the program “knows” the items which arefrequent. Each other iteration starts with a seed, which stands for the patternsfrom the previous pass. This seed is the base, which means, that after the nextiteration the seed is extended with one item. When there are no more candi-dates to extend the seed, the algorithm has finished. [2] provides a descriptivepseudo-code example of this algorithm (see Figure 3).

There are four parameters used by this algorithm:

1. dataSeqID (The attribute number representing the data sequence ID. Defaultset to 0.)

2. debug (A boolean value. If set to true, algorithm may output additional infoto the console.)

3. filterAttributes (The attribute numbers (e.g. “0, 1”) used for result filter-ing; only sequences containing the specified attributes in each of their ele-ments/itemsets will be output; -1 prints all.)

4. minSupport (Minimum support threshold)

Since single iterations are dependent of the findings of the previous iteration aparallelization is only possible within one iteration. This thesis uses a versionof this algorithm, which is made available from Weka. As you can read in [17]some versions of Weka disabled the use of GeneralizedSequentialPatterns in the

7

Fig. 2. Two examples of sequence mining

The first subsequence include two attributes, here blue and orange, because the se-quences (sequence 1 and sequence 2) share this pattern. In the second part the se-quences contain the same attributes but in a different order so this subsequence isempty.

Explorer due to a bug. Versions later than 3.6.1 and 3.7.0 (or snapshots laterthan 02/07/2009) include a fix for this.

For further information about the sequence pattern algorithm and sequencepatterns see [18], [12] and [2], for information about parallelization of this algo-rithm see [15].

3 Software System Design

The prototype designed and implemented in this thesis provides an interface toenable the user to fulfil the following tasks: Establish a connection to a data-source (for simplicity only MySQL databases), select appropiate data and editthem if necessary, apply a sequence pattern mining algorithm to the modifieddataset and finally visualize the results.

Weka is used for the database connection and the data mining algorithms.The first thing you do in the program is to connect to a database. After that

8

Fig. 3. Pseudo code of the “GSP” algorithm. [2]

there is the opportunity to combine tables, delete attributes and change values ofthe loaded dataset. The third step is to set parameters for the “GSP” algorithm(e.g. minimum support). As you can read in [13] and [18], there are a lot offinetunings for this algorithm. An explanation of the effects about those is givenin Section 4. The next part in the program is the execution of the algorithmwhich generates textual output. Here the Weka part is finished. The textualoutput from Weka gets analyzed and splitted into the necessary information forthe visualization component. After that it is handed over to it to get visualizedby pressing the “Visualize” button. Also, here are multiple ways to show thedata in different forms. Detailed explanations can be found in Section 3.4.

The overall workflow can be seen in Figure 4.

3.1 Waikato Environment for Knowledge Analysis (WEKA)

As you can read in [19] the Waikato Enviroment for Knowledge Analysis, shortWeka, is a performance oriented project which provides a comprehensive collec-tion of machine learning algorithms and data preprocessing tools. It is a state-of-the-art tool with a graphical user interface as well as a library widely usedin all kinds of different (research) communities [20]. It provides functionalityto quickly apply data mining methods on various datasets. As you can readin [17] it offers not only sequential pattern mining algorithms (e.g. GSP), butalso clustering, classification, regression, attribute selection, and an associationrule mining algorithm, so the prototype of this thesis can be extended easily.Weka also provides a visualization for the output of some of the algorithms, butnot for sequential patterns, only the textual output is provided. This is, however,unsuitable for the application domain that we target.Weka was founded by the New Zealand government in 1993, the project is still

9

Fig. 4. Abstraction of the sequential pattern mining workflow. [3]

First the data gets collected from the input of the patients, other applications andbackground information. This data is stored in a database or an datawarehouse. Adatamining algorithm (in this case the “GSP”) is used to gain knowledge out of thedata. The output of this algorithm could take place in a textual or pictorial way. Thisknowledge helps the related persons to serve a better support to the respective patient.

active. In this thesis Weka 3.6.10 is used. The goal of the project is:

“The programme aims to build a state-of-the-art facility for devel-oping techniques of machine learning and investigating their ap-plication in key areas of the New Zealand economy. Specifically wewill create a workbench for machine learning, determine the factorsthat contribute towards its successful application in the agriculturalindustries, and develop new methods of machine learning and waysof assessing their effectiveness.” [20]

3.2 Architecture

This section describes the software architecture used to realize the implementa-tion. The data processing part makes extensive use of the Weka library, whichprovides classes to integrate databases, coordinate interactions with them as wellas algorithms to datamine the patterns.

The graphical user interface is realized in Java using the default library Swing.The use cases, already described above, are displayed in Figure 5. In general twocomponents are used: one for the visualization, the second for the completemanagement of the used data. For better understanding see Figure 6.

The classes in the Data-Management component are partially using availableparts from the Weka library. The Visualization component gets all the infor-mation from the Data-Management component and also sends user inputs to it.The communication between the classes in the Visualization component itselfis handled by the Main Window. It consists of multiple window parts classeswhich are waiting for user inputs and send any event information directly to

10

Fig. 5. Use Case Diagram

The workflow of the domain expert is as following: He connects to a database, loadingthe data and selects or modifies the data to his needs. Next, he applies the “GSP”algorithm to the selected data and interactively explores the results by using the visu-alization provided by the application.

the Main Window. If another window part needs this information (e.g. enable abutton) it passes it through, but in general the Main Window informs the Data-Management component about the user interaction, which then can update thedata in order to present an other view. One goal of this thesis is to make thistool easily expandable. For reaching this goal, interfaces where used to makethe providing classes for the main functions easily interchangeable. For furtherinformation see Figure 7.

3.3 Input Data Model

The logged data produced by the user interaction with the eScrapBook are storedin a MySQL database. Figure 8 on Page 14 shows the Entity Relationship dia-gram used in the project. The attributes are specified more detailed in AppendixA. As you can see there, time dependent information is stored as well as all theevents produced by the user. Beyond that, there are not only the log files stored,but also information about the patients and the book. Weka provides the func-

11

Fig. 6. Component Diagram

tionality to combine that tables into one and makes it easy to work with theaggregated data.

3.4 Visualization

Visualization is indispensable in the field of data mining, because it maps com-plex data, which would be poorly predicative for humans without a graphicalrepresentation, into patterns. The aim of “Scientific Visualization” is to simplifythe analysis, insight and the communication of data, models and concepts [22].Anyway, it is not always easy to find a visual representation of data which affordan effective examination. The following sections will give a brief overview andanalysis about existing approaches.

There are different forms of patterns in the area of data mining, therefore,different forms of visualization are necessary. Sequence patterns are related toassociation rules paired with the aspect of time or sequences, which means thata visualization of sequence patterns has to cover both categories.

Overview of existing visualization approaches

Trees The tree structure plays a prominent role in computer science as well asin data visualization. The visual representation is easy to understand and neatlyarranged. Most people read the drawing instinctively from the top to the bottomwhich makes it easy to understand the data’s story.

A tree consists always of nodes and branches. The node in the very beginningis called root. From this node the branches lead to other nodes named leafnodes [23]. A node represents an object or a state whereas a branch connectsa parent node with its children. Figure 9 on page 15 shows an example using atree in the context of sequential patterns.

12

Fig. 7. Overview Class Diagram

The interfaces are very important here, because they guarantee the expandabillity andthe possibility to parallelize the program (e.g. fire and forget pattern, as describedin [21]).

Graphs Directed and undirected graphs play a prominent role in computerscience as well as in data visualization. The visual representation is easy tounderstand and neatly arranged and if there is an adjacency matrix provided tothe user there is no other way than understanding the data’s story.

A graph consists always of nodes and edges. The starting node in the verybeginning is called root and either it is marked with a “1” or an “A” or there isan arrow to highlight it. From this node the edges lead to other nodes. A noderepresents an object or a state whereas a edge connects two nodes. Figure 10on page 16 shows an example using a directed and an undirected graph in thecontext of sequential patterns.

Scatter Plots A scatter plot is a graph of observed object value pairs whichare plotted against each other on the axes. It is very often used for clustering inthe field of data mining. Clustering algorithms are trying to find groups of databy using two or more attributes. Figure 11 on page 17 provides an example withtwo attributes plotted.

Timelines There are different types of timelines, but they always consist of anaxis which represents a process with a chronological sequence. This axis shouldbe labeled with some timestamps to make it easier to understand its context.Everything else is not specified so you have an artistic licence to insert every

13

Fig. 8. ER Diagram SPES project

information you might like to show. As you can read in [24], the figures providedfrom this visualization are also very easy to compare with other patterns, e.g.found by an analysis of a different data subset (see Figure 13 on page 18).

Figure 12 on page 17 offers an example.More detailed information and other forms of visualization are provided in

[23].

Approach used One good solution in the special case of sequential patternsis to visualize the results in different layers of abstraction, because it providesa general overview about the whole dataset as well as detailed information ondemand. The first and most abstract layer, which is used as the first visualiza-tion in the tool, is visualized by using timelines. In this layer all sequences arevisualized at once. It is possible to show the sequences of the patient as well asthe subsequences calculated from the “GSP” to have a brief overview of the holesolution (see Figure 14 on page 19). The next layer visualizes only one timelineenhanced with detail information. This visualization provides the focus to justone subsequence for the user. The tool offers a function to switch between thesequences in an ascending or in a descending way, see Figure 15 on page 20.Besides the subsequences, also a sequence from one specific user can be visual-ized. For the third and last layer the timeline was extended by an axis in thesame way as also K. Vrotsou has done it in [16]. The user can decide which at-tribute the new axis reveals, so this layer provides various visualizations for eachdataset (for an example see Figure 16 on page 21). In addition to the layers amouse-over popupwindow provides background information such as informationabout corresponding patient group or information about the other attributes.Figure 15 also provides an example for the mouse-over popupwindow.

14

Fig. 9. Tree visualization

This figure shows the result from the sequence pattern mining algorithm applied to allpatients in Phase 2 and all patients in Phase 4 of dementia. A digital book containingindividual pictures (e.g. from family) was provided to them and it was measured howlong they looked up each picture which then was used as the input for the sequencepattern algorithm. As you can see, the patients in Phase 4 look much longer at thepictures like the ones in Phase 2.

This covers every information provided by the results of the data, is very easyto decode and encapsulates the data’s story. Therefore, this type of ScientificVisualization supports the users in gaining knowledge and understanding theconcepts contained in the data.

4 Testing and Evaluation of the Prototype Workflow

As already explained in Figure 4 on page 10, the data from the input of thepatient gets first put together with other necessary information in a database.The prototype connects now, with the help of Weka, to this database, see Figure17 on page 22. After that an input in terms of the selection and an editing ofthe data is necessary, see Figure 18 on Page 23. The first two steps require amodification in the Weka configuration file in terms of the used database (e.g.database type and location), an introduction to that is given in [17]. To avoidthat, there is also an alternative way of the first two steps available, see Figure19 on Page 23. The choosen data now gets printed in a textual way. Also theattributes and the possible values of these are shown, see Figure 20 on Page24. Now a possibility to change the algorithm settings is given, see Figure 21on page 24. The last part where Weka is used is the presentation of the textual

15

Fig. 10. Visualization with graphs

In the left top a sequence from the eScrapBook logfiles is given and visualized witha directed and an undirected graph. From this visualization you are not able to gainmuch information, but you gain an abstract overview of the patterns.

output of the sequences found by the “GSP”, see Figure 22 on Page 25. It ishard to decode, but it provides a coarse overview of the uncovered patterns. Thepart which is made available from Weka ends here. The textual outputs fromit (the selected data and the found patterns) get analyzed and passed to thevisualization component. The first layer gets represented and the visualizationpart described in Section 3.4 begins. In generally you can say, that every userinteraction is sent to the controller component to get the requested data, whichgets calculated and sent back to the visualization tool. This functionality isinspired by the Model View Controller (“MVC”) pattern, described in [25].

Synthetic test data sets can be generated in several ways. In general thereare three types of test data generation [26]:

1. Random test data generators.2. Path oriented test data generators.3. Goal oriented test data generators.

IBM for example developed a data generator for a set of algorithms. HenceIBM has invented the “GSP” algorithm, this data generator also provides thefunctionality to generate synthetic sequential data. This software is open source.[27] offers a description in detail for the IBM data generator. A second exampleis the PSDG “parallel synthetic data generator”, which is a powerful high speeddata generator for a huge set of data using cluster computing. A description ofit can be found in [28]. In addition to the data generated by the eScrapBook, Ihave decided to use a simple random test data generator.

16

Fig. 11. Scatter plot example

Each circle represents one specific file in the book, for example a picture named Hol-idays 1999:04:07.png. The yellow circles are pictures, the red ones are videos and theblue ones audio files. All entries have been looked up by the patients with differentfrequencies. From this plot you are able to read how often videos, pictures and audiofiles are chosen in the book from men and women.

Fig. 12. Timeline example

Here a sequence from a patient is visualized by a timeline. Each color is representingone specific media from the eScapBook which was looked up from this patient.

17

Fig. 13. Compare Timelines

As you can see, timeline comparison is very easy to manage and you can read a lot ofinformation out of this kind of visualization. In this figure you are able to compare thesequences of multiple patients visually.

18

Fig. 14. First layer

Here all found subsequences are visualized. The number of events of each pattern isgiven on the left side. As you can see, it occurs that sometimes the same lenght ofan subsequence is possible. On the right side in the top of the program, the user canswitch between the visualization of the sequences/patterns. In the menu below he canchoose between the shown attributes and in the bottom right a legend describes themeanings of the colors.

A comparison between the textual output of Weka and the visualized datahas led to the conclusion that the tool works as expected.

An example scenario would be: A domain expert gets a new dataset of a groupof patients suffering from dementia. He executes the tool and loads the new data.Next, he edits and selects the attributes he is interested in. He runs the sequencepattern algorithm and presses the “Visualize” button. First, he investigates thefirst layer to get a brief overview of the whole dataset and the found patterns.After that, he may examine concrete sequences or patterns, which seem to be ofinterest. In addition he extends the currently viewed pattern or sequence witha y-axis to gain more information (e.g. bookpage). By using all these steps thedomain expert can explore the results of the datamining algorithm in order togain new insights.

There are a lot of modifiable options in the “GSP” algorithm and if youchange some of the settings just for a little, its output can change totally. Thereare two settings which make the biggest differences. The first one is “min Sup-port”, which means how many percent of the sequences are needed with thesame subsequence in order to get recognized as a pattern. If you just modifythis option by ten percent, the number of subsequences changes by minimum oftwenty times more. The second thing is the attribute, which have to be insidethe subsequence. You can leave this out or select one or more at once. This alsochanges the output considerably.

19

Fig. 15. Second layer.

Here you see one specific, selected pattern with the lenght 9. To switch between thepatterns the buttons “prev.” and “next” on the right side are provided. Also a mouseover popupwindow with further information is shown. To switch to the next layer theuser has to click the checkbox “Add Y-Axis”, what also activates the options beneathit and makes the selection of the attribute for the y-axis possible.

The output of Weka in this case is just a textual, but you can see the numberof found subsequences, the used attributes and all other necessary informationin the very beginning. The extended visualization, which was included in thecontext of this thesis provides a better way to analyze the patterns. I think thebest is to consider both, the textual and the visual solution. If the number offound patterns gets too big, the detection of finding interesting subsequencesturns into a very hard undertaking.

5 Conclusions

5.1 Results Achieved and Discussion

Weka is a very performance oriented tool for data mining. It provides manyalgorithms for data mining and also utilities like connecting and processingdatabases, editing the datasets, and various other helpful extensions. In the spe-cial case of the “GSP” algorithm, Weka does not provide a visualization for theresults, which is absolutely necessary to find interesting information within thepatterns. As said before, the main goal of this thesis is to help domain expertsof dementia disease in a technical way with their research. I think the prototypeis a good start for that target. However, the user of this tool has to know somethings about databases and also about Weka. At least some database commandslike “SELECT attribute FROM table”.

20

Fig. 16. Third layer.

Here a timeline was extended by a y-axis. It is not so easy to compare two sequenceswhen using this visualization, but you gain much more information out of it. Here forexample you also can read the bookpage from the picture. In a normal timeline youwould be able to see, that the patient was looking up audiofiles for three times, butwith the expanding axis you recognize, that it was two times the same file on page 1and one time the file on page 6.

21

Fig. 17. Connection to the database.

If the “GSP” algorithm finds too many subsequences, it may be very hard toanalyze the resulted dataset, even when it consists of patterns. For that, I rec-ommend to take the results and apply another data mining algorithm in Wekato gather knowledge out of the extracted patterns (see Section 5.2). The concep-tional elaboration of the prototype and the overall platform was successful andI found a generic solution for a software architecture, which provides the possi-bility to easily modify and extend the functionality of the prototype. A workingprototype was implemented and the evaluation and tests have shown that allspecified goals of this thesis were completed well.

5.2 Future Work

Very interesting would be a comparison of the “GSP” algorithm and other se-quential pattern mining algorithms such as the modified algorithm from [15] orthe “SPADE” (An apriori-based vertical data format sequential pattern miningalgorithm) described in [12]. Hence the output of the provided tool is sometimesa very large set of data, an extension of the functionality would also be a thingof interest. For example a further algorithm which investigates the output of the“GSP” in terms of their togetherness would be one possibility to solve the prob-lem of large result sets. Also a simple extension of the tool for other dataminingalgorithms (e.g. clustering, association roles, classification or regression) with theassociated visualizations could help to get new information for the domain ex-perts. Since the data produced from the patients were not so big, a parallelizedversion of the “GSP” was not necessary. But to make the tool more applica-ble for a larger range beyond the “eScrapBook” logfiles, an investigation of itsparallelizability and its performance would be helpful. An alternative to Wekawould also be “RapidMiner” which also is a common data mining and predic-tive analysis solutions tool. A comparison related to their effectiveness between“RapidMiner” and “Weka” would also be a thing of interest. A description indetail for “RapidMiner” is given in [29].

22

Fig. 18. Editing and selection of the data.

In the first selection menu the group of patients is selected. There is a large set ofdifferent groups to select, such as patients of different phases of their illness, gendergroups, age groups ect. The other two selections refer to attributes. Of course, the“GSP” would handle more than two attributes, but in this context one or two choicesare enough to analyse the data in an efficient way.

Fig. 19. Second possibility to connect to the database and select the data.

Expanding the prototype to this second possibility makes it possible to use it also forother datasources than the log files of the ”eScrapBook“.

While working on this thesis I got the idea to aggregate all of the collecteddata regarding included persons of the dementia disease into the cloud and toapply analysis methods as well as transformations of the results into visual-izations via cloud computing. This could lead to a worldwide collaboration ofresearchers and domain experts and might result in more discoveries and mighteven support the slowdown of the disease’s course.

23

Fig. 20. Textual output of the data.

Fig. 21. Settings for the algorithm.

It is possible to save the selected options to be able to reopen it in the next passage.The selected parameters in this figure are the default one.

24

Fig. 22. Textual output of the “GSP” algorithm.

As you can see, only the textual output is not enough for getting an overview aboutthe data. A visualization of this results can resolve this problem.

25

References

1. Stefanov, D.: The smart house for older persons and persons with physical disabil-ities: structure, technology arrangements, and perspectives. IEEE (June 2004)

2. Slimani, T., Lazzez, A.: Sequential mining: Patterns and algorithms analysis.CoRR abs/1311.0350 (2013)

3. Brezany, P., et al.: Management and analysis of big data produced in brain disorderrehabilitation (2014)

4. Perala, S., Makela, K., Salmenaho, A., Latvala, R.: Technology for elderly withmemory impairment and wandering risk. E-Health Telecommunication Systemsand Networks 2 (2013) 13

5. Douglas, A. JAMA (February 1995)

6. : Spes project. http://www.spes-project.eu/index.php?id=0 Accessed: 2014-11-25.

7. : Oldes project. www.oldes.eu Accessed: 2014-11-25.

8. Johansson, N., Lofgren, A.: Designing for extensibility: An action research studyof maximizing extensibility by means of design principles. (2009)

9. Elsayed, I., et al.: Aba-cloud – support for collaborative breath research. (2013)

10. Fiori, C.: Reformvorschlage fur das deutsche Rentensystem. GRIN Verlag (2002)

11. Froestl, H., Bickel, H., Kurz, A.: Alzheimer Demenz: Grundlagen, Klinik undTherapie. Springer Berlin Heidelberg (1999)

12. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann(2006)

13. Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and per-formance improvements. In: Proceedings of the 5th International Conference onExtending Database Technology: Advances in Database Technology. EDBT ’96,London, UK, UK, Springer-Verlag (1996) 3–17

14. Dilch, D.: Optimierung von life sciences algorithmen fur gpus mit cuda/opencl.Master’s thesis, Universitat Wien (2013)

15. Kumar, V., et al.: A universal formulation of sequential patterns. (1999)

16. Vrotsou, K.: Everyday mining: Exploring sequences in event-based data. LinkopingStudies in Science and Technology. Dissertations No. 1331, Linkoping University,Sweden (2010)

17. : weka. http://weka.wikispaces.com Accessed: 2014-11-20.

18. Agrawal, R., Srikant, R.: Mining sequential patterns. In: Data Engineering, 1995.Proceedings of the Eleventh International Conference on. (Mar 1995) 3–14

19. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: TheWEKA data mining software: an update. ACM New York (June 2009)

20. Pfahringer, B.: Weka:A tool for exploratory data mining. University of Waikato,New Zealand (2007)

21. Subramaniam, S., Loh, G.H.: Fire-and-forget: Load/store scheduling with no storequeue at all. In: Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACMInternational Symposium on, IEEE (2006) 273–284

22. Schumann, H., Muller, W.: Anforderungen an eine visualisierung. In: Visual-isierung. Springer Berlin Heidelberg (2000) 5–13

23. Brezany, P., Ivanov, R.: Advanced visualization of data mining and olap results.Technical report (August 2005)

24. Vrotsou, K., Ynnerman, A., Cooper, M.: Seeing beyond statistics: Visual explo-ration of productivity on a construction site. In: Visualisation, 2008 InternationalConference. (July 2008) 37–42

26

25. Bucanek, J.: Model-view-controller pattern. Learn Objective-C for Java Developers(2009) 353–402

26. Ferguson, R., Korel, B.: The chaining approach for software test data generation.ACM Trans. Softw. Eng. Methodol. 5(1) (January 1996) 63–86

27. Agrawal, R., Srikant, R.: Mining sequential patterns. In: Data Engineering, 1995.Proceedings of the Eleventh International Conference on, IEEE (1995) 3–14

28. Hoag, J.E., Thompson, C.W.: A parallel general-purpose synthetic data generator.ACM SIGMOD Record 36(1) (2007) 19–24

29. Burget, R., Karasek, J., Smekal, Z., Uher, V., Dostal, O.: Rapidminer image pro-cessing extension: A platform for collaborative research. In: The 33rd InternationalConference on Telecommunication and Signal Processing. (2010)

A Appendix

This appendix contains additional information as mentioned before.

27

Fig. 23. Attribute Table - Part 128

Fig. 24. Attribute Table - Part 2

29

mining sequence patterns from data collected by brain damage rehabilitationa1108558/thesis.pdf ·...

Documents