[ieee 2010 ieee international conference on intelligent computer communication and processing (iccp)...

7
A Hierarchical Semantically Enhanced Multimedia Data Warehouse Andrei Vanea, Rodica Potolea Technical University of Cluj-Napoca Cluj-Napoca, Romania [email protected], [email protected] Abstract - Data warehouses are used in many domains. Their purpose is to store historical data and to assist in the decision making process. Multimedia data warehouses are used for storing files which contain texts, graphics, videos and sound. These kinds of files are produced in large quantities, in fields such as medicine or space research. We propose a framework for building such a data warehouse, structuring the data in a familiar way, for a warehouse user. We present a hierarchical way of structuring the data and the information extracted. We also propose a method of semantically enhancing the data and the information extraction process with the use of hierarchical metadata. I. INTRODUCTION A data warehouse is a repository of electronic data, designed for storing, aggregating and summarizing the data. Operational databases are used for storing daily business transactions, preserving data integrity and providing fast access to data. The data model used with success in operational databases is the relational model, which follows Codd’s normalization rules. Data warehouses focus on business processes and the entities that describe them. Data warehouses store historical data, from all the areas of a business, rather than from only one single department, the way that operational databases do. As an example, operational databases store data produced at a specific store, in a chain of stores, and a data warehouse stores all the data produced at all the stores of that particular store chain. Another use for data warehouses is in the decision making process, therefore data warehouses are optimized for a fast analysis of data. Usually, dimensional modeling is used in data warehouses. Dimensional modeling is a logical design technique, in which many tables, called dimensions, describe a central table, called fact, e.g. the central table references them. A fact table is the primary table in a dimensional model where the numerical performance measurements of the business are stored. Dimension tables are integral companions to a fact table and contain the textual descriptors of the business [1]. There is almost always a way of describing the data in the data warehouse, and this is done by using metadata, which is data that describes (central) data [2]. Sometimes, in order to insure the significance (meaning) of the data, some meta-metadata is used, to describe the metadata itself [3]. Multimedia data is complex data in different formats - texts, graphs, videos sounds [4, 5]. These multimedia files capture different events or different descriptions of the same event. Therefore, the multimedia data needs to be stored so that it can later be processed and analyzed. Although data warehouse technology for numerical and symbolic data is considered to be mature [5], there is much to do in regard to complex, multimedia data warehousing [6]. We humans can relatively easy extract information from different types of multimedia objects, such as text files, images and sounds. But the systematic information extraction through an automated process needs adequate techniques, specific for each type of multimedia object stored, from which knowledge is extracted. But why, if we can do it ourselves, do we have to bring computers into the knowledge extraction process? The answer is simple: because of the large amount of multimedia data that is created/captured. We humans do not have the ability to process in depth such complex data nor to detect and extract knowledge that might be hidden in multimedia data. This can only be done with the aid of knowledge discovery processes, which are systematic and (semi) automated. Many complex fields, such as medicine, space research or weather forecast, acquire data in many formats: text, audio, video. There is a need to store, retrieve and process this complex data. One of the best solutions to accomplish this is trough the use of data warehouses. Traditional DBMS systems are not really suited for complex, multimedia data. This is because relational databases require that the data they store have structure, whereas multimedia data is often semi-structured. In this case, new ways of storing and processing this multimedia data had to be developed and adopted. Because of the way in which XML can represent any kind of structure, it was the self-evident way in which multimedia data could be stored and handled. Some DBMS support XML files and other are XML-native, such as eXist. The existence of XML based languages, such as xPath and xQuery, has further improved the use of XML as a multimedia data storing technology. 978-1-4244-8230-6/10/$26.00 ©2010 IEEE 3

Upload: rodica

Post on 20-Dec-2016

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: [IEEE 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP) - Cluj-Napoca, Romania (2010.08.26-2010.08.28)] Proceedings of the 2010 IEEE 6th

A Hierarchical Semantically Enhanced

Multimedia Data Warehouse

Andrei Vanea, Rodica Potolea

Technical University of Cluj-Napoca

Cluj-Napoca, [email protected], [email protected]

Abstract - Data warehouses are used in many domains. Their

purpose is to store historical data and to assist in the decision

making process. Multimedia data warehouses are used for

storing files which contain texts, graphics, videos and sound.

These kinds of files are produced in large quantities, in fields

such as medicine or space research. We propose a framework

for building such a data warehouse, structuring the data in a

familiar way, for a warehouse user. We present a hierarchical

way of structuring the data and the information extracted. We

also propose a method of semantically enhancing the data and

the information extraction process with the use of hierarchical

metadata.

I. INTRODUCTION

A data warehouse is a repository of electronic data,

designed for storing, aggregating and summarizing the data.

Operational databases are used for storing daily business

transactions, preserving data integrity and providing fast

access to data. The data model used with success in

operational databases is the relational model, which follows

Codd’s normalization rules. Data warehouses focus on

business processes and the entities that describe them. Data

warehouses store historical data, from all the areas of a

business, rather than from only one single department, the

way that operational databases do. As an example,

operational databases store data produced at a specific store,

in a chain of stores, and a data warehouse stores all the data

produced at all the stores of that particular store chain.

Another use for data warehouses is in the decision making

process, therefore data warehouses are optimized for a fast

analysis of data. Usually, dimensional modeling is used in

data warehouses. Dimensional modeling is a logical design

technique, in which many tables, called dimensions,

describe a central table, called fact, e.g. the central table

references them. A fact table is the primary table in a

dimensional model where the numerical performance

measurements of the business are stored. Dimension tables

are integral companions to a fact table and contain the

textual descriptors of the business [1]. There is almost

always a way of describing the data in the data warehouse,

and this is done by using metadata, which is data that

describes (central) data [2]. Sometimes, in order to insure

the significance (meaning) of the data, some meta-metadata

is used, to describe the metadata itself [3]. Multimedia data

is complex data in different formats - texts, graphs, videos

sounds [4, 5]. These multimedia files capture different

events or different descriptions of the same event.

Therefore, the multimedia data needs to be stored so that it

can later be processed and analyzed.

Although data warehouse technology for numerical and

symbolic data is considered to be mature [5], there is much

to do in regard to complex, multimedia data warehousing

[6]. We humans can relatively easy extract information from

different types of multimedia objects, such as text files,

images and sounds. But the systematic information

extraction through an automated process needs adequate

techniques, specific for each type of multimedia object

stored, from which knowledge is extracted. But why, if we

can do it ourselves, do we have to bring computers into the

knowledge extraction process? The answer is simple:

because of the large amount of multimedia data that is

created/captured. We humans do not have the ability to

process in depth such complex data nor to detect and extract

knowledge that might be hidden in multimedia data. This

can only be done with the aid of knowledge discovery

processes, which are systematic and (semi) automated.

Many complex fields, such as medicine, space research

or weather forecast, acquire data in many formats: text,

audio, video. There is a need to store, retrieve and process

this complex data. One of the best solutions to accomplish

this is trough the use of data warehouses. Traditional DBMS

systems are not really suited for complex, multimedia data.

This is because relational databases require that the data

they store have structure, whereas multimedia data is often

semi-structured. In this case, new ways of storing and

processing this multimedia data had to be developed and

adopted. Because of the way in which XML can represent

any kind of structure, it was the self-evident way in which

multimedia data could be stored and handled. Some DBMS

support XML files and other are XML-native, such as eXist.

The existence of XML based languages, such as xPath and

xQuery, has further improved the use of XML as a

multimedia data storing technology.

978-1-4244-8230-6/10/$26.00 ©2010 IEEE 3

Page 2: [IEEE 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP) - Cluj-Napoca, Romania (2010.08.26-2010.08.28)] Proceedings of the 2010 IEEE 6th

II. RELATED WORK

Data warehouses are widely used now in many fields,

from economics to medicine and weather forecast [1]. At

the beginning, there where mostly numerical and textual

data warehouses, primarily in business operations such as

economics, marketing and sales. The large amount of data

generated by these businesses was successfully integrated

using the dimensional model. The relational model, used for

regular storage in OLTP databases, was not suited for the

decisions that had to be pulled from the existing data. This

is because it not optimized for aggregating such large

amounts of data, which are stored in data warehouses. The

fact tables store summarized, aggregated data, which is later

used by OLAP tools to aid in the decision making process.

Symbolic objects have also been widely spread in data

warehouse environments. These objects are mostly character

strings. Such objects may be found in surveys and

questionnaires [7, 8].

In [6] the authors describe a data warehouse for complex

objects. The semi-structure format of these objects is

captured via XML files, which are then parsed and validated

against a minimum requirements pattern. The

communication between the user and the data warehouse is

accomplished using the xQuery language, a language for

XML-like structured data. A medical data warehouse

focused on ECG signal recordings, is described in [5],

containing image data and symbolic data.

Retrieving multimedia data from a database or data

warehouse can be done in two ways: by content or by

description [5]. Description based retrieval uses attribute

descriptions of data (color, audio/video duration, number of

instances a particular word is used), while content based

retrieval uses the actual data inside the files (clouds, ideas,

theories). When dealing with multimedia, it is helpful to

separate the types of media files during data retrieval or

processing. Storing these files in a hierarchical way is a

solution, as presented in [9]. It is important to understand

what the significance of the data that it is stored in the

warehouse is, so the use of metadata is a crucial part of the

data warehouse system.

Current trends are to semantically enhance the data

representation of data stored in warehouses. In [10] the

authors propose a method to semantically translate

conceptual models into their platform specific counter parts,

by using an OLAP algebra. The authors of [11] have built a

data warehouse which has two ontologies, one for the

specific business terms and one for the technical terms,

specific to the aggregation and knowledge extraction tools.

This requires a one-time collaboration between the business

experts and data warehouse designers, to produce a mapping

between the two ontologies. As a result, whenever a new

query is requested by the business analysts, the warehouse

administrator can quickly create the appropriate data mart,

without the need of long and repetitive meetings between

the two expert teams.

In [12] the authors implement a system in which they

analyze multimedia data, medical in nature, in order to

extract knowledge from it and to assist the physicians.

III. THE PROPOSED MODEL

One of the problems that is more frequent in multimedia

data warehouses than in numerical and symbolic

warehouses, is answering questions like “which are the

entities that have some particular features?”, which in

complex object warehouses, focusing on medical records for

example, could translate to something like “which are the

people that have had a heart attack?”.

Our work focuses on creating a multimedia data

warehouse which can represent the data in a familiar, top-

down way, and process the data stored by knowing their

connections and dependencies. The system must minimize

human intervention in creating the needed facts and

dimensions that are not considered in the design and

implementation processes. To instantiate our system we

build a medical data warehouse.

The proposed model takes into consideration the fact

that metadata for complex objects refers to information and

description of such things as file format, size, location,

number of words, number of lines, width, height,

(video)length, and so forth, needing thus a hole new way of

representing metadata and the connections between data and

the metadata describing it. Also, the model aims at

improving the way a question like the one already presented

can be answered.

Another thing our model wants to assure is the semantic

value of metadata. Most data warehouses use metadata to

help the user to understand the data stored, making it easier

for them to select the appropriate tools for summarizing,

reporting or analyzing the data. As an example, consider a

(multimedia) data warehouse used in a complex field, such

as finance or medicine. If the beneficiaries want to get some

results that are not designed to be extracted by the system,

they need to contact the data warehouse administrator(s).

But most of the time, they are not experts in the field of

economics, nor in that of medicine. They are specialized in

the field of computer science or at least in database/data

warehouse maintenance. So, there will be a lot of time spent

before the administrator understands the needs of the

beneficiaries, and by using the metadata, creates the

appropriate queries or data marts. With the use of a rich

semantic metadata, the system can automatically resolve

such requests.

A. SYSTEM ARCHITECTURE

We structured our data warehouse in five blocks: the

ETL tools block, the warehouse block, the semantic

metadata block, the processing and metadata maintenance

block and the query processor block (Fig. 1).

4

Page 3: [IEEE 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP) - Cluj-Napoca, Romania (2010.08.26-2010.08.28)] Proceedings of the 2010 IEEE 6th

Figure 1. The system architecture, containing the five blocks: ETL Tools,

Data Warehouse, Metadata, Query Processor and Processing and Metadata Maintenance Tools.

1) The ETL Tools Block

The ETL tools block acquires the data and afterwards

prepares it to be stored in the warehouse. It checks the type

of the file that will be stored and gathers specific data

(meta)features such as name, file length or format. The

information acquired in this step is loaded in XML files

which store the dimensions and characteristics of the files.

2) The Metadata Block

To assure the semantics of the data, we use two

repositories of metadata: one describing the terms specific

to the business domain and one describing the technical

terms that the system can extract and process. We also store

the mappings between the business terms and the technical

terms. These are gathered by both business and technical

specialists, at implementation time. All the information

within these two repositories is represented in a hierarchical

manner, using XML files and XML elements. Lower level

items are nested in upper level items. In this way, the query

processor can resolve the query on a high level item, by

breaking it in lower level items. Finding out which items

that characterize (influence) other items, the query processor

can access the corresponding fact tables. If the desired fact

table does not exist, the query processor checks to see if

another existing fact table (or tables) can provide significant

data for the computation of the query result. If not, the query

builder can create the appropriate fact table, using the

semantic metadata provided in the repositories. This creates

a dynamic environment, which does not need the

intervention of the data warehouse designer or

administrator.

3) The Warehouse Block

We further propose a similar construct for the actual data

warehouse, by modifying the classical dimensional model.

Our model relies on the hierarchical composition of

features. The block is composed of two other blocks, one

containing the dimensions of the system and one containing

the facts.

The facts block communicates with the dimensional

block, which provides the dimensional data needed to

extract the fact data. It contains one data mart for each type

of multimedia data that is managed by the system: text,

image, video, audio and database. Each data mart contains a

Figure 2. The layering of fact tables. Some facts may depend on other facts,

i.e. they see other facts as dimensions (left), but some may not (right).

hierarchy of facts, from level 1 to level n. So, in a way

similar to metadata attributes which depend on the attributes

at a lower level, each fact table may be referenced by the

fact table at the next level (Fig. 2). This means that each fact

table becomes a dimension table for the upper level fact

table, and the dimension tables become support tables. Each

fact table is linked with the corresponding level of metadata,

to allow a faster access to the right fact table(s) that is (are)

needed to answer the query.

4) The Processing and Metadata Maintenance Tools

Block

The processing and maintenance block are made up of

the tools needed to compute aggregations of the data and to

manipulate the metadata. Aggregation tools operate on the

different types of media supported by the data warehouse.

The metadata maintenance tools allow editing the metadata

repositories and also the mappings between the business

repository and the technical one. The mappings are directly

influenced by what the processing tools can compute and

extract in technical terms.

5) The Query Processor

The query processor acts like a controller. It is

connected to all the other three blocks and it resolves the

semantics of the query that the user inputs. After the

dependencies of the query are computed, it selects the

corresponding aggregation tools, based on technical

mappings.

IV. MODEL IMPLEMENTATION

We build our system according to the architecture and

the model proposed. The particular domain for which we

instantiated the implementation is the medical one; more

specific pneumology (pulmonology). The data acquired so

far is represented in two main formats: symbolic (text) and

5

Page 4: [IEEE 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP) - Cluj-Napoca, Romania (2010.08.26-2010.08.28)] Proceedings of the 2010 IEEE 6th

images. Symbolic data is used for storing information about

the name, id, the patient’s date of birth, their gender,

weather they are smokers, non smokers or former smokers,

but it may also store the physicians comments about the

medical state of the patient. Two types of time series are

stored in the data warehouse: the first one contains the

amount (volume) of air exhaled by the patient over time,

and the second one contains the flow of the air volume.

The data warehouse user can view single patient data or

submit a query. The query parameters that need to be

specified are the type of aggregation requested, the domain

specific term on which the aggregation function is to be

applied and the characteristics of the data that is going to be

processed.

The ETL tools in the ETL block of the system get the

operational data and transform it in the structure used by the

data warehouse. Each dimension is stored in one XML file,

so for every new patient or new data for an existing patient,

a new record - i.e. XML node - is appended to the

appropriate XML dimension. If a new image containing a

graphical representation of the functional respiratory tests of

patient is extracted, its characteristics such as file name,

format, size, path, type of time series and corresponding

patient are stored in specific dimensions.

All the parameters that the user can select from to create

the query are already defined in the metadata repository.

The medical terms repository stores the terms that are

supported by the warehouse in an XML file. Every medical

term can be used to describe another medical, i.e. it

influences another term. This is accomplished by nesting

medical terms (i.e. XML nodes) inside upper level terms.

Every medical term has either a direct mapping with a

technical term or in indirect one, via transitivity. The same

property is valid for the technical terms. The structures of

the technical and medical metadata repository are similar.

An XML file contains the direct mappings between two

terms, belonging to each domain. Each XML node

represents a mapping and has two child elements,

representing the matching between medical and technical

terms. This mapping is viewed as the way in which a

particular medical (i.e. business) term is represented as

information. Therefore, knowing the medical term (issue)

that the user is interested in, and how it is represented, the

system can select the appropriate internal functions for

analyzing the existing data.

Each fact level of each data mart is stored in a different

XML file. In this way, we achieve the hierarchical structure

of the data marts, in which a fact table may become a

dimension for another fact table. Such a fact file contains

the result of the aggregation and at least one set of

references for each dimension used in the aggregation

process. To speed up the process of automatically creating

new queries, an XML file is created, containing technical

terms associated with every existing fact table.

After the query parameters are submitted by the user, the

query processor checks the medical term for existing

technical mapping, in the mappings XML file. If a mapping

is not found, the query processor checks for other medical

terms that describe the current medical term and then checks

for a mapping for all these lower level terms. This process is

repeated until all the medical terms that describe the current

query have a mapping with a technical term. After this step,

the query processor checks all the technical mappings found

in the previous step if they have an associated fact table, i.e.

a fact table that has already stored the needed aggregation. If

no satisfactory fact table is found, it then checks for existing

aggregation tools which can extract them from the data. A

similar process is implemented for the technical terms as

with the medical ones. The difference is that the query

processor searches in an XML file containing a mapping

between a technical term and an existing processing

procedure. Once the technical mappings have been fully

resolved, the query processor computes the aggregation and

if no similar fact table exists, where it can store the result, it

automatically creates the fact table. The computed result is

presented to the user.

V. EXPERIMENTAL RESULTS

A. DESCRIBING THE DATA

(a) (b)

Figure 3. (a) – A typical air flow graph for a healthy patient; (b) – A typical

air flow graph for a sick patient: the PEF and FVC are not reached and the

FV line is concave.

In defining a system to assist physicians in taking better

medical decisions, an important step is to correctly

understand the particularities of the problem under

investigation, and identify the features that trigger a

decision. Therefore, we should first detect the end user

needs, and transform them into possible knowledge that our

data warehouse could offer. Such knowledge is extracted

from basic information about patients and lung performance

test results, stored by the system. This knowledge could

help in identifying existing lung problems, via the data

mining process. Such lung problems are encoded in the

recorded images on respiratory tests. Fig. 3a shows a record

of a flow - volume test on a healthy patient and Fig. 5a

shows a record of a volume over time test, also on a healthy

patient. With the domain expert (i.e. lung physician) we

6

Page 5: [IEEE 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP) - Cluj-Napoca, Romania (2010.08.26-2010.08.28)] Proceedings of the 2010 IEEE 6th

identified the significant aspects of the image that are good

indicators of the lung health. They are transformed into

features characterizing the image, which represent rough

data stored in the data warehouse. Moreover, it allowed

creating the list of mappings between the medical and

technical terms. The information extraction process is

beneficial as: (1) is time consuming and (2) as a domain

expert barely handles large and complex data sets, might not

(easily) see hidden relations among the components.

Our first investigations lead us to a number of 35

features: 23 numeric, 7 Boolean and 5 nominal.

The data recorded consists of image data, symbolic and

numeric data. The basic symbolic and numeric data relates

to the known information about the patients, data that is

usually found in medical records. In our particular problem

under investigation, the specified medical records are:

patient id, patient name, date of birth, weight, height, sex,

smoking history and race (sometimes, the race influences

the normal values that a test on a healthy patient should

produce).

To identify patterns, help in determining the diagnostic

and propose a treatment plan, a classification process is

required. In order to do so, a set of features should be

extracted from the images. The image data (Fig. 3a and 5a)

presents graphs indicating inhaled and exhaled air flow

related features.

Figure 4. The dimensions of the system.

The first image type (Fig. 3a) plots the flow of air of a

patient over the volume of exhaled and inhaled air. These

features are presented below and they are all stored as a

numerical value:

is the angle formed by the ordinate

and the first section of the graph;

The Peak Expiratory Flow (PEF) is the

maximal flow (or speed) achieved during the

maximally forced expiration initiated at full

inspiration, measured in liters per second;

The Normal Peak Expiratory Flow (NPEF) is

the computed normal value for the PEF for a

healthy patient, given particular features such

as weight, height, age, etc.;

is the angle between the first initial

curve of the exhaled air and curve between the

second portion of the graph;

The Forced Expiratory Flow at 25-75% (FEF25-75) is the average flow (or speed) of air

coming out of the lung during the middle

portion of the expiration, measured in liters per

second;

The Normal Forced Expiratory Flow at 25-

75% (NFEF25-75) is the computed normal

value for the FEF25-75 for a healthy patient,

given particular features such as weight, height,

age, etc.;

The Forced Vital Capacity (FVC) is the volume

of air that can be forcibly be blown out after a

full inspiration, measured in liters;

The Normal Forced Vital Capacity (NFVC) is

the computed normal value for the FVC for a

healthy patient, given particular features such

as weight, height, age, etc.;

The Flow-Volume Line (FV line) plots the way

the air is exhaled, from the PEF to the FVC.

Regarding the second type of image (Fig. 5a) which

plots the air volume over time, the features that reveal

interest for the diagnosis process, in combination with the

existing air flow image features, are these numerical values:

The Forced Expiratory Volume in one second(FEV1) is the maximum volume of air that can

be forcibly blown out in the first second of the

FVC maneuver;

The Normal Forced Expiratory Volume in one

second (NFEV1) is the computed normal value

for the FEV1 for a healthy patient, given

particular features such as weight, height, age,

etc.

The FVC, FEV1 and PEF have specific values, the

normal value, computed for each (healthy) individual,

according to height, age, sex, and sometimes race and

weight. These characteristics are stored in the dimensions of

the data warehouse (Fig. 4). The normal values are present

in the images acquired by the ETL tools and stored in the

data warehouse. Another interest is in determining whether

the PEF and FVC features have reached the computed

normal values (i.e. the corresponding values for the

specified features on a healthy patient, with the given

7

Page 6: [IEEE 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP) - Cluj-Napoca, Romania (2010.08.26-2010.08.28)] Proceedings of the 2010 IEEE 6th

characteristics). In the ETL step these conditions are

checked and the results are stored in a dimension, particular

to the air flow images. If the computed normal values are

not reached, we compute the percentage of the measured

value in aspect to the predicted one.

(a) (b)Figure 5. (a) – A typical air volume graph for a healthy patient; (b) – A

typical air volume graph for a sick patient: the FEV1 is not reached.

Fig. 3a presents a typical air flow graph for a healthy

patient. The s are very small, the computed

normal values for the FVC and PEF are reached, and the FV

line is straight. In Fig. 5a we can notice that the value for the

FEV1 is also reached.

Fig. 3b presents a typical lung problem air flow image. It

can observe that the PEF and the FEV where not reached

and the FV line is concave, which indicates signs of illness

and lungs problem. There are three types of lung problems

that could be identified from such images:

obstructive lung disease - concave FV line and

FEV1 not reached;

restrictive lung disease - FVC and FEV1 not

reached;

mixed lung disease - PEF, FEV1 and FVC not

reached.

Some factors that are technical in nature might influence

the relativity and interpretability of the images containing

the air flow. An incorrect test might not yield the most

accurate information about the patient’s health. An example

is the way in which the patient blows out the air. If they

blow out the air in one continuous stream, then the test is

successful, but if they stop exhaling and inhale for short

periods, the test is not accurate. This is indicated by the

shape of the FV line, i.e. if it is smooth (continuous

exhaling) or presents many spikes (exhaling and inhaling,

intermediately). If it is not smooth, then the test is not as

relevant as it should be.

computed values reached, smoothness of the curve of the

exhaled air and concavity of the graph, we build specific

technical tools which we map to the corresponding technical

terms. These technical terms are then mapped to their

corresponding medical terms. The mappings to the medical

terms where done with the help of the medical specialist.

The

direct line between the

computed by using the direct line between the origin and the

PEF and the direct line between the PEF and the FVC. In

order to determine the concavity, we use areas. We compute

the area determined by the actual graph stored in the image,

and the area for the graph resulted by considering the FV

line as the straight line which connect the PEF and FVC.

This second area represents the area of a perfect lung exam

result, in which the patient is healthy. We then subtract the

actual area from the second area. In case the difference

gives a positive result, then the FV line is concave.

Otherwise, the FV line is convex.

B. EXTRACTING AND STORING KNOWLEDGE

The physicians need both data and knowledge from such

a system, in order to be assisted in the decision making

process. Therefore, intelligent queries to aggregate existing

(legacy) data and extract significant information from them

are increasing the impact of such systems.

The power and the speed of the system is improved

trough the existence of such information as mean values for

the different results captured by the respiratory tests, on

different patients, both healthy and sick, with particularities

such as height, weight or age.

For the problem under investigation, which is to

determine if a patient is suffering from some kind of lung

problem, knowledge of interest is represented by: the mean

value of the percentage of the measured PEF value in

respect to the normal PEF (NPEF) value, for a given age

group, the mean value of the measured FVC value in respect

to the NFVC (Fig. 6a), the mean value for the concavity

degree of the FV line, the number of patients that have a

concave FV line of a degree lower than the mean, the

number of patients with restrictive lung disease which reach

at least 70% of the NFVC.

(a) (b)Figure 6. (a) An air flow image of a patient, containing the mean

percentage of the PEF (horizontal) and FVC (vertical) in respect to their

normal values; (b) An air flow image of a patient containing a percentage of lung problem diagnosed patients, with similar FV line.

We identified the relevant medical aspects, transformed

them into queries, allowing us to store relevant information,

ready to be used by the physicians. We populated the

warehouse with significant information, before deploying it

(but also in a dynamic fashion, afterwards) to the end user.

This represents an enhancement, difficult to be obtained by

the warehouse administrator, who is not a (medical) domain

specialist. It also reduces the response time of the system.

8

Page 7: [IEEE 2010 IEEE International Conference on Intelligent Computer Communication and Processing (ICCP) - Cluj-Napoca, Romania (2010.08.26-2010.08.28)] Proceedings of the 2010 IEEE 6th

The first type of query, for the mean percentage of the

measured PEF in respect to the NPEF for the 6 – 10 years

age group offers the physician basic knowledge about how

much air is exhaled at those particular ages. The result was

stored in a fact table which corresponds to that particular

age group. This particular query type is important and

relevant for all age groups. Therefore, the knowledge stored

in the system was enhanced with the computed mean for the

age group 11 – 20, the age group 5 – 20, and so forth (i.e.

ranges suggested by domain expert).

Figure 7. Sample taken from the technical metadata repository.

When the first query was submitted, the query processor

checked for mappings for the medical term and then

checked for existing results for the query. Because it did not

find any existing results, already stored in the data

warehouse, it selected the corresponding records, and

computed the requested mean. The same was done for the

second query. For the third query (5 – 20 mean) the query

processor found in the technical metadata repository that a

mean for the age group 6 – 20 can be computed using the

means of the two age groups 6 – 10 and 11 – 20 (Fig. 7).

After finding this information, the query processor looked

for records containing those two means. After finding them,

it decides to use them to resolve the query, instead of

retrieving the required data for all the patients between 6

and 20 years, and computing the mean.

Because the system deals with images, querying the

warehouse based on raw image data needs to be one of the

supported features of the system. Inputting an image as a

query, to get aggregate data relevant to the medical

investigation retrieves relevant knowledge from the system.

Such knowledge is the percentage o patients diagnosed with

some kind of lung problem, with similar FV line (Fig 6b).

The similarity threshold is submitted with the query

parameters. When the image was submitted, the concavity

degree was computed and used in selecting the

corresponding images of the sick patients.

VI. CONCLUSIONS AND FUTURE WORK

We have presented an architecture for a multimedia data

warehouse which aims at a semantically rich environment.

We also presented a data model for representing facts and

dimensions accordingly to the hierarchical structure of

entities captured in multimedia objects. The metadata model

we proposed is also based on hierarchies and can be easily

extended, to provide better performance.

As for the future work, we plan to expand the medical

and technical terms lists and to improve the way in which

they are mapped. We plan to develop more powerful ETL

tools which can extract more information from the files

before they are loaded into the warehouse, to improve the

speed of query processing.

REFERENCES

[1] R. Kimball, The Data Warehouse Toolkit, Wiley and

Sons, 2nd

Edition, 2002

[2] P. Vassiliadis, Data Warehouse Metadata,

Encyclopedia of Database Systems, Springer, 2009

[3] Object Management Group, Common Warehouse

Metamodel (CWM) Specification, 2003

[4] A. Tanasescu , O. Boussaid, F. Bentayeb, Towards

Complex Data Warehousing : A new approach for

integrating and modeling complex data, 5th

International Conference on Modeling, Computation

and Optimization in Information Systems and

Management Sciences, France, 2004

[5] A. M. Arigon, M. Miquel, A. Tchounikine, Multimedia

data warehouses: a multiversion model and a medical application, Multimedia Tools and Applications, vol.

35, 2007

[6] H. Mahboubi, J.C. Ralaivao, S. Loudcher, O. Boussaid,

F. Bentayeb, J. Darmont, X-WACoDa: An XML-based

approach for Warehousing and Analyzing Complex

Data, Advances in Data Warehousing and Mining, IGI

Publishing, 2009

[7] E. Diday, L. Billard, Symbolic Data Analysis:

Definitions and Examples, 2002

[8] S. E. G. Cisaro, H. O. Nigro, Architecture for Symbolic

Object Warehouse, Encyclopedia of Data Warehousing

and Mining, 2nd

Edition, IGI Global, 2009

[9] J. You, Q. Li, On hierarchical content-based image

retrieval by dynamic indexing and guided search,

Proceedings of the 8th IEEE International Conference

on Cognitive Informatics, 2009

[10] J. Pardillo, J. N. Mazón, J. Trujillo, Bridging the Semantic Gap in OLAP Model: Platform-independent

Queries, Proceedings ACM 11th International

Workshop on Data Warehousing and OLAP, 2008

[11] G. Xie, Y. Yang, S. Liu, Z. Qiu, Y. Pan, X. Zhou,

EIAW: Towards a Business-friendly Data Warehouse

Using Semantic Web Technologies, The Semantic Web,

6th International Semantic Web Conference, 2nd Asian

Semantic Web Conference, ISWC 2007 + ASWC 2007,

2007

[12] M. L. Antonie, O. R. Zaiane, A. Coman, Application of

Data Mining Techniques for Medical Image

Classification, Proceedings of the Second International

Workshop on Multimedia Data Mining, 2001

9