2010 last papers

Clinical Document Data Warehouse (CD2W) Josep Vilalta Marzo a , Diego Kaminker b , Josep M. Picas Vidal c, , M. Lluisa Bernard Antoranz d

, Cristina Siles c , Rafael Rosa Prat a

a Vico Open Modeling S.L.,b Kern Information Technology SRL, c Hospital de la Santa Creu i Sant Pau , d Institut Català de la Salut

AbstractThis paper shows the development of the Clinical Document Data Warehouse project (CD2W) and its implementation using CDA R2 documents (Clinical Document Architecture Release 2). The main objectives of this project was to prototype a web based portal allowing access to a clinical document repository and a data warehouse with clinical information about patients from several medical organizations in Barcelona, Spain (a major hospital and 4 primary care centers), giving access to clinical patient information for primary care and leveraging the same standardized information to populate the data warehouse (secondary use). The project was developed during the first half of 2009 under the general direction of the Hospital de la Santa Creu i Sant Pau. (HSCSP) Due to the prototype nature of the project, the scope was limited to patients with Congestive Heart Disease (CHD) who consented the use of their information for clinical research to the HSCSP and with the current vocabularies used by the providers (ICD9 and other local terminologies). The data warehouse was developed using HL7 RIM basic concepts. The project mission was to improve patient care with the use of global standards, open technology, low exploitation cost and ease of use. CDA R2 documents.During this project, we used the SCRUM agile methodology allowing scalable, progressive and incremental software development process.

Introduction – Business Case

Solving the questions arising when a patient shows several clinical issues is one of the greatest challenges for healthcare providers. Fast clinical decisions making at the healthcare location generates the need to put all relevant and up-to-date clinical information to support the process. Usually, we don't have efficient filtering flagging the critical issues where to focus our attention. We encounter scenarios with data overload demanding a big effort to synthesize useful information, or sparse data demanding the use of imagination to connect and reach any conclusion. Another problem is that usually our available clinical information source is the sole organization supporting the healthcare provider. Other information generated by other channels where the patient was attended is usually brought by the patient in several paper formats. Consolidation of relevant and up to date information from distinct healthcare organizations from different authorities is a pending issue waiting for an agreement from the healthcare authorities and harmonization of different data base schemes.

Nowadays, there are several on-going projects with the shared goal of consolidation of clinical information, both at the Spain level or at a autonomic community level. These projects also share a great level of complexity and costs to achieve their goals. This small project CD2W aspires to help in achieving this consolidation of clinical information with focus on easing the task for the healthcare providers, organizations administering huge data bases with several difficulties to integrate and no reference information model available. Materials and Methods

Hospital de la Santa Creu i Sant Pau is a high complexity hospital which dates back six centuries, making it the oldest hospital in Spain. Healthcare is centered on Barcelona but extends to the rest of Catalonia. The center plays a prominent role in Spain and is internationally renowned. The hospital has distinguished itself in the healthcare provided in many fields, making it a reference centre in several specialties. The center attends over 34,000 admissions each year and more than 150,000 emergencies. Approximately 300,000 people are visited at the ambulatory services annually and the Day Hospital attends over 60,000 users. There are 71 day hospital beds, 634 hospitalization beds and 19 surgical rooms. Teaching and training programmes at the Hospital de la Santa Creu i Sant Pau cover many levels, comprising the UAB (Universitat Autonoma de Barcelona) Faculty of Medicine Teaching Unit, the University School of Nursing, participation in the State Residency Programmes to train specialists, masters and doctorate course, continuing education, etc. In the field of research, Hospital de la Santa Creu i Sant Pau is one of the most prominent centers in Spain, as can be appreciated from the volume of papers published and their input factor, the number and quality of projects which receive funding and the grants awarded. The Hospital de la Santa Creu i Sant Pau is governed by the Patronat de la Fundació de Gestió Sanitària (FGSHSCSP), a board with representatives from the Regional Catalan Government (Generalitat de Catalunya), the City Hall of Barcelona and the Archbishopric of Barcelona. The goals for this project were:1. Define a scheme to integrate information from disjoint information platforms.2. Implement a process for periodical data and document exchange with minimal workload implications for the primary care centers.3. Generate a clinical data store enabling the coexistence of clinical documents and relational data in a longitudinal patient healthcare record..

4. Create a simple user interface to ease fast queries to the relevant clinical information about a patient..5. Restrict access to clinical information to professionals authorized by the participating organizations (HSP and ICS)6. Use open technology for design, development and implementation of the data warehouse, minimizing exploitation costs and use global interoperability standards to enable universal access.7. Evaluate the impact of normalizing data from different organizations and code systems.. 8.Evaluate useability and added value chain of CD2W to the physicians for healthcare decisions9. Evaluate results of this prototype to study the possible extension of the access to the patients or to other professionals.10 .Evaluate the results for a new project with a broader scope.

Implementation, Methodology and Tools

The project has four main design aspects:1. Clinical Datawarehouse Design2. Standard Document Design 3. Datawarehouse Population Process Design4. User Interface Design5. Technological implementation.Lets review them:

1. Clinical Datawarehouse Design

The model for CD2W was derived from the RIM base classes (role,entity,act,participation) [3] following a three step methodology:a- Development of a conceptual framework, to be discussed with the CD2W stakeholders, including which were the measures or facts, and how should the information be classified (dimensions).

Figure 1 – CD2W conceptual model

b. Then, on a more technical level, a domain analysis model was generated.

Figure 2 – CD2W Domain Analysis

And finally the derivation of a datawarehouse model to store facts, dimensions and supporting standard clinical documents.RIM-wise, facts are acts, and main dimensions are entities and roles and their attributes.

2. Standard Document Design.

The documents were the 'lingua franca' between disparate systems used by the participating organizations (primary care centers, Hospital her and discharge system) [1].

We designed two different clinical document templates, one for the evolution note from the primary care centers and one for the discharge note from the Hospital discharge system.

The templates shared the same information at the header level, but differed in their section contents.Since this data warehouse was intended for secondary use, patient and physician information was de-identified [2] (names, identifications and addresses were removed or replaced).

The information generated by the local provider applications was transformed using an XSL to clinical documents and this standard documents were processed and stored into the C2DW.

We tested our mapping, process and query interface with 16125 clinical documents from the hospital and primary care centers.

In order to guide the development team a table was built with the main required elements from each primary care center and its location inside of the standard clinical document.

Figure 4 – Transformation Table

3. Data warehouse population process

The process to populate the data ware house included several steps.

[At the primary care center]1. Select encounters for patients suffering of CHD with

signed consents.2. Create a basic, shallow XML for each primary care center

encounter[At the data processing center]

3. Create a CDA conformant document using the defined mapping for the center, for each instance

4. Populate the CD2W database with the information from each document header, and the document itself.

This process was triggered periodically by the primary care centers and HSP.

4. User interface design

The user interface design was use-case based.Identified use cases were:

A. App Administration

Application setup and parametrization

− Define healthcare agent− Define healthcare agent type− Define healthcare agent role− Define app parameters− Define service catalog− Define service

B. Data Generation Process

Use Cases related to the DW population

Login- User validation- Batch load- Stat Processing- Document Processing- Update audit log

C. Queries

Data access and retrieval from the users− Retrieve app parameters− Query control panel (figure 3)− Query Patient Monitor (figure 4)− Query Encounter Monitor− Query Conditions Monitor− Query Healthcare Centers− Query Healthcare Professionals− Query Healthcare Services− Query Stats (figure 5,6)− Query audit logs− Browse CDA R2 document (figure 7)

The following figures illustrate the main user interface forms for CD2W:

Figure 3 – CD2W Control Panel

Figure 4 – CD2W Patient Query

Figure 5 – Services for one patient

Figure 6- Temporal series for a patient

Figure 7- CDA R2 Document for an encounter

5. Technological ImplementationThe system was implemented using an open source Apache application service, a MySQL database, PHP development environment and the amCharts data analysis and graphing component.

Evaluation/Assessment/Lessons Learned

The HL7 Reference Information Model was very useful to aid modeling our specific domain and generate a scheme to integrate all participating applications into an information model.

The process for exchange could be established, but we needed to educate the participating centers on the use of CDA R2 and the rationale for asking each piece of information. Nevertheless, we needed to bridge their data model to CDA R2 by providing them with a customized XSL for each center. Coexistence of documents with the relational information needed to explore the embedded information was possible, although we need to test with more volume. The user interface was enough for our pilot users from three centers to achieve their data exploration and verification needs. The use of open technologies and standards was a key factor to minimize development time. Our normalizing efforts were not finished, we end up using ICD-9 and internal ICS vocabularies.

Future Plans

Extension of this tool to be accesible to patients and other professionals will be studied, but current policies make very difficult to gain access to patient data, even for this approved project. We also want to explore using native XML open source databases. A project with a greater scope will be studied but the whole approach and the generated model are suitable for other domains.

Acknowledgements

Spain Ministry of Health for Scholarship FIS Dossier P105/230.

Bibliography

[1] HL7 Clinical Document Architecture, Release 2Robert H Dolin et alJAMIA 2006;13:30-39 doi:10.1197/jamia.M1888

[2] Ley Orgánica 15/1999, de 13 de diciembre, de Protección de Datos de Carácter Personal (BOE núm. 298, de 14-12-1999, pp. 43088-43099)

[3] The Design of the HL7 RIM-based SharingComponents for Clinical Information SystemsWei-Yi Yang, Li-Hui Lee, Hsiao-Li Gien, Hsing-Yi Chu, Yi-Ting Chou, and Der-Ming LiouWorld Academy of Science, Engineering and Technology 53 2009

ContactJosep Vilalta MarzoFrancesc Layret 24, Badalona, Barcelona, SpainEmail: [email protected]

http://www.boe.es/

mailto:[email protected]

Improving the Usability of HL7 Information Models by Automatic Filtering

Antonio Villegas and Antoni OliveServices and Information Systems Engineering Department

Universitat Politecnica de CatalunyaBarcelona, Spain

Email: avillegas, [email protected]

Josep VilaltaHL7 Education & e-Learning Services

HL7 Spain (Health Level Seven International)Barcelona, Spain

Email: [email protected]

Abstract—The amount of knowledge represented in theHealth Level 7 International (HL7) information models is verylarge. The sheer size of those models makes them very usefulfor the communities for which they are developed. However,the size of the models and their overall organization makes itdifficult to manually extract knowledge from them.

We propose to extract that knowledge by using a novelfiltering method that we have developed. Our method is basedon the concept of class interest as a combination of classimportance and class closeness. The application of our methodautomatically obtains a filtered information model of the wholeHL7 models according to the user preferences. We show thatthe use of a prototype tool that implements that method andproduces such filtered model improves the usability of the HL7models due to its high precision and low computational time.

Keywords-Usability, Health Level Seven International, HL7,Models, Filtering, UML

I. INTRODUCTION

The Health Level Seven International (HL7) is a not-for-profit, ANSI-accredited standards developing organizationdedicated to providing a comprehensive framework andrelated standards for the exchange, integration, sharing, andretrieval of electronic health information that supports clini-cal practice and the management, delivery and evaluation ofhealth services [1].

HL7 develops specifications, the most widely used beinga messaging standard that enables disparate healthcare ap-plications to exchange key sets of clinical and administrativedata. The HL7 standard specifications are unified by sharedreference models of the healthcare and technical domains[2], [3].

The amount of knowledge represented in the HL7 infor-mation models is very large and continuously improved. Thesheer size of those models makes them very useful to thecommunities for which they were developed: HL7 interna-tional affiliates with more than fifty HL7 active workinggroups (Structured Documents, Clinical Decision Support,Clinical Genomics...), large integrated healthcare deliverynetworks, government agencies and other organizations thatuse those models for the development of their enterpriseinformation architecture of health systems [4], [5].

However, the size of HL7 information models and theirorganization makes it very difficult for those communities

to manually extract knowledge from them. This problem isshared by other large models [6].

Currently, there is a lack of computer support to makethose models usable for the goal of knowledge extraction. Inthis paper, we propose to extract that knowledge by using anovel filtering method that we have developed, and we showthat the use of our prototype implementation of that methodimproves the usability of HL7 information models.

The structure of the paper is as follows. Section IIintroduces the HL7 models and describes the main UMLconstructs used to build them. Section III describes theconcept of class importance and references the methods thatcan be used to compute it. Section IV describes the conceptof class interest with respect to a filter set of classes andexplains how to compute it. Section V presents our modelfiltering method. Section VI evaluates the use of the methodin the context of the HL7 models. Finally, Section VIIsummarizes the conclusions and points out future work.

II. HL7 INFORMATION MODELS

Types of ModelsThe HL7 information models comprise three types of

models. Each of the model types is based on the UML,although the concrete notation used differs depending onthe model type. Also, the models differ from each other interms of their information content, scope, and intended use.The following types of information models are defined:• Reference Information Model (RIM) - The RIM is the

information model that encompasses the HL7 domainof interest as a whole. The RIM is a coherent, sharedinformation model that is the source for the data contentof all HL7 interoperability artifacts: V2.x messages andXML clinical documents CDA R2 [3].

• Domain Message Information Model (D-MIM) - A D-MIM is a refined subset of the RIM that includes a setof classes, attributes and relationships that can be usedto create messages and structured clinical documentsfor a particular domain (a particular area of interestin healthcare). There are predefined D-MIMs for a setof over 15 universal domains, such as Accounting andBilling, Care Provision, Claims and Reimbursement,and so on.

Figure 1. Sample of HL7 RIM refinements related to ActAppointment class

• Refined Message Information Model (R-MIM) - The R-MIM is a subset of a D-MIM that is used to express theinformation content for a message/document or set ofmessages/documents with annotations and refinementsthat are message/document specific. The content of anR-MIM is drawn from the D-MIM for the specificdomain in which the R-MIM is used.

Structure of the HL7 Information Models

The RIM, D-MIM and R-MIM models can be analyzedas if they were built using in a particular way a small subsetof constructs provided by the UML [7]. Figure 1 illustrateswith a very small fragment of the RIM and of one D-MIMthe main UML constructs used. RIM comprises six backboneclasses: Act, Participation, Entity, Role, ActRelationship andRoleLink. Figure 1 shows the first four of these classes.Each one has a number of attributes with a defined mul-tiplicity. Surprisingly, there are only eight main associationsbetween the RIM classes, all of them binary and with theircorresponding multiplicities. Figure 1 shows four of theseassociations.

Each of the RIM classes has many subclasses, althoughonly a few of them are explicitly shown in the diagramsof the HL7 RIM specification. There are many special-ization/generalization relationships (called IsA relationships,e.g. Organization IsA Entity) in the HL7 models. The num-ber of RIM classes and subclasses is over 2,500. Figure 1shows seven subclasses of four of the backbone RIM classesand seven IsA relationships.

D-MIM models refine the RIM in three ways:1) The participants of one of the eight main associations

defined between RIM classes are refined in the sub-classes. This is the refinement most often used in theHL7 models. Note that it is not allowed to add newassociations.

2) The multiplicities of an association defined betweenRIM classes are strengthened in the subclasses.

3) The multiplicity of an attribute of a RIM class isstrengthened in a subclass. An optional attribute ina RIM class can be made mandatory or not allowedin a subclass. Note that it is not allowed to add newattributes.

R-MIM models refine D-MIM models in the same way.In all cases, the three kind of refinements can be expressedusing UML constructs.

Figure 1 shows a few refinements related to the ActAppo-intment class. The instances of this class are appointments(a particular kind of Act). There may be several kinds ofparticipations in an appointment. Figure 1 shows only twoof them: PerformerOfActAppointment and SubjectOfActAp-pointment. To indicate that when the act is an appointmentthen the participations must be instances of PerformerOf-ActAppointment or of SubjectOfActAppointment, we redefinethe association Participation-Act as shown in the figure. Notethat redefinition is a UML construct, which is very useful insituations like this one. The redefinition of the associationRole-Participation is similar. The overall semantics of theseredefinitions is that the performer of an appointment is aPerson that plays the role AssignedPerson, and that thesubject of an appointment is a Person that plays the rolePatient.

Sometimes, the UML redefinition construct does not allowthe graphical representation of the strengthening of asso-ciation multiplicities. In these cases, the redefinition mustbe formally captured by OCL invariants. For example, inFigure 1, the refinement of act in SubjectOfActAppointmentalso implies that an instance of ActAppointment is associatedwith a non-empty set (1..*) of SubjectOfActAppointment.However, this cannot be expressed graphically, and an OCLinvariant must be used instead [8].

Figure 1 also shows the redefinitions of associationsplayer-playedRole and scoper-scopedRole between Entityand Role. The player and the scoper of an AssignedPersonand of Patient must be a Person and an Organization,respectively.

III. IMPORTANCE OF HL7 CLASSES

Our filtering method is based on the concept of classimportance. The importance of a class is a real number thatmeasures the relative importance of that class in a model.We will see in the next section that we use that importanceto select which classes are shown to the users.

There exist different kinds of methods to compute theimportance of classes in the literature. The simplest familyof methods is that based on occurrence counting [9]–[11],where the importance of a class is equal to the number ofcharacteristics the class has represented in the model. Thesemethods are class centered in the sense that the importanceof a class depends only in the information the class has.Therefore, the more information about a class, the moreimportant it will be.

Another family of methods are those based in link analysis[11], [12], where the importance of a class is defined asa combination of the importance of the classes that areconnected to it with associations and/or IsA relationships.Such recursive definition results in an equation system andindicates that the more important the classes connectedto a class are, the more important such class will be. Inthese methods the importance is shared through connections,changing from a class centered philosophy to a more in-terconnected approach of the importance. Iterative methodsare required to solve the importance equation system, whichincreases the computational cost of this kind of methods.

Finally, there are some methods that even use the infor-mation about the existing instances of the classes and theassociations of the model. Therefore, the importance theycompute takes into account the structural part of the modelbut also the data that the classes instantiate. The problemwith this family of instance-dependent methods [13], [14]is that without instances the method cannot be used.

As an example, Table I shows the 10 most importantclasses of the HL7 models1 computed using the CEntityRankimportance algorithm (see 3.6 of [15]). To compute thisimportance, the method takes into account the classes, theIsA relationships between them, the attributes and theirmultiplicities, the associations and their multiplicities, theassociation redefinitions and the OCL invariants.

The in-depth study of the computation of the importanceof classes is beyond the scope of this paper. A review ofmethods to compute the importance taking into accountdifferent levels of knowledge is given in [15].

1The results have been obtained taking into account the RIM and thefollowing D-MIM models: Laboratory, Account and Billing, Schedulingand Medical Records [1].

Table ITOP-10 MOST IMPORTANT CLASSES.

Rank Class Importance

1 Act 7.51

2 Role 5.11

3 ActRelationship 4.03

4 Participation 3.67

5 Entity 3.5

6 Observation 2.64

7 InfrastructureRoot 1.81

8 Organization 1.72

9 RoleLink 1.59

10 FinancialTransaction 1.54

The filtering method described in the next sections canbe used in connection with any of the existing methods forcomputing the importance of classes.

IV. INTEREST OF HL7 CLASSES

The importance of a class is an absolute metric thatdepends only on the whole set of HL7 models. The metricis useful when a user wants to know which are the mostimportant classes, but it is of little use when the user isinterested in a specific subset of classes, independently fromtheir importance. What is needed then is a metric thatmeasures the interest of a class with respect to such set,that we call filter set.

A filter set FS of classes is a non-empty set of classesfrom the HL7 models. The filter set comprises the minimumset of classes in which a user is interested at a particularmoment. For example, if the user wants to see what isthe knowledge the models have about classes Patient andActAppointment, then she defines FS = Patient, ActAppo-intment. We will see in the next section that starting fromthis filter set, our filtering method retrieves the knowledgerepresented in the models about Patient and ActAppointmentthat is likely to be of more interest to the user.

Additionally, it is possible to define a set of classes not tobe considered in the filtering method. We call such set therejection set RS.

Intuitively, the interest to a user of a class c with respectto a filter set FS should take into account both the absoluteimportance of c (as explained in the previous section) anda closeness measure of c with regard to the classes in FS .For this reason, we define:

Φ(c,FS) = α×Ψ(c) + (1− α)× Ω(c,FS) (1)

where Φ(c,FS) is the interest of class c with respect toFS , Ψ(c) the absolute importance of class c, and Ω(c, FS)is the closeness of class c with respect to FS .

Note that α is a balancing parameter in the range [0,1]to set the preference between closeness and importance for

the retrieved knowledge. An α > 0.5 benefits importanceagainst closeness while an α < 0.5 does the opposite. Thedefault α value is set to 0.5 and can be modified by the user.

There may be several ways to compute the closenessΩ(c,FS) of class c with respect to the classes of FS .Intuitively, the closeness of class c should be directly relatedto the inverse of the distance of c to the filter set FS . Forthis reason, we define:

Ω(c,FS) =|FS|∑

c′∈FSd(c, c′)

(2)

where |FS| is the number of classes of FS and d(c, c′)is the minimum distance between a class c and a classc′ belonging to the filter set FS . Intuitively, those classesthat are closer to more classes of FS will have a greatercloseness Ω(c,FS).

We assume that a pair of classes c, c′ are directlyconnected to each other if there is a direct association (orredefinition of association) between them or if one class isa direct subclass of the other. For these cases, d(c, c′) = 1.Otherwise, when c, c′ are not directly connected, d(c, c′)is defined as the length of the shortest path between themtraversing associations and/or ascending/descending throughclass hierarchies.

As an example, Table II shows the top-10 classes witha greater value of interest when the user defines FS =Patient, ActAppointment and α = 0.5.

Results in Table II indicate that included within thetop-10 there are classes that are directly connected toall members of the filter set FS = Patient, ActAp-pointment as in the case of SubjectOfActAppointment(Ω(SubjectOfActAppointment,FS) = 1.0) but also classesthat are not directly connected to any class of FS (althoughthey are closer).

V. FILTERING HL7 INFORMATION MODELS

We have developed a method for filtering large models,and we have used the HL7 models as a case study fordeveloping and experimenting with the method, and itsassociated tool. The method consists of four consecutivesteps. The characteristics of each step are detailed below.Figure 2 presents an overview of the method and steps.

Intuitively, from a small subset of classes selected by theuser the method automatically obtains a filtered informationmodel with knowledge of interest.

Step 1: Setting the User Preferences

The first step of the method consists of prepare therequired information to filter the HL7 information modelsaccording to the user preferences. Basically, the user focuson a set of classes (filter set) she is interested in and ourmethod surrounds them with additional knowledge from theHL7 models. Therefore, it is mandatory for the user to select

Figure 2. Method Overview.

a non-empty initial filter set FS . An example of filter setto obtain knowledge about patient and appointments in theHL7 can be FS = Patient, ActAppointment.

In the same way, the user can specify a rejection set RS(may be empty) with those classes that have no interest toher.

In addition to the filter set, the user can decide the amountof knowledge she wants to obtain by indicating the numberof additional classes (Cmax) the method has to select andinclude in the filtered information model.

Apart from that, the user has the possibility to select whichimportance method (see Section III) wants to be used in thefollowing step. Also, she can include her preferences aboutcloseness and importance by setting a value for the balancingparameter α (see (1) in Section IV).

Note that RS , Cmax, the importance method, and theparameter α have default values (RS = ∅, Cmax = 10,the default importance method is CEntityRank [15] andα = 0.5) and therefore are all optional.

The user interaction is required only in this initial step.

Step 2: Compute Filtering Measures

The second step of the method consists in computing themetrics of importance (Ψ) and closeness (Ω) for the HL7classes.

By definition, the importance Ψ(c) of a class c is anabsolute metric that depends on the knowledge representedon the whole set of HL7 models. The filtering methodcomputes the importance of each class in the HL7 models,but this computation must be done only once. The results arevalid until the HL7 models change. In our current prototype,the time required for this computation is about 2 seconds.

Table IIMOST INTERESTING CLASSES WITH REGARD TO FS = Patient, ActAppointment.

Rank Class (c) Ψ(c) d(c, Patient) d(c, ActAppointment) Ω(c,FS) Φ(c,FS)

1 SubjectOfActAppointment 0.11 1 1 1.0 0.7003

2 Organization 1.72 1 3 0.5 0.3552

3 Person 1.22 1 3 0.5 0.3537

4 ServiceDeliveryLocation 0.79 2 2 0.5 0.3524

5 AssignedPerson 0.72 2 2 0.5 0.3522

6 ManufacturedDevice 0.55 2 2 0.5 0.3517

7 LocationOfActAppointment 0.26 3 1 0.5 0.3508

8 ReusableDeviceOfActAppointment 0.19 3 1 0.5 0.3506

9 SubjectOfAccountEvent 0.13 1 3 0.5 0.3504

10 AuthorOfActAppointment 0.12 3 1 0.5 0.3503

On the other hand, to compute the closeness Ω(c,FS) ofan HL7 class with regard to the filter set FS it is requiredto know the minimum distances between classes in the HL7models (see (2) in Section IV). However, it is only necessaryto compute the distance from each class in the filter set toany class out of FS , which requires a lower computationalcost. Note that the method computes the closeness only forthose classes that are out of the filter set.

Step 3: Select Interest Set

The third step of the method consists in computing theinterest (Φ) for each class out of the FS . As previouslyshown in (1) of Section IV, the interest Φ(c,FS) of acandidate class c to be included in the output model is alinear combination of the importance Ψ(c) and the closenessΩ(c,FS) taking into account the balancing parameter α.

Note that if a non-empty rejection set RS was definedin the first step of our method, those classes included insuch set will not be considered for the final result nor theirinterest Φ will be computed.

The interest Φ produces a sorted ranking of HL7 classesand the method selects the top classes of that ranking untilreaching the selected limit Cmax specified in the first step.We call such set of classes the Interest Set. Second columnof Table II shows the classes that belong to the InterestSet according to FS = Patient, ActAppointment whenCmax = 10.

In case of two or more classes get the same interest ourmethod is non-deterministic: it might select any of those.Some enhancements can be done to try to avoid selectingclasses in a random manner, like prioritizing the classeswith a higher value of closeness or importance (or any othermeasure) in case of ties.

Step 4: Compute Filtered Information Model

Finally, the last step of the method obtains the Interest Setof classes from the previous step and puts it together withthe classes of the filter set FS in order to create a filteredinformation model with the classes of both sets.

The main goal of this step consists in filtering informationfrom the whole HL7 information models involving classesin the filtered model. To achieve this goal, the methodexplores the associations, redefinitions of associations, andgeneralization/specialization relationships in the HL7 infor-mation models that are defined between those classes andincludes them in the filtered model to obtain a connectedmodel. The filtered information model for FS = Patient,ActAppointment and the previous Interest Set is shown inFigure 3.

Our method also takes into account associations thatare specified between superclasses of classes included inthe filtered information model, and brings them down toconnect such subclasses. An example of that behaviour isthe association between Participation and ActAppointment inFigure 3. Such association is originally defined between Par-ticipation and Act (see Figure 1). Given that, ActAppointmentis a subclass of Act. Such association is descended to thecontext of ActAppointment to indicate that there exists theconnection with Participation although Act was not includedin the Interest Set.

When descending an association there exist the case thatsuch association could be repeated. Figure 3 shows theassociation between Participation and ActAppointment. Notethat Participation is not a member of the Interest Set (seeTable II). However, Participation has been included in thefiltered information model as an auxiliary class (markedin Figure 3 with a light grey color). The rationale isthat such association should be descended between eachof the five subclasses (SubjectOfActAppointment, AuthorOf-ActAppointment, ReusableDeviceOfActAppointment, Loca-tionOfActAppointment and SubjectOfAccountEvent) of Par-ticipation present in the Interest Set and ActAppointmentwhich is not an UML compliant situation.

To avoid repeated associations our method finds the lowestcommon parent (LCP) for the previous subclasses, whichin this case is Participation, includes it in the filteredinformation model as an auxiliary class, and descends the

Figure 3. Filtered Information Model for FS = Patient, ActAppointment.

association to such LCP class. The same situation occursfor RoleClassAssociative and RoleChoice, which are LCPclasses included as auxiliary in the filtered informationmodel of Figure 3.

Besides, if there are two classes in the filtered informationmodel such that one is an indirect subclass of the other inthe HL7 models, our method creates an IsA relationshipbetween them in the filtered information model (markedas indirect) to indicate such knowledge. Figure 3 showsthat the five subclasses of Participation and the four onesof RoleClassAssociative are indirect subclasses by markingthose IsA relationships in a light gray color. For the caseof RoleChoice, its subclasses are directly connected to itby means of IsA relationships (marked with ordinary blackcolor).

Finally, the filtered information model presented in Fig-ure 3 shows information about two HL7 domains: theScheduling domain and the Account and Billing domain.By using our filtering method, a user that wanted to knowabout patients and appointments discovers that patients arealso related to account events. This way, the user easily cancompose another filter set like FS = Patient, SubjectOf-AccountEvent to get more knowledge about them in a newiteration of our method.

VI. EVALUATION

Our filtering method and prototype tool provide supportto the task of extracting knowledge from the HL7 models,which has normally been done manually or with littlesupport.

Finding a measure that reflects the ability of our method tosatisfy the user is a complicated task. However, there existsrelated work [16], [17] about some measurable quantities inthe field of information retrieval that can be applied to ourcontext:

• The ability of the method to withhold non-relevantknowledge (precision)

• The interval between the request being made and theanswer being given (time)

Precision Analysis

A correct method must retrieve the relevant knowledgeaccording to the user preferences. The precision of a methodis defined as the percentage of relevant knowledge presentedto the user.

In our context, we use the concept of precision appliedto HL7 universal domains (specified with D-MIM’s). Eachdomain contains a main class which is the central pointof knowledge to the users interested in such domain. Theother classes presented in the domain conform the relevantknowledge related to the main class.

HL7 professionals interested in a particular domain decideabout the knowledge to incorporate in it through ballots.Thus, a common situation for a user is to focus on the mainclass of a domain and to navigate through the D-MIM tounderstand its related knowledge.

To know the precision of our method, we simulate thegeneration of a D-MIM from its main class. We define asingle-class filter set with such class and set Cmax withthe size of the domain. This way, we will obtain a filteredinformation model with the same number of classes as suchdomain.

In one iteration of our method, we obtain two groupsof classes within the resulting filtered information model:the relevant classes to the user, that is, the ones that wereoriginally defined in the D-MIM by experts, and the non-relevant ones. The precision of the result is defined as thefraction of the relevant classes over the total Cmax.

To refine the obtained result, the non-relevant classes areincluded in a rejection set RS and the method is executed

0 5 10 15 20 25 30

4050

6070

8090

100

Precision

Iterations

Pre

cisi

on (

%)

Medical RecordsSchedulingAccount and BillingLaboratory

1 2 3 4 5

4050

6070

8090

100

Precision (Zoom Iterations 1−5)

Iterations

Pre

cisi

on (

%)

Medical RecordsSchedulingAccount and BillingLaboratory

Figure 4. Precision analysis for HL7 domains.

again taking into account RS . It is expected that the filteredinformation result of this step will have a greater precision.

This manner, at each iteration non-relevant classes to theuser are rejected, and we know that in a finite numberof steps our filtering method will obtain all the classes ofthe original domain. The smaller the number of requirediterations until getting such domain, the better the method.

Figure 4 shows the number of iterations needed to reachthe maximum precision for four of the HL7 domains. Notethat right side of Figure 4 zooms in the first five iterations.The test reveals that to reach more than 80% of the relevantclasses of a domain, only three iterations are required.

Time Analysis

It is clear that a good method does not only requireprecision, but it also needs to present the results in anacceptable time according to the user.

To find the time spent by our method it is only necessaryto record the time lapse between the request of knowledge,i.e. once a filter set FS has been indicated by the user, andthe receipt of the filtered information model.

Average Time

Filter Set Size

Tim

e (s

)

0 5 10 15 20 25 30 35 40

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Figure 5. Time analysis for different sizes of FS.

It is expected that as we increase the size of the filterset, the time will increase linearly. Our method computesthe distances from each class in the filter set to all therest of classes. This computation requires the same time (inaverage) for each class in the filter set. Therefore, the moreclasses we have in a filter set, the more the time our methodspends in computing distances.

In our experimentation, we set our prototype tool to applythe filtering method several times with an increasing numberof classes in the filter set. The average results for sizes from asingle-class filter set up to a 40-classes filter set are presentedin Figure 5.

According to the expected use of our method, havinga filter set FS of 40 classes is not a common situation(although possible). Sizes of filter sets up to 10 classesare more realistic, in which case the average time does notexceed one second.

VII. CONCLUSIONS

HL7 information models are very large. The wealth ofknowledge they contain makes them very useful to theirpotential target audience. However, the size and the or-ganization of these models makes it difficult to manuallyextract knowledge from them. This task is basic for theimprovement of services provided by HL7 affiliates, ven-dors and other organizations that use those models for thedevelopment of health systems.

What is needed is a tool that makes HL7 models moreusable for that task. We have presented a method thatmakes it easier to automatically extract knowledge from theHL7 models. Input to our method is the set of classes theuser is interested in. The method computes the interest ofeach class with respect to that set as a combination of itsimportance and closeness. Finally the method selects themost interesting classes from that models, including theirdefined knowledge in the original models (e.g. associations,

redefinition of associations, IsA relationships).The experiments we have done clearly show that the

proposed method and its associated tool provides an easierway to extract knowledge from the models. Concretely, ourprototype tool recovers more than 80% of the knowledgeof a D-MIM in three iterations, with an average time periteration that for common uses does not exceed one second.

We plan to continue our work along three directions. Thefirst is to include all HL7 models into our tool to give fullsupport to all HL7 communities. Currently we have fourD-MIMs. Experimentation with the full set of models willallow us to improve the method.

We also plan to experiment with the latest definitionand nomenclature of HL7 models published by the HL7iternational. Basically, it specifies a new level on top of theRIM model that consists on a domain analysis model (DAM)to describe business process and use cases, and a localizedinformation model (LIM) in the bottom of the model typesto adapt the R-MIMs to locale-specific requirements forstructure and terminology. To take into account these twonew models is a challenge that will improve our work.

Finally, another research area to explore consists in gen-erating traceability links from the elements in the filteredmodel to the original models, so that it is easy to findout the origin of each element. Keeping such backwardlinks improves the integration of different models in aninteroperability context. Also, our method and tool imple-menting traceability could be used as an aid in the designof implementation guides for HL7 interoperability artifacts(HL7 V3 messaging and CDA R2 documents).

ACKNOWLEDGMENT

The authors want to thank the collaboration of DiegoKaminker, HL7 Education WG co-chair and HL7 Inter-national Mentoring Committee co-chair, Carles Gallego,current Chair of HL7 Spain, and Dr. Joan Guanyabens,former Chair of HL7 Spain.

We would also like to thank the people of the GMCgroup for their useful comments to previous drafts of thispaper. This work has been partly supported by the Ministeriode Ciencia y Tecnologıa under the project TIN2008-00444,Grupo Consolidado.

REFERENCES

[1] Health Level Seven International, “HL7 web,” feb 2010.[Online]. Available: http://www.hl7.org

[2] R. Dolin, L. Alschuler, C. Beebe, P. Biron, S. Boyer, D. Essin,E. Kimber, T. Lincoln, and J. Mattison, “The HL7 clinicaldocument architecture,” Journal of the American MedicalInformatics Association, vol. 8, no. 6, pp. 552–569, 2001.

[3] R. Dolin, L. Alschuler, S. Boyer, C. Beebe, F. Behlen,P. Biron, and A. Shabo, “HL7 clinical document architecture,release 2,” Journal of the American Medical InformaticsAssociation, vol. 13, no. 1, pp. 30–39, 2006.

[4] J. Conesa, V. C. Storey, and V. Sugumaran, “Usability ofupper level ontologies: The case of researchcyc,” Data &Knowledge Engineering, vol. 69, no. 4, pp. 343–356, 2010.

[5] A. Danko, R. Kennedy, R. Haskell, I. Androwich, P. Button,C. Correia, S. Grobe, M. Harris, S. Matney, and D. Russler,“Modeling nursing interventions in the act class of HL7 RIMVersion 3,” Journal of biomedical informatics, vol. 36, no.4-5, pp. 294–303, 2003.

[6] J. Lyman, S. Pelletier, K. Scully, J. Boyd, J. Dalton, S. Tro-pello, and C. Egyhazy, “Applying the HL7 reference in-formation model to a clinical data warehouse,” in IEEEInternational Conference on Systems, Man and Cybernetics,vol. 5, 2003, pp. 4249–4255.

[7] OMG, Unified Modeling Language: Superstructure, version2.1.1, Object Modeling Group, February 2007.

[8] OMG, Object Constraint Language, version 2.0, Object Mod-eling Group, May 2006.

[9] S. Castano, V. De Antonellis, M. G. Fugini, and B. Pernici,“Conceptual schema analysis: techniques and applications,”ACM Transactions on Database Systems, vol. 23, no. 3, pp.286–333, 1998.

[10] D. L. Moody and A. Flitman, “A methodology for clusteringentity relationship models - a human information processingapproach,” in Conceptual Modeling - ER 1999, 18th Inter-national Conference on Conceptual Modeling, ser. LectureNotes in Computer Science, vol. 1728. Springer, 1999, pp.114–130.

[11] Y. Tzitzikas, D. Kotzinos, and Y. Theoharis, “On RankingRDF Schema Elements (and its Application in Visualiza-tion),” Journal of Universal Computer Science, vol. 13,no. 12, pp. 1854–1880, 2007.

[12] Y. Tzitzikas and J.-L. Hainaut, “How to tame a very largeer diagram (using link analysis and force-directed drawingalgorithms),” in Conceptual Modeling - ER 2005, 24th In-ternational Conference on Conceptual Modeling, ser. LectureNotes in Computer Science, vol. 3716. Springer, 2005, pp.144–159.

[13] C. Yu and H. V. Jagadish, “Schema summarization,” in VLDB2006, 32nd International Conference on Very Large DataBases, 2006, pp. 319–330.

[14] X. Yang, C. M. Procopiuc, and D. Srivastava, “Summariz-ing relational databases,” in VLDB 2009, 35th InternationalConference on Very Large Data Bases, 2009, pp. 634–645.

[15] A. Villegas and A. Olive, “On computing the importance ofentity types in large conceptual schemas,” in Advances inConceptual Modeling - Challenging Perspectives, ER 2009Workshops, ser. Lecture Notes in Computer Science, vol.5833. Springer, 2009, pp. 22–32.

[16] R. Baeza-Yates and B. Ribeiro-Nieto, Modern InformationRetrieval. Addison Wesley, 1999.

[17] C. Van Rijsbergen, “Information Retrieval,” Cataloging &Classification Quarterly, vol. 22, no. 3, 1996. [Online].Available: http://www.dcs.gla.ac.uk/Keith/Preface.html

2010 last papers

Health & Medicine

clinical patient information

sant hospital

standardized information

date information

clinical data store

c hospital

clinical issues

clinical research