lindsay

XML Databases and Biomedical Informatics

A report by James Lindsay

for UCONN CSE 300, Spring 2008 [email protected]

Introduction XML is a language with many uses, primarily to share data between heterogeneous systems via the internet. Since its introduction XML has grown very rapidly being accepted by corporate users and standardization bodies. In fact XML’s unique properties have allowed it to become the application languages of hundreds of various applications. It is recommended by the World Wide Web Consortium, as a free and open standard [13-16]. In the field of biomedical informatics one of the long standing open problems is finding a way to share medical data across a variety of mediums. XML has emerged as a leading facilitator in the solution. Inherently medical data is generated by a multitude of sources. One individual can have individualized medical data stored in dozens of locations, each of which utilizes a different system to store the data locally. Various other medical fields such as genetics, or medical imaging are also generating massive quantities of digital information. The need to share this information is obvious, and yet there are very few practical implementations of national or world wide medical data sharing. Storing medical data in XML format is an idea with long roots. Medical data intuitively maps to the XML language. However the available software packages to deal with XML data in a highly reliable, high load environment are currently not available. The naive approach to storing and retrieving XML data has been to map it to traditional object oriented models, storing it in relational databases. These methods have seen widespread use in the past few years, but more importantly native XML databases have been developed which accomplish the same goal. This paper requires no previous knowledge of XML, or specific medical domain topics. The paper will look at the various properties of medical data in general which allow it to be efficiently stored and transferred using XML. Then XML will be briefly described in a non-technical manner. The only topics mentioned will be prerequisites to our later discussions on databases. Following this will be an in-depth analysis of various information models and their relationship with XML. After understanding the structure of information, it is then possible to look at three ways in which XML can be effectively databased. Our focus will be on the native XML database systems. Finally several case studies will be looked at to understand the state of XML databases in the medical industry today. The last of which will be a look at the UCONN Computer Sciences Biomedical Informatics system and how it utilizes XML database technology. Medical Data Research On June 3, 1999 the Working Group on Biomedical Computing Advisory Committee to the Director of the National Institutes of Health issued a charge [11].

“The biomedical community is increasingly taking advantage of the power of computing, both to manage and analyze data, and to model biological processes. The working group should investigate the needs of NIH-supported investigators for computing resources, including hardware, software, networking, algorithms, and training. It should take into account efforts to create a national information infrastructure, and look at working with other agencies (particularly NSF and DOE) to ensure that the research needs of the NIH-funded community are met.”

The group goes on to describe the principal barrier to better health care as the lack of knowledge. They point out that there is paradigm shift taking place where researchers are spending less time in the lab, and more time behind a computer. Not only are the researchers running intensive computational simulations to generate new knowledge, but are also increasingly using data mining techniques to the same end. The groups goal is to “discover, encourage, train and support the new kinds of scientists needed for tomorrow's science”.

The second recommendation of the charge calls for a new program which endeavors to further our understanding of the principals and practices of information storage, curation, analysis and retrieval. There are new and old fields of study which are generating increasing amounts of information. The Human Genome project, clinical trials, statistical analysis, population genetics, and medical imaging are just a few. The amount of raw data when looked upon as a whole is enormous, subsequently so too must be the infrastructure in place to make that data available. A single biomedical laboratory may generate up to a 100 terabytes of information a year, which must be indexed, stored, and have the ability to be retrieved efficiently [11]. They call for advanced database technology which will help standardize medical information. In this way information will not only be transferable, but semantically inter-operable between heterogeneous sources. Scientists can then have access to any information they could desire, enabling the creation of new technology quicker and faster. Personal The concept of a personal health record is not new, in fact there is published references to the idea as early as 1978. Biomedical informatics does not only refer to information used by a researcher, but it also includes information used by an individual. Both concepts of medical information require the same improvements. Just as research data should be available in a consistent form for a researcher, an individual has the need to make their personal health record available consistently. When a person moves, the majority of their medical data is stored on paper. This requires one doctors office to fax, or mail a copy of that information. However by allowing the individual to maintain a digital copy of this information, this delay is lost. The delay is not only an inconvenience but also a barrier to life saving information. In emergencies, electronic record transfer may be the only way of retrieving information needed to safe someones life. President Bush announced that 10 years from 2004, every citizen would have access to electronic medical records. The Center for Medicare and Medicaid Services, a government organization has been promoting through grants and advertisement the exploration of individual use of personal health records. Structure of Medical Data It is a difficult task to attempt to capture the structure of medical data. In fact it seems reasonable that there is no non-trivial definition of medical data as a whole. As one increases their granularity it becomes possible to describe certain types of data, for example monitoring a person with diabetes' glucose level. The data consists of a persons name, the reading of a glucose meter, and possibly some observations. The structure of this data could maybe be described as a few numbers, a string identifying the person, followed by human generated text describing personal observations. This structure is certainly not compatible with a structure used to record human genome information. The point of that simple example is to illustrate that there exists no over arching structure to medical data. Traditionally medical data would be described as heterogeneous [6]. It is heterogeneous in not only its content, but also in its sources. A document may contain dozens of numerical values along with a qualitative analysis of their meaning. Or even precise dosing instructions including a photo, with remarks detailing ways of recognizing side effects. So within a single document there are a multitude of data types present.

Also there is an incongruence between documents. A discharge report from one hospital more contain different information from another, how would one integrate the two to identify patterns? The structure of medical data has many similarities between itself and a technology called XML. Very soon after XML was created in the mid 1990's people began recognizing that it might be a suitable storage medium for medical data. Description of XML XML is an abbreviation for extensible markup language, in many ways it is similar to HTML. Markup languages have been in use for decades, originating in the need for symbolic printer's instructions. XML however serves a slightly different role in that it is not solely intended to display information, but also a medium to convey or store information. It must be noted that XML itself will do nothing, as it is simply a carrier, someone must create a computer program that will interpret the data within the XML and display it. The heart of XML is that it is extensible. In markup languages such as HTML, there exists a set of predefined tags which a writer must use. The tags server as a way of defining what type of information is enclosed within the tag. There does not exist a set of predefined tags for XML, each user must define their own. In this capacity XML is therefore a self-descriptive language [16]. Comparing XML to more traditional computer science data types, it would be classified as simply a string. XML is just plain text with a certain structure embedded within that text. The power of XML as a data transfer medium across heterogeneous systems comes from the variety of standards that exist for XML. There exists a variety of languages from which an XML schema can be defined. The schema is a set of rules and tags which a particular XML instance must abide by and utilize. Each language has advantages and disadvantages, in particular we will focus on the XML Schema (W3C) language, know as XSD (XML Schema Definition). This language ensured that a particular document was valid if and only if it adhered to a particular set of rules. Additionally, the schema defined certain data types that an XSD would produce. In addition to laying out available data types or tags, the schema also serves to give a relationship between these elements, or a structure. The schema defines a tree like structure to all the available tags. The root of the tree is always a tag declaring what schema a particular document is following. Between this root are more tags, which can be nested to create a very complex hierarchical structure. XML is often referred to as a canonical data model [2]. The specific implementation of defining an XML schema is not the focus of this paper, however the reader should be aware that their exists well though out languages and specifications which can define an XML schema which any XML instance can be validated against. Also the unambiguous hierarchal nature of the XML is paramount to understanding how XML can be queried, and databased.

XML Queries XML has many methods of extracting information from it. Some XML documents can be somewhat human readable and just by looking at them a user will understand the content. However when documents become large, containing numerous tags and elements this becomes impossible. The W3C has developed several methods of efficiently retrieving information from XML.

Drawing 1: Glucose

The first of these methods is XPath. Simply put XPath is a language for addressing parts of an XML document [15]. It also is able to provide basic functionality for manipulating strings, numbers and booleans. XPath is a not XML based, and instead uses a compact notion similar to a URL/URI. The reasoning behind making XPath notation so simple is simply to facilitate its use. XPath is one of the oldest XML query languages published by the W3C. XPath at its heart is a a way of traversing a structured XML document. Each element or tag within the document can be though of as a node in the document tree. Starting from the root, and XPath query traverses down the tree, trying to math nodes to the query. A simple XPath query of an XML document which stores health information of a patient might be “'johndoe.xml'/heart/medications/” which would return a list of the patients heart medications. XPath demonstrates the hierarchal nature of accessing XML documents which will later influence how XML documents are databased. While traversing an XML document tree is very important, complex documents, and a user desire for more control caused the W3C to create the XQuery language which is a functional expansion of XPath. According to the published specification, “It is designed to be a language in which queries are concise and easily understood. It is also flexible enough to query a broad spectrum of XML information sources, including both databases and documents” [16]. There are really two important points which XQuery address, and are also important when considering XML databases. First XQuery queries are not limited to a single document as the XPath example was. Queries can span multiple documents, known as document collections. It is even designed to work on document fragments. It follows that XQuery would be a language suitable for querying all documents in an XML database. Additionally XQuery allows a query to not only access, but to also modify information within the schema. This too is a necessary feature for any database which allows modifications. In fact XQuery is often referred to as the SQL of XML. We know understand the methods by which information can be retrieved, utilized and modified however we have not explored how it can be stored. Before diving into XML databases, it is necessary that we understand what types of information that will be stored using XML. Models of Information and XML This section title may seem out of place, and certainly requires explanation. In an earlier section the notion of the XML data model was addressed. There, the way in which XML captures information was briefly explained. The hierarchical nature of XML is referred to as the XML data model. This however is not the topic of the current section. Here we are dealing with models of information; what kind of information do we want to store in XML format, and how is that information structured? We can then attempt to map the model of information to the XML data model and be able to gain an understanding of what types of information XML is best suited to handle. The subsequent section will expand upon the models of information we have defined and begin connecting them to databasing strategies. All XML files are referred to as documents, however there is a key distinction between a data-centric document and a document-centric document. Therefore in XML there is a notion of information which is “data-centric” and “document-centric”. Apparently the origin of these terms are mostly unknown but probably date back to 1997 from the XML-DEV mailing list [12]. Document-centric information is just as it would seem, examples include a book, an e-mail or

advertisement. The information is usually generated by hand, and then converted to XML. It is also most likely human readable. The important distinction is that the order of each element within the document is important. The document would loose meaning if the introduction section was placed at the end of a book. The less intuitive applicable model of information is the data-centric model. Here a document simply uses XML as a transportation medium and nothing else. The application utilizing the data is not aware of, or effected by the fact that at some point the data was stored in XML. Generally speaking information in this model is fine grained, and the oder of it should not be of much importance. For example a person updating their medical record will define new values for certain elements. For glucose levels, the user will add a list of number, and the user may choose to update his email address as well. In this situation it is not important that the glucose levels came before an email update, the datum are relatively independent except when validating the document. Typically this kind of data will come from two sources; from a database and XML is being used to expose that information, or from a source where the information is being generated. In reality most information will be a combination of the two models just described. For this paper, our case study will be medical information. The focus is not on what types of medical information exist, but rather on how we can use XML to store it. A quick look at the variety of medical information out there such as patient charts, hospital records, gene data, etc, should demonstrate that there is an enormous amount of heterogeneous medical information. While certainly some areas will be more data-centric, like genome data, there are also areas which are document-centric like a hospital discharge report. However as a whole medical information will fall somewhere between the two extreme in a hybrid of data/document-centric information model. Later on we will discuss some specific examples and explain these claims in more detail. XML, Databases and Data-centric Information Now that we have defined several information models it is time to explore methods of storing it and leveraging the appropriate advantages that XML provides to each model. When XML contains data-centric information often a natural approach to store this information is by decomposing the XML data into a relational database. There are two approaches available to accomplish this mapping, the first is table-based, and the second is object based [12]. In table-based mapping we are attempting to map an XML document schema to the table format of a relational database. This implies that the schema follows the format of a database table or someone hand codes a mapping which emulates table format. There often exists middle ware products which automate this mapping if the schema follows a specified convention. Labeling elements as a table, row, or column data obviously allows simple automation of this mapping product. Some products are even able to infer a mapping just be comparing the structure of the two mediums. One draw back is that this approach limits our XML schema to a simple tabular format. As stated before biomedical information can be very complex, and heterogeneous which would seem a poor match for this approach. In the object, or more appropriately object-relational mapping paradigm the hierarchical nature of XML is exploited. The root of a sub-tree in the document is considered to be a class, and its elements are scalars. Following this approach utilizing traditional Object Relational techniques the XML model can be mapping into a relational model and stored into a database [1, 12]. However this mapping isn't limited just to the database, there exists middle ware applications which will map the XML document model to objects in the application code using the same principals. Therefore the data will exist as XML application code, and finally reside in a database.

The mapping of an XML schema to application code is a hot area in software development. This concept is generally known as XML data binding. Sun Microsystems really initiated this area of industry with their product Java Architecture for XML Binding, or JAXB. The exact method in which any product in this area accomplishes the mapping varies greatly. However generally they all follow the concepts outlined previously, and usually add more flexibility or custom functionality depending on their focus. So far we have really focused on the one way mapping starting with an XML schema and transferring it to a relational database. However data-centric information often times already exists in some type of relational database. Again there is a myriad of software products that will accomplish this for you, each with its own slight variations, but they all follow a general pattern. Each table in an database becomes an XML element type. Every column from the table becomes an attribute of that XML element, and finally add elements corresponding to primary and foreign keys, allowing an informed developer to recursively explore the dependencies. While mapping an XML schema to a relational database, or vice-versa is an intuitive and popular choice as a way of utilizing XML to store information it does have serious drawbacks. Superficially automated mapping is never truly automated and the larger the project, the more human intervention will be needed. The larger issue arises from fundamental differences between the relational database and XML paradigms. In the relational database world, consistency and performance are paramount. However XML allows for slackness in those areas because of its slightly different goals. Often times it is said that XML is a verbose language, which implies it has extra or superfluous information in it. For example, often times when looking at an XML document, there is a parent node which servers as a wrapper element. This wrapper element servers as a way to data type the information inside of it, and a better way to organize the document (see Drawing 1). In a relational database this information would be stored using two tables; one for each glucose reading and one for observations [12]. However attempting to map the XML automatically would yield that result and would create an extra data structure for the glucose reading. An additional observation is that when following this approach a designer must use relational query languages like SQL to extract and build XML documents. XML, Databases and Document-centric Information When an XML document contains document-centric information the operation of decomposing it to facilitate storage in a relational database is not desired. Since document-centric information by definition requires that the order and structure of the document remains intact we should attempt to database the document as a complete entity. There exists a variety of methods to accomplish this goal including commercial content management applications. While a CMS built to manage XML will certainly do a good job, they are generally extremely specific to the area in which they were designed for. Rather we will focus the two general approaches of using the file-system and again a relational database. If someone has a set of XML documents, they may simply just use their operating systems file system to store them. Modern OS have searching, and indexing capabilities natively built into them. Even programs like CVS can add additional levels of security and maintenance. This simple approach is obviously limited because of a lack of query capability. No search programs are XML aware, and therefore inconsistent or irrelevant results might be returned. Also management becomes infeasible when dealing with a large collection of documents. A more versatile approach would be to use an existing database and store the documents as a BLOB or

character BLOB. Here we gain many of the useful features that databases provide such as security, transaction control and monitoring tools. Also many databases offer full text searches which allows someone to query entire XML documents. Recently database designers have built in an XML awareness. Some allow XML based queries such as XQuery or XPath to be executed directly on a BLOB containing XML. The designer can create a custom method of indexing the documents in each BLOB. This approach is a simple and quick way of databasing XML documents, but we are still hindered by limited, non-native querying capabilities. Native XML Databases As we discussed earlier, most information is not entirely data or document-centric. It usually falls somewhere in between those two extremes. Another approach to databasing XML documents which captures not only the area between, but also data and document-centric information models is through native XML databases. Although the technology behind native XML databases is extremely new, especially when compared to its relational competitor, many of the worlds cutting edge systems utilize it. They have been used in genetics, and medical research, applications which are based on hierarchical data and they even can be found on many popular websites [7, 9, 11]. The main motivation behind their use is that most information in the world is semi-structured, and they provide an efficient means of databasing XML documents which capture all that information. The research, development and creation of native XML databases is very much driven by industry. Therefore there is no clear cut definition of what a native XML database is, however leading developers in this area [1, 7] state that a native XML database (NXD) must; specialize in storing XML data and all its components including the model intact. They must accept XML documents as input and return XML documents as output. However these features get implemented in irrelevant. These few simple criteria may seem to apply to the XML-enabled databases we have described before. By adding the additional criteria that providing the previously stated featured should be the main goal of the database, should remove most of the ambiguity. Previously we have described ways in which existing databases can store XML documents, however in the following discussion of native databases the focus will not be on the low level implementations, but rather the high level concepts that make native XML databases unique and useful. The low-level implementations will be touched upon again later but only in so much as to discuss the performance of native XML databases. Collections The term collections was used previously in this paper without a formal definition. A collection of XML documents is what a native XML database calls a set of documents [7]. There is no explicit requirement that all documents in the collections must homogeneous, or follow the same schema. In fact most native XML databases do not require a schema to define a collection. Very few available native databases can validate a document against a reference schema before allowing insertion, or possibly even extraction. There collections are often called schema-independent. While this gives application developers flexibility how they utilize an XML database, it certainly angers database administrators. A schema-independent collection has no intuitive notion of data-integrity. A collection is analogous to an instance of a relational database. Note that there can exists interaction between relational databases, and a similar notion applies to collections. An XML collection can be nested within another collection even. We now need to explore what useful things a native XML database can extract from a collection. Query Languages One defining point of native XML databases is their query languages. In a native XML database the primary query language is an XML query language. As stated previously there are dozens of publish

XML query languages but the most used happens to be XQuery and XPath. Various XML database implementations offer a variety of support for these, and other query languages. The important point is to understand if the queries you application needs is supported in that database. It seems intuitive that any method of databasing should offer a way of updating and deleting information, but there exists no standards for these features in native XML databases. The notion of updating or deleting XML data is not as straight forward as its relational counterpart. One way is by simple document replacement, if anything in a document is modified just provide a new document. Some databases provide a live Document Object Model tree which allows real time updates of the information. Transactions, Locking and Concurrency One of the conceptual difficulties behind establishing a method of updating and deleting XML data is how you deal with locking and concurrency. Any relational database can guarantee the well known ACID properties. In XML these issues have not been solved. The issue can be explained by a simple example, suppose someone is modifying the glucose reading node discussed previously in drawing 1. To prevent anyone from changing this data before the person finishes updating that particular node is locked. To prevent someone from deleting that node, the parent node is locked and so forth. This leads to document level locking which may be to coarse to be acceptable. This area of research is active not only in business, but also in the academic realm. Patrick Lehti has proposed a method expanding XQuery to contain methods and mechanisms to deterministically insert and update node level information [10]. Another technical paper address node level concurrency by introducing a locking mechanism based on the query path being executed [5]. In this idea any node on the path from the root, to the node in question will be locked to some degree. However the actual implementation of this is very limited because it involves evaluating the query many times which is very expensive. Transactions, locking and concurrency are a key issue when considering biomedical information. Existing databases are massive, and they will certainly not shrink in size. Therefore when considering a native XML database, it is important to understand how it handles these vital issues. In general most free implementations of native XML databases do not offer node level locking, however a few commercial ones do, and it is certainly a planned feature in any product not supporting it now. Round Trip XML Document In many settings, especially medical information it may be required by law to maintain exact copies of certain documents [12]. For example a patients health record maybe be used as evidence in a trial. In most cases the court would require that the exact record must be presented in order to combat after the fact alterations. Various native XML databases support this requirement, each in their own way. A text based implementation of native XML databases stores entire XML documents in some fashion, then performs queries by scanning the text. Therefore in a text based native XML database, the exact same file that was inserted into the database will be removed (assuming no updates of course). In this scenario the XML document is no different than any other legally bound digital document. On the other hand a model based implementation will only store the exact information up to some level of granularity. Recall that the model based native XML databases work by decomposing the XML document into its constituent pieces. Therefore when a document is retrieved by the database, while on a small window the document is identical, in reality certain structures that bind the segments together

are created on the fly to piece the document together. In this scenario that ability to reproduce exact copies of documents might not satisfy legal or contractual obligations. Application Programming Interfaces (API) When relational databases were still developing they suffered from proprietary API's causing developers to have to learn many conventions. Eventually standardized API's were agreed upon, although there is no penalty from straying for these, most vendors choose to stay close to keep customers happy. Native XML databases are suffering the same malaise. Each implementation not only varies in its capabilities but also in its interface. Learning from the mistakes of predecessors, many leading XML database developers and forming standardization committees. The XML:DB API was an initial attempt at creating a standard API and it included several native XML database developers and even the OpenHealth Care Group. However it seems that this initiative has floundered, but many databases still support their API. A more active group, supported by Sun, Oracle, BEA, and Intel to name a few is developing JSR: 225: XQuery API for Java (XQJ). While this standard is limited to Java, it is likely to be supported explicitly by many implementations of native XML databases. When considering a native XML database for a project with a long life span as most medical projects will have. It is important to ensure that the interface to your database will be as standardized as possible. In the long run, especially with native XML databases being in their infancy there is still room in the industry for great change. By choosing a database that supports a standard API, you will reduce the amount of rewrite necessary if the database piece has to be changed. Relational Database Topics Applied to Native XML Databases This section will cover many of the cornerstone issues in relational databases, and look at how they are handled in a native XML database. It relies heavily on work published by R.P. Bourret, a heavily cited consultant and major player in the XML world [12]. Normalization In health-care the accuracy and consistency of information is a very serious concern. It is often a factor in the life or death of a person. In the relational world normalization means designing a database such that information is not repeated. In health-care data originates and is utilized by a variety of sources, therefore insuring a consistent view of information may be difficult, but the basic idea is to only store one instance of any piece of data. This reduces the chance of queries accessing different, non-synchronous versions of that piece of data. When utilizing native XML databases, normalization means that same thing. The goal is to design a schema which does not repeat information. There are several tools available to link one XML node to another in order to minimize repeating information. The W3C has developed the XLink language which allows just this. Native XML databases often offer an early implementation of XLink, or their own proprietary method of joining data. Currently there is no accepted, or implemented standard to provide for data normalization. This feature will more than likely be required to be added into the application logic if it is desired. In fact even though normalization is key in relational databases, it may be that the type of information stored in a native XML database may not require normalization for a variety of reasons. Imagine the scenario of going to a general doctors office. Each member of your family has a medical record in a filing cabinet, and when you arrive at the office the nurse finds that record and passes it to the doctor. If we consider that record to be an XML document, and our schema cleverly implements normalization

by linking the section in the document listing your address to another document. All your other family member's address tags are linked to that same document, eliminating the repeat of information. Now, instead of finding one file, the nurse must find two, reduce the external file to just your address, unify the two documents, then pass it to the doctor. There is a reason why doctors offices repeat address in medical files, efficiency. The price paid for quick document retrieval in a native XML database might be paid in the cost of ensuring application logic accurately maintains consistent information. Referential Integrity Closely tied to normalization is referential integrity. This is the idea of ensuring the validity of pointers within the data. In a relational database, if a column in a table containing glucose readings contains patient Ids then it is important to ensure that every ID maps to a patient in the database. Foreign keys must map to valid elements. This concept is handled in two ways in native XML databases. The first is dealing with intra-document integrity, and the other is inter-document integrity. The W3C schema provides a few mechanisms to validate integrity. The ID/IDREF, and KEY/KEYREF field enable one element to reference another. This reference is checked when the document is validated against a schema. Intra-document integrity can be done with methods like XLink described before. The interesting point has to do with when the integrity is verified. The mechanisms mentioned are checked when the document is being validated, which if it is done at all, is usually only done upon insertion. In a native XML database, if someone alters a “foreign” key after the document is inserted, more than likely it is not re-validated. XML database designers are well aware of this deficiency and more than likely a solution will be added to native databases in the future. Performance of Native XML Databases The performance and scalability of any database management system is often the deal breaker. There is a notion that dealing with XML data is inherently slow because it is a text format. Binary format are small and compact, quickly transferred across networks. Text is bulkier requiring more space, and XML itself utilizes a redundant syntax requiring more text from the beginning. While these statements are true, often it is found the benefits of XML outweigh these costs. It should be noted that these remarks do not apply to XML databases. An interesting paper by T. Fiebig, and S. Helmer of Software AG and Universitat Mannheim, respectively, entitled Anatomy of a native XML base management system published in 2002 introduces a novel way of storing XML data natively [1]. They explains how properties of XML can be utilized to increase performance compared to various other relational, and even XML-aware relational database technologies. While the purpose of this paper is not to look in depth at any one native XML database, this technical paper and their implementation of a native XML database called Natix will be used as a means of familiarizing the reader with the low level mechanisms of native XML technology that will factor into performance. The performance of any database management system is most determined by its storage engine. The Natix system uses a B-Tree data structure as its storage engine. B-Trees allow logarithmic insertion, deletions and search, while requiring less balancing when compared to similar structures. It would seem intuitive for a native XML database system to use trees to store tree structured XML documents. Storing the XML as a flat-file (CLOB in a db) preferences linear data access. Storing the XML in a generated model involves decomposing the information which allows quick point element retrieval but requires a reconstruction step to retrieve whole documents. Natix attempts to negate the downfall of the previous approaches by linearizing subtrees of the XML document, and storing them in a B-Tree. Similar subtrees are clustered, which attempts to minimize the amount of time spent seeking when

accessing information horizontally (with respect to the schema). By linearizing subtree's retrieving information vertically (w.r.t. the schema), Natix does not require a full reconstruction step. The challenge in Natix approach is determining how large each linear segment should be, compared the the amount of horizontal search required to find similar elements. They provide an algorithm for semantically decomposing large documents based on their tree structure, effectively partitioning one large tree, in smaller subtrees. They create what they call proxy nodes to connect subtree's which are not linearized to the same location. By following all proxy nodes, any document can be reconstructed in its entirety. Unfortunately they did not compare their native XML database implementation with any other, most especially a relational database. Direct comparisons are inherently very difficult to execute because the speed relies heavily on how the technologies are implemented. A well coded relational database with hundreds of thousands of man hours of maintenance and tweaking will surely perform much faster than a native XML database coded by a graduate student. Additionally it should not be forgotten that a poorly designed database XML, or relational is still a poorly designed database. This paper is by no means an exhaustive exploration, or comparisons of upper and lower performance bounds of relational and native XML databases. To understand performance one must completely understand the information model you are attempting to capture, and then designing a good XML schema. One must also consider the data storage engine being used in their chosen native XML database, and know what types of queries which will be run on it. Only then is it possible to understand the performance capabilities of the native XML database. Case Studies HL7 and CDA and XML Standards A standardization body within the medical community is the Health Level Seven Group, it has published specifications which are ubiquitous in many US and international hospitals. Their current specification known as Version 2 is obtuse and limited. It is an ASCII medium utilizing vertical bar characters as the delimiter. If there is missing data in the document the delimiter still needs to be present even though known may be between it. While the information is somewhat human readable, it is by no means an easy task. HL7 has been working for several years in developing a new specification, and this new specification is certainly XML compatible. Version 3 of their specification is built around a Reference Information Model (RIM). This model has four core classes; entity, role participation and act. This model is designed to standardize specifications within the HL7 organization. In XML format Version 3 is referred to as Implementation Technology Specification. The Structured Document Technical Committee (HL7) is the creator of the standard known as Clinical Document Architecture (CDA). The most recent version of the CDA leverages RIM, to create its XML model. The model uses certain elements such as observation, code, and value. The CDA, or similar XML standards are beginning to be prevalent in the industry [9]. The area of clinical trials has been quick to adopt XML technology as it enables interoperability between various system a critical notion in drug development. The Clinical data Interchange Standards Consortium (CDISC) has several XML data models to be used by researchers conducting clinical trials. One such standard is the Operational data model which is designed to bridge the gap between information collection points such as a doctors office, and a centralized database. Another area where XML standards has taken a strong hold is in genetics. This field is moving at an

impressive rate, and generating incredible amounts of information. Several organizations have began centralizing the information around the world. Many of them even provide an XML based model for the data. However because each were created independently, the interoperability is limited. Eventually several universal standards have emerged for various problems within the field. Bioinformatics Sequence Markup Language (BSML) standardized the way raw genetic sequences can be stored or at least transferred. One group in particular is testing the size limits of information stored within XML. The Microarray Gene Expression Group (MAGE) has developed a standard to store information taken from gene micro arrays. This information is in the form of large image files, and summaries which often take up gigabytes of disk space. A paper published by IBM researchers on XML and biomedical information [9] notes several common problems with XML and medical data. As was just mentioned XML files may be very large, and there are certain file size limits imposed by operating and database systems. This is probably not a long term problem and when XML databases have time to mature larger file support will be available. The more fundamental mental problem highlighted was the complexity of XML specifications. An biomedical XML schema can very complex requiring dozens of files just to describe itself. In fact attempting to transform the CDA schema to an object-model representation attempted to utilize more than the 512 megabytes of memory allocated for Java heap space. While tracing the execution of the conversion, the system was attempting to create nested objects dozens of levels deep. This level of complexity may demand not only a powerful system, but a powerful mind to utilize it effectively. The IBM paper diagrams the work flow needed to gather the necessary data to do genetic counseling. The final product requires integration of several XML specifications from a variety of separate information databases. While this task is difficult the use of the specifications makes designing such a work flow possible. CDA and the Mayo Clinic The Mayo clinic adopted the CDA specification very early on, most recently with the XML version 3 [3]. They have utilized successfully CDA as a way of collection in-bound information from regional systems, transferring that information to a variety of heterogeneous systems. The two main uses are for clinical notes, and genomic data. Inside the Mayo clinic's system they have two XML document repositories which store the CDA documents for genomics and MICS (notes). Once documents are loaded into these two silo's, they can be transferred through an interface engine which maps Mayo's CDA to the published XML specifications of Genomics and MICS repositories. The motivation behind choosing the CDA standard was; entry automation, templated document entry and and digital dictation. CDA also offered the clinic additional benefits of readability, durability, shareable, and flexibility. XML is a natural language and therefore documents are human and machine readable. The standardization ensures that when the supporting hardware changes, the replacements can be seamlessly integrated. Also because of the standards less effect has to be put into sharing information with outside partners, saving money. Finally XML is also a web language, so CDA documents can be viewed over the internet. Native XML Database and Clinical Document Research A recent publication by several researchers at the University of Columbia details their use of a native XML database for clinical document research [0]. Their goal was to design a schema which captures clinical documents on a very fine level of detail, and to leverage a native XML database to efficiently retrieve information. The study utilized two of the key benefits that XML offers over other technologies. They exploited XML's self defining abilities to add meaning to all human written text in

the documents, and XML's hierarchical nature to perform quick and efficient queries. Their goals were to create a system which accommodated for a broad range of researchers needs. The ability to rapidly process queries against text and mark up information added to the text. This goal was accomplished using XQuery as a means to search the XML elements, and perform full text searches on the elements data. Other goals also accomplished via XQuery were to provide a standard method for querying the documents, the ability to select documents along many different axes of interest, the ability to deliver the correct level of granularity of information and finally a flexible schema to adapt to and query new information and annotations without destroying previous work. The previously mentioned goals which could not be done with XQuery alone were fulfilled using a native XML database. The database provides security, and reliability to the XML documents. Also the properties of collections allows non-homogeneous XML schema to be stored concurrently, allowing for the seamless integration of information. The referencing and linking abilities of XLink and various database technologies allowed the researchers to structure their documents in such a way as to link words and concepts to external entities. They simply structured their schema to store documents on a sentence by sentence granularity there by enabling the user to easily select portions of the text. The native XML database used by these researchers, along with a well though out schema allowed them to create a system which enabled unprecedented ease of access for their users. The native XML database enabled the design to focus not on implementation details like mapping XML data to a relational model. In this situation, with a particular model of information XML was very beneficial. UCONN BMI and Native XML Database As part of a system designed to integrate an individuals personal health records and medical data with a providers health data, along with latest relevant research information, the BMI class explored the use of native XML databases. The goal was to build into the system an XML database which would facilitate the transfer of medical information from a patient to a provider and vice versa. Also an XML database and specification could serve as a way of standardizing information researchers publish. This would enable better integration of that information into the system, allowing both patients and providers better access to the most recent findings. The stated goals were quite large and would certainly take longer than one semester. Therefore as a proof of concept it was decided that user registration would be done via XML. This goal was implemented successfully however it was a bumpy road. In hindsight we are able to apply the principals described in this paper to better understand where our mistakes were. First the information model that we desiring to capture with XML must be understood. User registration consisted of several fields ranging from a users name, phone number, address to what medical specialties (if the user was a provider or researcher) the user was interested in. This information has very little interdependencies. This is evident if we consider the implications of re-arranging the order of the elements in the XML document. In fact I would claim, registration is very data-centric. Next we can explore what type of XML storage best supports our data-centric information model. The answer was trivial because all the registration fields were extracted from a relational database schema we had developed. Data-centric information maps very easily to a relational model by definition. This

is where we made our first mistake, for the sake of a demonstration we attempted to take data which fits natively in a relational schema and map it to an XML one. The consequences of this were immediately obvious when the developers began implementing the code. Since information in a web form is stored and transferred in text format, when the data entered the web server there was a need to transform it to Java application data types. This conversion was hand done and lead to incompatibilities in the Date data types. Next Java code took the data we had just mapped to it, and wrote an XML document as a file stream (text). This conversion takes no mapping, but it did change the data type of the information again (Java types to text). Finally using our native XML database's API the document was inserted into the database (Illustration 1).

While our test was a success it was difficult to see the benefits XML give to the system. Rather than choosing that very data-centric example, a more heterogeneous one such as transferring patient PHR to the researcher may work better. Certainly there must be a smother process of handling XML data in our system which does not require multiple mappings. Also Illustration 1 does not address extracting information from the XML database. Illustration 2 provides another possible methodology to better utilize a native XML database.

In this illustration the system behaves much more like a service oriented application. By providing a Javascript toolkit to the users browser which transforms HTML form data into XML we are standardizing how the server handles requests. To the server, there is no distinction between a user submitting a form, and a partner system transferring data. Once the request is sent to the server out information will exist exclusively in the XML domain. Physically the server will handle the XML document as a string data type, but this fact is unimportant as we will only use XML languages to manipulate it. Business logic can be performed using XQuery. In this phase the server may validate the document, or transform it slightly because the public schema might be slightly different than the database's for security reasons. Finally using the native database's Java API the document will be submitted. A similar process for document retrieval can be seen in Illustration 3. The important point to note in

Illustration 1: Naive

Illustration 2: Input

this illustration is that the server can perform an extra transformation step on the document before it returns it to the requester. This step might be obfuscating the database schema, or even preparing a fully functional XHML document for display.

Conclusions This document has attempted to provide a general synopsis of XML databases in biomedical informatics. Several specific case studies were introduced however noting particular XML database products was avoided. A summary in this document would not provide enough information to differentiate the various existing products. Also as XML database's is a very active field, such information would quickly become dated. If the reader desires such information I suggest view the web page of [12] in the references section. This resource is arguably the most up to date, and complete survey of products as of this writing. Biomedical information is a natural fit to be encapsulated in XML format. Some of XML's earliest proponents noticed this intuitive fact and therefore XML and medical information have advanced together. It is important to fully understand the information model you are attempting to capture. XML is certainly not a panacea for all data related issues. If the information seems to be heterogeneous in nature, and or there is a need for semantic interoperability along with transferability then XML is a good choice. Once the information model you are attempting to capture is identified to be compatible with XML you can try and choose an XML database that most compliments your data. There exists plenty of XML-aware DBMS which can be easily utilized if the information is very data or document-centric. This have the advantages years of development, but do not fully utilize all the potential XML can offer. If the information is heterogeneous, just like most medical information, then a native XML database would be appropriate. Native XML database's attempt to keep the advantages of their XML-aware counterparts while fully utilizing every feature of XML. A native XML database primarily stores XML as its fundamental storage medium, takes XML as input and returns XML as output. The performance of XML databases is tied to two major factors, the first being a well designed schema. The second is how the product implements its data storage layer. One popular approach is to linearize subtree's of a document, attempting to find a balance of speed between horizontal (across the schema) and vertical (with the schema) data access. Native XML databases are not new to the biomedical world, there have been many successful implementations of them in real world systems. There will certainly be many more as interest and development of XML languages and databases increases.

Illustration 3: Output

References [0] A Native XML Database Design for Clinical Document Research. S. Johnson, D. Campbell. Dept. of Medical Informatics, Columbia University, New York NY. [1] “Anatomy of a native XML base management system”. T.Fiebig, S. Helmer, et. al. The VLDB Journal. Springer Berlin. Volume 11, Number 4. December 20002. [2] “Canonical XML Version 1.0”, John Boyer. 15 March 2001. W3C [3] Clinical Document Architecture Update and Preview. Calvin Beebe. Mayo Clinic. 2002. <http://www.hl7.de/iamcda2004/finalmat/day1/Calvin%20Beebe%20CDA%20Update.pdf> [4] “Extensible Markup Language (XML)”. 1 January 2008. W3C http://www.w3.org/XML/ [5] “Design and Implementation of a Data Manipulation Processor for an XML Query Language” Patrick Lehti. Technishe Universitat Darmstadt. August 2001. [6] HL7 ontology and mobile agents for interoperability in heterogeneous medical information systems. Computers in Biology and Medicine, Volume 36, Issue 7 - 8 , Pages 817 - 836. Orgun , J . Vu [7] “Introduction to native XML Databases” Kimbro Statken. October 31, 2001. O'Reilly XML. <http://www.xml.com/pub/a/2001/10/31/nativexmldb.html> [8] “Overview of the CDISC Operational Data Model”. 26 April 2002. CDISC [9] Revolutionary impact of XML on biomedical information interoperability. A. Shabo, et. al. IBM Systems Journal. Volume 45. 2 November 2006. [10] “Sven Helmer, Carl-Christian Kanne, and Guido Moerkotte; Evaluating lock-based protocols for cooperation on XML documents”, In: SIGMOD Record (33):1, 2004. [11] Working Group on Biomedical Computing Advisory Committee to the Director of the National Institutes of Health. 3 June 1999. <http://www.nih.gov/about/director/060399.htm> [12] “XML and Databases”, R.Bourret. September 2007. <http://www.rpbourret.com/xml/XMLAndDatabases.htm> [13] “XML Schema”. XML Schema Working Group. 1 January 2008. W3C <http://www.w3.org/XML/Schema> [14] “XML Schema: Formal Description” Brown, Fuchs, et. al. 25 September 2001. W3C <http://www.w3.org/TR/xmlschema-formal/> [15] “XML Path Language (Xpath) 2.0”. W3C working Draft. 2 May 2003. W3C

[16] “XQuery 1.0: An XML Query Language”, Scott Bao et. al. 17 January 2007. W3C <http://www.w3.org/TR/xquery/>

lindsay

Documents