slide 1: healthcare data models - amazon s3 · slide 10: big data technologies . i’ll be...

HCA201 Healthcare Data Acquisition and Management Lesson 6

Copyright © 2014 The Regents of the University of California of 14

Slide 1: Healthcare Data Models Hello students, and welcome to lesson titled “Healthcare Data Models” given by Bill Riedl. This lesson is about data storage strategies, and you are in good hands with Bill. He is an expert in the area of clinical databases and has years of experience being on the leading edge of informatics and analytical projects at UC Davis. First, Bill carefully describes four storage strategies commonly used in today’s healthcare data storage landscape. These four strategies are what you are most likely to encounter when working in the area of healthcare data management. Bill will describe ontologies, the most abstract storage strategy. Second, “big data” is currently the buzz in the industry, so he will teach you what database types are used in this area. Third, hierarchical data stores offer great value to healthcare, especially when designing systems for providers to quickly access all of a patient’s information. Finally, Bill covers relational data stores, the most common storage paradigm you’ll encounter in healthcare and many other industries. For these four storage strategies, Bill offers insights in the following areas: • Describe the pros and cons of big data technologies • Suggest reasons why rela tional systems are the most widely adopted • Compare examples of different types of common relational data models • Illustrate a conceptual roadmap to follow when moving data between systems and storage paradigms. I am excited about this lesson because I think you will gain a very clear understanding of how relational databases work. This is critical because most health organizations use these. But it is also important to learn how other storage solutions work since these may soon disrupt the current approaches. Listen carefully – the examples and concise words will help you communicate solutions in your organization. In my work, I constantly apply the information in this lesson.

Slide 2: Understanding Storage I would like for you, the students of this class, to think about common tasks in your life in which you arrange and record data. For example, I enjoy music, but I’m also very fickle. Many of the musical tools or apps today allow you to create playlists. Let’s say I’ve built a lot of playlists based specifically on artists, but it’s Wednesday, and I want to listen to a little bit of everything in the rock genre. My playlist won’t support that. The point being that at some level, your data organization strategy will be dictated by the intended use.

Slide 3: Understanding Storage

While we present storage as one of the four pillars of this course, I’d like for you to understand that it’s embedded within the continuum of the topics we cover, from data capture all the way through to analytics. What varies is the storage strategies. This course is designed to give you an overview about how to manage, understand, and reuse healthcare data. The first point you need to keep in mind is healthcare data is captured to support the business process of healthcare, not to be reused for reporting or analytics. This means there’s typically some effort involved in reusing the data.

For example, put yourself in your doctor’s shoes. Wouldn’t you expect the clinical application you’re working with to pull up patient information quickly as well as save any edits or newly entered information very quickly



as well? This use case is very different from trying to understand how many and what type of patients on medication X suffered side effect Y. Pretty typical use case for the reuse of healthcare data. In a typical healthcare setting, data is captured and saved by one or more systems. Often, answering a question such as our medication side effect example from before requires collecting data from several of these source systems, aligning it, and then selectively reporting the important elements. As the data moves through these steps, it will likely be moved from system to system and rearranged along the way. The second point I’d like to make is more of an ethical one. While you as health system representatives manage the systems that collect and manage healthcare data, it is technically the property of the patients you care for. While I will be focusing on the technical aspects of data storage and reuse for the remainder of this lecture, please be sure to keep in mind that healthcare data is sensitive information, and it must be handled appropriately.

Slide 4: The Mechanics of Storage Before we dive in, I’d like to take a moment to demystify data storage. All storage paradigms are simply a way to access data on disk, much the same way you save and retrieve documents on your computer. In this case, the location of the document is indexed in your memory, and the disk is simply saving a string of ones and zeroes that can be used to recapitulate your document by the program in which you created it. While the document is saved as ones and zeroes on disk, there’s no visibility into the document’s context. Only applications that can deal with that document type are able to open, recapitulate the document, and allow you to review and/or edit it. All storage paradigms function this way. They save ones and zeroes to disks. That’s it.

Slide 5: Data Description The differentiating factor lies in the level of formal data description. By describing the data becomes more widely available to processes other than that which created it. For example, let’s assume that your computer requires that you enter the number of paragraphs in your document upon saving. Imagine now searching for all documents on your computer that contain greater than five paragraphs. You have gained some accessibility into the data, but you paid an upfront cost of entering additional descriptions. This is the art of data storage that is tough to teach. Maintaining a full description of your data may seem reasonable at the time, but often, it is so time consuming, it cripples systemic effectiveness. However, without any description at all, one is left to only a naïve approach to the information contained within your data. Now that I’ve primed you with a relatable example, we’re going to take a step back and focus on each technology, one by one. I’ll do my best to provide real world examples about how each technology may best be utilized. Our first stop will be ontologies.

Slide 6: Ontologies Defined Ontologies are highly descriptive. They seek to completely define concepts and relationships within a given domain. Additionally, they often come with a defined set of rules or sets of characteristics about concepts that further help us define what a concept is and is not. For this reason, ontologies are used mainly as referential data sets for classification and data exploration tasks.



Slide 7: Ontology Example An example of a classification task that can be performed with an ontology could look something like this. Let’s say we want to understand how often our doctors are prescribing aspirin or aspirin analogues for use in their patient population. Thousands of drugs meet the criteria of being either aspirin or an aspirin analogue. In this case, we would use the ontology to compare each and every medication order to see if it fits inside of the rules that are built around describing what an aspirin or an aspirin analogue is, thus allowing us to classify each and every one of our very different medication orders into a logical grouping of aspirin or aspirin analogues. From a data exploration perspective, we may wish to look at a patient’s medical record and see how many aspirin analogue related drugs they’re on. It is possible to using the intricately defined relationships within an ontology to understand how many other drugs the patient is already on that may have similar pharmaceutical effects as aspirin, thus empowering the physician to do a better job at assessing whether or not aspirin is truly indicated. Because of their descriptive power, the use of ontologies can be computationally intensive, especially as they grow large. It’s unlikely you’ll regularly use a true ontology in practice. However, the term ontology is often intertwined with medical terminologies, so you will hear the word thrown around in day-to-day practice. Likely, the first ontology that will widely used in medicine is SNOMED-CT, which you may have heard about in the previous lecture, but that is years away from widespread adoption. If you’re interested in reading more about ontologies, please look up the Stanford protégé project or SNOMED-CT itself.

Slide 8: Defining Big Data The definition of big data is rapidly evolving. At one point in time, it was purely a volume metric assessment. Today, it refers to both large data volumes and inconsistent, rapidly changing data structure. For example, think about all forms of social media. Users are demanding more and more functionality in the form of new data types and multimedia. This level of volatility in the data types that need to be persisted or saved has never been seen before. As such, new technologies are arising to better handle this new reality. However, most of them do so at the expense of formal modeling in some of the legacy data store systems. This means the data is merely saved. It is not broadly accessible. To relate this to medicine, consider progress notes written by your doctor. While most progress notes agree on basic note sections, typically five, subjective, objective, assessment, and plan, the level of detail, acronym and abbreviation usage, and other language differences consider regional influences, doctor specific specialty influences, et cetera, coupled with the continuously evolving nature of medicine, means these notes will vary widely.



Slide 9: Big Data These new data storage technologies commonly referred to as no SQL or no relation technologies, also sometimes referred to as big data appliances, were created specifically to account for these new and unique demands. Previous storage technologies require a formal description of the data, a process known as data modeling, prior to accepting and/or storing data. In these new big data appliances, data is usually stored in a more natural form, eliminating the need to incur the cost of the data-modeling step. This, however, as we’ve discussed before comes at a great expense since the storage strategy itself does not formally describe the data. The data is only valuable to the process that created it or those willing to spend a lot of time analyzing it. Over the next few slides, I’ll be introducing the most common big data appliances at a very high level simply to provide some context. I encourage you to research them more deeply if you have specific interests. I hope only to introduce the major strength and major weakness of each one of these technologies.

Slide 10: Big Data Technologies I’ll be introducing to you three big data appliances. Key:value stores, document stores, and graph databases. Key:value stores are the most simplistic of the three, offering extremely fast performance, but absolutely no data visibility. Document stores are less blind to the data, but this comes at a slight performance and configuration cost, compared to a key:value store. Graph databases are the best for modeling data in its natural state. However, gaining an understanding of the data that’s present and the relationships can be a very time consuming process. In general, these technologies don’t currently have much presence in the healthcare field. There is one exception, a specific technology known as Hadoop, a form of a document data store, is beginning to have an impact in unstructured data analytics.

Slide 11: Big Data – Key:value Datastores

Key:value stores offer an extremely simple data model. A key uniquely associated with a value. They require no formal data modeling, and offer little more than fast, reliable storage. The use or understanding of the data must be done completely outside of the data store. This is very much akin to saving a document or any other file type on your computer. These characteristics have made key:value stores very popular for rapid, object oriented application development. The reason for this is that the key:value store does not place any limitations on what data type the value can be, can range from anything from a string all the way to a complex object. Often times, object oriented programmers will simply save their objects into this database, making retrieval very fast.



Slide 12: Document Stores Document stores are similar to key:value stores except they are not blinded to the data. This comes at a configuration cost. The document model must be created in advance. Remember the data modeling we spoke about earlier? The advantage is that the data in the document is now query-able. I would like to illustrate with a fictitious example. I’m assuming each one of you in this course is at least familiar with Facebook. Facebook is basically a collection of pages. Each person has a page, sometimes groups have a page, sometimes businesses have pages. In this case, consider Facebook storing each page as a document inside of a document data store. However, Facebook also wants to support searching for pages with common themes, which is a characteristic on the page. In this case, a simple key:value store completely blinded to the data would be ineffective. However, a document data store, which has some data visibility, may be able to support this use case.

Slide 13: Hadoop As it turns out, data blindness is such a problem that a specialized document store known as Hadoop has been specifically optimized for document querying. Its selling point is that it can help you define your data through analysis after collection as opposed to incurring a data-modeling step prior to collecting information. This approach is very computationally intensive and was not possible until the recent era of powerful computer architectures.



Slide 14: Hadoop (Continued)

Hadoop is an implementation of the map reduce algorithm, which runs very well on today’s modern, multi-core CPUs. It runs in two steps. Step 1 is the map function. Here the parallelism afforded by modern CPUs allows Hadoop to map the function to every document at the same time as opposed to applying the function serially or one document at a time. The output from each document is then reduced or combined to provide one answer. In this specific example, the challenge is to count how often the word John exists in all of the documents that we have stored combined. The map step first counts the occurrence of the word John in each document. The reduced step then sums the count from each output to provide a single answer.



Slide 15: Graph Databases

Graph databases are excellent at representing data in their natural form. They represent data as a series of notes or concepts connected by their natural relationships. They are most effective at exposing secondary or tertiary or even higher order relationships between concepts. For example, on the image presented to you here, it’s easy to visualize the common friend of two people who enjoy sushi. In general, graph databases have very little penetrance into the healthcare field. This also closes our overview of big data appliances. Our next topic will be hierarchical databases, which in contrast to those we’ve covered so far, have a big place in the healthcare data landscape.



Slide 16: Hierarchical Databases

Hierarchical data stores are fairly intuitive. Now also consider the strict chain of command enforced within the military rank system. Each person reports up to one and only one person with a superior rank. In contrast, any given person may have many people reporting to them. In data terms, general concepts will be located near the top while specific concepts will be located near the bottom.

Slide 17: Hierarchical Database: Characteristics In technical terms, we can sum up the military example rake as well as our general descending down to more specific data example with this quote. Hierarchical models are well suited to modeling data in which the segments are composed of descending one to many relationships. This intrinsic characteristic has made hierarchical data stores highly relevant to healthcare. Medicine is practiced one patient at a time. Therefore, clinical systems are optimized to present and interact with one patient at a time. Consider a hierarchical model with patients near the top. The intrinsic model of the data is optimized to return an entire patient’s record in one go. This offers massive scaling for clinical applications at a much lower cost than other storage paradigms. In fact, this model is so well suited for medicine, the programming language, MUMPS, commonly know as M, was developed with a built in hierarchical database specifically to house medical information. This took place all the way back in 1966.



Slide 18: Hierarchical Databases: Drawbacks

Hierarchical databases have their drawbacks. Due to their intrinsic structure, they must be accessed from their top node. Consider the challenge of counting the number of office visits last month. Remember, our hierarchy has patients at the top. Our analysis is subsequently forced to start with the first patient, descend down their tree, evaluate each encounter associated with that patient, test whether or not it occurred that month, count the tally, then ascend backup to the next patient, descend back down to encounters, and so on. This is not very efficient. In this case, wouldn’t it be easier if all the encounters at your health system were just stored in one location, and you could count them directly?

Slide 19: Relational Data Stores So far, I’ve discussed a variety of storage paradigms with specific utility. Here, we introduce relational data stores. You can consider a relational data store the jack of all trades. In fact, relational data stores are so flexible, it is possible to implement many of the previously discussed storage paradigms within them. This flexibility, however, comes at a cost. While it is possible to model information in virtually any configuration, it is nearly impossible to achieve the performance achieved with more niche systems. In general, a relational database could be optimized to perform well at closely related tasks.

Slide 20: What is a Relational Database? A relational database is a collection of data tables and formally described relationships. Each data table typically represents a collection of related data. A table is formally described as a series of columns representing the data elements that can be found in each record or row. Each row could be treated independently of other rows and should be identifiable by a key, the primary key. This is one or more column values which uniquely identify a row within a table. The power in a relational database is the ability to link or relate data in many tables together. For example, if records in two tables should be associated with one another, at least one of the tables should be given a column representing the primary key of the other table. This is known as a foreign key because it uniquely identifies a row in a foreign table, allowing these two tables to be joined or linked together.



Slide 21: Relational Database: SQL Joins

When joining two tables in a relational database together, think of a Venn diagram. A logical join syntax exists to retrieve every possible Venn representation. This is best taught visually.

Slide 22: Relational Databases: Normalization Let’s illustrate the concept of normalization with these classic examples. At the two extremes of the normalization continuum lie on one end transactional systems, and on the other end, analytical systems. Let’s start by first discussing transactional systems.

Slide 23: Transactional and Analytical Databases Transactional systems are optimized for data creation, updates, and deletions. These are typically highly normalized. The reason for this is there’s simply less accounting to be done when these operations take place. Highly normalized databases, as you’ll remember, have less data duplication. So for each data point that must be deleted, it’s optimal to only have to delete it once and be done. Conversely, analytical systems are designed for applying functions such as counting, summing, or averaging. Remember the example I discussed very early in this lecture about counting the number of patients on medication X with side effect Y? In this case, we are not concerned with optimizing for creations, updates, and deletions. Rather, we’re interested in optimizing to produce the count of patients that meet these specific criteria. Each one of these use cases was optimized for performance at its intended task. There is no single data model which will optimally perform both of these tasks within the SQL realm.

Slide 24: Data Warehouse/Repository Models Now that we’ve gained an understanding of what a relational data store is, I would like to introduce some common data models used for data aggregation and analytics. I will be focusing specifically on three models because they are the most commonly used in this space. They are the star schema, the snowflake schema, and the cube.



Slide 25: Data Warehouse: Star Schema

We will first discuss the star schema. It is aptly named because it looks like a star. At the center of the star is a fact which can be fully described by the intersection of many characteristics broken up into dimensions. Let’s illustrate by considering an office visit. An office visit requires a patient, a doctor, a physical location or clinic, and it takes place at a certain time. The intersection of that patient, that doctor, that clinic, and that time compose the fact of an office visit.



Slide 26: Data Warehouse: Snowflake Schema

The snowflake schema is much like a star schema, except that it is more highly normalized. Like a star, it is aptly named because it looks like a snowflake. I’d like to illustrate the difference by revisiting our office visit example. Our star schema modeled the doctor in one single dimension. Let’s say one aspect of the physician we would like to record is physician specialty. When modeled as a single dimension, only one specialty can be stored. Some physicians have multiple specialties requiring a more complex data model. In the case of the snowflake, we can handle these nuances easily. However, this comes at greater cost, resulting from additional data modeling.

Slide 27: Data Ware House: Cube The last common data warehouse model we will discuss is the cube. Cubes are valuable when ad hoc analytical querying is not feasible due to high data dimensionality. Cubes speed up data retrieval by pre-computing measures at dimensional intersections. Typical measures include counts, sums, averages, minimums, or maximums.



Slide 28: Data Warehouse: Cube (Continued) I’d like to illustrate the power of a cube by example. Look at the cell circled in red. This is the intersection of three dimensions. Customer, in this case C2, product, in this case B, and time, in this case the month of August. The intersect value located at the cell circled in red could represent the number of times customer C2 bought the product B in August. Or perhaps, the maximum price paid by customer 2 for product B within the month of August. Slide 29: Data Warehouse: Cube (Continued) While the visual representation is helpful in presenting the content, in practice, cubes are accessed via set notation. For example, to retrieve the contents of the red and circled cell from the previous slide, we would supply the three indices which uniquely locate that cell within the

cube. In this case, customer c2, product B, and the month of August act as our indices for accessing the value stored in that location. In sum, cubes are optimized for intersect value retrieval speed and are valuable when ad hoc querying performance is slow.

Slide 30: Moving Data Today, we’ve discussed a variety of systems and storage paradigms, any of which can be used to store your data. You know that you need to choose the best technology for the job, but what happens when the best technology is different from where the data reside today? Answer, you will need to move the data. Here are a few considerations to keep in mind. Is the data organization strategy different in the destination than in the source? Does the destination support the same data types as the source. Think about date formats, numeric precision, et cetera. Will the destination tolerate missing data from the source?

Slide 31: Moving Data Between Systems Data movement operations are typically referred to as ETL processes. The E stands for extract where you pull your data out of the source system. The T stands for translation. This is an opportunity to manipulate the data to overcome structural or data type mismatches between the source and destination system. The L stands for load where you load the data into the destination system.



Slide 32: Data Translation

Translation is typically the most complicated part of the ETL process. A typical step in the translation process involves mapping differences between data types. Let’s use dates as an example. Some systems store date and time as two different data types, and therefore as separate values. Others store them as one. Communicating in one direction requires the two be concatenated together while going in the other direction requires the date and the time be split apart. Another common scenario is handling missing data. Some systems handle missing data well. Others do not. Sticking with the date example, one may need to choose a representative date that flags the information as missing. In these cases, you would typically want to choose a date very far in the past or sufficiently far in the future that it is unlikely to be present in your natural data. However, keep in mind that this is dangerous. Be sure to pick a standard and stick with it.

Slide 33: Summary We covered a lot of ground today. In sum, there are a variety of storage paradigms, many of which I hope you now have a general sense for. The single most important takeaway from this lesson is understanding the intended use of the data you wish to store. This will always guide you to the optimal storage solution.

slide 1: healthcare data models - amazon s3 · slide 10: big data technologies . i’ll be...

Documents