challenges in transforming observational data for analysis don griffin health informatics technology...
TRANSCRIPT
Challenges In Transforming Observational Data
For Analysis
Don Griffin
Health Informatics Technology Director
Computer Sciences Corporation
May 20, 2009
OR
How To Call Into QuestionYour Observational Data
Without Even Trying
Health Informatics May 20, 2009 2
Objectives
Lofty Objective:
Present a complete health Informatics solution:• that is flexible enough to accommodate all of the types of source data that end users will
require—even if they do not know what those data will be—and
• that is rich enough in functionality to support all of the data transformations and manipulations that end users will require to convert those source data into research-oriented knowledge on which they may confidently rely.
More Practical Objective:
Leave those in the audience with an appreciation for the things that must be done ahead-of-time to make multifarious, disparate, observational source data sets useful for analysis.
Health Informatics May 20, 2009 3
Definitions
Observational Data
– “... the outcomes of acts of measurement using particular protocols within the context of any objective scientific measurement activity.”
– “… the basic or atomic notion of an observation represents:• the outcome of some measurement taken of a defined attribute or characteristic of some
‘entity’ (e.g., an organism ‘in the field,’ a specimen, a sample, an experimental treatment, etc.),
• within some context (possibly given by other observations).”
– “Every observation entails the measurement of one or more properties of some real-world entity or phenomenon.”
Biodiversity Information Standards – TDWG
For Our Purposes:
– we are most interested in observational data on drug exposures and medical conditions (but other data may interest us, too), and
– chief sources will be Medical Claims and Electronic Health Records (EHRs).
Health Informatics May 20, 2009 4
Definitions
Data Transformation– “... the operation of changing (as by rotation or mapping) one configuration or
expression into another in accordance with a mathematical rule; especially: a change of variables or coordinates in which a function of new variables or coordinates is substituted for each original variable or coordinate…”
– “… an operation that converts (as by insertion, deletion, or permutation) one grammatical string (as a sentence) into another…”
Merriam-Webster’s Dictionary
– One of the three pillars of data governance (along with compliance and integration). “… transformation is a goal unto itself, as well as an enabler for the goals of compliance and integration.”
The Data Warehousing Institute
• For Our Purposes:– we are most interested in reformatting data into a Common Data Model that
allows portability of analysis methods across disparate source data sets, and
– in standardizing data representations to make analysis results from disparate source data sets readily comparable.
Health Informatics May 20, 2009 5
Transforming Observational Data
Again, for our purposes, the process is rather simple. However, to do it correctly presents some challenges.
Health Informatics May 20, 2009 6
Transforming Observational Data
Again, for our purposes, the process is rather simple. However, to do it correctly presents some challenges.
Health Informatics May 20, 2009 7
The IT View of the End User’s Goal
Skillful use of Common Data Model content to communicate “complex ideas… with clarity, precision, and efficiency” (and, ideally, unimpeachability )
– Show the data
– Induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production, or something else
– Avoid distorting what the data have to say
– Present many numbers in a small space
– Make large data sets coherent
– Encourage the eye to compare different pieces of data
– Reveal the data at several levels of detail, from a broad overview to the fine structure
– Serve a reasonably clear purpose: description, exploration, tabulation, or decoration
– Be closely integrated with the statistical and verbal descriptions of a data set
Edward Tufte, The Visual Display of Quantitative Information
Health Informatics May 20, 2009 8
The IT View of IT’s Goals
Provide services necessary to populate the Common Data Model
– Data Architecture
– Data Collection
– Data Extraction, Transformation, and Loading (ETL)
– Data Management
Help (or do not hinder) end users in pursuit of their own goals
– Preserve the data (i.e., their native values, formats, etc.)
– Avoid distorting the data
– Maintain data detail
Foster the widespread understanding of the data
– What the data are and are not
– What the data can and cannot do
Health Informatics May 20, 2009 9
IT Issues/Challenges
Source Target(CDM)
Technical
Philosophical
DataCollection
DataManagement
ETLDesign
DataArchitecture
DataUnderstanding
Health Informatics May 20, 2009 10
IT Issues/Challenges
Data Collection
– Batch vs. Stream
– Reception and Profiling
– Verification to Specification
– Culling and Cleansing
– Staging
Health Informatics May 20, 2009 16
IT Issues/Challenges
Data Management
– Inventory and Tracking
– Privacy, Security, and Compliance
– Master/Reference Data Management
– Logging and Auditing
Health Informatics May 20, 2009 17
Privacy
Protected Health Information– Any information (not just textual data) in the medical record or designated data set that
can be used to identify an individual, and
– That was created, used, or disclosed in the course of providing a health care service (e.g., diagnosis, treatment, etc.)
HIPAA regulations allow researchers to access and use PHI when necessary to conduct research. However, HIPAA only affects research that uses, creates, or discloses PHI that will be entered in to the medical record or that will be used for the provision of heath care services (e.g., treatment). – Research studies involving review of existing medical records for research information,
such as retrospective chart review, are subject to HIPAA regulations.
– Research studies that enter new PHI into the medical record (e.g., because the research includes rendering a health care service, such as diagnosing a health condition or prescribing a new drug or device for treating a health condition) are also subject to HIPAA regulations.
– If in doubt, stay away from the 18 “identifiers.”
Health Informatics May 20, 2009 18
Privacy
18 Identifiers1. Names;
2. All geographical subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code, if according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000.
3. All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;
4. Phone numbers;
5. Fax numbers;
6. Electronic mail addresses;
7. Social Security numbers;
Health Informatics May 20, 2009 19
Privacy
18 Identifiers
8. Medical record numbers;
9. Health plan beneficiary numbers;
10. Account numbers;
11. Certificate/license numbers;
12. Vehicle identifiers and serial numbers, including license plate numbers;
13. Device identifiers and serial numbers;
14. Web Universal Resource Locators (URLs);
15. Internet Protocol (IP) address numbers;
16. Biometric identifiers, including finger and voice prints;
17. Full face photographic images and any comparable images; and
18. Any other unique identifying number, characteristic, or code (note this does not mean the unique code assigned by the investigator to code the data)
Health Informatics May 20, 2009 20
Privacy
De-identification is a possible solution. However, additional standards and criteria apply.
– Any code used to replace the identifiers in datasets cannot be derived from any information related to the individual and the master codes, nor can the method to derive the codes be disclosed. For example, a subject's initials cannot be used to code his data because the initials are derived from his name.
– The researcher must not have actual knowledge that the subject could be re-identified from the remaining identifiers in the PHI used in the research study. That is, the information would still be considered identifiable is there was a way to identify the individual even though all of the 18 identifiers were removed.
Health Informatics May 20, 2009 21
Privacy
The following is NOT considered PHI, and therefore is not subject to HIPAA regulations.– Health information absent the 18 identifiers.– Data that would ordinarily be considered PHI, but which are not associated with or derived
from a healthcare service event (treatment, payment, operations, medical records), not entered into the medical record, and not disclosed to the subject. Research health information that is kept only in the researcher’s records is not subject to HIPAA, but is regulated by other human subjects protection regulations.
Examples of research health information not subject to HIPAA include such studies as the use of aggregate data, diagnostic tests that do not go into the medical record because they are part of a basic research study and the results will not be disclosed to the subject, and testing done without the PHI identifiers.– Some genetic basic research can fall into this category such as the search for potential
genetic markers, promoter control elements, and other exploratory genetic research. – In contrast, genetic testing for a known disease that is considered to be part of diagnosis,
treatment and health care would be considered to use PHI and therefore subject to HIPAA regulations.
University of California, BerkeleyCommittee for Protection of Human Subjects
Health Informatics May 20, 2009 22
IT Issues/Challenges
Data Extraction
– Form (e.g., ASCII vs. EBCDIC)
– Format (e.g., delimited, fixed-length, ragged right, etc.)
Data Transformation
– Reformatting (usually from flat to relational)
– Probabilistic Matching
– Augmentation (excluding Standardization)
– Master <fill in the blank> Indexing
– Standardization
Data Loading
Health Informatics May 20, 2009 23
Augmentation
Person Timeline
Drug A
Drug B
DrugEra1
DrugEra2 DrugEra3
Persistencewindow
Persistencewindow
A1 A2 A3 A4
B1 B2
Person Timeline
Condition A
Condition B
ConditionEra1
ConditionEra2 ConditionEra3
Persistence
window
A1 A2 A3 A4
B1 B2
Health Informatics May 20, 2009 25
IT Issues/Challenges
Data Architecture
– Common Data Model Design Paradigms– “All models are wrong, but some are useful” George Box, Statistician
– Flexibility vs. Intuitiveness “Compromise”
Health Informatics May 20, 2009 27
OMOP Common Data Model (logical)
CDM Domain
PERSON
PERSON_ID
SOURCE_PERSON_KEYYEAR_OF_BIRTHGENDER_CONCEPT_CODE (FK)RACE_CONCEPT_CODE (FK)LOCATION_CONCEPT_CODE (FK)
DRUG_EXPOSURE
DRUG_EXPOSURE_ID
PERSON_ID (FK)DRUG_EXPOSURE_START_DATEDRUG_EXPOSURE_END_DATEDRUG_CONCEPT_CODE (FK)DRUG_EXPOSURE_TYPE (FK)SOURCE_DRUG_CODESTOP_REASONREFILLSDRUG_QUANTITYDAYS_SUPPLY
CONDITION_ERA
CONDITION_ERA_ID
PERSON_ID (FK)CONDITION_CONCEPT_CODE (FK)CONDITION_START_DATECONDITION_END_DATECONDITION_OCCUR_TYPE (FK)CONDITION_OCCURRENCE_COUNTCONFIDENCE
CONCEPT_PARENT_CHILD
CONCEPT_PARENT_CHILD_ID
PARENT_CONCEPT_CODE (FK)CHILD_CONCEPT_CODE (FK)
CONCEPT
CONCEPT_CODE
CONCEPT_NAMECONCEPT_DESCRIPTION
CONCEPT_PROPERTY
CONCEPT_PROPERTY_ID
CONCEPT_CODE (FK)PROPERTY_ID (FK)CONCEPT_PROPERTY_VALUE
CONCEPT_PROPERTY_QUALIFIER
CONCEPT_PROPERTY_QUALIFIER_ID
CONCEPT_PROPERTY_ID (FK)QUALIFIER_ID (FK)
PROPERTY
PROPERTY_ID
PROPERTY_NAMEPROPERTY_DESCRIPTION
QUALIFIER
QUALIFIER_ID
QUALIFIER_NAMEQUALIFIER_DESCRIPTION
CONCEPT_ASSOCIATION_OR_ROLE
CONCEPT_ASSOCIATION_PR_ROLE_ID
SUBJ ECT_CONCEPT_CODE (FK)ASSOCIATION_OR_ROLE_ID (FK)PREDICATE_CONCEPT_CODE (FK)
ASSOCIATION_OR_ROLE
ASSOCIATION_OR_ROLE_ID
ASSOCIATION_OR_ROLE_NAMEASSOCIATION_OR_ROLE_DESCRIPTION
OBSERVATION_PERIOD
OBSERVATION_PERIOD_ID
PERSON_ID (FK)OBSERVATION_START_DATEOBSERVATION_END_DATEPERSON_STATUS_CONCEPT_CODE (FK)RX_DATA_AVAILABILITY
VISIT_OCCURRENCE
VISIT_OCCURRENCE_ID
PERSON_ID (FK)VISIT_CONCEPT_CODE (FK)VISIT_START_DATEVISIT_END_DATESOURCE_VISIT_CODE
PROCEDURE_OCCURRENCE
PROCEDURE_OCCURRENCE_ID
PROCEDURE_CONCEPT_CODE (FK)PERSON_ID (FK)PROCEDURE_DATESOURCE_PROCEDURE_CODEPROC_OCCUR_TYPE (FK)
DRUG_EXPOSURE_REF
DRUG_EXPOSURE_TYPE
DRUG_EXPOSURE_TYPE_DESCPERSISTENCE_WINDOW
CONDITION_OCCURRENCE_REF
CONDITION_OCCUR_TYPE
CONDITION_OCCUR_TYPE_DESCPERSISTENCE_WINDOW
DRUG_ERA
DRUG_ERA_ID
PERSON_ID (FK)DRUG_ERA_START_DATEDRUG_ERA_END_DATEDRUG_EXPOSURE_TYPE (FK)DRUG_CONCEPT_CODE (FK)DRUG_EXPOSURE_COUNT
CONDITION_OCCURRENCE
CONDITION_OCCURRENCE_ID
PERSON_ID (FK)CONDITION_CONCEPT_CODE (FK)CONDITION_OCCUR_TYPE (FK)SOURCE_CONDITION_CODECONDITION_START_DATECONDITION_END_DATESTOP_REASONDX_QUALIFIER
OBSERVATION
OBSERVATION_OCCURRENCE_ID
PERSON_ID (FK)OBSERVATION_CONCEPT_CODE (FK)OBSERVATION_TYPE (FK)SOURCE_OBSERVATION_CODEOBS_VALUE_AS_NUMBEROBS_VALUE_AS_STRINGOBS_VALUE_AS_CONCEPT_CODE (FK)OBS_UNITS_CONCEPT_CODE (FK)OBSERVATION_DATEOBS_RANGE_LOWOBS_RANGE_HIGH
OBSERVATION_TYPE_REF
OBSERVATION_TYPE
OBSERVATION_TYPE_DESC
PROC_OCCURRENCE_REF
PROC_OCCUR_TYPE
PROC_OCCUR_TYPE_DESCPERSISTENCE_WINDOW
Health Informatics May 20, 2009 28
Solution Framework
CORE BUSINESS INTELLIGENCE SERVICES
FOUNDATIONAL DATA SERVICES
SU
PP
OR
TIN
G S
ER
VIC
ES
Data Architecture
Data Collection Data Integration Data Management
QueriesReports/
DashboardsOLAP, ROLAP MOLAP, HOLAP
Process Models
Statistical Analysis and
ValidationBusiness Rules/
Predictive ModelsOptimization
Database Management System Data Models Metadata
Reception and Profiling
Verification to Specification
Culling and Cleansing
Staging for Integration
Probabilistic Matching
Augmentation
Master Person Indexing
Controlled Medical
Vocabularies
Inventory and Tracking
Privacy, Security, and Compliance
Master/Reference Data Maintenance
Logging and Auditing
Bus
ine
ss In
teg
ratio
n S
erv
ice
s
Pre
sen
tatio
n a
nd P
ort
al S
ervi
ces
Sys
tem
s M
ana
ge
me
nt
Se
rvic
es
Health Informatics May 20, 2009 29
Solution Context
CORE BUSINESS INTELLIGENCE SERVICES
FOUNDATIONAL DATA SERVICES
LIFE SCIENCES SOLUTIONS
OVERALL SOLUTION STEWARDSHIP
SU
PP
OR
TIN
G S
ER
VIC
ES
Data Architecture
Data Collection Data Integration Data Management
Scientific Applications
Strategy Process Intelligence Governance
Protocol Feasibility
Study Recruitment
Health Outcomes & Economics
Drug Safety Monitoring
QueriesReports/
DashboardsOLAP, ROLAP MOLAP, HOLAP
Process Models
Statistical Analysis and
ValidationBusiness Rules/
Predictive ModelsOptimization
Database Management System Data Models Metadata
Reception and Profiling
Verification to Specification
Culling and Cleansing
Staging for Integration
Probabilistic Matching
Augmentation
Master Person Indexing
Controlled Medical
Vocabularies
Inventory and Tracking
Privacy, Security, and Compliance
Master/Reference Data Maintenance
Logging and Auditing
Bus
ine
ss In
teg
ratio
n S
erv
ice
s
Pre
sen
tatio
n a
nd P
ort
al S
ervi
ces
Sys
tem
s M
ana
ge
me
nt
Se
rvic
es
Exploratory Data Analysis
StudyManagement
Site Management
Drug Safety Management
Clinical Data Management
Executive Dashboards
Operational Reporting
Licensing Intelligence
Closed Loop Marketing
Market Intelligence
Marketing
Thank You
Don Griffin ([email protected])
Health Informatics Technology Director
Computer Sciences Corporation
May 20, 2009