de-identification: a critical success factor in clinical and population research

15
De-identification: A Critical Success Factor in Clinical and Population Research Steven Merahn MD Dee Lang, RHIT Prepared for 2007 APIII Pittsburgh, PA September 10, 2007

Upload: shafira-alexander

Post on 30-Dec-2015

25 views

Category:

Documents


0 download

DESCRIPTION

De-identification: A Critical Success Factor in Clinical and Population Research. Steven Merahn MD Dee Lang, RHIT Prepared for 2007 APIII Pittsburgh, PA September 10, 2007. Major gaps exist today in between patient care, clinical research and evidence-based medicine. Sharing Data is the Key. - PowerPoint PPT Presentation

TRANSCRIPT

De-identification: A Critical Success Factor in Clinical and

Population Research

Steven Merahn MD

Dee Lang, RHIT

Prepared for 2007 APIII

Pittsburgh, PA

September 10, 2007

Major gaps exist today in between patient care, clinical research and evidence-based

medicine.

Sharing Data is the Key “Amassing large quantities of anonymized

clinical and non-clinical information from medical records and reports and analyzing that data for patterns and other observations (is the best way to) to support continuous quality improvement, shape best practices and inform clinical and population-based decision making” A Rapid Learning Health System Health Affairs

26(2), January 2007

Processing Predicated on Protecting Patient Privacy Clinical records can be an important source of

information…most of the information in these records is in the form of free text and extracting useful information from them requires automatic processing (e.g., index, semantically interpret, and search). A prerequisite to the distribution of clinical records outside of hospitals, be it for Natural Language Processing (NLP) or medical re- search, is de-identification J Am Med Inform Assoc. 2007;14:550-563. DOI

10.1197/jamia.M2444.

Problems to Solve

Sources of data Protecting patient privacy Creating and maintaining a corpus of

HIPAA compliant and searchable data Building collaborations; creating networks of

institutions sharing data Emerging patient “data rights” issues

Sources of Data EMR/CIS systems

Large amounts of free text; not all data is parsed or field-limited

Transcribed Records and Reports Even in systems without CIS, most transcriptions are

delivered as electronic files• Pathology Reports (cf CaTIES)• Surgical Notes• Radiology Reports• Dischage Summaries

No need to wait for an EMR to create an RLHS

Protecting Patient Privacy De-identification is a well-defined, but limited, step

in a broader research workflow or protocol The defined nature of the step includes managing

individually identifiable information in records and reports Such schema includes redaction, elimination,

categorical replacement (e.g., place, age range), and replacement with proxies (Dr X), and offsets (day 1)

A process which must be constantly “tuned” in response to dynamic input variables and patterns of documentation

CISTranscribed

Reports

De-identified Database

De-identified Data

De-identificationMethodology

QueryInterface

QA QA

FIREWALL

Trusted Proxy

RE-ID Method

Admin

NLPOther

processes

Considerations When choosing a de-identification methodology,

four things need consideration What is the reliability and validity of the

methodology? Can the method maintain its specificity and

sensitivity in local use? What are the limitations of the methodology? Can files be re-identified?

Consistency, Reliability and Validity

Fundamental problems is inter-record reliability, manpower resource and time constraints

The issue then becomes the quality of the quality -- over-marking (specificity) and under-marking (sensitivity)

What are acceptable levels of sensitivity and specificity? 100% for sensitivity for names What is the benchmark? What is the value of consistency?

Automated Methodologies:As Good As?/Better? Classification of tokens Sequence tracking problem (using Hidden

Markov Models or Conditional Random Fields Rule-based system utilizing global features

(sentence position), local features (lexical cues, special characters, and format patterns), and syntactic features

Hybrid systems of rules, pattern matching algorithms, heuristics and dictionaries

Local Use Can your methodology be customized to meet local

needs? While some methods may have good ‘numbers’, will

they hold up in local use? Every community has its own acronyms, place names

and other local vocabulary What is the protocol to manage local quality?

Regular checks against manual review Formal evaluation research

“Data Rights” Issues Legal models exist Make ‘de-identified” data sharing part of

informed consent Offer different tiers of consent

Publicly-funded research Academic research Commercial research

Make the general public aware of the level of existing data sharing Claims data already widely shared and sold

De-identified Database

QueryInterface

QA

FIREWALLBuilding Collaboration

Call to Action:Pathology Informatics Community

caBIG and caTIES are models for cross institutional data sharing

Major institutions are establishing data repositories of pathology reports

Help facilitate data aggregation among other departments Radiology (Radiology Reports) Surgery (Surgical Notes) Medicine (Discharge Summaries)

Establish cross-departments “Rapid Learning” teams