synthetic data panel · 5/5/2019  · 3. ohsu has transformed data-related research information...

32
Synthetic Data Panel Albert Lai Chief Research Information Officer Washington University in St. Louis David Dorr Chief Research Information Officer Oregon Health and Science Jeremy Harper Chief Research Information Officer Regenstrief Institute

Upload: others

Post on 02-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Synthetic Data Panel

Albert Lai – Chief Research Information Officer – Washington University in St. Louis

David Dorr – Chief Research Information Officer – Oregon Health and Science

Jeremy Harper – Chief Research Information Officer – Regenstrief Institute

Page 2: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

What is synthetic data?

• Information that is artificially manufactured.

This can have basis on reality or not.

• The dream – accurately representing

healthcare datasets via statistical generation

Preserving Multivariate Relationships .

• Being able to allow broader access to data

Page 3: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Example – Image

Left, Original Image, Right image generated via synthetic algorithm

Synthetic data has been used in image

science to develop realistic but unique

images

Page 4: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Business Functions

• Data Accessibility

• Machine Learning

• Agile Development – Not waiting for data

• Research – Understanding Properties

• Financial Services – Fraud

Protection/Detection

• Healthcare – Sensitive Data Protection

Page 5: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Synthetic Control Arm

• Generating a control arm through existing

data resources representing Normal

Patient Statistics

• Increases Speed of study

• Replaces Placebo’s

• Potentially more fair to subjects

Page 6: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Security

• No real PII in dataset

– Possibility for privacy leakage

Page 7: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Accuracy

• Preserving Multivariate Relationships is

HARD

– Evaluation methods abound violin plots of

variable distributions, multivariate

correlations, and Kruskal-Wallis non-

parametric distribution comparisons are

examples of validation tests for original vs the

generative dataset.

Page 8: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

RI Setup

• Purchased Jan 2019

• Implementation Start Feb 2019

• Implemented Pilot Go Live May 5th 2019

• 4 Pilot Studies

• Enterprise Rollout Pending

Page 9: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

The Methodology

• Raab GM, Nowok B, Dibben CJapa. Guidelines for producing useful synthetic data. 2017.

• Snoke J, Raab GM, Nowok B, Dibben C, Slavkovic AJJotRSSSA. General and specific utility measures for synthetic data. 2018;181:663-88.

• LOGAN: evaluating privacy leakage of generative models using generative adversarial networks J Hayes; arXivpreprint arXiv:1705.07663, 2017

Page 10: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E

Getting Ready for an Enterprise

DeploymentAlbert M. Lai, PhD, FAMIA

Chief Research Information Officer, Washington University School of Medicine

Deputy Director, Institute for Informatics

Associate Professor, Department of Medicine

Associate Professor, Department of Computer Science and Engineering

Page 11: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E

Page 12: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E

Current State

• Research Data Core (WUSTL clinical data repository) is deployed on premises

• Working with a vendor solution (MDClone) for synthetic data

• MDClone deployed in MS Azure using Cloudera Hadoop VMs

• Have a single environment

Page 13: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E

Evaluation Strategy

• Use case #1 – Pediatric Trauma, Dr. Jose Pineda

• Use case #2 – Sepsis Prediction, Dr. Andrew Michelson & Sean Yu, MS (PhD Candidate)

• Use case #3 – STI Infection Rates, Dr. Randi Foraker

Page 14: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E

MDClone Sepsis Risk Prediction Study• Aim 1: Develop a machine learning approach to predict sepsis 6-hours

ahead of clinical onset

• Aim 2: Use the a novel platform (MDClone) to create synthetic data for sepsis prediction, thus accelerating research in this area

• Aim 3: Deploy the machine learning approach on both real and synthetic patient data and compare the results

Data Acquisition

Data preprocessing

Cohort identification

Feature engineering

Develop prediction

models

Assess models on patient

data

Assess models on synthetic

data

Slides adapted from: Andrew Michelson & Sean Yu

Page 15: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E

Time of Sepsis-24 Hours -6 HoursAdmission

Feature Selection:

• Demographics

• Vital signs

• Laboratory analyses

• Comorbidities

• Additional patient

characteristics

Prediction Models:

• Logistic Regression

• SVM with various kernels

• KNN

Slides adapted from: Andrew Michelson & Sean Yu

Data, Methods

Page 16: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E

61364 Inpatient Encounters

49493 Unique Patients

7462 Met SIRS criteria

4737 Met Suspicion of Infection criteria

1799 Met both criteria for Sepsis

415 Developed sepsis ≥24 hours after admission

377 With sufficient vital sign documentation

Slides adapted from: Andrew Michelson & Sean Yu

Page 17: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E

real-trained synth-trained

SVM

Train

Accuracy 0.925 0.911

Precision 0.95 0.925

Recall 0.817 0.799

F-Score 0.879 0.858

Test

Accuracy 0.846 0.841

Precision 0.836 0.845

Recall 0.671 0.645

F-Score 0.745 0.731

Trained on patient dataTested on patient data

Trained on synthetic dataTested on patient data

Slides adapted from: Andrew Michelson & Sean Yu

MDClone Sepsis Risk Prediction Study

Page 18: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E

Getting ready for an Enterprise-grade deployment• Hiring support staff

• Scientific Consultant

• Help desk

• Data Use Agreements

• Improving Robustness of infrastructure• Moving to having 3 synthetic data environments (Dev,

QA, Prod), likely on-premise

Page 19: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E

Future Considerations

• Development of cloud-native data lake & materialized EDW for School of Medicine

• Enables easier collection and access to data from diverse data assets for data analysis

• Easier to source data for MDClone

• Easier to manage bursts in storage and compute needs

• Pushing vendor to support Azure HDInsight rather than fully open source Hadoop stack

Page 20: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Synthetic data: what kinds of projects

benefit? David Dorr, MD, MS

Chief Research Information Officer

Professor and Vice Chair

Oregon Health & Science University

Page 21: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Oregon Health and Science University Chief Research Information Officer

Focusing on data, analytics, and use of information and knowledge in research,

• Listen for opportunities and current gaps

• Set a vision

1. OHSU has been transformed into a data-driven Learning Organization, integrating a Learning Health System, innovative experiential education, and research.

2. OHSU is established as a leader in data science, specifically in education, innovation, and dissemination.

3. OHSU has transformed data-related research information technology and core services to meet users' needs.

4. OHSU is a leader in making data, information, and knowledge from diverse sources findable, accessible, interoperable, and reusable (FAIR) and encouraging sharing and dissemination of data, software, and other research products.

• Support that vision

• Communicate!

Page 22: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

The Mission of Care Management Plus

is to better understand how data, information, and knowledge can assist in transforming health for our most vulnerable patient populations.

Identifying vulnerable people

Risk stratification and segmentation

Tailoring care to these needs

Improving outcomes

Page 23: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Intent

• Create a widely sharable data set

• With an important and challenging set of signals

• That had the right level of messiness

• Test it with trainees to understand what can be learned from the dataset

• Release both the educational tools and the data together to help people understand how to do it

Page 24: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

https://www.biorxiv.org/content/10.1101/232611v2.full

Page 25: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Major issues to be addressed

1) Assess overall risk in the synthetic data and identify a cohort in which to predict CVD risk for;A. Focus on segments on the population for whom we don’t

predict well;B. Exploratory data analysis skills / formats;C. Messiness of data.

2) appropriate covariates to predict risk using machine learning techniques;A. Genetic information in 10%B. Younger age – increasing risk in that groupC. Some basic moderators and mediators worked in

3) Understand the prediction, especially compared to known standards

Page 26: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Synthetic solution : Bayesian Networks

Step 1: Random distributions based on extant

frequencies

Step 2: Create Bayesian Networks to nudge

probabilities for signals

Step 3: Tweak the heck out of them

Page 27: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Exploring the data

Page 28: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Results of machine learning on cohorts selected

Page 29: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Lessons learned

• Incorporating trainees into the process was• Good for validation• Not great for realistic results

• Bayesian Networks were twitchy but honestly great for this particular example (Thanks, Ted Laderas).

• More examples like this could provide quite helpful; our use of OMOP is a very good standard for this kind of direct to code access – few other data models are so good;

• However, a FHIR-based access model may be okay in the future.

Page 30: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Data, generation script, and course material availability• The current version of the synthetic dataset is

available as an R package called cvdRiskData on GitHub (http://github.com/laderast/cvdRiskData). This package also includes the script, Bayesian network, and CPTs used to generate the dataset. Our course materials for teaching the workshop as well as the dataset simulation script are also available (http://github.com/laderast/cvdNight1 and http://github.com/laderast/cvdNight2).

Page 32: Synthetic Data Panel · 5/5/2019  · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,

Panel Questions

• Will synthetic data generation be important to a successful research enterprise?

• Is synthetic data secure?• What does adding Dirty/Random noise do and why might you want

to introduce it in your synthetic datasets

• Is it accurate? How are you going about confirming its accuracy for your organization.

• Education surrounding the usefulness of these systems?

• Enterprise implementation, what’s required for success?• Data Use Agreements – Use them or not• When is synthetic data good vs de-identified good?• What types of problems can Synthetic data be used to answer?• Access Rights