synthetic data panel · 5/5/2019 · 3. ohsu has transformed data-related research information...
TRANSCRIPT
![Page 1: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/1.jpg)
Synthetic Data Panel
Albert Lai – Chief Research Information Officer – Washington University in St. Louis
David Dorr – Chief Research Information Officer – Oregon Health and Science
Jeremy Harper – Chief Research Information Officer – Regenstrief Institute
![Page 2: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/2.jpg)
What is synthetic data?
• Information that is artificially manufactured.
This can have basis on reality or not.
• The dream – accurately representing
healthcare datasets via statistical generation
Preserving Multivariate Relationships .
• Being able to allow broader access to data
![Page 3: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/3.jpg)
Example – Image
Left, Original Image, Right image generated via synthetic algorithm
Synthetic data has been used in image
science to develop realistic but unique
images
![Page 4: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/4.jpg)
Business Functions
• Data Accessibility
• Machine Learning
• Agile Development – Not waiting for data
• Research – Understanding Properties
• Financial Services – Fraud
Protection/Detection
• Healthcare – Sensitive Data Protection
![Page 5: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/5.jpg)
Synthetic Control Arm
• Generating a control arm through existing
data resources representing Normal
Patient Statistics
• Increases Speed of study
• Replaces Placebo’s
• Potentially more fair to subjects
![Page 6: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/6.jpg)
Security
• No real PII in dataset
– Possibility for privacy leakage
![Page 7: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/7.jpg)
Accuracy
• Preserving Multivariate Relationships is
HARD
– Evaluation methods abound violin plots of
variable distributions, multivariate
correlations, and Kruskal-Wallis non-
parametric distribution comparisons are
examples of validation tests for original vs the
generative dataset.
![Page 8: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/8.jpg)
RI Setup
• Purchased Jan 2019
• Implementation Start Feb 2019
• Implemented Pilot Go Live May 5th 2019
• 4 Pilot Studies
• Enterprise Rollout Pending
![Page 9: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/9.jpg)
The Methodology
• Raab GM, Nowok B, Dibben CJapa. Guidelines for producing useful synthetic data. 2017.
• Snoke J, Raab GM, Nowok B, Dibben C, Slavkovic AJJotRSSSA. General and specific utility measures for synthetic data. 2018;181:663-88.
• LOGAN: evaluating privacy leakage of generative models using generative adversarial networks J Hayes; arXivpreprint arXiv:1705.07663, 2017
![Page 10: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/10.jpg)
I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E
Getting Ready for an Enterprise
DeploymentAlbert M. Lai, PhD, FAMIA
Chief Research Information Officer, Washington University School of Medicine
Deputy Director, Institute for Informatics
Associate Professor, Department of Medicine
Associate Professor, Department of Computer Science and Engineering
![Page 11: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/11.jpg)
I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E
![Page 12: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/12.jpg)
I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E
Current State
• Research Data Core (WUSTL clinical data repository) is deployed on premises
• Working with a vendor solution (MDClone) for synthetic data
• MDClone deployed in MS Azure using Cloudera Hadoop VMs
• Have a single environment
![Page 13: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/13.jpg)
I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E
Evaluation Strategy
• Use case #1 – Pediatric Trauma, Dr. Jose Pineda
• Use case #2 – Sepsis Prediction, Dr. Andrew Michelson & Sean Yu, MS (PhD Candidate)
• Use case #3 – STI Infection Rates, Dr. Randi Foraker
![Page 14: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/14.jpg)
I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E
MDClone Sepsis Risk Prediction Study• Aim 1: Develop a machine learning approach to predict sepsis 6-hours
ahead of clinical onset
• Aim 2: Use the a novel platform (MDClone) to create synthetic data for sepsis prediction, thus accelerating research in this area
• Aim 3: Deploy the machine learning approach on both real and synthetic patient data and compare the results
Data Acquisition
Data preprocessing
Cohort identification
Feature engineering
Develop prediction
models
Assess models on patient
data
Assess models on synthetic
data
Slides adapted from: Andrew Michelson & Sean Yu
![Page 15: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/15.jpg)
I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E
Time of Sepsis-24 Hours -6 HoursAdmission
Feature Selection:
• Demographics
• Vital signs
• Laboratory analyses
• Comorbidities
• Additional patient
characteristics
Prediction Models:
• Logistic Regression
• SVM with various kernels
• KNN
Slides adapted from: Andrew Michelson & Sean Yu
Data, Methods
![Page 16: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/16.jpg)
I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E
61364 Inpatient Encounters
49493 Unique Patients
7462 Met SIRS criteria
4737 Met Suspicion of Infection criteria
1799 Met both criteria for Sepsis
415 Developed sepsis ≥24 hours after admission
377 With sufficient vital sign documentation
Slides adapted from: Andrew Michelson & Sean Yu
![Page 17: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/17.jpg)
I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E
real-trained synth-trained
SVM
Train
Accuracy 0.925 0.911
Precision 0.95 0.925
Recall 0.817 0.799
F-Score 0.879 0.858
Test
Accuracy 0.846 0.841
Precision 0.836 0.845
Recall 0.671 0.645
F-Score 0.745 0.731
Trained on patient dataTested on patient data
Trained on synthetic dataTested on patient data
Slides adapted from: Andrew Michelson & Sean Yu
MDClone Sepsis Risk Prediction Study
![Page 18: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/18.jpg)
I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E
Getting ready for an Enterprise-grade deployment• Hiring support staff
• Scientific Consultant
• Help desk
• Data Use Agreements
• Improving Robustness of infrastructure• Moving to having 3 synthetic data environments (Dev,
QA, Prod), likely on-premise
![Page 19: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/19.jpg)
I N S T I T U T E F O R I N F O R M A T I C S | W A S H I N G T O N U N I V E R S I T Y S C H O O L O F M E D I C I N E
Future Considerations
• Development of cloud-native data lake & materialized EDW for School of Medicine
• Enables easier collection and access to data from diverse data assets for data analysis
• Easier to source data for MDClone
• Easier to manage bursts in storage and compute needs
• Pushing vendor to support Azure HDInsight rather than fully open source Hadoop stack
![Page 20: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/20.jpg)
Synthetic data: what kinds of projects
benefit? David Dorr, MD, MS
Chief Research Information Officer
Professor and Vice Chair
Oregon Health & Science University
![Page 21: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/21.jpg)
Oregon Health and Science University Chief Research Information Officer
Focusing on data, analytics, and use of information and knowledge in research,
• Listen for opportunities and current gaps
• Set a vision
1. OHSU has been transformed into a data-driven Learning Organization, integrating a Learning Health System, innovative experiential education, and research.
2. OHSU is established as a leader in data science, specifically in education, innovation, and dissemination.
3. OHSU has transformed data-related research information technology and core services to meet users' needs.
4. OHSU is a leader in making data, information, and knowledge from diverse sources findable, accessible, interoperable, and reusable (FAIR) and encouraging sharing and dissemination of data, software, and other research products.
• Support that vision
• Communicate!
![Page 22: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/22.jpg)
The Mission of Care Management Plus
is to better understand how data, information, and knowledge can assist in transforming health for our most vulnerable patient populations.
Identifying vulnerable people
Risk stratification and segmentation
Tailoring care to these needs
Improving outcomes
![Page 23: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/23.jpg)
Intent
• Create a widely sharable data set
• With an important and challenging set of signals
• That had the right level of messiness
• Test it with trainees to understand what can be learned from the dataset
• Release both the educational tools and the data together to help people understand how to do it
![Page 24: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/24.jpg)
https://www.biorxiv.org/content/10.1101/232611v2.full
![Page 25: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/25.jpg)
Major issues to be addressed
1) Assess overall risk in the synthetic data and identify a cohort in which to predict CVD risk for;A. Focus on segments on the population for whom we don’t
predict well;B. Exploratory data analysis skills / formats;C. Messiness of data.
2) appropriate covariates to predict risk using machine learning techniques;A. Genetic information in 10%B. Younger age – increasing risk in that groupC. Some basic moderators and mediators worked in
3) Understand the prediction, especially compared to known standards
![Page 26: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/26.jpg)
Synthetic solution : Bayesian Networks
Step 1: Random distributions based on extant
frequencies
Step 2: Create Bayesian Networks to nudge
probabilities for signals
Step 3: Tweak the heck out of them
![Page 27: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/27.jpg)
Exploring the data
![Page 28: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/28.jpg)
Results of machine learning on cohorts selected
![Page 29: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/29.jpg)
Lessons learned
• Incorporating trainees into the process was• Good for validation• Not great for realistic results
• Bayesian Networks were twitchy but honestly great for this particular example (Thanks, Ted Laderas).
• More examples like this could provide quite helpful; our use of OMOP is a very good standard for this kind of direct to code access – few other data models are so good;
• However, a FHIR-based access model may be okay in the future.
![Page 30: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/30.jpg)
Data, generation script, and course material availability• The current version of the synthetic dataset is
available as an R package called cvdRiskData on GitHub (http://github.com/laderast/cvdRiskData). This package also includes the script, Bayesian network, and CPTs used to generate the dataset. Our course materials for teaching the workshop as well as the dataset simulation script are also available (http://github.com/laderast/cvdNight1 and http://github.com/laderast/cvdNight2).
![Page 31: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/31.jpg)
Thanks!
• David Dorr [email protected]
• Albert Lai [email protected]
• Jeremy Harper [email protected]
![Page 32: Synthetic Data Panel · 5/5/2019 · 3. OHSU has transformed data-related research information technology and core services to meet users' needs. 4. OHSU is a leader in making data,](https://reader035.vdocuments.us/reader035/viewer/2022081514/60249216318c9812534f4350/html5/thumbnails/32.jpg)
Panel Questions
• Will synthetic data generation be important to a successful research enterprise?
• Is synthetic data secure?• What does adding Dirty/Random noise do and why might you want
to introduce it in your synthetic datasets
• Is it accurate? How are you going about confirming its accuracy for your organization.
• Education surrounding the usefulness of these systems?
• Enterprise implementation, what’s required for success?• Data Use Agreements – Use them or not• When is synthetic data good vs de-identified good?• What types of problems can Synthetic data be used to answer?• Access Rights