multi-centric learning from medical data
DESCRIPTION
Georgi Nalbantov lecture on the ATIA Alan Turing Year Conference: AI & I: truly personalised medicine 1 november 2012TRANSCRIPT
Multi-centric learning from medical data
Nov 2012Georgi Nalbantov
Multiple sets of medical data exist in different hospitals
Currently: models are built from data in 1 center only
Currently: external model validation requires standardization
Data privacy is an issue: data cannot leave hospitals (easily)
Multi-centric learning from medical data: why?
General Hypothesis:
Current way of learning from medical data is suboptimal, as modeling techniques do not have access to all available data
Specific Hypothesis:
A distributed learning environment (euroCAT), by giving local access to all data, can be used to produce optimal models, the same as if data was centralized*
Multi-centric learning from medical data: why?
* For some modeling techniques
Learning from medical data: the current state
Center 1 Center 2 Center 3 Center 4 Center 5
C2 C3 C4 C5 C1 C3 C4 C5 C1 C2 C4 C5 C1 C2 C3 C5 C1 C2 C3 C4
We learn a model from one center only and validate at other centers (if possible)
We also check the predictions of the doctors. (The golden standard?)
Learning from medical data: the current state
Center 1 Center 2 Center 3 Center 4 Center 5
C2 C3 C4 C5 C1 C3 C4 C5 C1 C2 C4 C5 C1 C2 C3 C5 C1 C2 C3 C4
We learn a model from one center only and validate at other centers (if possible)
We also check the predictions of the doctors. (The golden standard?)
Problem: suboptimal
Optimal solution: learning from centralized data
NOT FEASIBLE
Learning from medical data: the challenge
C2 C3 C4 C5
Center 2Center 3Center 4Center 5
Center 1Data to predict
Data for learning
Centralize data
Decentralized data How to achieve that?
Option 1: Centralized learningCombine data from sites 1,2,3 and 4 at one central database. That is: bring the data to the model: NOT FEASIBLE
Option 2: Distributed learningApply “distributed learning” on the data from sites 1,2,3 and 4. That is, bring the model to the data: FEASIBLE
Centralized Learning
Data centralization
Learning
Nalbantov and Wiessler, Oct 2012
Distributed Learning
Nalbantov and Wiessler, Oct 2012
Distributed Learning: doing it
- For distributed learning we need a statistical modeling technique that can learn in distributed mode, that is without being able to “see” all the data at once.
We choose one of them for this study: SVM’s
- Have shown excellent results across a wide range of data-analysis problems- Robust to the inclusion of many features (bye-bye to the “15-1” rule of thumb)- Can be constructed in distributed-learning mode
- There exist learning models that are able to find an optimal solution no matter whether data is scattered across different centers or not.
Learning from medical data: distributed learning
SVMs
Model evaluation:ROC curve
The “Trial”: prediction of 2-year survival of lung cancer patients
Patients: 322 (Maastro) lung cancer patients, distributed across 5 sites:
Maastro 186Liege 52Hasselt/Genk 45Aachen 7Eindhoven 32
Endpoint: 2-year survival
Method: distributed learning SVMs (ref: Boyd, ADMM), euroCAT
Predictive features: gender, WHO performance status, FEV1, number of PLNSs, GTV (volume) and “EQD2,T”
The “Trial”:prediction of 2-year survival of lung cancer patients
2-year survival
gender WHO FEV1 number of PLNSs
GTV (volume)
EQD2,T
… … … … … … …
… … … … … … …
… … … … … … …
… … … … … … …
… … … … … … …
site patients
Maastro 186
Liege 52
Hasselt/G 45
Aachen 7
Eindhoven 32
The data for the trial
The “golden standard”: doctors’ predictionsprediction of 2-year survival of lung cancer patients
“Traditional” solutionprediction of 2-year survival of lung cancer patients
Center 1
C2 C3 C4 C5
site patients
Maastro 186
Liege 52
Hasselt/G 45
Aachen 7
Eindhoven 32
Build model at center 1
Validate the model at centers2,3,4 and 5
Step 1. Build an SVM model from data in center 1.- There is no “one-button” to press… It turns out SVM is a “family” of models and the trained statistician has to
choose one family member in much the same way as the surgeon has to choose from a variety of “knives”.
Step 2. Model evaluation: how will our model perform outside my hospital?- Perform cross-validation to find optimal SVM from the “family”
Step 3. Build the final model using the “best-performing” SVM from the SVM family.
“Traditional” solutionprediction of 2-year survival of lung cancer patients
Center 1
SVM with Lambda=1
SVM with Lambda=5
cross-validation ROC
Build the final SVM model on all data from center 1, that is “SVM with lambda = 2”
External validation
SVM family with different lambda’s AUC
0.723
0.757
0.671
0.6
Center 2
Center 3
Center 4
Center 5
Learning from centralized data:optimal solution
Distributed learning:optimal solution*
NOT FEASIBLE FEASIBLE (ex: EuroCAT)
euroCAT solution:prediction of 2-year survival of lung cancer patients
C2 C3 C4 C5
Center 2Center 3Center 4Center 5
Center 1Data to predict
Data for learning
C2 C3 C4 C5
Center 2Center 3Center 4Center 5
Decentralized data Decentralized data
Data BEHAVES like centralized*
*Using ADMM to solve SVM
Center 1Data to predict
Bring the data to the model Bring model to the data
euroCAT: a breakthrough
Training site(s)
Predicted site
AUC euroCAT learning on PREDICTED site
(same result as centralized)
color
2 1 0.754 red
3 1 0.678 green
4 1 0.610 cyan
5 1 0.723 pink
2,3,4,5 1 0.766 blue
1,2,3,4,5 world ? ?
What is the potential benefit from multi-centric batch learning on predicting survival?
The future
How can patients/clinics profit from distributed learning medical environments?
- Use real multi-centric data for modeling
- Use multiple endpoints: survival, dyspnea, dysphagia, fibrosis, etc.
- Include more variables: imaging, DNA, etc.
- Use standardized data (and more data)
- Etc…
Bedankt voor uw aandacht
Heeft u vragen of [email protected]