predictive analysis of metabolomics data in …...predictive analysis of metabolomics data in...

1
PREDICTIVE ANALYSIS OF METABOLOMICS DATA IN CHRONIC KIDNEY DISEASE USING A SUBGROUP DISCOVERY ALGORITHM Margaux Luck 1 , Gildas Bertho 2,3,4 , Eric Thervet 2,5 , Philippe Beaune 2,6,7 , François d’Ormesson 1 , Mathilde Bateson 1 , Anastasia Yartseva 1 , Cécilia Damon 1 , Nicolas Pallet 2,4,6,7 1 Hypercube Institute, Paris, France; 2 Paris Descartes University, Paris, France; 3 CNRS UMR 8601, Paris, France; 4 Magnetic Nuclear Resonance Platform “MetaboParis-Santé”, Paris, France; 5 Renal Division, Georges Pompidou European Hospital, Paris, France; 6 INSERM U1147, Paris, France; 7 Clinical Chemistry Department, Georges Pompidou European Hospital, Paris, France CONTEXT Chronic kidney disease (CKD) Renal function decrease : decrease of glomerular filtration rate (GFR) Often detected quite late & irreversible complications 1.7 to 2.5 million patients Heavy and invasive treatment (dialysis, graft) MATERIALS AND METHODS RESULTS eGFR ≥ 60 ml/min/1.73 m 2 low to mild CKD 24 subjects eGFR < 60 ml/min/1.73 m 2 moderate to established CKD 86 subjects Fourier Transform Frequency spectrum Temporal spectrum Supervised analysis 2) Feature selection b) Rule mining 3) Predictive model Normalized Mutual Information ( nMI) Measures of all types of features’ mutual dependence. *eGFR is for estimated glomerular filtration rate. Objective Identification of the metabolomic profiles predictive of CKD stages through an exhaustive local data mining approach with rule generation. Dataset : 110 subjects (Features : urine 1 H NMR spectra, Target : CKD stages, 2 modalities) 1 H NMR 1) Dimension reduction High dimension 32K points - Equidistant binning of 0.04 ppm (AUC) - Discretization of the features with a 10 bin quantization Feature selection Rule mining Predictive model References: [1] Loucoubar C et al. PLoS One (2011) 6(9):e24085 [2] N Pallet et al. Kidney International (2014) 85(5):1239-40 Supervised feature selection. Features (blue points) according to : χ 2 test : -log(p-value)≥3 => X χ2 nMI ≥ min[nMI(X χ2 )] = 0.14 Number of features selected = 36 a) Data mining Data mining CKD Stages f R Model of prediction based on Rule generation Features : signal acquisition and pre-processing Target : CKD Stages Complex mixture of metabolites Overlaps of some peaks Our local data mining approach with rule generation combined with logistic regression supports the discriminant metabolites obtained in a previous study [2]. It also provides information on the distribution of the CKD stages with respect to range of spectral values. Perspectives Include multi-source dataset Refine the GFR classes Make a predictive model on subgroups defined by rules (i.e. local predictive model) z-scores and normalized lift measure of significative 1D rules generated for the descriptive features 3.56 and 2.72 ppm for both output modalities GFR < 60 and GFR ≥ 60 ml/min/1.73 m 2 . A segment is a rule characterized by its z-score (y-axis), normalized lift level (colorscale), and its covered variable values (x-axis). The more interesting the rule is the redder and the higher on the y-axis it will be. Interpretation: Bucket 3.56 : number of subjects eGFR < 60 when AUC values . Bucket 2.72 : Opposite effect. Pearson’s chi-squared test (χ 2 ) Statistical test of goodness of fit and independence of the observed distribution of 2 discrete variables with respect to a theoretical distribution. O i and E i are respectively the observed and expected frequencies. r and c are resp. nb of rows and columns in the contingency table. p(x), p(y), p(x,y) are resp. X- marginal, Y-marginal, joint probability distribution. H is for entropy. Local distribution of the patients’ subgroups with identical degrees of CKD severity on each feature, using HyperCube® [1]. 1D rules : {Feature condition} {Target modality} range of values CKD stage Tests if the target modality proportion in the rule (p) and the total population (p 0 ) are different (z-score ≥ 1.96). CONCLUSION Rules quality measures Measures the over-density (Lift>1). O O X X O O X O O O O X O X X O Discretization in 10 bins Values ppm features O X O O O X X O O X O X O O O O X X X X X O O X O X O O O O X X X X O X O O O X X O O X O X O X O O O X O O O X X O O O X X X O O X O X O X O X O O O X O X O X O O X O O X O X O O O X X O X X X O X X O O X O X X X X O X O X O X X O X X O X O X O O O X O O O X X X O X X O O X O X O X O X O O O X O X O O O O O X O X O X O O O X O O X X O eGFR < 60 X eGFR ≥ 60 10 - 9 - 8 - 7 - 6 - 5 - 4 - 3 - 2 - 1 - 1D rules - Data split into a train (80%) and a test (20%) set - Regularization parameter optimization (10 folds CV) - Classification score (Leave One Out, weighted average f1 measure) - Permutation test, 100 runs Binary class L2 penalized logistic regression Model evaluation: Classification score : 0.79 (p-value : 0.001) The 10 features selected with our method (except the bucket 2.48) were in the first 16 features obtained with the Orthogonal Partial Least Square (O-PLS) model [2]. Our method provided information on the distribution of the target modalities with respect to range of spectral values (buckets AUC). Metabolomic profiles of the CKD stages : GFR => [citrate] (metabolic disorders) ; [dimethyl sulfone] (clearance without compensation) ; [trigonelline] ( protection against oxidative stress and apoptosis) … Increase the number of subjects y i is the target modality, X i the explanatory features, w the weight and C the regression coefficient. p-value=0.05 18 features with χ 2 36 features With nMI Relevant generated 1D rules were based on 10 out of the 36 previously selected features.

Upload: others

Post on 05-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PREDICTIVE ANALYSIS OF METABOLOMICS DATA IN …...PREDICTIVE ANALYSIS OF METABOLOMICS DATA IN CHRONIC KIDNEY DISEASE USING A SUBGROUP DISCOVERY ALGORITHM Margaux Luck1, Gildas Bertho2,3,4,

PREDICTIVE ANALYSIS OF METABOLOMICS DATA IN CHRONIC KIDNEY DISEASE USING A SUBGROUP DISCOVERY ALGORITHM

Margaux Luck1, Gildas Bertho2,3,4, Eric Thervet2,5, Philippe Beaune2,6,7, François d’Ormesson1, Mathilde Bateson1, Anastasia Yartseva1, Cécilia Damon1, Nicolas Pallet2,4,6,7

1 Hypercube Institute, Paris, France; 2 Paris Descartes University, Paris, France; 3 CNRS UMR 8601, Paris, France; 4 Magnetic Nuclear Resonance Platform “MetaboParis-Santé”, Paris, France; 5 Renal Division, Georges PompidouEuropean Hospital, Paris, France; 6 INSERM U1147, Paris, France; 7 Clinical Chemistry Department, Georges Pompidou European Hospital, Paris, France

CONTEXT

Chronic kidney disease (CKD) Renal function decrease : decrease of glomerular filtration rate (GFR) Often detected quite late & irreversible complications 1.7 to 2.5 million patients Heavy and invasive treatment (dialysis, graft)

MATERIALS AND METHODS

RESULTS

eGFR ≥ 60 ml/min/1.73 m2

low to mild CKD24 subjects

eGFR < 60 ml/min/1.73 m2

moderate to established CKD86 subjects

FourierTransform

Frequency spectrumTemporal spectrum

Supervised analysis

2) Feature selection

b) Rule mining

3) Predictive model

Normalized Mutual Information (nMI)Measures of all types of features’ mutual dependence.

*eGFR is for estimated glomerular filtration rate.

ObjectiveIdentification of the metabolomic profiles predictive of CKD stagesthrough an exhaustive local data mining approach with rule generation.

Dataset : 110 subjects (Features : urine 1H NMR spectra, Target : CKD stages, 2 modalities)

1H NMR

1) Dimension reduction

High dimension32K points

- Equidistant binning of 0.04 ppm (AUC)- Discretization of the features with a 10 bin quantization

Feature selection

Rule mining

Predictive model

References:[1] Loucoubar C et al. PLoS One (2011) 6(9):e24085[2] N Pallet et al. Kidney International (2014) 85(5):1239-40

Supervised feature selection.Features (blue points) according to : χ2 test : -log(p-value)≥3 => Xχ2

nMI ≥ min[nMI(Xχ2)] = 0.14Number of features selected = 36

a) Data mining

Data mining

CKD Stages

fR

Model of prediction based on Rule generation

Features : signal acquisition and pre-processing

Target : CKD Stages

Complex mixture of metabolites

Overlaps of some peaks

Our local data mining approach with rule generation combined withlogistic regression supports the discriminant metabolites obtained ina previous study [2].

It also provides information on the distribution of the CKD stageswith respect to range of spectral values.

Perspectives Include multi-source dataset Refine the GFR classes Make a predictive model on subgroups defined by rules (i.e. local

predictive model)

z-scores and normalized lift measure of significative 1D rules generated for thedescriptive features 3.56 and 2.72 ppm for both output modalities GFR < 60 andGFR ≥ 60 ml/min/1.73 m2. A segment is a rule characterized by its z-score (y-axis),normalized lift level (colorscale), and its covered variable values (x-axis). The moreinteresting the rule is the redder and the higher on the y-axis it will be.Interpretation: Bucket 3.56 : number of subjects eGFR < 60 when AUC values.

Bucket 2.72 : Opposite effect.

Pearson’s chi-squared test (χ2)Statistical test of goodness of fit and independence of the observeddistribution of 2 discrete variables with respect to a theoreticaldistribution.

Oi and Ei are respectively the observed andexpected frequencies.r and c are resp. nb of rows and columns inthe contingency table.

p(x), p(y), p(x,y) are resp. X-marginal, Y-marginal, jointprobability distribution.H is for entropy.

Local distribution of the patients’ subgroups with identicaldegrees of CKD severity on each feature, using HyperCube® [1].

1D rules : {Feature condition} {Target modality}range of values CKD stage

Tests if the target modality proportion inthe rule (p) and the total population (p0)are different (z-score ≥ 1.96).

CONCLUSIONRules quality measures

Measures the over-density (Lift>1).

OXOXXXOOXOXOOOOXOXXO

Dis

cret

izat

ion

in 1

0 b

ins

Values

ppm features

OXOOOXXOOXOXOOOOXXXX

XOOXOXOOOOXXXXOXOOOX

XOOXOXOXOXOOOXOOOXXO

OOXXXOOXOXOXOXOOOXOX

OXOOXOOXOXOXOOOXXOXX

XOOXOXOOXOXXXXOXOXOX

XOOXOXOXOXOOOXOOOXXX

OOXXXOOXOXOXOXOOOXOX

OXOOXOOXOXOXOOOXOOXX

O eGFR < 60

X eGFR ≥ 60

10 -

9 -

8 -

7 -

6 -

5 -

4 -

3 -

2 -

1 -

1D rules

- Data split into a train (80%) and a test (20%) set- Regularization parameter optimization (10 folds CV)- Classification score (Leave One Out, weighted average f1 measure) - Permutation test, 100 runs

Binary class L2 penalized logistic regression

Model evaluation:

Classification score : 0.79 (p-value : 0.001)

The 10 features selected with ourmethod (except the bucket 2.48)were in the first 16 features obtainedwith the Orthogonal Partial LeastSquare (O-PLS) model [2].

Our method provided informationon the distribution of the targetmodalities with respect to range ofspectral values (buckets AUC).

Metabolomic profiles of the CKDstages : GFR => [citrate](metabolic disorders) ; [dimethylsulfone] (clearance withoutcompensation) ; [trigonelline] (protection against oxidative stressand apoptosis) …

Increase the number of subjectsyi is the target modality, Xi the explanatory features, w the weight and C the regression coefficient.

p-value=0.05

18 featureswith χ2

36 featuresWith nMI

Relevant generated 1D rules were based on 10 out of the 36 previously selected features.