certified by peer review) is the author/funder. it is made available … · 11 . pennsylvania state...

36
1 Integrative Network Analysis of Differentially Methylated and Expressed 1 Genes for Biomarker Identification in Leukemia 2 Robersy Sanchez* and Sally A. Mackenzie * 3 4 Departments of Biology and Plant Science, The Pennsylvania State University, University Park, 5 PA 16802. 6 Running Title: Network Analysis in Leukemia 7 Corresponding Authors: 8 Robersy Sanchez 9 361 Frear North Bldg 10 Pennsylvania State University 11 University Park, PA 16802 12 Email: [email protected] 13 14 Sally Mackenzie 15 362 Frear North Bldg 16 Pennsylvania State University 17 University Park, PA 16802 18 Email: [email protected] 19 20 21 Abstract 22 Genome-wide DNA methylation and gene expression are commonly altered in pediatric acute 23 lymphoblastic leukemia (PALL). Integrated analysis of cytosine methylation and expression 24 datasets has the potential to provide deeper insights into the complex disease states and their 25 . CC-BY-NC-ND 4.0 International license a certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under The copyright holder for this preprint (which was not this version posted June 3, 2019. ; https://doi.org/10.1101/658948 doi: bioRxiv preprint

Upload: others

Post on 15-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

1

Integrative Network Analysis of Differentially Methylated and Expressed 1

Genes for Biomarker Identification in Leukemia 2

Robersy Sanchez* and Sally A. Mackenzie* 3 4

Departments of Biology and Plant Science, The Pennsylvania State University, University Park, 5

PA 16802. 6

Running Title: Network Analysis in Leukemia 7

Corresponding Authors: 8

Robersy Sanchez 9

361 Frear North Bldg 10

Pennsylvania State University 11

University Park, PA 16802 12

Email: [email protected] 13

14

Sally Mackenzie 15

362 Frear North Bldg 16

Pennsylvania State University 17

University Park, PA 16802 18

Email: [email protected] 19

20

21

Abstract 22

Genome-wide DNA methylation and gene expression are commonly altered in pediatric acute 23

lymphoblastic leukemia (PALL). Integrated analysis of cytosine methylation and expression 24

datasets has the potential to provide deeper insights into the complex disease states and their 25

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 2: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

2

causes than individual disconnected analyses. Studies of protein-protein interaction (PPI) 26

networks of differentially methylated (DMGs) and expressed genes (DEGs) showed that gene 27

expression and methylation consistently targeted the same gene pathways associated with cancer: 28

Pathways in cancer, Ras signaling pathway, PI3K-Akt signaling pathway, and Rap1 signaling 29

pathway, among others. Detected gene hubs and hub sub-networks are integrated by signature 30

loci associated with cancer that include, for example, NOTCH1, RAC1, PIK3CD, BCL2, and 31

EGFR. Statistical analysis disclosed a stochastic deterministic dependence between methylation 32

and gene expression within the set of genes simultaneously identified as DEGs and DMGs, 33

where larger values of gene expression changes are probabilistically associated with larger 34

values of methylation changes. Concordance analysis of the overlap between enriched pathways 35

in DEG and DMG datasets revealed statistically significant agreement between gene expression 36

and methylation changes, reflecting a coordinated response of methylation and gene-expression 37

regulatory systems. These results support the identification of reliable and stable biomarkers for 38

cancer diagnosis and prognosis. 39

Introduction 40

Network-based modeling approaches have the potential to integrate and improve the perception 41

of complex disease states and their root causes. To date, network analysis provides reliable and 42

cost effective approaches for early disease detection, prediction of co-occurring diseases and 43

interactions, and drug design 1. Although integrated genomic analysis of methylation and gene 44

expression in leukemia have been performed 2–5, an integration including network analysis of 45

methylation and gene expression is still missing. 46

Our study investigates protein-protein interaction networks (PPI), which are exclusively 47

focused on protein-protein associations and resulting cell activities. A PPI network can be 48

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 3: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

3

defined as a (un)directed graph/network holding vertices as proteins (or protein-encoding genes) 49

and edges as the interactions/association between them. Associations are meant to be specific 50

and biologically meaningful, i.e., two proteins (genes) are connected by an edge if jointly 51

contributing to a shared function, which does not necessarily reflect a physical binding 52

interaction. 53

Within the network, some proteins denote hubs interacting with numerous partners. 54

Biologically, hubs are key elements on which functionality of the cellular process modeled by 55

the network depends. Consequently, it is reasonable to assume that a biomarker suitable to define 56

specific disease states would likely be a hub or a hub regulator within a relevant network. 57

Frequently, more than one interacting network model is possible, with each model carrying a 58

different uncertainty level for the biological process under study. Integration of more than one 59

network model can help to reduce the implicit uncertainty associated to each model prediction6. 60

Here, we address the hypothesis that disease-induced DNA methylation changes can serve 61

as a source of reliable and stable biomarkers for cancer diagnosis and prognosis. Toward that 62

aim, aberrant DNA methylation of key genes was reported in Acute Lymphoblastic Leukemia 63

(ALL) 6. We report on a reproducible approach integrating network analysis of DMGs, DEGs 64

and DEGs-DMGs estimated within datasets from patients with pediatric ALL (PALL). Such an 65

integration may provide the basis for robust identification of reliable and stable biomarkers for 66

cancer diagnosis and prognosis. 67

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 4: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

4

Results 68

General methylation features of the study 69

Differentially methylated positions (DMPs) were estimated for control (four normal CD19+ 70

blood cell donors) and patient (ALL cells from three patients) groups relative to a reference 71

group of four independent normal CD19+ blood cell donors. Inclusion of a reference group 72

permitted the evaluation of natural variation in healthy individuals and reduction of noise in a 73

signal detection step of our methylation analysis pipeline. The distribution of methylation 74

changes at DMPs along the chromosome revealed a genome-wide methylation re-patterning 75

dominated by hypermethylation in PALL patients (Supplementary Fig. S1). Hypomethylated 76

sites are visible in the genome browser after zooming (tracks available in the Supplementary File 77

S1). Consistent with natural methylation variability in the population of healthy individuals, 78

DMPs were observed in the control group as well. 79

DMGs were estimated from group comparisons for number of DMPs within gene-body 80

regions between control (CD19+ blood cell donors) and ALL cells based on generalized linear 81

regression. This analysis yielded a total of 4795 DMGs, including protein-coding regions (3338) 82

and non-coding RNA genes (Supplementary Table S1). 1774 genes from the set of 2360 reported 83

(B-Cells) DEGs in the original study 7 were DMGs as well (75.2%, Supplementary Table S2). 84

The gene-body methylation signal detected in PALL patients coincided with a significant 85

number of genes from the list of all cancer consensus genes (723) from the COSMIC database8: 86

254 DMGs, and 126 DEGs, and from them 112 DEGs-DMGs. 87

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 5: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

5

Network analysis on a set of differentially methylated genes (DMGs) 88

When applying network analysis, not all DMGs and DEGs estimated from the 89

experimental datasets integrate networks. A subset of the most relevant genes from the 90

experimental dataset able to integrate networks is helpful to facilitate further network analysis. 91

The preliminary application of network-based enrichment analysis (NBEA9) and network 92

enrichment analysis test (NEAT 10) on the set of DMGs permitted the identification of 285 93

network-related DMGs (Supplementary Tables S1 and S1). Similar analysis permitted the 94

identification of 326 network-related DEGs (Supplementary Table S2-B, from B-Cells 2360 95

DEGs reported in Supplementary Table 3 from original study 7). These subsets were used to 96

build the corresponding protein-protein interaction (PPI) networks with the STRING app of 97

Cytoscape 11,12. Alternatively, to bypass possible bias introduced by the heuristic used to subset 98

the whole set of genes (NBEA9 and NEAT10), sub-clusters of hubs where retrieved applying the 99

MCODE Cytoscape app on the whole set of DMGs. 100

The PPI network built on the set of 285 DMGs is presented in Supplementary Fig. S2. 101

The analysis with available tools in Cytoscape 11 led to the identification of the main hubs from 102

the PPI network (Fig. 1A and C). Sizes of nodes and labels, as well as their colors, are used for 103

rapid identification of network hubs. Network hubs were confirmed based on betweenness-104

centrality and node degree13, such that the size of each node is proportional to its value of 105

betweenness-centrality and the label font size is proportional to its node degree. 106

The main hub subnetworks in Fig. 1A and 1C were identified with the application of K-107

means clustering on the main networks shown in Supplementary Fig. S2 and S3, respectively, 108

with network centralities measuring Degree, Betweeness-Centrality, Closeness-Centrality, 109

Clustering-Coefficient, and Average-Shortest-Path. Network enrichment analysis of the 110

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 6: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

6

subnetwork of hubs identified KEGG pathways involved in cancer development (Fig. 1B and D), 111

supporting our findings with analysis of network centralities. 112

113 Figure 1. PPI subnetworks of hubs derived from subsets of network-related DMGs. A, main subnetwork 114 of hubs obtained with the application of K-means clustering on the set of 285 network-related DMGs 115 identified with NBEA9 and NEAT10 tests. The size of each node is proportional to its value of 116 betweeness centrality and the label font size is proportional to its node degree. Node colors from light-117 green to red maps the discrete scale of logarithm base 2 of fold changes in DMP numbers for the 118 corresponding gene: light-green: [1, 2), cyan: [2, 3), blue: [3, 4), and red: 5 or more. Edge color is based 119 on co-expression index from white (0.042) to red (0.842). B, enrichment analysis with Cytoscape11 on 120 KEGG pathway sets on network in A. C, main subnetwork of hubs obtained with the application of 121 MCODE Cytoscape app and K-means clustering. D, enrichment analysis with Cytoscape11 on KEGG 122 pathway sets on the network in C. 123 124

125

K-means clustering split the network of 285 DMGs (Supplementary Fig. S2) into three 126

clusters: i) the main subnetwork of hubs (46 DMGs, shown in Fig.1A, Supplementary Table S1), 127

ii) a subnetwork with minor hubs (101 DMGs, Supplementary Fig. S4, Supplementary Table 128

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 7: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

7

S1), and iii) a cluster integrated by two subnetworks (139 DMGs, Supplementary Fig. S4, 129

Supplementary Table S1). Results with MCODE Cytoscape app and K-means were consistent 130

with those obtained by subsetting the whole set of DMGs via NBEA and NEAT 9,10 131

(Supplementary Fig. S3 and Supplementary Tables S2), with a notable enrichment of KEGG 132

pathways associated with cancer development (Supplementary Fig. S5). 133

The scatter plots of network centrality measures (Fig. 2) suggest that the main subnetwork 134

of hubs includes the most relevant network nodes/genes (in red) carrying the highest network 135

centrality measurements. We noted a transition from a non-linear behavior, in clusters iii (nodes 136

in blue) and ii (node in green), to a linear trend observed in cluster i (red points, Fig. 2). These 137

analyses suggest that the subnetwork of hubs shown in Fig. 1C also involves genes with 138

methylation signals that have a role in PALL development 14. 139

140 Figure 2. Scatter plots of network centralities measures. A general non-linear trend is notable for 141 genes/nodes from clusters iii to ii, while the linear trend in cluster i can be visualized. The highest values 142 of network centralities: degree, betweenness, centroid, stress, and radiality, are found in cluster i, which 143 correspond to the main subnetwork of hubs presented in Fig. 1B (consistent with the lowest values of 144 average-shortest-path-length). Networks from clusters i, ii, and iii are shown in Supplementary Fig. S4. 145 146

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 8: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

8

Results of network enrichment analysis of DMG and DEG PPI networks built with 147

STRING (Cytoscape) are shown in Fig. 3 (Supplementary Tables S1 an S2). The analyses 148

indicate that DMG and DEG datasets targeted many of the same pathways with overlap of 80% 149

(Fig. 3C). Pathways linked to cancer development and apoptosis are notable, and KEGG 150

pathways in cancer (hsa05200) showed pronounced enrichment, with more than 50 and 40 genes 151

from the DMG and DEG datasets, respectively. 152

153 Figure 3. Network based enrichment analysis of protein-protein interaction (PPI) networks independently 154 derived from DMGs and DEGs estimated in patients with PALL. A, PPI enriched network of DEGs with 155 15 or more genes. B, PPI enriched network of DMGs with 20 or more genes. C, Venn diagram with the 156 overlapping of all PPI enriched networks of DMGs and DEGs with 7 or more genes. The PPI enriched 157 network analysis was performed in STRING app on Cytoscape, 11,12 and the analysis is limited to KEGG 158 human pathways. 159

160

In the case of patients with PALL, enrichment for PI3K-Akt signaling pathway, MAPK 161

signaling pathway, JAK-STAT signaling pathway, Wnt signaling pathway, and Focal adhesion 162

(all included in KEGG pathway in cancer) was statistically significant for both DMG and DEG 163

subsets. The Venn diagram shown in Fig. 3C implies a high level of concordance between the 164

enriched KEGG pathways identified in PPI networks from DEGs and from DMGs. 165

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 9: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

9

166

167 Figure 4. Graphical evaluation of the concordance between DEG and DMG enrichments on KEGG 168 pathways. A, scatterplot of pathway ratings (see Eq. 1) from enriched pathways on the set of DMGs 169 (𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷) and DEGs (𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷), respectively. The regression analysis shows the linear trend of the 170 relationship 𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 > 0 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 > 0 (black dots). The identity dashed line (in blue) helps in 171 gauging the degree of agreement between measurements 15. Dots in red highlight pathways for which 172 𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 = 0 or 𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷 = 0. B, Bland-Altman plot of the agreement, on targeting gene pathways, between 173 responses from gene expression and methylation regulatory systems. The agreement between 174 measurements can also be tested by values of the Lin's concordance correlation coefficient (𝜌𝜌𝐶𝐶𝐶𝐶) and 175 Kendall coefficient of concordance (𝜌𝜌𝐾𝐾𝐶𝐶). 176

177

Figure 4 supports a strong concordance between the enriched KEGG pathways identified 178

in PPI networks from DEGs and from DMGs. Bootstrap Bayesian estimation of the Lin's 179

concordance correlation coefficient (𝜌𝜌𝑐𝑐𝑐𝑐) yielded a value of 𝜌𝜌𝑐𝑐𝑐𝑐 = 0.71 with a confidence 180

interval (C.I.) 0.52 ≤ 𝜌𝜌𝑐𝑐𝑐𝑐 ≤ 0.84, and a Kendall coefficient of concordance 𝜌𝜌𝐾𝐾𝐶𝐶 = 081 181

(permutation p-value < 0.001). The linear regression analysis presented in Fig. 4A indicates a 182

statistically significant linear relationship between the pathway score (𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷) of enriched KEGG 183

pathways in DMG PPI network (see definition at equation (1)) and pathway score (𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷) of 184

enriched KEGG pathways in DMG PPI network. The proximity of most of the regression points 185

(pairs of pathways scores) around the identity line (dashed line in blue) suggests significant 186

agreement between methylation and gene expression regulatory systems, also indicated by a 187

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 10: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

10

regression slope of 0.9. This concordance between gene expression and methylation is 188

graphically corroborated by Bland-Altman plot 15, where almost all the points are located in 189

between the mean − 2σ and mean + 2σ horizontal lines (Fig. 4B). 190

DEG-DMG network analysis 191

NBEA and NEAT 9,10 were applied to identify network-related genes from the set of DEGs-192

DMGs (191, 1774 genes). The PPI network of 191 DEGs-DMGs is shown in Supplementary Fig. 193

S6 (Supplementary Table S2). Three clusters were detected by applying K-means clustering on 194

the main PPI-network of DEGs-DMGs and two of them integrated the subnetworks of hubs 195

shown in Fig. 5B and D, while the third cluster gave rise to several subsets of subnetworks. 196

Enrichments detected in the main PPI network of 191 DEGs-DMGs network (Fig. 5A) and 197

subnetworks (Fig. 5C and 5E. Supplementary Table S2)) are consistent with previous results 198

(Fig. 3): i) only focused on the set of DMGs (not all of them DEGs, Fig. 3A) and ii) only focused 199

on the set of DEGs (not all of them DMGs, Fig. 3B). 200

Group means of methylation level differences at each gene-body DMP for genes 201

NOTCH1, CD44, and BCL2L1 (hubs from the DMGs-DEGs sub-network from Fig. 5B) are 202

shown in Fig. 6A. NOTCH1 and CD44 have been reported to be epigenetically regulated 16–19 203

and, in particular, NOTCH1 has been proposed as a drug target for the treatment of T-cell acute 204

lymphoblastic leukemia 17. BCL2L1 is known to have roles in apoptosis and has been proposed 205

as a drug target for cancer treatment 20. Genes from activation of the mitogen-activated protein 206

kinase (MAPK) pathway are frequently altered in cancer and have been proposed as drug targets 207

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 11: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

11

as well 21. 208

209

Figure 5. Network enrichment for the network-related DEG-DMGs. A, bar-plots of the enriched KEGG 210 pathways in the PPI-network of 191 DEG-DMGs (Supplementary Fig. S6). B and D, subnetworks 211 integrated by gene-hubs identified with K-means clustering of the network from panel. C and E, bar-plots 212 of the enriched KEGG pathways on the networks from panels B and D, respectively. In the networks, 213 nodes with the same color belong to the same cluster obtained with K-Medoid clustering. Gene hubs were 214 identified based on betweeness centrality and node degree, such that the size of each node is proportional 215 to its value of betweeness centrality and the label font size is proportional to its node degree. Edge color 216 is based on coexpression index from white (0.042) to red (0.938). The PPI network and the enrichment 217 analyses were performed in STRING app on Cytoscape 11,12. 218

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 12: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

12

219 Figure 6. DEG-DMGs reported as cancer related gene (oncogene) lists. A, Group mean of methylation 220 level differences at each cytosine identified differentially methylated genes (DMGs). BCL2L1, CD44, 221 MAP3K1 and NOTCH1 are linked to leukemia and other types of cancers. These genes were identified as 222 “hubs” of PPI networks (Fig. 5B and D). Irregular distribution of methylation signal, hyper- and hypo- 223 methylated, can be viewed. Traditional DMR-based approaches fail to detect these types of variation. 224 Methylation level differences were computed for control and treatment individuals with respect to normal 225 CD19+ methylome from four independent blood donors used as reference. This approach provides an 226 estimation of the natural variability of methylation changes existing in the control population. B, 227 Overlapping (≥500bp) between the differentially methylated enhancer regions (DMERs) and DEGs-228 DMGs. Although only 51 enhancers (DMERs) are activators of reported DEGs, the DMERs overlap with 229 159 DEGs-DMGs regions, from which 23 are reported oncogenes (see Methods). A total of 379 DEGs-230 DMGs are reported oncogenes. 231 232 233

Three members of this pathway are found in the sub-network DMG-DEGs shown in Fig. 234

5D and in the DMP distribution on MAP3K1 gene-body shown in Fig. 6A. In whole, 379 235

identified DEG-DMGs have been reported as cancer-related genes (Fig. 6B). 236

Differentially methylated enhancer regions (DMERs) 237

Our initial analysis was limited to the methylation signal carried on gene-body regions. As 238

suggested in Fig. 6, gene-associated methylation signal can also be present on genomic regions 239

upstream and downstream to genes, including transcription enhancer regions 22. Analysis of the 240

methylation datasets identified 325 differentially methylated enhancer regions (DMERs). 241

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 13: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

13

242 Although only 51 enhancers from the 325 identified DMERs are activators of reported DEGs 243

(Supplementary Table S2), the list of DEG-DMG regions covered by DMERs (in at least 500bp) 244

reach a total of 159 (Fig. 6B), from which 23 were identified oncogenes. 245

The top 29 genes with highest density variation of DMP number within enhancer regions 246

are shown in Figure 7. Many of these genes have been reported to be associated with cancer 247

development and were found in the sets of DMGs or DEGs. One example is the enhancer region 248

influencing gene EPIDERMAL GROWTH FACTOR-LIKE DOMAIN 7 (EGFL7) and the micro-249

RNA MIR-126, both associated with cancer 23,24. As shown in Figure 7B, MIR-126 resides within 250

an intron of EGFL7 and their enhancer region overlaps. 251

252 Figure 7. DEGs with differentially methylated enhancer region. A, Top 29 genes with the highest density 253 variation of DMP number (> 1.7 DMPs/kb) in the enhancer region. Bars in dark blue denote genes that 254 have been reportedly associated with cancer development. B, Group mean of methylation level 255 differences at each cytosine identified differentially methylated enhancer regions corresponding to the 256 genes: SMARCA4, EGRL7, MIR126, NUDT1, and CDK9. 257

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 14: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

14

258 MIR-126 modulates vascular integrity and angiogenesis, and it has been reported that 259

MIR-126 delivered via exosomes from endothelial cells promotes anti-tumor responses 25. The 260

hypomethylation pattern observed in the region spans a substantial part of gene AGPAT2, which 261

was identified as a DMG and, although over-expressed in different types of cancer, was not 262

reported as a DEG in the earlier PALL study 26. AGPAT2 promotes survival and etoposide 263

resistance of cancer cells under hypoxia 27. 264

Association between methylation and gene expression 265

Results to date suggest the existence of an association, or at least statistical inter-dependence, 266

between methylation and gene expression. To investigate this association, density variations of 267

the methylation signal were quantitatively expressed by different measurements: density of 268

methylation level difference �∆𝑝𝑝𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑�, density of total variation difference �∆𝑇𝑇𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑�, and 269

�∆𝐻𝐻𝐻𝐻𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑� (see method section for variable descriptions). Gene expression was shown as 270

absolute value of the logarithm base 2 of fold change, |𝑙𝑙𝑙𝑙𝑙𝑙2𝐹𝐹𝐹𝐹|. 271

The association between methylation and gene expression for the current study of patients 272

with PALL is shown in Supplementary Fig. S7. This association is not only corroborated by a 273

highly significant Spearman's rank correlation rho (p-value lesser than 0.001, Supplementary 274

Fig. S7), but also by two-dimensional kernel estimation (2D-KDE ) and Farlie-Gumbel-275

Morgenstern (FGM) copula of joint probability distributions for each annotated pair of variables 276

in the coordinate axes from the contour-plot plane (Supplementary Fig. S7). 277

Results indicate that methylation and gene expression show positive dependence. Roughly 278

speaking, a bivariate distribution is considered to have a specific positive dependence property if 279

larger values of either random variable are probabilistically associated with larger values of the 280

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 15: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

15

other random variable 28. According to Lai29, the FGM copulas shown in Supplementary Fig. S7 281

indicate CDM and gene expression to be positively quadrant dependent and positively regression 282

dependent. In other words, if 𝑋𝑋 is the density of methylation level difference, the regression 283

𝐸𝐸(𝑌𝑌|𝑋𝑋 = 𝑥𝑥) is linear in x 29. Thus, the regression of the conditional expected value of gene 284

expression with respect to density variations of methylation signal X is linear in x (possible 285

values of X). This linear trend is noticed with high joint probability in the outlined contour-plot 286

red regions (Supplementary Fig. S7). 287

PC-score of DEG-DMGs 288

The identification of genes playing fundamental roles in cancer progression is limited by the 289

availability of protein-protein interaction information in a database (STRING, in the current 290

case). Consequently, results could be mostly populated with genes from network-associated 291

diseases. To circumvent these possible limitations, principal component analysis (PCA) was 292

applied to score genes according to their discriminatory power to discern the disease state from 293

healthy. PCA was performed on the set of individuals, representing each in the 1775-dimensional 294

space DEGs-DMGs, where each gene was represented by the density of an information 295

divergence on gene-body, which provides a normalized measurement of the intensity of the 296

methylation signal. Two PC-scores were derived from two information divergences: 1) absolute 297

difference of methylation levels and 2) Hellinger divergence. 298

The first principal component (PC1) was used to build the PC-scores for DMGs, since it 299

carried 85% of the whole sample variance with eigenvalues greater than 1 (Guttman-Kaiser 300

criterion 30, see Methods). A list of the first 12 genes with top PPI-network PC-scores is 301

presented in Table 1, indicating genes associated with cancer development and further 302

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 16: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

16

confirming that regardless of approach followed, genes involved with cancer origin and 303

progression are DEG-DMGs. 304

Discussion 305

Data from this study reflect non-random methylation repatterning that targets gene networks 306

reportedly associated with cancer development and risk. The majority of DNA methylation 307

changes fall within intergenic regions of the genome, and only 4795 (including non-coding) of 308

the 57241 annotated human genes were identified as DMGs. This result suggests that in patients 309

with PALL, the methylation machinery may selectively target specific genes. The methylation 310

signal is observed not only within gene-body regions of DMGs, but also (and frequently with 311

high intensity) in upstream and downstream domains. 312

Network analysis of DMGs identified several KEGG pathways and genes associated with 313

cancer. Relevant genes were identified as network hubs and grouped into clusters of network 314

hubs carrying the highest network centrality measurements (Fig. 1 and 5). Presumably, 315

disruption of a network hub by altering the gene, or others that regulate the hub, could alter the 316

entire gene network 14,31. Thus, identification of hubs offers candidate targets in the search for 317

potential biomarkers. The strong linearity trends observed in pairwise regression between the 318

centrality measurements (Fig. 2) in the main hub cluster (Fig. 1A) suggests that genes from the 319

cluster are non-randomly targeted by the action of methylation regulatory machinery during 320

PALL development 14. 321

Clusters of hubs integrating PPI subnetworks comprise the backbone of a network. The 322

essentiality of gene hubs in preserving the integrity of the interacting network is quantitatively 323

expressed in network centrality statistics. For sub-networks of hubs (Fig. 1 and 5), higher 324

centrality values and linear relationships between the centrality statistics of the network hubs 325

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 17: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

17

reflects a higher number of reported biologically meaningful associations between the hubs and 326

the other genes on the sub-network and the main network (Fig. 2). 327

Strong correspondence was seen in the network enrichment analyses derived from PPI 328

networks in DMGs and DEGs (Fig. 3), supporting the non-random nature of methylation signals 329

within protein-coding regions in signaling pathways linked to cancer development. Although not 330

all DEGs are detected as DMGs and vice versa, massive overlap of enriched KEGG pathways 331

(Fig. 3) suggest a coordinated response of methylation and gene-expression machineries. This in 332

concert regulatory response was statistically supported by Lin's concordance correlation 333

coefficient and Kendall coefficient of concordance. 334

An example of coordinated regulatory response of methylation and gene expression is seen 335

in the case of the EGFR gene, identified as a hub in the DMG network (Fig. 1). EFGR is a 336

tyrosine kinase that regulates autophagy via the PI3K/AKT1/mTOR, RAS/MAPK1/3 (enriched 337

pathways shown in Fig. 3A and B, and in Fig. 5A and E), and STAT3 signaling pathways 32,33. 338

Although EGFR was not a reported DEG, its activators, EPIDERMAL GROWTH FACTOR 339

(EGF, Fig. 5B) and EGFL7 were identified as both DMGs and DEGs. EGFL7 is reported to be a 340

key factor for the regulation of the EGFR signaling pathway 34. Additionally, EGFL7 is a 341

secreted angiogenic factor that can result in pathologic angiogenesis and enhance tumor 342

migration and invasion via the NOTCH signaling pathway 23 (a pathway enriched in the PPI-343

DMG network). The NOTCH pathway is a conserved intercellular signaling pathway that 344

regulates interactions between physically adjacent cells. In the set of patients with PALL, 345

NOTCH1 is reported as a DEG and DMG (Fig. 1A and 5B). 346

Another example of the gene network architecture of leukemia emerges by tracking up- 347

and downstream interconnections of genes PIK3CG (DEG-DMG) and PIK3CD (a DMG 348

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 18: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

18

network hub, Fig. 1) from the PI3K/AKT signaling pathway (enriched in the set of DEG-DMGs, 349

Fig. 5). Phosphatidylinositol-4,5-bisphosphate 3-kinase (PI3K) is a critical node in the B-cell 350

receptor (BCR, a DEG-DMG) signaling pathway and its isoforms, PIK3CD and PIK3CG are 351

involved in B-cell malignancy 35. Crosslinking CD19 with the BCR augments PI3K activation, 352

and VAV proteins, VAV1 (DMG), VAV2 (DEG-DMG), and VAV (DEG-DMG) also 353

contributes to PI3K activation downstream of BCR and related receptors 36. BCR and its 354

downstream signaling pathways, including Ras/Raf/MAPK, JAK/STAT3, and PI3K/AKT (all 355

enriched in PALL patients, Fig. 3 and 5), play important roles in malignant transformation of 356

leukemia 37. 357

Our analysis also considered gene regulatory domains upstream and downstream to gene-358

body regions and, in particular, enhancer regions. The set of genes targeted by DMERs does not 359

integrate to a PPI network, but is found in signaling pathways or regulators from them. As in the 360

previous analyses, enhancer methylation repatterning identifies genes known to be involved in 361

cancer development (Fig. 6B). For example, SMARCA4 (Fig. 7) encodes an ATPase of the 362

chromatin remodeling SWI/SNF complexes frequently found upregulated in tumors 38 and 363

represents a DEG-DMG in patients with PALL. The product of this gene can bind BRCA1 364

(DEG-DMG) 39 and also regulates the expression of the tumorigenic protein CD44 (DEG-DMG) 365

40. 366

PPI networks are only models to identify highly interconnected players from the subjacent 367

web architecture of genes involved in a specific biological process. Thus, results from the 368

application of more than one network model can complement, and different network models do 369

not necessarily overlap 100% with the set of enriched pathways. Deriving subsets of the DEG-370

DMG dataset by applying MCODE clustering increased confidence over previous results. 371

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 19: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

19

The integrative analyses of DMGs, DEGs and the networks derived from them, as well as 372

DMERs (graphically summarized in Fig. 1 to 7), provided consistent indication of a web of 373

interacting genes in cancer development and an association between gene methylation 374

repatterning and gene expression changes. This association was supported by Spearman’s rank 375

correlation rho and the bivariate FGM copula (Supplementary Fig. S7), which implies a linear 376

dependence for expected values of gene expression changes in methylation changes for the set of 377

DEG-DMGs. 378

Our analysis uncovered a stochastic deterministic dependence relationship, where larger 379

values of gene expression changes are probabilistically associated with larger values of 380

methylation changes (in the whole set of 1772 DEG-DMGs). Within the set of DEG-DMGs, 381

observed changes in gene expression were not statistically independent of the methylation 382

changes, showing association with a significant linear trend (Supplementary Fig. S7). This result 383

may be indication that the relationship between gene methylation repatterning and altered gene 384

expression would be present at lower density methylation levels. Such a relationship can be 385

overlooked with over-stringent filtering of methylome data. Three analytical approaches assist in 386

discovering this association: i) signal detection for DMP identification, ii) GLM-based group 387

comparison for DMG identification, and iii) copula modeling of stochastic dependence. 388

Our results demonstrate the potential of integrative network analysis of DMGs and DEGs 389

for the identification of biologically relevant methylation biomarkers. Numerous clusters of 390

interacting genes are detected in the sub-networks of hubs from PPI networks of DMGs and 391

DEGs, a few of which are described here. More detailed analysis of these data has allowed us to 392

propose three factors likely to be important to biomarker identification. A potential biomarker 393

must 1) be a DMG or a DEG-DMG with one or more well defined differential methylation 394

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 20: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

20

pattern(s) on gene-body, upstream or downstream of the gene-body; 2) integrate one or more 395

gene pathways that are biologically relevant for leukemia and, simultaneously, show enrichment 396

in the PPI networks of DMGs and DEGs, and 3) represent a hub or be biologically connected to 397

a relevant hub. Genes holding to these guidelines integrate the subnetworks of hubs shown in 398

Figs. 1B and 4C-D, and the list of potential biomarkers can be extended using the information 399

provided in the Supplementary Tables S1 and S2. 400

Intersection of the identified networks with available data from independent studies of 401

cancer further supports the potential of our approach for identifying robust disease biomarkers. 402

However, while intersection of methylome and gene expression data with cancer-relevant gene 403

networks is compelling, we cannot eliminate the possibility that these outcomes may be 404

influenced by the relative abundance of cancer-related networks within the various databases 405

currently available. To help circumvent this limitation, we proposed ranking the DEG-DMGs 406

based on their discriminatory power to discern disease state from healthy. 407

Potential biomarkers can be scored with the application of PCA (Table 1 and 408

Supplementary Table S2). In this study, the first PC was sufficient to build a PC-score of DEG-409

DMGs based on gene-body methylation signal intensity. PC-scores identify cancer-related genes 410

not identified by the PPI network approach, although not all relevant genes were identifiable, 411

e.g., NOTCH1. Within a long gene like NOTCH1, the non-homogenous distribution of gene 412

body methylation signal (Fig. 6A) can result in apparently low density methylation signal 413

globally, even when signal is high locally. Nevertheless, PC-score provides an acceptable 414

complement to the PPI network approach. Results obtained with the approach proposed here 415

support its application to the identification of reliable and stable biomarkers for cancer diagnosis 416

and prognosis. Lists of genes relevant as biomarker candidates for leukemia (several of which 417

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 21: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

21

have already been proposed as biomarkers by others) are provided in the Supplementary Tables 418

online. 419

Materials and Methods 420

Methylation and gene expression datasets 421

The datasets of genome-wide methylated and unmethylated read counts (for each cytosine site) 422

from normal CD19+ blood cell donor (NB) and from patients with pediatric acute lymphoblastic 423

leukemia (PALL) where downloaded from the Gene Expression Omnibus (GEO) database. 424

DMPs were estimated for control (NB, GEO accession: GSM1978783 to GSM1978786) and for 425

patients (ALL cells, GEO accession number GSM1978759 to GSM1978761) relative to a 426

reference group of four independent normal CD19+ blood cell donor (GEO accession: 427

GSM1978787 to GSM1978790). The datasets of DEGs from the group of patients with PALL 428

were taken from the Supplementary information provided in the previous study 7. 429

A list of 2,579 cancer-related genes compiled by Bushman Lab 430

(http://www.bushmanlab.org/links/genelists) was used to identify DEG-DMGs oncogenes. 431

Methylation analysis 432

Methylation analysis was performed by using our home pipeline Methyl-IT version 0.3.1 (a R 433

package available at https://git.psu.edu/genomath/MethylIT). Estimation of differentially 434

methylated positions (DMPs) is consistent with the classical approach using Fisher’s exact test 435

except for a further application of signal detection (see examples of methylation analysis with 436

MethylIT at https://github.com/genomaths/MethylIT). Need for the application of signal 437

detection in cancer research was pointed out decades ago 41. Here, application of signal detection 438

was performed according to standard practice in current implementations of clinical diagnostic 439

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 22: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

22

tests 42–44. That is, optimal cutoff values of the methylation signal were estimated on the receiver 440

operating characteristic curves (ROCs) based on ‘Youden Index’42 and applied to identify DMPs. 441

The decision of whether a DMP is detected by Fisher’s exact test (or any other statistical test 442

implemented in Methyl-IT) is based on optimal cutoff value 43. 443

Estimation of differentially methylated regions (DMRs). The regression analysis of the 444

generalized linear model (GLMs) with logarithmic link, implemented in MethylIT function 445

countTest, was applied to test the difference between groups of DMP numbers/counts at 446

specified genomic regions, regardless of direction of methylation change. Here, the concept of 447

DMR is generalized and it is not limited to any specific genomic region found with specific 448

clustering algorithm. It can be applied to any naturally or algorithmically defined genomic 449

region. For example, an exon region identified statistically to be differentially methylated by 450

using GML is a DMR. In particular, a DMR spanning a whole gene-body region shall be called a 451

DMG. DMGs were estimated from group comparisons for the number of DMPs on gene-body 452

regions between control (CD19+ blood cell donor) and ALL cells based on generalized linear 453

regression. 454

The fitting algorithmic approaches provided by glm and glm.nb functions from the R 455

packages stat and MASS were used for Poisson (PR), Quasi-Poisson (QPR) and Negative 456

Binomial (NBR) linear regression analyses, respectively. These algorithms are implemented in 457

the Methyl-IT function countTest and countTest2, which only differ in the way to estimate the 458

weights used in the GLM with NBR. The following countTest parameters were used: minimum 459

DMP count per individual (8 DMPs), test P-value from a likelihood ratio test (test = “LRT”) 460

and P-value adjustment method (Benjamini & Hochberg45), cut off for P-value (α = 0.05), and 461

Log2Fold Change for group DMP number mean difference >1. 462

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 23: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

23

The methylation analysis of genomic regions to identify differentially methylated enhancer 463

regions (DMERs) was performed on a set of enhancers reported by Hnisz et al. 46. Usually, the 464

size of the genomic region covered by an enhancer varies depending on the tissue type. In our 465

current case, for each enhancer we analyzed the maximum region spanning all reported sizes for 466

different tissues. 467

Network analysis 468

Protein-protein interaction (PPI) networks were built with STRING app of Cytoscape 11,12. 469

Network analysis were conducted in Cytoscape. When the number of genes exceeded l00 for 470

network analysis, biologically meaningful web connections were difficult to visualize. 471

Biologically relevant subsets of genes were obtained from the whole set of genes (DMGs, DEG, 472

or DEGs-DMGs) by using the R packages NBEA and NEAT 9,10. Alternatively, Cytoscape app 473

MCODE was then used for subsetting an entire network 47. PPI subnetworks from four network 474

modules identified with MCODE are shown. MCODE parameters for degree cutoff: 10, node 475

density cutoff: 0.01, node score cutoff: 0.2, K- score 10, and max. depth: 100. K-mean clustering 476

algorithm was applied to each subnetwork to obtain subnetworks of hubs using the following 477

node attributes for clustering: betweenness-centrality, degree, closeness-centrality, and 478

clustering coefficient. 479

Network hubs were identified based on betweenness-centrality and node degree, where 480

size of each node (in PPI network) is proportional to its value of betweenness-centrality and label 481

font size is proportional to its node degree. Network enrichment analysis in KEGG pathways 482

follows each graphic subnetwork. 483

484

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 24: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

24

Concordance test for DEG and DMG enrichments on KEGG pathways 485

The concordance between DEG and DMG enrichments on KEGG pathways, derived from the 486

PPI network via STRING app in Cytoscape, was evaluated with the application of the Lin's 487

concordance correlation coefficient (𝜌𝜌𝑐𝑐𝑐𝑐) and Kendall coefficient of concordance (𝜌𝜌𝐾𝐾𝐶𝐶). The R 488

package agRee was used for the bootstrap Bayesian estimation of 𝜌𝜌𝐶𝐶𝐶𝐶 point value and confidence 489

interval 48; while the R package vegan was used to compute 𝜌𝜌𝐾𝐾𝐶𝐶 through a permutation test 49. 490

To perform the concordance test, a score was assigned to each enriched KEGG pathway 491

from DEGs and DMGs based on the number of genes in the pathway and on its corresponding 492

statistical signification based on its FDR p-value. Only pathways with FDR p-value lesser than 493

0.0004 were considered. A new variable, statistical signification (sig) was defined according 494

with the scale: 𝑣𝑣𝑠𝑠𝑙𝑙 = 1, 2, 3, for p-values in the intervals (10−5, 10−4 ), (10−6, 10−5 ), and 495

(0, 10−6 ), respectively. The valor of 𝑣𝑣𝑠𝑠𝑙𝑙 = 0 was assigned to pathways not enriched in one of 496

the group, DEGs or DMGs. For example, Phosphatidylinositol signaling system was not 497

enriched in the set of PPI-DMGs and, consequently 𝑣𝑣𝑠𝑠𝑙𝑙𝐷𝐷𝐷𝐷𝐷𝐷 = 0, but it was enriched in the set of 498

PPI-DEGs with 𝑣𝑣𝑠𝑠𝑙𝑙𝐷𝐷𝐷𝐷𝐷𝐷 = 3. Then, a new variable, named pathway score was defined according 499

to the formula: 500

𝑃𝑃 = # 𝑙𝑙𝑜𝑜 𝑙𝑙𝑣𝑣𝑔𝑔𝑣𝑣𝑣𝑣 𝑠𝑠𝑔𝑔 𝑝𝑝𝑝𝑝𝑝𝑝ℎ𝑤𝑤𝑝𝑝𝑤𝑤 × 𝑣𝑣𝑠𝑠𝑙𝑙 (1) 501

We would use the notation 𝑃𝑃𝑘𝑘𝑑𝑑 to indicate that the rating was performed for pathway 𝑠𝑠 502

identified on the gene set k (k =DMGs, DEGs). That is, the pathway score P not only carries 503

information on how many genes are found on each pathway but also information on the 504

enrichment statistical signification. The estimated values of 𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝑑𝑑 and 𝑃𝑃𝐷𝐷𝐷𝐷𝐷𝐷𝑑𝑑 for each enriched 505

pathway 𝑠𝑠 (from DEGs and DMGs sets, respectively) were used in the concordance tests and in 506

the Bland-Altman plot (Fig. 3E). 507

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 25: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

25

Stochastic association between methylation and gene expression 508

To investigate such an association, the methylation density of gene regions simultaneously 509

identified as DEGs and DMGs were expressed in terms of different magnitudes: 1) 𝑝𝑝𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 , 510

density of methylation levels (i: control or patients); 2) 𝑇𝑇𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 , density of the difference of 511

methylation levels between each group (control or patients) and an independent group of four 512

healthy individuals (reference group); 3) 𝑇𝑇𝑇𝑇𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 , 𝑇𝑇𝑇𝑇 with Bayesian correction, and 4) 513

𝐻𝐻𝐻𝐻𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 , density of Hellinger divergence, where i denotes the group mean, control or patient. 514

The density in 1000 bp of a variable X at a given gene region was defined as the sum of the 515

magnitude X divided by the length of the region and multiplied by 1000. The differences of 516

methylation densities between control and patient groups were estimated as the absolute 517

difference of methylation levels �∆𝑋𝑋𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑� = �𝑋𝑋𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑐𝑐𝑐𝑐𝑑𝑑𝑑𝑑𝑟𝑟𝑐𝑐𝑜𝑜 − 𝑋𝑋𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑝𝑝𝑝𝑝𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑�, where 𝑋𝑋𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 represents 518

one of the mentioned variables. Methyl-IT R package provides all the functions to obtain all 519

variable mentioned here (https://github.com/genomaths/MethylIT and 520

https://github.com/genomaths/MethylIT.utils). 521

Spearman's rank correlation 𝜌𝜌 (rho) was estimated to evaluate the association between the 522

pairs of variable |∆𝑙𝑙𝑙𝑙𝑙𝑙2𝐹𝐹𝐹𝐹| versus: �∆𝑝𝑝𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑�, �∆𝑇𝑇𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑�, �∆𝑇𝑇𝑇𝑇𝐻𝐻𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑�, and �∆𝐻𝐻𝐻𝐻𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑�. 523

Since correlation analysis only measures the degree of dependence (mainly linear) but does not 524

clearly discover the structure of dependence, we further investigate the structural dependence 525

between these variables with application of Farlie-Gumbel-Morgenstern (FGM) copula. FGM 526

copula model estimation was performed with R package copula 50. 527

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 26: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

26

Principal component analysis (PCA) 528

PCA is standard statistical procedure to reduce data dimensionality, to represent the set of DMGs 529

by new orthogonal (uncorrelated) variables, the principal components (PCs) 51, and to identify 530

the variables with the main contribution to the PCs carrying most the sample variance. Herein, a 531

PC-based score (PC-score) was built by ranking the DEG-DMGs based on its discriminatory 532

power to discern between the disease state and healthy. Each individual was represented as 533

vector of the 1775-dimensional space of DEG-DMGs. Two PC-scores were estimated: the first 534

based on the density of Hellinger divergence on the gene-body and the second one based on the 535

density of the absolute value of methylation levels difference. The density of a magnitude x is 536

defined as the sum of x at each DMP divided by the gene width (in base-pairs). The first 537

principal component (PC1) was used to build a PC-based score for the DEG-DMG set, since it 538

had an eigenvalues (variance) greater than 1 and carried more than 85% of the whole sample 539

variance (Guttman-Kaiser criterion 30). The PC-score was built using the absolute values of the 540

coefficients (loadings) in PC1 for each variable (gene). Since the sum of the squared of variable 541

loadings over a principal component is equal to 1, the squared loadings tell us the proportion of 542

variance of one variable explained by the given principal component. Thus, the greater is the PC-543

score value, the greater will be the discriminatory power carried by the gene. 544

The density of HD on the gene-body was computed with MethylIT function 545

getGRegionsStat and the principal component with function pcaLDA, which conveniently 546

applies the PCA calling function prcomp from the R package ‘stats’. 547

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 27: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

27

Acknowledgements 548

We wish to thank Dr. Xiaodong Yang and Thomas Maher for valuable discussions during 549

the development of studies. This study was supported by a grant from the Bill and Melinda Gates 550

Foundation (OPP1088661) to S.A.M. 551

Author contributions 552

R.S. designed experiments conducted mathematical and statistics analyses. S.M. assessed 553

experiments and edited manuscript. 554

Competing interests 555

The authors declare no competing interests. 556

Data availability 557

All the methylome datasets and software used in this work are publicly available. The MethylIT 558

R package used in the DMP and DMG estimations, as well as several examples on how to use 559

Methyl-IT, are available at GitHub: https://github.com/genomaths/MethylIT. The datasets 560

supporting conclusions of this report are included within Supplementary material. 561

562

References 563

1. Suresh, N. T. & Ashok, S. Comparative Strategy for the Statistical & Network based 564

Analysis of Biological Networks. Procedia Comput. Sci. 143, 165–180 (2018). 565

2. Hogan, L. E. et al. Integrated genomic analysis of relapsed childhood acute lymphoblastic 566

leukemia reveals therapeutic strategies. Blood 118, 5218–26 (2011). 567

3. Nordlund, J. et al. Genome-wide signatures of differential DNA methylation in pediatric 568

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 28: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

28

acute lymphoblastic leukemia. Genome Biol. 14, r105 (2013). 569

4. Chatterton, Z. et al. Epigenetic deregulation in pediatric acute lymphoblastic leukemia. 570

Epigenetics 9, 459–67 (2014). 571

5. Nordlund, J. & Syvänen, A. C. Epigenetics in pediatric acute lymphoblastic leukemia. 572

Semin. Cancer Biol. 51, 129–138 (2018). 573

6. Rahmani, M., Talebi, M., Hagh, M. F., Feizi, A. A. H. & Solali, S. Aberrant DNA 574

methylation of key genes and Acute Lymphoblastic Leukemia. Biomedicine and 575

Pharmacotherapy 97, 1493–1500 (2018). 576

7. Wahlberg, P. et al. DNA methylome analysis of acute lymphoblastic leukemia cells 577

reveals stochastic de novo DNA methylation in CpG islands. Epigenomics 8, 1367–1387 578

(2016). 579

8. Tate, J. G. et al. COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids 580

Res. (2018). doi:10.1093/nar/gky1015 581

9. Geistlinger, L. EnrichmentBrowser: Seamless navigation through combined results of set-582

based and network-based enrichment analysis. R package version 2.1.0. 1–15 (2015). 583

10. Signorelli, M. et al. NEAT: an efficient network enrichment analysis test. BMC 584

Bioinformatics 17, 352 (2016). 585

11. Shannon, P. et al. Cytoscape: A software Environment for integrated models of 586

biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003). 587

12. Szklarczyk, D. et al. The STRING database in 2017: Quality-controlled protein-protein 588

association networks, made broadly accessible. Nucleic Acids Res. 45, D362–D368 589

(2017). 590

13. Jalili, M. et al. Evolution of centrality measurements for the detection of essential proteins 591

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 29: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

29

in biological networks. Frontiers in Physiology 7, 375 (2016). 592

14. Pavlopoulos, G. A. et al. Using graph theory to analyze biological networks. BioData 593

Mining 4, 10 (2011). 594

15. Martin Bland, J. & Altman, D. Statistical methods for assessing agreement between two 595

methods of clinical measurement. Lancet 327, 307–310 (1986). 596

16. Huang, Y.-C. C. et al. Epigenetic regulation of NOTCH1 and NOTCH3 by KMT2A 597

inhibits glioma proliferation. Oncotarget 5, 63110–63120 (2017). 598

17. Waibel, M. et al. Epigenetic targeting of Notch1-driven transcription using the HDACi 599

panobinostat is a potential therapy against T-cell acute lymphoblastic leukemia. Leukemia 600

32, 237–241 (2018). 601

18. Eberth, S. et al. Epigenetic regulation of CD44 in Hodgkin and non-Hodgkin lymphoma. 602

BMC Cancer 10, 517 (2010). 603

19. Müller, I., Wischnewski, F., Pantel, K. & Schwarzenbach, H. Promoter- and cell-specific 604

epigenetic regulation of CD44, Cyclin D2, GLIPR1 and PTEN by Methyl-CpG binding 605

proteins and histone modifications. BMC Cancer 10, 297 (2010). 606

20. Chu, L. H. & Chen, B. Sen. Construction of a cancer-perturbed protein-protein interaction 607

network for discovery of apoptosis drug targets. BMC Syst. Biol. 2, 56 (2008). 608

21. Xue, Z. et al. MAP3K1 and MAP2K4 mutations are associated with sensitivity to MEK 609

inhibitors in multiple cancer models. Cell Res. 28, 719–729 (2018). 610

22. Lou, S. K. et al. Whole-genome bisulfite sequencing of multiple individuals reveals 611

complementary roles of promoter and gene body methylation in transcriptional regulation. 612

Genome Biol. 15, (2014). 613

23. Wang, J. et al. EGFL7 participates in regulating biological behavior of growth hormone–614

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 30: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

30

secreting pituitary adenomas via Notch2/DLL3 signaling pathway. Tumor Biol. 39, 615

1010428317706203 (2017). 616

24. Yang, C. et al. Increased expression of epidermal growth factor-like domain-containing 617

protein 7 is predictive of poor prognosis in patients with hepatocellular carcinoma. J. 618

Cancer Res. Ther. 14, 867–872 (2018). 619

25. Tomasetti, M. et al. MiR-126 in intestinal-type sinonasal adenocarcinomas: exosomal 620

transfer of MiR-126 promotes anti-tumour responses. BMC Cancer 18, 896 (2018). 621

26. Song, L. et al. Silencing LPAATβ inhibits tumor growth of cisplatin-resistant human 622

osteosarcoma in vivo and in vitro. Int. J. Oncol. 50, 535–544 (2017). 623

27. Triantafyllou, E.-A., Georgatsou, E., Mylonis, I., Simos, G. & Paraskeva, E. Expression of 624

AGPAT2, an enzyme involved in the glycerophospholipid/triacylglycerol biosynthesis 625

pathway, is directly regulated by HIF-1 and promotes survival and etoposide resistance of 626

cancer cells under hypoxia. Biochim. Biophys. Acta - Mol. Cell Biol. Lipids 1863, 1142–627

1152 (2018). 628

28. Kimeldorf, G. & Sampson, A. R. A framework for positive dependence. Ann. Inst. Stat. 629

Math. 41, 31–45 (1989). 630

29. Lai, C. D. Morgenstern’s bivariate distribution and its application to point processes. J. 631

Math. Anal. Appl. 65, 247–256 (1978). 632

30. Jackson, D. A. Stopping Rules in Principal Components Analysis : A Comparison of 633

Heuristical and Statistical Approaches. Ecology 74, 2204–2214 (1993). 634

31. Zotenko, E., Mestre, J., O’Leary, D. P. & Przytycka, T. M. Why do hubs in the yeast 635

protein interaction network tend to be essential: Reexamining the connection between the 636

network topology and essentiality. PLoS Comput. Biol. 4, (2008). 637

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 31: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

31

32. Li, H., You, L., Xie, J., Pan, H. & Han, W. The roles of subcellularly located EGFR in 638

autophagy. Cell. Signal. 35, 223–230 (2017). 639

33. Sooro, M. A., Zhang, N. & Zhang, P. Targeting EGFR-mediated autophagy as a potential 640

strategy for cancer therapy. Int. J. Cancer 143, 2116–2125 (2018). 641

34. Liu, Q. et al. Role of EGFL7/EGFR-signaling pathway in migration and invasion of 642

growth hormone-producing pituitary adenomas. Sci. China Life Sci. 61, 893–901 (2018). 643

35. Piddock, R. E. et al. PI3Kδ and PI3Kγ isoforms have distinct functions in regulating pro-644

tumoural signalling in the multiple myeloma microenvironment. Blood Cancer J. 7, e539–645

e539 (2017). 646

36. Deane, J. A. & Fruman, D. A. PHOSPHOINOSITIDE 3-KINASE: Diverse Roles in 647

Immune Cell Activation. Annu. Rev. Immunol. 22, 563–598 (2004). 648

37. Burger, J. A. & Wiestner, A. Targeting B cell receptor signalling in cancer: preclinical and 649

clinical advances. Nat. Rev. Cancer 18, 148–167 (2018). 650

38. Guerrero-Martínez, J. A. & Reyes, J. C. High expression of SMARCA4 or SMARCA2 is 651

frequently associated with an opposite prognosis in cancer. Sci. Rep. 8, 2043 (2018). 652

39. Hill, D. A., De La Serna, I. L., Veal, T. M. & Imbalzano, A. N. BRCA1 interacts with 653

dominant negative SWI/SNF enzymes without affecting homologous recombination or 654

radiation-induced gene activation of p21 or Mdm2. J. Cell. Biochem. 91, 987–998 (2004). 655

40. Strobeck, M. W. et al. The BRG-1 Subunit of the SWI/SNF Complex Regulates CD44 656

Expression. J. Biol. Chem. 276, 9273–9278 (2001). 657

41. Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950). 658

42. Carter, J. V., Pan, J., Rai, S. N. & Galandiuk, S. ROC-ing along: Evaluation and 659

interpretation of receiver operating characteristic curves. Surgery 159, 1638–1645 (2016). 660

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 32: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

32

43. López-Ratón, M., Rodríguez-Álvarez, M. X., Cadarso-Suárez, C., Gude-Sampedro, F. & 661

others. OptimalCutpoints: an R package for selecting optimal cutpoints in diagnostic tests. 662

J. Stat. Softw. 61, 1–36 (2014). 663

44. Hippenstiel, R. D. Detection theory: applications and digital signal processing. (CRC 664

Press, 2001). 665

45. Yoav, B. & Yosef, H. Controlling the false discovery rate: a practical and powerful 666

approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995). 667

46. Hnisz, D. et al. Super-enhancers in the control of cell identity and disease. Cell 155, 934–668

47 (2013). 669

47. Bader, G. D. & Hogue, C. W. V. An automated method for finding molecular complexes 670

in large protein interaction networks. BMC Bioinformatics 4, 2 (2003). 671

48. Feng, D., Baumgartner, R. & Svetnik, V. A bayesian framework for estimating the 672

concordance correlation coefficient using skew-elliptical distributions. Int. J. Biostat. 14, 673

(2018). 674

49. Oksanen, J. et al. vegan: Community Ecology Package. (2018). 675

50. Jun Yan. Enjoy the Joy of Copulas: With a Package copula. J. Stat. Softw. 21, 1–21 676

(2007). 677

51. Stevens, J. P. Applied Multivariate Statistics for the Social Sciences. (Routledge 678

Academic, 2009). 679

680

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 33: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

33

Tables 681

Table 1. First 12 genes with the top PC-score based on density of methylation level differences 682 and density of Hellinger divergences*. 683

Density of meth. level differences Density of Hellinger divergence

Gene PC-score Signal density variation† Gene PC-score Signal density

variation COX8C 53.23 23.30 COX8C 55.10 23.30 MSC 27.02 10.50 MSC 22.14 10.50 MPEG1 16.11 8.87 MPEG1 17.36 8.87 P2RY1 15.47 5.80 BLACE 12.97 6.37 CLEC11A 15.20 6.60 CTGF 11.96 3.75 BLACE 13.20 6.37 UHRF1 11.26 5.26 UHRF1 12.08 5.26 P2RY1 11.02 5.80 EGFL7 11.95 5.64 CMTM2 9.52 3.68 ID4 11.80 5.15 CXCR5 9.34 4.63 CDK5R1 9.50 6.76 ID4 9.31 5.15 CTGF 9.13 3.75 DDIT4L 8.77 2.65

*The entire table and details are given in Supplementary Table S2. †Signal density variation for each gene is given in 684 the output of MethylIT function countTest2. This is the group mean difference of the normalized number of DMPs 685 in 1kb. 686 687

688

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 34: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

34

Supplementary Figures 689

Supplementary Figure S1. Distribution of methylation changes on chromosome and gene-body. 690

A, Distribution of methylation changes at DMP positions on selected chromosomes as viewed 691

within genome browser. B, boxplot of the means of methylation levels on chromosomes and at 692

genes. In all cases, patient (P) data are in blue and control (C) in green. 693

694

Supplementary Figure S2. PPI networks built on the subset of 285 network-related DMGs. The 695

size of each node is proportional to its value of betweeness centrality and the label font size is 696

proportional to its node degree. Node colors from light-green to red maps the discrete scale of 697

logarithm base 2 of fold change in DMP number for the corresponding gene: light-green: [1, 2), 698

cyan: [2, 3), blue: [3, 4), and red: 5 or more. B, a subnetwork with minor hubs (101 DMGs). C, a 699

cluster (139 DMGs) integrated by two subnetworks. 700

701

Supplementary Figure S3. PPI subnetwork module derived with Cytoscape app MCODE from 702

the PPI network of 1775 DEG-DMGs. Node colors from yellow to red maps the discrete scale of 703

logarithm base 2 of fold changes in gene expression for the corresponding gene: yellow: lesser or 704

equal to -6, …, light-green: (-2, -1], …, cyan: (2, 3], … blue: (4, 5], …, red: 10 or more. 705

706

Supplementary Figure S4. Sub-networks derived with K-means clustering from the subset of 707

285 network-related DMGs. 708

709

Supplementary Figure S5. Network enrichment analysis on KEGG pathways for module 710

derived with Cytoscape app MCODE from the PPI network of 1775 DEG-DMGs.. 711

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 35: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

35

712

Supplementary Figure S6. PPI networks on the set of 191 network related DEG-DMGs. The 713

PPI network was built with Cytoscape 11,12 from a subset of 191 DEG-DMGs previously 714

obtained by applying network-based enrichment analysis 51. Nodes with the same color belong 715

to the same cluster obtained by K-means clustering. 716

717

Supplementary Figure S7. Association between methylation and gene expression. A, 718

Spearman's rank correlation rho between variables for absolute value of logarithm base 2 of fold 719

change (𝑙𝑙𝑙𝑙𝑙𝑙2𝐹𝐹𝐹𝐹) in gene expression at DEG-DMGs and the differences in methylation densities 720

(between control and patient groups). All correlations are statistically significant (p-value lesser 721

than 0.001). The variables analyzed are the absolute difference (∆) of: 𝑝𝑝𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑: density of 722

methylation levels, 𝑇𝑇𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑: density of the difference of methylation levels, 𝑇𝑇𝑇𝑇𝑇𝑇𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑: TV 723

with Bayesian correction, and 𝐻𝐻𝐻𝐻𝑑𝑑𝑑𝑑𝑑𝑑𝐷𝐷𝑑𝑑𝑑𝑑𝑑𝑑: density of Hellinger divergence of methylation levels. 724

B, D, and F panels show two-dimensional kernel estimations (2D-KDE) of the joint probability 725

distribution for each annotated pair of variables in the coordinate axes from the contour-plot 726

plane (see main text for variable description). C and E panels: Farlie-Gumbel-Morgenstern 727

(FGM) copula joint probability distribution built from the estimation of marginals distribution 728

(XZ plane: Gamma probability distribution and YZ plane: generalized gamma distribution). 729

Together, panels A to F indicate that, in the current study of patients with PALL, methylation 730

and gene expression are not statistically independent, but associated with statistically highly 731

significant linear trend, located with high joint probability in the outlined contour-plot red 732

regions. 733

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint

Page 36: certified by peer review) is the author/funder. It is made available … · 11 . Pennsylvania State University 12 . University Park, PA 16802 13 Email: rus547@psu.edu 14 . 15 . Sally

36

Supplementary Tables 734

Supplementary Table S1: Excel files containing Tables S1-A to S1-G. 735

736

Supplementary Table S2: Excel files containing Tables S2-A to S1-I. 737

738

Supplementary File S1. zip file containing the wig files with tracks for the group means of the 739

differences of methylation levels between each group and the reference group (four independent 740

normal CD19+ blood cell donor): control (four normal CD19+ blood cell donor) versus 741

reference, and patients (ALL cells from three patients) versus reference. 742

743

.CC-BY-NC-ND 4.0 International licenseacertified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under

The copyright holder for this preprint (which was notthis version posted June 3, 2019. ; https://doi.org/10.1101/658948doi: bioRxiv preprint