dependency discovery via multiscale graph correlation · modern data sets may be high-dimensional,...

135
Dependency Discovery via Multiscale Graph Correlation Cencheng Shen University of Delaware Collaborators: Joshua T. Vogelstein, Carey E. Priebe, Shangsi Wang, Youjin Lee, Mauro Maggioni, Qing Wang, Alex Badea. Acknowledgment: NSF DMS, DARPA SIMPLEX. R package available in CRAN and https: // github. com/ neurodata/ MGC/ Matlab code available in https: // github. com/ neurodata/ mgc-matlab C. Shen MGC: 1/38

Upload: others

Post on 12-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Dependency Discovery viaMultiscale Graph Correlation

Cencheng Shen

University of Delaware

Collaborators: Joshua T. Vogelstein, Carey E. Priebe, Shangsi Wang, Youjin Lee,Mauro Maggioni, Qing Wang, Alex Badea.

Acknowledgment: NSF DMS, DARPA SIMPLEX.

R package available in CRAN and https: // github. com/ neurodata/ MGC/

Matlab code available in https: // github. com/ neurodata/ mgc-matlab

C. Shen MGC: 1/38

Page 2: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Overview

1. Motivation

2. Methodology

3. Theoretical Properties

4. Simulations and Experiments

5. Summary

C. Shen MGC: 2/38

Page 3: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivation

C. Shen MGC: 3/38

Page 4: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivation

Given paired data (Xn,Yn) = {(xi, yi) ∈ Rp × Rq, for i = 1, . . . , n},

• Are they related?

• How are they related?

X Y

brain connectivity creativity / personality

brain shape health

gene / protein cancer

social networks attributes

anything anything else

C. Shen MGC: 4/38

Page 5: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivation

Given paired data (Xn,Yn) = {(xi, yi) ∈ Rp × Rq, for i = 1, . . . , n},• Are they related?

• How are they related?

X Y

brain connectivity creativity / personality

brain shape health

gene / protein cancer

social networks attributes

anything anything else

C. Shen MGC: 4/38

Page 6: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivation

Given paired data (Xn,Yn) = {(xi, yi) ∈ Rp × Rq, for i = 1, . . . , n},• Are they related?

• How are they related?

X Y

brain connectivity creativity / personality

brain shape health

gene / protein cancer

social networks attributes

anything anything else

C. Shen MGC: 4/38

Page 7: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivation

Given paired data (Xn,Yn) = {(xi, yi) ∈ Rp × Rq, for i = 1, . . . , n},• Are they related?

• How are they related?

X Y

brain connectivity creativity / personality

brain shape health

gene / protein cancer

social networks attributes

anything anything else

C. Shen MGC: 4/38

Page 8: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivation

Given paired data (Xn,Yn) = {(xi, yi) ∈ Rp × Rq, for i = 1, . . . , n},• Are they related?

• How are they related?

X Y

brain connectivity creativity / personality

brain shape health

gene / protein cancer

social networks attributes

anything anything else

C. Shen MGC: 4/38

Page 9: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Formal Definition of Independence Testing

(xi, yi)i.i.d.∼ FXY , i = 1, . . . , n

H0 : FXY = FXFY ,

HA : FXY 6= FXFY .

A test is universally consistent if its power converges to 1 as n→∞against any dependent FXY .

Without loss of generality, we shall assume FXY has finite secondmoments.

C. Shen MGC: 5/38

Page 10: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Formal Definition of Independence Testing

(xi, yi)i.i.d.∼ FXY , i = 1, . . . , n

H0 : FXY = FXFY ,

HA : FXY 6= FXFY .

A test is universally consistent if its power converges to 1 as n→∞against any dependent FXY .

Without loss of generality, we shall assume FXY has finite secondmoments.

C. Shen MGC: 5/38

Page 11: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Formal Definition of Independence Testing

(xi, yi)i.i.d.∼ FXY , i = 1, . . . , n

H0 : FXY = FXFY ,

HA : FXY 6= FXFY .

A test is universally consistent if its power converges to 1 as n→∞against any dependent FXY .

Without loss of generality, we shall assume FXY has finite secondmoments.

C. Shen MGC: 5/38

Page 12: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Benchmarks

CorrelationMeasures

Linear

Pearson

Rank

CCA

RV

Non-linear

MICMantel

UniversalConsistent

Dcorr

HSIC

HHG

C. Shen MGC: 6/38

Page 13: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Benchmarks

CorrelationMeasures

Linear

Pearson

Rank

CCA

RV

Non-linear

MICMantel

UniversalConsistent

Dcorr

HSIC

HHG

C. Shen MGC: 6/38

Page 14: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Benchmarks

CorrelationMeasures

Linear

Pearson

Rank

CCA

RV

Non-linear

MICMantel

UniversalConsistent

Dcorr

HSIC

HHG

C. Shen MGC: 6/38

Page 15: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Benchmarks

CorrelationMeasures

Linear

Pearson

Rank

CCA

RV

Non-linear

MICMantel

UniversalConsistent

Dcorr

HSIC

HHG

C. Shen MGC: 6/38

Page 16: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Benchmarks

CorrelationMeasures

Linear

Pearson

Rank

CCA

RV

Non-linear

MICMantel

UniversalConsistent

Dcorr

HSIC

HHG

C. Shen MGC: 6/38

Page 17: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Page 18: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Page 19: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Page 20: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Page 21: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Page 22: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Page 23: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Motivations

Modern data sets may be high-dimensional, nonlinear, noisy, of limitedsample size, structured, from disparate spaces. Thus we desire a testthat

• is consistent against all dependencies;

• has good finite-sample testing performance;

• is easy to understand and efficient to implement;

• *provides insights into the dependency structure.

To that end, we propose the multiscale graph correlation in [Shen etal.(2018)].

C. Shen MGC: 7/38

Page 24: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Methodology

C. Shen MGC: 8/38

Page 25: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Page 26: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Page 27: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Page 28: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Page 29: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Page 30: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Page 31: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Page 32: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Flowchart of MGC

(Xn,Yn)

ComputingDistances

& Centering:A,B ∈ Rn×n

Dcov =n∑i 6=j

AijBji

IncorporatingLocality:

{Ak, Bl ∈ Rn×n,for k, l ∈ [n]}

Dcovk,l =n∑i 6=j

AkijBlji −

n∑i 6=j

Akijn∑i 6=j

Blij

All LocalCorrelations{Dcorrk,l}∈ [−1, 1]n×n

SmoothedMaximum:c∗ ∈ [−1, 1],and OptimalScale (k∗, l∗)

P-value byPermutationTest

C. Shen MGC: 9/38

Page 33: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Computing Distance and Centering

Input: Xn = [x1, . . . , xn] as the data matrix with each columnrepresenting one sample observation, and similarly Yn.

Distance Computation: Let A be the n× n Euclidean distancematrices of Xn:

Aij = ‖xi − xj‖2,

and similarly B from Yn.

Centering: Then we center A and B by columns, with the diagonalsexcluded:

Aij =

{Aij − 1

n−1

∑ns=1 Asj , if i 6= j,

0, if i = j;(1)

similarly for B.

C. Shen MGC: 10/38

Page 34: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Computing Distance and Centering

Input: Xn = [x1, . . . , xn] as the data matrix with each columnrepresenting one sample observation, and similarly Yn.

Distance Computation: Let A be the n× n Euclidean distancematrices of Xn:

Aij = ‖xi − xj‖2,

and similarly B from Yn.

Centering: Then we center A and B by columns, with the diagonalsexcluded:

Aij =

{Aij − 1

n−1

∑ns=1 Asj , if i 6= j,

0, if i = j;(1)

similarly for B.

C. Shen MGC: 10/38

Page 35: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Computing Distance and Centering

Input: Xn = [x1, . . . , xn] as the data matrix with each columnrepresenting one sample observation, and similarly Yn.

Distance Computation: Let A be the n× n Euclidean distancematrices of Xn:

Aij = ‖xi − xj‖2,

and similarly B from Yn.

Centering: Then we center A and B by columns, with the diagonalsexcluded:

Aij =

{Aij − 1

n−1

∑ns=1 Asj , if i 6= j,

0, if i = j;(1)

similarly for B.

C. Shen MGC: 10/38

Page 36: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Incorporating the Locality Principle

Ranking: Define {RAij} as the “rank” of xi relative to xj , that is, RAij = k

if xi is the kth closest point (or “neighbor”) to xj , as determined byranking the set {A1j , A2j , . . . , Anj} by ascending order. Similarly defineRBij for the y’s.

For any (k, l) ∈ [n]2, define the rank truncated matrices Ak, Bl, and thejoint distance matrix Ckl as

Akij = AijI(RAij ≤ k),

Blij = BijI(RBij ≤ l).

When ties occur, minimal rank is recommended, e.g., if Y only takes twovalue, RBij takes value in {1, 2} only. We assume no ties for each ofpresentation.

C. Shen MGC: 11/38

Page 37: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Incorporating the Locality Principle

Ranking: Define {RAij} as the “rank” of xi relative to xj , that is, RAij = k

if xi is the kth closest point (or “neighbor”) to xj , as determined byranking the set {A1j , A2j , . . . , Anj} by ascending order. Similarly defineRBij for the y’s.

For any (k, l) ∈ [n]2, define the rank truncated matrices Ak, Bl, and thejoint distance matrix Ckl as

Akij = AijI(RAij ≤ k),

Blij = BijI(RBij ≤ l).

When ties occur, minimal rank is recommended, e.g., if Y only takes twovalue, RBij takes value in {1, 2} only. We assume no ties for each ofpresentation.

C. Shen MGC: 11/38

Page 38: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Incorporating the Locality Principle

Ranking: Define {RAij} as the “rank” of xi relative to xj , that is, RAij = k

if xi is the kth closest point (or “neighbor”) to xj , as determined byranking the set {A1j , A2j , . . . , Anj} by ascending order. Similarly defineRBij for the y’s.

For any (k, l) ∈ [n]2, define the rank truncated matrices Ak, Bl, and thejoint distance matrix Ckl as

Akij = AijI(RAij ≤ k),

Blij = BijI(RBij ≤ l).

When ties occur, minimal rank is recommended, e.g., if Y only takes twovalue, RBij takes value in {1, 2} only. We assume no ties for each ofpresentation.

C. Shen MGC: 11/38

Page 39: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Incorporating the Locality Principle

Ranking: Define {RAij} as the “rank” of xi relative to xj , that is, RAij = k

if xi is the kth closest point (or “neighbor”) to xj , as determined byranking the set {A1j , A2j , . . . , Anj} by ascending order. Similarly defineRBij for the y’s.

For any (k, l) ∈ [n]2, define the rank truncated matrices Ak, Bl, and thejoint distance matrix Ckl as

Akij = AijI(RAij ≤ k),

Blij = BijI(RBij ≤ l).

When ties occur, minimal rank is recommended, e.g., if Y only takes twovalue, RBij takes value in {1, 2} only. We assume no ties for each ofpresentation.

C. Shen MGC: 11/38

Page 40: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Local Distance Correlations

A Family of Local Correlations: Let ◦ denote the entry-wise product,E(·) = 1

n(n−1)

∑ni 6=j(·) denote the diagonal-excluded sample mean of a

square matrix, then the sample local covariance, variance, and correlationare defined as:

dCovk,l(Xn,Yn) = E(Ak ◦Bl′)− E(Ak)E(Bl),

dV ark(Xn) = E(Ak ◦Ak′)− E2(Ak),

dV arl(Yn) = E(Bl ◦Bl′)− E2(Bl),

dCorrk,l(Xn,Yn) = dCovk,l(Xn,Yn)/√dV ark(Xn) · dV arl(Yn).

for k, l = 1, . . . , n. If dV ark(Xn) · dV arl(Xn) ≤ 0, we setdCorrkl(Xn,Yn) = 0 instead.

There are a maximum of n2 different local correlations. At k = l = n,dCorrkl(Xn,Yn) equals the “global” distance correlation dCorr(Xn,Yn)by Szekely et al.(2007).

C. Shen MGC: 12/38

Page 41: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Local Distance Correlations

A Family of Local Correlations: Let ◦ denote the entry-wise product,E(·) = 1

n(n−1)

∑ni 6=j(·) denote the diagonal-excluded sample mean of a

square matrix, then the sample local covariance, variance, and correlationare defined as:

dCovk,l(Xn,Yn) = E(Ak ◦Bl′)− E(Ak)E(Bl),

dV ark(Xn) = E(Ak ◦Ak′)− E2(Ak),

dV arl(Yn) = E(Bl ◦Bl′)− E2(Bl),

dCorrk,l(Xn,Yn) = dCovk,l(Xn,Yn)/√dV ark(Xn) · dV arl(Yn).

for k, l = 1, . . . , n. If dV ark(Xn) · dV arl(Xn) ≤ 0, we setdCorrkl(Xn,Yn) = 0 instead.

There are a maximum of n2 different local correlations. At k = l = n,dCorrkl(Xn,Yn) equals the “global” distance correlation dCorr(Xn,Yn)by Szekely et al.(2007).

C. Shen MGC: 12/38

Page 42: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Local Distance Correlations

A Family of Local Correlations: Let ◦ denote the entry-wise product,E(·) = 1

n(n−1)

∑ni 6=j(·) denote the diagonal-excluded sample mean of a

square matrix, then the sample local covariance, variance, and correlationare defined as:

dCovk,l(Xn,Yn) = E(Ak ◦Bl′)− E(Ak)E(Bl),

dV ark(Xn) = E(Ak ◦Ak′)− E2(Ak),

dV arl(Yn) = E(Bl ◦Bl′)− E2(Bl),

dCorrk,l(Xn,Yn) = dCovk,l(Xn,Yn)/√dV ark(Xn) · dV arl(Yn).

for k, l = 1, . . . , n. If dV ark(Xn) · dV arl(Xn) ≤ 0, we setdCorrkl(Xn,Yn) = 0 instead.

There are a maximum of n2 different local correlations. At k = l = n,dCorrkl(Xn,Yn) equals the “global” distance correlation dCorr(Xn,Yn)by Szekely et al.(2007).

C. Shen MGC: 12/38

Page 43: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Local Distance Correlations

A Family of Local Correlations: Let ◦ denote the entry-wise product,E(·) = 1

n(n−1)

∑ni 6=j(·) denote the diagonal-excluded sample mean of a

square matrix, then the sample local covariance, variance, and correlationare defined as:

dCovk,l(Xn,Yn) = E(Ak ◦Bl′)− E(Ak)E(Bl),

dV ark(Xn) = E(Ak ◦Ak′)− E2(Ak),

dV arl(Yn) = E(Bl ◦Bl′)− E2(Bl),

dCorrk,l(Xn,Yn) = dCovk,l(Xn,Yn)/√dV ark(Xn) · dV arl(Yn).

for k, l = 1, . . . , n. If dV ark(Xn) · dV arl(Xn) ≤ 0, we setdCorrkl(Xn,Yn) = 0 instead.

There are a maximum of n2 different local correlations. At k = l = n,dCorrkl(Xn,Yn) equals the “global” distance correlation dCorr(Xn,Yn)by Szekely et al.(2007).

C. Shen MGC: 12/38

Page 44: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Local Distance Correlations

A Family of Local Correlations: Let ◦ denote the entry-wise product,E(·) = 1

n(n−1)

∑ni 6=j(·) denote the diagonal-excluded sample mean of a

square matrix, then the sample local covariance, variance, and correlationare defined as:

dCovk,l(Xn,Yn) = E(Ak ◦Bl′)− E(Ak)E(Bl),

dV ark(Xn) = E(Ak ◦Ak′)− E2(Ak),

dV arl(Yn) = E(Bl ◦Bl′)− E2(Bl),

dCorrk,l(Xn,Yn) = dCovk,l(Xn,Yn)/√dV ark(Xn) · dV arl(Yn).

for k, l = 1, . . . , n. If dV ark(Xn) · dV arl(Xn) ≤ 0, we setdCorrkl(Xn,Yn) = 0 instead.

There are a maximum of n2 different local correlations. At k = l = n,dCorrkl(Xn,Yn) equals the “global” distance correlation dCorr(Xn,Yn)by Szekely et al.(2007).

C. Shen MGC: 12/38

Page 45: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Smoothed Maximum c∗(Xn,Yn)

One would like to use the optimal local correlation for testing.

But directly taking the maximum local correlation

max(k,l)∈[n]2

{Dcorrk,l(Xn,Yn)}

will yield a biased statistic under independence, i.e., the maximum isalways larger than 0 in expectation even under independent relationship!

Instead, we take a smoothed maximum, by finding a connected region inthe local correlation map with significant local correlatons – if such aregion exists, use the maximum within the region.

C. Shen MGC: 13/38

Page 46: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Smoothed Maximum c∗(Xn,Yn)

One would like to use the optimal local correlation for testing.

But directly taking the maximum local correlation

max(k,l)∈[n]2

{Dcorrk,l(Xn,Yn)}

will yield a biased statistic under independence, i.e., the maximum isalways larger than 0 in expectation even under independent relationship!

Instead, we take a smoothed maximum, by finding a connected region inthe local correlation map with significant local correlatons – if such aregion exists, use the maximum within the region.

C. Shen MGC: 13/38

Page 47: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Smoothed Maximum c∗(Xn,Yn)

One would like to use the optimal local correlation for testing.

But directly taking the maximum local correlation

max(k,l)∈[n]2

{Dcorrk,l(Xn,Yn)}

will yield a biased statistic under independence, i.e., the maximum isalways larger than 0 in expectation even under independent relationship!

Instead, we take a smoothed maximum, by finding a connected region inthe local correlation map with significant local correlatons – if such aregion exists, use the maximum within the region.

C. Shen MGC: 13/38

Page 48: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Smoothed Maximum

Pick a threshold τ ≥ 0 (we choose by an approximate null distribution ofDcorr, which is symmetric beta and converges to 0 as n→∞), computethe set

{(k, l) such that Dcorrk,l(Xn,Yn) > max{τ,Dcorr(Xn,Yn)}},

and calculate the largest connected component R of the set.

If there are sufficiently many elements in R (> 2n), take the maximumcorrelation within R as MGC statistic c∗(Xn,Yn), and set theneighborhood pair as the optimal scale (k∗, l∗).

C. Shen MGC: 14/38

Page 49: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Smoothed Maximum

Pick a threshold τ ≥ 0 (we choose by an approximate null distribution ofDcorr, which is symmetric beta and converges to 0 as n→∞), computethe set

{(k, l) such that Dcorrk,l(Xn,Yn) > max{τ,Dcorr(Xn,Yn)}},

and calculate the largest connected component R of the set.

If there are sufficiently many elements in R (> 2n), take the maximumcorrelation within R as MGC statistic c∗(Xn,Yn), and set theneighborhood pair as the optimal scale (k∗, l∗).

C. Shen MGC: 14/38

Page 50: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Smoothed Maximum

Pick a threshold τ ≥ 0 (we choose by an approximate null distribution ofDcorr, which is symmetric beta and converges to 0 as n→∞), computethe set

{(k, l) such that Dcorrk,l(Xn,Yn) > max{τ,Dcorr(Xn,Yn)}},

and calculate the largest connected component R of the set.

If there are sufficiently many elements in R (> 2n), take the maximumcorrelation within R as MGC statistic c∗(Xn,Yn),

and set theneighborhood pair as the optimal scale (k∗, l∗).

C. Shen MGC: 14/38

Page 51: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Smoothed Maximum

Pick a threshold τ ≥ 0 (we choose by an approximate null distribution ofDcorr, which is symmetric beta and converges to 0 as n→∞), computethe set

{(k, l) such that Dcorrk,l(Xn,Yn) > max{τ,Dcorr(Xn,Yn)}},

and calculate the largest connected component R of the set.

If there are sufficiently many elements in R (> 2n), take the maximumcorrelation within R as MGC statistic c∗(Xn,Yn), and set theneighborhood pair as the optimal scale (k∗, l∗).

C. Shen MGC: 14/38

Page 52: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Permutation Test

To get a p-value by MGC for any given data, we utilize the permutationtest: randomly permute index of the second data set for r times, computethe permuted MGC statistic c∗(Xn,Yπn ) for each permutation π, andestimate

Prob(c∗(Xn,Yn) > c∗(Xn,Yπn ))

as the p-value.

This is a standard nonparametric testing procedure employed by Mantel,Dcorr, HHG, HSIC, where the null distribution of the dependency measurecannot be exactly derived.

C. Shen MGC: 15/38

Page 53: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Permutation Test

To get a p-value by MGC for any given data, we utilize the permutationtest: randomly permute index of the second data set for r times, computethe permuted MGC statistic c∗(Xn,Yπn ) for each permutation π, andestimate

Prob(c∗(Xn,Yn) > c∗(Xn,Yπn ))

as the p-value.

This is a standard nonparametric testing procedure employed by Mantel,Dcorr, HHG, HSIC, where the null distribution of the dependency measurecannot be exactly derived.

C. Shen MGC: 15/38

Page 54: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Computation Complexity

• Distance computation takes O(n2 max(p, q))

• Centering takes O(n2)

• Ranking takes O(n2log(n))

• All local correlations can be iteratively computed in O(n2)

• The smoothed maximum takes O(n2)

Overall, MGC can be computed in O(n2 max(p, q, log(n))), which iscomparable to DCorr, HHG, and HSIC.

The permutation test takes O(n2 max(r, p, q, log(n))) for r randompermutations.

There are a number of ways to speed up the method for big data: fasterimplementation when p = q = 1, null distribution approximation bysubsampling and spectral embedding.

C. Shen MGC: 16/38

Page 55: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Computation Complexity

• Distance computation takes O(n2 max(p, q))

• Centering takes O(n2)

• Ranking takes O(n2log(n))

• All local correlations can be iteratively computed in O(n2)

• The smoothed maximum takes O(n2)

Overall, MGC can be computed in O(n2 max(p, q, log(n))), which iscomparable to DCorr, HHG, and HSIC.

The permutation test takes O(n2 max(r, p, q, log(n))) for r randompermutations.

There are a number of ways to speed up the method for big data: fasterimplementation when p = q = 1, null distribution approximation bysubsampling and spectral embedding.

C. Shen MGC: 16/38

Page 56: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Computation Complexity

• Distance computation takes O(n2 max(p, q))

• Centering takes O(n2)

• Ranking takes O(n2log(n))

• All local correlations can be iteratively computed in O(n2)

• The smoothed maximum takes O(n2)

Overall, MGC can be computed in O(n2 max(p, q, log(n))), which iscomparable to DCorr, HHG, and HSIC.

The permutation test takes O(n2 max(r, p, q, log(n))) for r randompermutations.

There are a number of ways to speed up the method for big data: fasterimplementation when p = q = 1, null distribution approximation bysubsampling and spectral embedding.

C. Shen MGC: 16/38

Page 57: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Computation Complexity

• Distance computation takes O(n2 max(p, q))

• Centering takes O(n2)

• Ranking takes O(n2log(n))

• All local correlations can be iteratively computed in O(n2)

• The smoothed maximum takes O(n2)

Overall, MGC can be computed in O(n2 max(p, q, log(n))), which iscomparable to DCorr, HHG, and HSIC.

The permutation test takes O(n2 max(r, p, q, log(n))) for r randompermutations.

There are a number of ways to speed up the method for big data: fasterimplementation when p = q = 1, null distribution approximation bysubsampling and spectral embedding.

C. Shen MGC: 16/38

Page 58: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Page 59: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Page 60: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Page 61: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Page 62: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Page 63: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Page 64: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Examples

Dcorr(Xn,Yn) = 0.15MGC(Xn,Yn) = 0.15p-vals: < 0.001

Dcorr(Xn,Yn) = 0.01MGC(Xn,Yn) = 0.13p-vals: 0.3 vs < 0.001

C. Shen MGC: 17/38

Page 65: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

TheoreticalProperties

C. Shen MGC: 18/38

Page 66: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Basic Properties of Sample MGC

Theorem 1 (Well-behaved Correlation Measure)

1. Boundedness: c∗(Xn,Yn) ∈ [−1, 1].

2. Symmetric: c∗(Xn,Yn) = c∗(Yn,Xn).

3. Invariant: c∗(Xn,Yn) is invariant to any distance-preservingtransformations φ, δ applied to Xn and Yn each (i.e., rotation, scaling,translation, reflection).

4. 1-Linear: c∗(Xn,Yn) = 1 if and only if FX is non-degenerate and(X,uY ) are dependent via an isometry for some non-zero constant u.

C. Shen MGC: 19/38

Page 67: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Basic Properties of Sample MGC

Theorem 1 (Well-behaved Correlation Measure)

1. Boundedness: c∗(Xn,Yn) ∈ [−1, 1].

2. Symmetric: c∗(Xn,Yn) = c∗(Yn,Xn).

3. Invariant: c∗(Xn,Yn) is invariant to any distance-preservingtransformations φ, δ applied to Xn and Yn each (i.e., rotation, scaling,translation, reflection).

4. 1-Linear: c∗(Xn,Yn) = 1 if and only if FX is non-degenerate and(X,uY ) are dependent via an isometry for some non-zero constant u.

C. Shen MGC: 19/38

Page 68: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Basic Properties of Sample MGC

Theorem 1 (Well-behaved Correlation Measure)

1. Boundedness: c∗(Xn,Yn) ∈ [−1, 1].

2. Symmetric: c∗(Xn,Yn) = c∗(Yn,Xn).

3. Invariant: c∗(Xn,Yn) is invariant to any distance-preservingtransformations φ, δ applied to Xn and Yn each (i.e., rotation, scaling,translation, reflection).

4. 1-Linear: c∗(Xn,Yn) = 1 if and only if FX is non-degenerate and(X,uY ) are dependent via an isometry for some non-zero constant u.

C. Shen MGC: 19/38

Page 69: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Basic Properties of Sample MGC

Theorem 1 (Well-behaved Correlation Measure)

1. Boundedness: c∗(Xn,Yn) ∈ [−1, 1].

2. Symmetric: c∗(Xn,Yn) = c∗(Yn,Xn).

3. Invariant: c∗(Xn,Yn) is invariant to any distance-preservingtransformations φ, δ applied to Xn and Yn each (i.e., rotation, scaling,translation, reflection).

4. 1-Linear: c∗(Xn,Yn) = 1 if and only if FX is non-degenerate and(X,uY ) are dependent via an isometry for some non-zero constant u.

C. Shen MGC: 19/38

Page 70: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Basic Properties of Sample MGC

Theorem 1 (Well-behaved Correlation Measure)

1. Boundedness: c∗(Xn,Yn) ∈ [−1, 1].

2. Symmetric: c∗(Xn,Yn) = c∗(Yn,Xn).

3. Invariant: c∗(Xn,Yn) is invariant to any distance-preservingtransformations φ, δ applied to Xn and Yn each (i.e., rotation, scaling,translation, reflection).

4. 1-Linear: c∗(Xn,Yn) = 1 if and only if FX is non-degenerate and(X,uY ) are dependent via an isometry for some non-zero constant u.

C. Shen MGC: 19/38

Page 71: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Consistency of Sample MGC

Theorem 2 (Consistency)

1. 0-Indep: c∗(Xn,Yn)n→∞→ 0 if and only if independence.

2. Valid Test: Under the permutation test, Sample MGC is a valid test,i.e., it controls the type 1 error level α.

3. Consistency: At any type 1 error level α, testing powerβ(c∗(Xn,Yn))

n→∞→ 1 against any dependent FXY .

C. Shen MGC: 20/38

Page 72: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Consistency of Sample MGC

Theorem 2 (Consistency)

1. 0-Indep: c∗(Xn,Yn)n→∞→ 0 if and only if independence.

2. Valid Test: Under the permutation test, Sample MGC is a valid test,i.e., it controls the type 1 error level α.

3. Consistency: At any type 1 error level α, testing powerβ(c∗(Xn,Yn))

n→∞→ 1 against any dependent FXY .

C. Shen MGC: 20/38

Page 73: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Consistency of Sample MGC

Theorem 2 (Consistency)

1. 0-Indep: c∗(Xn,Yn)n→∞→ 0 if and only if independence.

2. Valid Test: Under the permutation test, Sample MGC is a valid test,i.e., it controls the type 1 error level α.

3. Consistency: At any type 1 error level α, testing powerβ(c∗(Xn,Yn))

n→∞→ 1 against any dependent FXY .

C. Shen MGC: 20/38

Page 74: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Consistency of Sample MGC

Theorem 2 (Consistency)

1. 0-Indep: c∗(Xn,Yn)n→∞→ 0 if and only if independence.

2. Valid Test: Under the permutation test, Sample MGC is a valid test,i.e., it controls the type 1 error level α.

3. Consistency: At any type 1 error level α, testing powerβ(c∗(Xn,Yn))

n→∞→ 1 against any dependent FXY .

C. Shen MGC: 20/38

Page 75: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Defining Population MGC

Suppose (X,Y ), (X ′, Y ′), (X ′′, Y ′′), (X ′′′, Y ′′′) are iid as FXY .

Let I(·)be the indicator function, define two random variables

IρkX,X′ = I(

∫B(X,‖X′−X‖)

dFX(u) ≤ ρk)

IρlY ′,Y = I(

∫B(Y ′,‖Y ′−Y ‖)

dFY (u) ≤ ρl)

for ρk, ρl ∈ [0, 1]. Further define

dρkX = (‖X −X ′‖ − ‖X −X ′′‖)IρkX,X′

dρlY ′ = (‖Y ′ − Y ‖ − ‖Y ′ − Y ′′′‖)IρlY ′,Y

The population local covariance can be defined as

Dcovρk,ρl(X,Y ) = E(dρkX dρlY ′)− E(dρkX )E(dρlY ′).

Normalizing and taking a smoothed maximum yield population MGC.

C. Shen MGC: 21/38

Page 76: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Defining Population MGC

Suppose (X,Y ), (X ′, Y ′), (X ′′, Y ′′), (X ′′′, Y ′′′) are iid as FXY . Let I(·)be the indicator function, define two random variables

IρkX,X′ = I(

∫B(X,‖X′−X‖)

dFX(u) ≤ ρk)

IρlY ′,Y = I(

∫B(Y ′,‖Y ′−Y ‖)

dFY (u) ≤ ρl)

for ρk, ρl ∈ [0, 1].

Further define

dρkX = (‖X −X ′‖ − ‖X −X ′′‖)IρkX,X′

dρlY ′ = (‖Y ′ − Y ‖ − ‖Y ′ − Y ′′′‖)IρlY ′,Y

The population local covariance can be defined as

Dcovρk,ρl(X,Y ) = E(dρkX dρlY ′)− E(dρkX )E(dρlY ′).

Normalizing and taking a smoothed maximum yield population MGC.

C. Shen MGC: 21/38

Page 77: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Defining Population MGC

Suppose (X,Y ), (X ′, Y ′), (X ′′, Y ′′), (X ′′′, Y ′′′) are iid as FXY . Let I(·)be the indicator function, define two random variables

IρkX,X′ = I(

∫B(X,‖X′−X‖)

dFX(u) ≤ ρk)

IρlY ′,Y = I(

∫B(Y ′,‖Y ′−Y ‖)

dFY (u) ≤ ρl)

for ρk, ρl ∈ [0, 1]. Further define

dρkX = (‖X −X ′‖ − ‖X −X ′′‖)IρkX,X′

dρlY ′ = (‖Y ′ − Y ‖ − ‖Y ′ − Y ′′′‖)IρlY ′,Y

The population local covariance can be defined as

Dcovρk,ρl(X,Y ) = E(dρkX dρlY ′)− E(dρkX )E(dρlY ′).

Normalizing and taking a smoothed maximum yield population MGC.

C. Shen MGC: 21/38

Page 78: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Defining Population MGC

Suppose (X,Y ), (X ′, Y ′), (X ′′, Y ′′), (X ′′′, Y ′′′) are iid as FXY . Let I(·)be the indicator function, define two random variables

IρkX,X′ = I(

∫B(X,‖X′−X‖)

dFX(u) ≤ ρk)

IρlY ′,Y = I(

∫B(Y ′,‖Y ′−Y ‖)

dFY (u) ≤ ρl)

for ρk, ρl ∈ [0, 1]. Further define

dρkX = (‖X −X ′‖ − ‖X −X ′′‖)IρkX,X′

dρlY ′ = (‖Y ′ − Y ‖ − ‖Y ′ − Y ′′′‖)IρlY ′,Y

The population local covariance can be defined as

Dcovρk,ρl(X,Y ) = E(dρkX dρlY ′)− E(dρkX )E(dρlY ′).

Normalizing and taking a smoothed maximum yield population MGC.

C. Shen MGC: 21/38

Page 79: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Defining Population MGC

Suppose (X,Y ), (X ′, Y ′), (X ′′, Y ′′), (X ′′′, Y ′′′) are iid as FXY . Let I(·)be the indicator function, define two random variables

IρkX,X′ = I(

∫B(X,‖X′−X‖)

dFX(u) ≤ ρk)

IρlY ′,Y = I(

∫B(Y ′,‖Y ′−Y ‖)

dFY (u) ≤ ρl)

for ρk, ρl ∈ [0, 1]. Further define

dρkX = (‖X −X ′‖ − ‖X −X ′′‖)IρkX,X′

dρlY ′ = (‖Y ′ − Y ‖ − ‖Y ′ − Y ′′′‖)IρlY ′,Y

The population local covariance can be defined as

Dcovρk,ρl(X,Y ) = E(dρkX dρlY ′)− E(dρkX )E(dρlY ′).

Normalizing and taking a smoothed maximum yield population MGC.C. Shen MGC: 21/38

Page 80: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Sample to Population

Alternatively, the population version can be equivalently defined viacharacteristic functions of FXY :

Dcovρk=1,ρl=1(X,Y ) =

∫t,s|gXY (t, s)− gX(t)gY (s)|2dw(t, s)

with respect to a non-negative weight function w(t, s) on (t, s) ∈ Rp ×Rq.The weight function is defined as:

w(t, s) = (dpdq|t|1+p|s|1+q)−1,

where dp = π(1+p)/2

Γ((1+p)/2) is a non-negative constant tied to the dimensionality

p, and Γ(·) is the complete Gamma function.

Can be similarly adapted to the local correlation.

C. Shen MGC: 22/38

Page 81: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Sample to Population

Alternatively, the population version can be equivalently defined viacharacteristic functions of FXY :

Dcovρk=1,ρl=1(X,Y ) =

∫t,s|gXY (t, s)− gX(t)gY (s)|2dw(t, s)

with respect to a non-negative weight function w(t, s) on (t, s) ∈ Rp ×Rq.

The weight function is defined as:

w(t, s) = (dpdq|t|1+p|s|1+q)−1,

where dp = π(1+p)/2

Γ((1+p)/2) is a non-negative constant tied to the dimensionality

p, and Γ(·) is the complete Gamma function.

Can be similarly adapted to the local correlation.

C. Shen MGC: 22/38

Page 82: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Sample to Population

Alternatively, the population version can be equivalently defined viacharacteristic functions of FXY :

Dcovρk=1,ρl=1(X,Y ) =

∫t,s|gXY (t, s)− gX(t)gY (s)|2dw(t, s)

with respect to a non-negative weight function w(t, s) on (t, s) ∈ Rp ×Rq.The weight function is defined as:

w(t, s) = (dpdq|t|1+p|s|1+q)−1,

where dp = π(1+p)/2

Γ((1+p)/2) is a non-negative constant tied to the dimensionality

p, and Γ(·) is the complete Gamma function.

Can be similarly adapted to the local correlation.

C. Shen MGC: 22/38

Page 83: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Sample to Population

Alternatively, the population version can be equivalently defined viacharacteristic functions of FXY :

Dcovρk=1,ρl=1(X,Y ) =

∫t,s|gXY (t, s)− gX(t)gY (s)|2dw(t, s)

with respect to a non-negative weight function w(t, s) on (t, s) ∈ Rp ×Rq.The weight function is defined as:

w(t, s) = (dpdq|t|1+p|s|1+q)−1,

where dp = π(1+p)/2

Γ((1+p)/2) is a non-negative constant tied to the dimensionality

p, and Γ(·) is the complete Gamma function.

Can be similarly adapted to the local correlation.

C. Shen MGC: 22/38

Page 84: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 3 (Convergence, Mean and Variance)

1. 0-Indep: c∗(X,Y ) = 0 if and only if independence.

2. Convergence: c∗(Xn,Yn)n→∞→ c∗(X,Y ).

3. Almost Unbiased: E(c∗(Xn,Yn)) = c∗(X,Y ) +O(1/n).

4. Diminishing Variance: V ar(c∗(Xn,Yn)) = O(1/n).

The last three properties also hold for any local correlation by(ρk, ρl) = ( k−1

n−1 ,l−1n−1).

C. Shen MGC: 23/38

Page 85: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 3 (Convergence, Mean and Variance)

1. 0-Indep: c∗(X,Y ) = 0 if and only if independence.

2. Convergence: c∗(Xn,Yn)n→∞→ c∗(X,Y ).

3. Almost Unbiased: E(c∗(Xn,Yn)) = c∗(X,Y ) +O(1/n).

4. Diminishing Variance: V ar(c∗(Xn,Yn)) = O(1/n).

The last three properties also hold for any local correlation by(ρk, ρl) = ( k−1

n−1 ,l−1n−1).

C. Shen MGC: 23/38

Page 86: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 3 (Convergence, Mean and Variance)

1. 0-Indep: c∗(X,Y ) = 0 if and only if independence.

2. Convergence: c∗(Xn,Yn)n→∞→ c∗(X,Y ).

3. Almost Unbiased: E(c∗(Xn,Yn)) = c∗(X,Y ) +O(1/n).

4. Diminishing Variance: V ar(c∗(Xn,Yn)) = O(1/n).

The last three properties also hold for any local correlation by(ρk, ρl) = ( k−1

n−1 ,l−1n−1).

C. Shen MGC: 23/38

Page 87: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 3 (Convergence, Mean and Variance)

1. 0-Indep: c∗(X,Y ) = 0 if and only if independence.

2. Convergence: c∗(Xn,Yn)n→∞→ c∗(X,Y ).

3. Almost Unbiased: E(c∗(Xn,Yn)) = c∗(X,Y ) +O(1/n).

4. Diminishing Variance: V ar(c∗(Xn,Yn)) = O(1/n).

The last three properties also hold for any local correlation by(ρk, ρl) = ( k−1

n−1 ,l−1n−1).

C. Shen MGC: 23/38

Page 88: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 3 (Convergence, Mean and Variance)

1. 0-Indep: c∗(X,Y ) = 0 if and only if independence.

2. Convergence: c∗(Xn,Yn)n→∞→ c∗(X,Y ).

3. Almost Unbiased: E(c∗(Xn,Yn)) = c∗(X,Y ) +O(1/n).

4. Diminishing Variance: V ar(c∗(Xn,Yn)) = O(1/n).

The last three properties also hold for any local correlation by(ρk, ρl) = ( k−1

n−1 ,l−1n−1).

C. Shen MGC: 23/38

Page 89: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 3 (Convergence, Mean and Variance)

1. 0-Indep: c∗(X,Y ) = 0 if and only if independence.

2. Convergence: c∗(Xn,Yn)n→∞→ c∗(X,Y ).

3. Almost Unbiased: E(c∗(Xn,Yn)) = c∗(X,Y ) +O(1/n).

4. Diminishing Variance: V ar(c∗(Xn,Yn)) = O(1/n).

The last three properties also hold for any local correlation by(ρk, ρl) = ( k−1

n−1 ,l−1n−1).

C. Shen MGC: 23/38

Page 90: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Page 91: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Page 92: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Page 93: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Page 94: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Page 95: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Page 96: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Theoretical Advantages of MGC

Theorem 4 (Advantages of Population MGC vs Dcorr)

1. For any dependent FXY , c∗(X,Y ) ≥ Dcorr(X,Y ).

2. There exists dependent FXY such that c∗(X,Y ) > Dcorr(X,Y ).

As MGC and Dcorr share similar variance and same mean under the null,the mean advantage in the alternative is translated to the testing power.

Theorem 5 (Optimal Scale of MGC Implies Geometry Structure)

If the relationship is linear (or with independent noise), the global scale isalways optimal and c∗(X,Y ) = Dcorr(X,Y ).

Conversely, the optimal scale being local, i.e., c∗(X,Y ) > Dcorr(X,Y ),implies a non-linear relationship.

C. Shen MGC: 24/38

Page 97: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

MGC is applicable to similarity / kernelmatrix

Theorem 6 (Transforming kernel to distance)

Given any characteristic kernel function k(·, ·), define an inducedsemi-metric as

d(·, ·) = 1− k(·, ·)/max{k(·, ·)}.

Then d(·, ·) is of strong negative type, and the resulting MGC isuniversally consistent.

Namely, given a sample kernel matrices Kn×n, one can compute theinduced distance matrix by

D = J −K/maxi,j∈[1,...,n]2{K(i, j)},

and apply MGC to the induced distance matrices.

C. Shen MGC: 25/38

Page 98: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Simulationsand

Experiments

C. Shen MGC: 26/38

Page 99: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Visualizations of 20 Simulation Settings

Line

ar

Linear: 1

0

1

Exp

onen

tial

Exponential: 0.99

0

1

Cub

ic

Cubic: 0.84

0

0.9

Join

t Nor

mal

Joint Normal: 0.2

0

0.5

Ste

p F

unct

ion

Step Function: 0.76

0

0.9

Qua

drat

ic

Quadratic: 0.67

0

0.7

W S

hape

W Shape: 0.42

0

0.5

Spi

ral

Spiral: 0.3

0

0.3

Ber

noul

li

Bernoulli: 0.93

0

1

Loga

rithm

ic

Logarithmic: 0.67

0

0.7

Fou

rth

Roo

t

Fourth Root: 0.65

0

0.7

Sin

e P

erio

d 4

Sine Period 4 : 0.3

0

0.3

Sin

e P

erio

d 16

Sine Period 16 : 0.14

0

0.2

Squ

are

Square: 0.08

0

0.1

Tw

o P

arab

olas

Two Parabolas: 0.46

0

0.5

Circ

le

Circle: 0.52

0

0.6

Elli

pse

Ellipse: 0.56

0

0.6

Dia

mon

d

Diamond: 0.08

0

0.1

Mul

tiplic

ativ

e

Multiplicative: 0.11

0

0.2

Inde

pend

ence

Independence: 0

MGC, Distance Correlation, and Pearson's Correlation for 20 Dependencies

0

0.1

C. Shen MGC: 27/38

Page 100: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Visualizations of 20 Simulation SettingsLi

near

Linear: 1

0

1

Exp

onen

tial

Exponential: 0.99

0

1

Cub

ic

Cubic: 0.84

0

0.9

Join

t Nor

mal

Joint Normal: 0.2

0

0.5

Ste

p F

unct

ion

Step Function: 0.76

0

0.9

Qua

drat

ic

Quadratic: 0.67

0

0.7

W S

hape

W Shape: 0.42

0

0.5

Spi

ral

Spiral: 0.3

0

0.3

Ber

noul

li

Bernoulli: 0.93

0

1

Loga

rithm

ic

Logarithmic: 0.67

0

0.7

Fou

rth

Roo

t

Fourth Root: 0.65

0

0.7

Sin

e P

erio

d 4

Sine Period 4 : 0.3

0

0.3

Sin

e P

erio

d 16

Sine Period 16 : 0.14

0

0.2

Squ

are

Square: 0.08

0

0.1

Tw

o P

arab

olas

Two Parabolas: 0.46

0

0.5

Circ

le

Circle: 0.52

0

0.6

Elli

pse

Ellipse: 0.56

0

0.6

Dia

mon

d

Diamond: 0.08

0

0.1

Mul

tiplic

ativ

e

Multiplicative: 0.11

0

0.2

Inde

pend

ence

Independence: 0

MGC, Distance Correlation, and Pearson's Correlation for 20 Dependencies

0

0.1

C. Shen MGC: 27/38

Page 101: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Evaluation Criterion

• Power is the probability of rejecting the null when the alternative istrue.

• Required sample size Nα,β(c) to achieve a power of β at type 1 errorlevel α using a statistic c.

C. Shen MGC: 28/38

Page 102: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Evaluation Criterion

• Power is the probability of rejecting the null when the alternative istrue.

• Required sample size Nα,β(c) to achieve a power of β at type 1 errorlevel α using a statistic c.

C. Shen MGC: 28/38

Page 103: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Testing Power: Linear vs Nonlinear

0 0.5 1 1.5 2

Noise Level

0

0.2

0.4

0.6

0.8

1

Tes

ting

Pow

er

Linear Relationship

0 0.25 0.5 0.75 1

Noise Level

0

0.2

0.4

0.6

0.8

1

Tes

ting

Pow

er

Quadratic Relationship

MGC Distance Correlation Pearson's Correlation

n = 30, p = q = 1,

X ∼ Uniform(−1, 1),

ε ∼ Normal(0, noise),Y = X + ε and Y = X2 + ε.

C. Shen MGC: 29/38

Page 104: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Testing Power: Linear vs Nonlinear

0 0.5 1 1.5 2

Noise Level

0

0.2

0.4

0.6

0.8

1

Tes

ting

Pow

er

Linear Relationship

0 0.25 0.5 0.75 1

Noise Level

0

0.2

0.4

0.6

0.8

1

Tes

ting

Pow

er

Quadratic Relationship

MGC Distance Correlation Pearson's Correlation

n = 30, p = q = 1,

X ∼ Uniform(−1, 1),

ε ∼ Normal(0, noise),Y = X + ε and Y = X2 + ε.

C. Shen MGC: 29/38

Page 105: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Required Sample Size

When noise = 1, p = q = 1, the required sample size Nα=0.05,β=0.85(c):

in linear relationship, 40 for all three methods;in quadratic relationship, 80 for MGC, 180 for Dcorr, and > 1000 forPearson.

Next we compute the size for each simulation, and summarize by themedian over close-to-linear (type 1-5) and strongly non-linear relationships(type 6-19).

We consider univariate (1D) and multivariate (10D) cases.

C. Shen MGC: 30/38

Page 106: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Required Sample Size

When noise = 1, p = q = 1, the required sample size Nα=0.05,β=0.85(c):

in linear relationship, 40 for all three methods;in quadratic relationship, 80 for MGC, 180 for Dcorr, and > 1000 forPearson.

Next we compute the size for each simulation, and summarize by themedian over close-to-linear (type 1-5) and strongly non-linear relationships(type 6-19).

We consider univariate (1D) and multivariate (10D) cases.

C. Shen MGC: 30/38

Page 107: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Required Sample Size

When noise = 1, p = q = 1, the required sample size Nα=0.05,β=0.85(c):

in linear relationship, 40 for all three methods;in quadratic relationship, 80 for MGC, 180 for Dcorr, and > 1000 forPearson.

Next we compute the size for each simulation, and summarize by themedian over close-to-linear (type 1-5) and strongly non-linear relationships(type 6-19).

We consider univariate (1D) and multivariate (10D) cases.

C. Shen MGC: 30/38

Page 108: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Required Sample Size

When noise = 1, p = q = 1, the required sample size Nα=0.05,β=0.85(c):

in linear relationship, 40 for all three methods;in quadratic relationship, 80 for MGC, 180 for Dcorr, and > 1000 forPearson.

Next we compute the size for each simulation, and summarize by themedian over close-to-linear (type 1-5) and strongly non-linear relationships(type 6-19).

We consider univariate (1D) and multivariate (10D) cases.

C. Shen MGC: 30/38

Page 109: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Required Sample Size

When noise = 1, p = q = 1, the required sample size Nα=0.05,β=0.85(c):

in linear relationship, 40 for all three methods;in quadratic relationship, 80 for MGC, 180 for Dcorr, and > 1000 forPearson.

Next we compute the size for each simulation, and summarize by themedian over close-to-linear (type 1-5) and strongly non-linear relationships(type 6-19).

We consider univariate (1D) and multivariate (10D) cases.

C. Shen MGC: 30/38

Page 110: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Median Size TableTesting Methods 1D Lin 1D Non-Lin 10D Lin 10D Non-Lin

MGC 50 90 60 165Dcorr 50 250 60 515

Pearson / RV / CCA 50 >1000 50 >1000

HHG 70 90 100 315

HSIC 70 95 100 400

MIC 120 180 n/a n/a

C. Shen MGC: 31/38

Page 111: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Signal Subgraph via MGC

We consider predicting the site and sex based on functional magneticresonance image (fMRI) graphs. Two datasets used are SWU4 and HNU1,which have 467 and 300 samples respectively.

Each sample is an fMRI scan registered to the MNI152 template using theDesikan altas, which has 70 regions.

We used an iterative screening method (similar to backward selection) viaMGC from [Wang et al.(2018)] to extract signal subgraph (in this casebrain regions) that are most dependent with sites and sex, and also runleave-one-out cross validation with K-Nearest Neighbor classifier to verifythe results.

C. Shen MGC: 32/38

Page 112: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Signal Subgraph via MGC

We consider predicting the site and sex based on functional magneticresonance image (fMRI) graphs. Two datasets used are SWU4 and HNU1,which have 467 and 300 samples respectively.

Each sample is an fMRI scan registered to the MNI152 template using theDesikan altas, which has 70 regions.

We used an iterative screening method (similar to backward selection) viaMGC from [Wang et al.(2018)] to extract signal subgraph (in this casebrain regions) that are most dependent with sites and sex, and also runleave-one-out cross validation with K-Nearest Neighbor classifier to verifythe results.

C. Shen MGC: 32/38

Page 113: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

C. Shen MGC: 33/38

Page 114: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Figure: A total of 22 regions are recognized for site difference, which maximizesthe MGC statistic and almost minimizes the leave-one-out cross validation error.It is no longer the case for sex, for which neither the MGC nor the error are toosignificant for any size of subgraph.

C. Shen MGC: 34/38

Page 115: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Summary

C. Shen MGC: 35/38

Page 116: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Page 117: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Page 118: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Page 119: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Page 120: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Page 121: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Page 122: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Summary

MGC builds on distance correlation, the locality principle, and taking asmoothed maximum.

• Proper distance transformation ensures the universal consistency.

• Compute all local correlations iteratively.

• Identify the optimal local correlation without inflating the sample bias.

They made MGC advantageous in theory and practice.

C. Shen MGC: 36/38

Page 123: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Page 124: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Page 125: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Page 126: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Page 127: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Page 128: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Page 129: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Page 130: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Page 131: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Page 132: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Page 133: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Page 134: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

Advantages of MGC

1. Performant under any joint distribution of finite second moments:

• Equals 0 asymptotically if and only if independence.

• Amplify the dependency signal while mostly avoiding the sample bias.

• Superior finite-sample performance over all benchmarks, against linear/ nonlinear / noisy / high-dimensional relationships.

2. It works for:

• Low- and high-dimensional data.

• Euclidean and structured data (e.g., images, networks, shapes).

• Any dissimilarity / similarity / kernel matrix.

3. Intuitive to understand and efficient to implement in O(n2logn).

C. Shen MGC: 37/38

Page 135: Dependency Discovery via Multiscale Graph Correlation · Modern data sets may be high-dimensional, nonlinear, noisy, of limited sample size, structured, from disparate spaces. Thus

References1. C. Shen, C. E. Priebe, and J. T. Vogelstein, “From distance correlation to themultiscale graph correlation,” Journal of the American Statistical Association,2019.

2. J. T. Vogelstein, E. Bridgeford, Q. Wang, C. E. Priebe, M. Maggioni, and C.Shen, “Discovering and Deciphering Relationships Across Disparate DataModalities,” eLife, 2019.

3. Y. Lee, C. Shen, and J. T. Vogelstein, “Network dependence testing viadiffusion maps and distance-based correlations,” Biometrika, 2019.

4. S. Wang, C. Shen, A. Badea, C. E. Priebe, and J. T. Vogelstein, “Signalsubgraph estimation via iterative vertex screening,” under review.

5. C. Shen and J. T. Vogelstein, “The Exact Equivalence of Distance and KernelMethods for Hypothesis Testing,” under review.

C. Shen MGC: 38/38