detecting novel associations in large data sets

Michele Filannino + You

Presented paper:D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334,

no. 6062, pp. 1518-1524, 2011.

CS-GN-TEAM: internal presentation

Manchester, 05/03/2012

detecting novel associationsin large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

where we are

2


Introduction



novel association

4

■ two variables, X and Y, are associated if there is a

relationship between them

● functional

▶

● non functional

▶

■ novel: unknown



Data set 10x6

example

5

f0 f1 f2 f3 f4 f5

s0 4.00 -0.76 5.00 12.00 8.22 1.83

s1 9.00 0.41 10.00 23.00 27.12 4.30

s2 3.00 0.14 4.00 0.00 0.56 -0.43

s3 10.00 -0.54 11.00 100.00 94.02 6.24

s4 5.00 -0.96 6.00 45.00 39.25 3.56

s5 2.00 0.91 3.00 123.00 125.73 2.97

s6 7.00 0.66 8.00 4.00 9.26 2.56

s7 8.00 0.99 9.00 -2.00 6.90 2.37

s8 1.00 0.84 2.00 36.00 37.68 1.58

s9 6.00 -0.28 7.00 0.00 -1.96 0.71



Data set 10x6

example

6

f0 f1 f2 f3 f4 f5

s0 4.00 -0.76 5.00 12.00 8.22 1.83

s1 9.00 0.41 10.00 23.00 27.12 4.30

s2 3.00 0.14 4.00 0.00 0.56 -0.43

s3 10.00 -0.54 11.00 100.00 94.02 6.24

s4 5.00 -0.96 6.00 45.00 39.25 3.56

s5 2.00 0.91 3.00 123.00 125.73 2.97

s6 7.00 0.66 8.00 4.00 9.26 2.56

s7 8.00 0.99 9.00 -2.00 6.90 2.37

s8 1.00 0.84 2.00 36.00 37.68 1.58

s9 6.00 -0.28 7.00 0.00 -1.96 0.71



f2(x) = f0(x) + 1

scatter plot: f0 vs. f2

7



Data set 10x6

example

8

f0 f1 f2 f3 f4 f5

s0 4.00 -0.76 5.00 12.00 8.22 1.83

s1 9.00 0.41 10.00 23.00 27.12 4.30

s2 3.00 0.14 4.00 0.00 0.56 -0.43

s3 10.00 -0.54 11.00 100.00 94.02 6.24

s4 5.00 -0.96 6.00 45.00 39.25 3.56

s5 2.00 0.91 3.00 123.00 125.73 2.97

s6 7.00 0.66 8.00 4.00 9.26 2.56

s7 8.00 0.99 9.00 -2.00 6.90 2.37

s8 1.00 0.84 2.00 36.00 37.68 1.58

s9 6.00 -0.28 7.00 0.00 -1.96 0.71



no relation

scatter plot: f0 vs. f1

9



correlation coefficients

10

Pearson Mutual Infor. MI norm.

f0-f5 0.63 2.45 0.74

f0-f1 -0.17 1.57 0.47

f0-f2 1.00 3.32 1.00

f2-f3 -0.08 3.12 0.94

f0-f3 -0.08 3.12 0.94



pros. & cons.

11

■ Pearson’s coeff.

✔ closed interval result

✖ only linear relations

✖ feature independency

■ Mutual Information

✔ non linear relations

✖ only categorical data

✖ biased towards higher

arity features


the new measure



motivations

■ generality:

● capture a wide range of interesting associations, not

limited to specific function types

■ equitability:

● give similar scores to equally noisy relationships of

different types

13



definition of MIC

■ Given a finite set D of ordered pairs, we can

partition the X-values of D into x bins and the Y-

values of D into y bins

■ We obtain a pair of partitions called x-by-y grid

14

D = (F0, F1)

F0 = (1.00, 2.00, 3.00 | 4.00, 5.00, 6.00, | 7.00, 8.00, 9.00, 10.00)

F1 = (-0.96, -0.76 | -0.54, -0.28 | 0.14, | 0.41, | 0.66, 0.84, 0.91, 0.99)



2-by-4 grid

x-by-y grid

15



definition of MIC

■ given the grid we could calculate D|G, the frequency

distribution induced by the points in D on the cells

of G

● different grids G result in different distributions D|G

16



maximal MI over all grids

17

number of rowsnumber of columns



characteristic matrix

18

normalisation factor(derived by MI definition)

Infinite matrix!



Maximal Information Coeff.

19

max grid size



matrix computation

■ space of grids grows exponentially

● B(n) ≤ O(n1-ε) for 0 < ε < 1

■ approximation of MIC

● heuristic dynamic programming

20



MIC summary



✔ all types of data

✖ B(n) is crucial

✖ too high: non-zero scores even for random data

✖ too low: we are searching only for simple pattern

✖ still univariate

21



B(n) behaviour

22



B(n) behaviour

23


how to use it



https://github.com/ajmazurie/xstats.MINE

pythonimport xstats.MINE as MINE

x = [40,50,None,70,80,90,100,110,120,130,140,150,

160,170,180,190,200,210,220,230,240,250,260]

y = [-0.07,-0.23,-0.1,0.03,-0.04,None,-0.28,-0.44,

-0.09,0.12,0.06,-0.04,0.31,0.59,0.34,-0.28,-0.09,

-0.44,0.31,0.03,0.57,0,0.01]

print "x y", MINE.analyze_pair(x, y)

25



python: result

{'MCN': 2.5849625999999999,

'MAS': 0.040419996,

'pearson': 0.31553724,

'MIC': 0.38196000000000002,

'MEV': 0.27117000000000002,

'non_linearity': 0.28239626000000001}

26



correlation coefficients

27

PearsonMutual

Informat.MI norm. MIC graph

f0-f5 0.63 2.45 0.74 0.24

f0-f1 -0.17 1.57 0.47 0.24

f0-f2 1.00 3.32 1.00 1.00

f2-f3 -0.08 3.12 0.94 0.24

f0-f3 -0.08 3.12 0.94 0.24



MIC summary



✔ all types of data

✖ B(n) is crucial

✖ n is too low!

✖ still univariate

28



pythonimport xstats.MINE as MINEimport math

x = [n*0.01 for n in range(1,2000)]

y = [math.sin(n) for n in x]result = MINE.analyze_pair(x, y)

print "MIC:", result[‘MIC’]

print "Pearson:", result[‘pearson’]

>>> MIC: 0.99999>>> Pearson: -0.16366038

29


conclusion



Source: paper

relationship types

31



Source: paper

relationship types

32



Source: paper

real application

33



suggestions

34

■ use MIC only when you have lots of samples

● samples > 2000

■ use B(n) = n0.6

■ don’t use it for all the possible pairs of features

● it is slower than Pearson’s correlation coefficient or

Mutual Information

Thank you.



references

■ D. N. Reshef et al., “Detecting Novel Associations in

Large Data Sets,” Science, vol. 334, no. 6062, pp.

1518-1524, 2011.

■ D. N. Reshef et al., “Supporting Online Material for

Detecting Novel Associations in Large Data Sets”

36

detecting novel associations in large data sets

Technology