detecting novel associations in large data sets

36
Michele Filannino + You Presented paper: D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011. CS-GN-TEAM: internal presentation Manchester, 05/03/2012 detecting novel associations in large data sets

Upload: michele-filannino

Post on 18-Dec-2014

767 views

Category:

Technology


5 download

DESCRIPTION

Paper presentation: D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011.

TRANSCRIPT

Page 1: Detecting novel associations in large data sets

Michele Filannino + You

Presented paper:D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334,

no. 6062, pp. 1518-1524, 2011.

CS-GN-TEAM: internal presentation

Manchester, 05/03/2012

detecting novel associationsin large data sets

Page 2: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

where we are

2

Page 3: Detecting novel associations in large data sets

presentation my research taster project

Introduction

Page 4: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

novel association

4

■ two variables, X and Y, are associated if there is a

relationship between them

● functional

● non functional

■ novel: unknown

Page 5: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

Data set 10x6

example

5

f0 f1 f2 f3 f4 f5

s0 4.00 -0.76 5.00 12.00 8.22 1.83

s1 9.00 0.41 10.00 23.00 27.12 4.30

s2 3.00 0.14 4.00 0.00 0.56 -0.43

s3 10.00 -0.54 11.00 100.00 94.02 6.24

s4 5.00 -0.96 6.00 45.00 39.25 3.56

s5 2.00 0.91 3.00 123.00 125.73 2.97

s6 7.00 0.66 8.00 4.00 9.26 2.56

s7 8.00 0.99 9.00 -2.00 6.90 2.37

s8 1.00 0.84 2.00 36.00 37.68 1.58

s9 6.00 -0.28 7.00 0.00 -1.96 0.71

Page 6: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

Data set 10x6

example

6

f0 f1 f2 f3 f4 f5

s0 4.00 -0.76 5.00 12.00 8.22 1.83

s1 9.00 0.41 10.00 23.00 27.12 4.30

s2 3.00 0.14 4.00 0.00 0.56 -0.43

s3 10.00 -0.54 11.00 100.00 94.02 6.24

s4 5.00 -0.96 6.00 45.00 39.25 3.56

s5 2.00 0.91 3.00 123.00 125.73 2.97

s6 7.00 0.66 8.00 4.00 9.26 2.56

s7 8.00 0.99 9.00 -2.00 6.90 2.37

s8 1.00 0.84 2.00 36.00 37.68 1.58

s9 6.00 -0.28 7.00 0.00 -1.96 0.71

Page 7: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

f2(x) = f0(x) + 1

scatter plot: f0 vs. f2

7

Page 8: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

Data set 10x6

example

8

f0 f1 f2 f3 f4 f5

s0 4.00 -0.76 5.00 12.00 8.22 1.83

s1 9.00 0.41 10.00 23.00 27.12 4.30

s2 3.00 0.14 4.00 0.00 0.56 -0.43

s3 10.00 -0.54 11.00 100.00 94.02 6.24

s4 5.00 -0.96 6.00 45.00 39.25 3.56

s5 2.00 0.91 3.00 123.00 125.73 2.97

s6 7.00 0.66 8.00 4.00 9.26 2.56

s7 8.00 0.99 9.00 -2.00 6.90 2.37

s8 1.00 0.84 2.00 36.00 37.68 1.58

s9 6.00 -0.28 7.00 0.00 -1.96 0.71

Page 9: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

no relation

scatter plot: f0 vs. f1

9

Page 10: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

correlation coefficients

10

Pearson Mutual Infor. MI norm.

f0-f5 0.63 2.45 0.74

f0-f1 -0.17 1.57 0.47

f0-f2 1.00 3.32 1.00

f2-f3 -0.08 3.12 0.94

f0-f3 -0.08 3.12 0.94

Page 11: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

pros. & cons.

11

■ Pearson’s coeff.

✔ closed interval result

✖ only linear relations

✖ feature independency

■ Mutual Information

✔ non linear relations

✖ only categorical data

✖ biased towards higher

arity features

Page 12: Detecting novel associations in large data sets

presentation my research taster project

the new measure

Page 13: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

motivations

■ generality:

● capture a wide range of interesting associations, not

limited to specific function types

■ equitability:

● give similar scores to equally noisy relationships of

different types

13

Page 14: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

definition of MIC

■ Given a finite set D of ordered pairs, we can

partition the X-values of D into x bins and the Y-

values of D into y bins

■ We obtain a pair of partitions called x-by-y grid

14

D = (F0, F1)

F0 = (1.00, 2.00, 3.00 | 4.00, 5.00, 6.00, | 7.00, 8.00, 9.00, 10.00)

F1 = (-0.96, -0.76 | -0.54, -0.28 | 0.14, | 0.41, | 0.66, 0.84, 0.91, 0.99)

Page 15: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

2-by-4 grid

x-by-y grid

15

Page 16: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

definition of MIC

■ given the grid we could calculate D|G, the frequency

distribution induced by the points in D on the cells

of G

● different grids G result in different distributions D|G

16

Page 17: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

maximal MI over all grids

17

number of rowsnumber of columns

Page 18: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

characteristic matrix

18

normalisation factor(derived by MI definition)

Infinite matrix!

Page 19: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

Maximal Information Coeff.

19

max grid size

Page 20: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

matrix computation

■ space of grids grows exponentially

● B(n) ≤ O(n1-ε) for 0 < ε < 1

■ approximation of MIC

● heuristic dynamic programming

20

Page 21: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

MIC summary

✔ closed interval result

✔ non linear relations

✔ all types of data

✖ B(n) is crucial

✖ too high: non-zero scores even for random data

✖ too low: we are searching only for simple pattern

✖ still univariate

21

Page 22: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

B(n) behaviour

22

Page 23: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

B(n) behaviour

23

Page 24: Detecting novel associations in large data sets

presentation my research taster project

how to use it

Page 25: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

https://github.com/ajmazurie/xstats.MINE

pythonimport xstats.MINE as MINE

x = [40,50,None,70,80,90,100,110,120,130,140,150,

160,170,180,190,200,210,220,230,240,250,260]

y = [-0.07,-0.23,-0.1,0.03,-0.04,None,-0.28,-0.44,

-0.09,0.12,0.06,-0.04,0.31,0.59,0.34,-0.28,-0.09,

-0.44,0.31,0.03,0.57,0,0.01]

print "x y", MINE.analyze_pair(x, y)

25

Page 26: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

python: result

{'MCN': 2.5849625999999999,

'MAS': 0.040419996,

'pearson': 0.31553724,

'MIC': 0.38196000000000002,

'MEV': 0.27117000000000002,

'non_linearity': 0.28239626000000001}

26

Page 27: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

correlation coefficients

27

PearsonMutual

Informat.MI norm. MIC graph

f0-f5 0.63 2.45 0.74 0.24

f0-f1 -0.17 1.57 0.47 0.24

f0-f2 1.00 3.32 1.00 1.00

f2-f3 -0.08 3.12 0.94 0.24

f0-f3 -0.08 3.12 0.94 0.24

Page 28: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

MIC summary

✔ closed interval result

✔ non linear relations

✔ all types of data

✖ B(n) is crucial

✖ n is too low!

✖ still univariate

28

Page 29: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

pythonimport xstats.MINE as MINEimport math

x = [n*0.01 for n in range(1,2000)]

y = [math.sin(n) for n in x]result = MINE.analyze_pair(x, y)

print "MIC:", result[‘MIC’]

print "Pearson:", result[‘pearson’]

>>> MIC: 0.99999>>> Pearson: -0.16366038

29

Page 30: Detecting novel associations in large data sets

presentation my research taster project

conclusion

Page 31: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

Source: paper

relationship types

31

Page 32: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

Source: paper

relationship types

32

Page 33: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

Source: paper

real application

33

Page 34: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

suggestions

34

■ use MIC only when you have lots of samples

● samples > 2000

■ use B(n) = n0.6

■ don’t use it for all the possible pairs of features

● it is slower than Pearson’s correlation coefficient or

Mutual Information

Page 35: Detecting novel associations in large data sets

Thank you.

Page 36: Detecting novel associations in large data sets

/ 3605/03/2012, Michele Filannino

presentation my research taster project

references

■ D. N. Reshef et al., “Detecting Novel Associations in

Large Data Sets,” Science, vol. 334, no. 6062, pp.

1518-1524, 2011.

■ D. N. Reshef et al., “Supporting Online Material for

Detecting Novel Associations in Large Data Sets”

36