detecting novel associations in large data sets
DESCRIPTION
Paper presentation: D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334, no. 6062, pp. 1518-1524, 2011.TRANSCRIPT
Michele Filannino + You
Presented paper:D. N. Reshef et al., “Detecting Novel Associations in Large Data Sets,” Science, vol. 334,
no. 6062, pp. 1518-1524, 2011.
CS-GN-TEAM: internal presentation
Manchester, 05/03/2012
detecting novel associationsin large data sets
/ 3605/03/2012, Michele Filannino
presentation my research taster project
where we are
2
presentation my research taster project
Introduction
/ 3605/03/2012, Michele Filannino
presentation my research taster project
novel association
4
■ two variables, X and Y, are associated if there is a
relationship between them
● functional
▶
● non functional
▶
■ novel: unknown
/ 3605/03/2012, Michele Filannino
presentation my research taster project
Data set 10x6
example
5
f0 f1 f2 f3 f4 f5
s0 4.00 -0.76 5.00 12.00 8.22 1.83
s1 9.00 0.41 10.00 23.00 27.12 4.30
s2 3.00 0.14 4.00 0.00 0.56 -0.43
s3 10.00 -0.54 11.00 100.00 94.02 6.24
s4 5.00 -0.96 6.00 45.00 39.25 3.56
s5 2.00 0.91 3.00 123.00 125.73 2.97
s6 7.00 0.66 8.00 4.00 9.26 2.56
s7 8.00 0.99 9.00 -2.00 6.90 2.37
s8 1.00 0.84 2.00 36.00 37.68 1.58
s9 6.00 -0.28 7.00 0.00 -1.96 0.71
/ 3605/03/2012, Michele Filannino
presentation my research taster project
Data set 10x6
example
6
f0 f1 f2 f3 f4 f5
s0 4.00 -0.76 5.00 12.00 8.22 1.83
s1 9.00 0.41 10.00 23.00 27.12 4.30
s2 3.00 0.14 4.00 0.00 0.56 -0.43
s3 10.00 -0.54 11.00 100.00 94.02 6.24
s4 5.00 -0.96 6.00 45.00 39.25 3.56
s5 2.00 0.91 3.00 123.00 125.73 2.97
s6 7.00 0.66 8.00 4.00 9.26 2.56
s7 8.00 0.99 9.00 -2.00 6.90 2.37
s8 1.00 0.84 2.00 36.00 37.68 1.58
s9 6.00 -0.28 7.00 0.00 -1.96 0.71
/ 3605/03/2012, Michele Filannino
presentation my research taster project
f2(x) = f0(x) + 1
scatter plot: f0 vs. f2
7
/ 3605/03/2012, Michele Filannino
presentation my research taster project
Data set 10x6
example
8
f0 f1 f2 f3 f4 f5
s0 4.00 -0.76 5.00 12.00 8.22 1.83
s1 9.00 0.41 10.00 23.00 27.12 4.30
s2 3.00 0.14 4.00 0.00 0.56 -0.43
s3 10.00 -0.54 11.00 100.00 94.02 6.24
s4 5.00 -0.96 6.00 45.00 39.25 3.56
s5 2.00 0.91 3.00 123.00 125.73 2.97
s6 7.00 0.66 8.00 4.00 9.26 2.56
s7 8.00 0.99 9.00 -2.00 6.90 2.37
s8 1.00 0.84 2.00 36.00 37.68 1.58
s9 6.00 -0.28 7.00 0.00 -1.96 0.71
/ 3605/03/2012, Michele Filannino
presentation my research taster project
no relation
scatter plot: f0 vs. f1
9
/ 3605/03/2012, Michele Filannino
presentation my research taster project
correlation coefficients
10
Pearson Mutual Infor. MI norm.
f0-f5 0.63 2.45 0.74
f0-f1 -0.17 1.57 0.47
f0-f2 1.00 3.32 1.00
f2-f3 -0.08 3.12 0.94
f0-f3 -0.08 3.12 0.94
/ 3605/03/2012, Michele Filannino
presentation my research taster project
pros. & cons.
11
■ Pearson’s coeff.
✔ closed interval result
✖ only linear relations
✖ feature independency
■ Mutual Information
✔ non linear relations
✖ only categorical data
✖ biased towards higher
arity features
presentation my research taster project
the new measure
/ 3605/03/2012, Michele Filannino
presentation my research taster project
motivations
■ generality:
● capture a wide range of interesting associations, not
limited to specific function types
■ equitability:
● give similar scores to equally noisy relationships of
different types
13
/ 3605/03/2012, Michele Filannino
presentation my research taster project
definition of MIC
■ Given a finite set D of ordered pairs, we can
partition the X-values of D into x bins and the Y-
values of D into y bins
■ We obtain a pair of partitions called x-by-y grid
14
D = (F0, F1)
F0 = (1.00, 2.00, 3.00 | 4.00, 5.00, 6.00, | 7.00, 8.00, 9.00, 10.00)
F1 = (-0.96, -0.76 | -0.54, -0.28 | 0.14, | 0.41, | 0.66, 0.84, 0.91, 0.99)
/ 3605/03/2012, Michele Filannino
presentation my research taster project
2-by-4 grid
x-by-y grid
15
/ 3605/03/2012, Michele Filannino
presentation my research taster project
definition of MIC
■ given the grid we could calculate D|G, the frequency
distribution induced by the points in D on the cells
of G
● different grids G result in different distributions D|G
16
/ 3605/03/2012, Michele Filannino
presentation my research taster project
maximal MI over all grids
17
number of rowsnumber of columns
/ 3605/03/2012, Michele Filannino
presentation my research taster project
characteristic matrix
18
normalisation factor(derived by MI definition)
Infinite matrix!
/ 3605/03/2012, Michele Filannino
presentation my research taster project
Maximal Information Coeff.
19
max grid size
/ 3605/03/2012, Michele Filannino
presentation my research taster project
matrix computation
■ space of grids grows exponentially
● B(n) ≤ O(n1-ε) for 0 < ε < 1
■ approximation of MIC
● heuristic dynamic programming
20
/ 3605/03/2012, Michele Filannino
presentation my research taster project
MIC summary
✔ closed interval result
✔ non linear relations
✔ all types of data
✖ B(n) is crucial
✖ too high: non-zero scores even for random data
✖ too low: we are searching only for simple pattern
✖ still univariate
21
/ 3605/03/2012, Michele Filannino
presentation my research taster project
B(n) behaviour
22
/ 3605/03/2012, Michele Filannino
presentation my research taster project
B(n) behaviour
23
presentation my research taster project
how to use it
/ 3605/03/2012, Michele Filannino
presentation my research taster project
https://github.com/ajmazurie/xstats.MINE
pythonimport xstats.MINE as MINE
x = [40,50,None,70,80,90,100,110,120,130,140,150,
160,170,180,190,200,210,220,230,240,250,260]
y = [-0.07,-0.23,-0.1,0.03,-0.04,None,-0.28,-0.44,
-0.09,0.12,0.06,-0.04,0.31,0.59,0.34,-0.28,-0.09,
-0.44,0.31,0.03,0.57,0,0.01]
print "x y", MINE.analyze_pair(x, y)
25
/ 3605/03/2012, Michele Filannino
presentation my research taster project
python: result
{'MCN': 2.5849625999999999,
'MAS': 0.040419996,
'pearson': 0.31553724,
'MIC': 0.38196000000000002,
'MEV': 0.27117000000000002,
'non_linearity': 0.28239626000000001}
26
/ 3605/03/2012, Michele Filannino
presentation my research taster project
correlation coefficients
27
PearsonMutual
Informat.MI norm. MIC graph
f0-f5 0.63 2.45 0.74 0.24
f0-f1 -0.17 1.57 0.47 0.24
f0-f2 1.00 3.32 1.00 1.00
f2-f3 -0.08 3.12 0.94 0.24
f0-f3 -0.08 3.12 0.94 0.24
/ 3605/03/2012, Michele Filannino
presentation my research taster project
MIC summary
✔ closed interval result
✔ non linear relations
✔ all types of data
✖ B(n) is crucial
✖ n is too low!
✖ still univariate
28
/ 3605/03/2012, Michele Filannino
presentation my research taster project
pythonimport xstats.MINE as MINEimport math
x = [n*0.01 for n in range(1,2000)]
y = [math.sin(n) for n in x]result = MINE.analyze_pair(x, y)
print "MIC:", result[‘MIC’]
print "Pearson:", result[‘pearson’]
>>> MIC: 0.99999>>> Pearson: -0.16366038
29
presentation my research taster project
conclusion
/ 3605/03/2012, Michele Filannino
presentation my research taster project
Source: paper
relationship types
31
/ 3605/03/2012, Michele Filannino
presentation my research taster project
Source: paper
relationship types
32
/ 3605/03/2012, Michele Filannino
presentation my research taster project
Source: paper
real application
33
/ 3605/03/2012, Michele Filannino
presentation my research taster project
suggestions
34
■ use MIC only when you have lots of samples
● samples > 2000
■ use B(n) = n0.6
■ don’t use it for all the possible pairs of features
● it is slower than Pearson’s correlation coefficient or
Mutual Information
Thank you.
/ 3605/03/2012, Michele Filannino
presentation my research taster project
references
■ D. N. Reshef et al., “Detecting Novel Associations in
Large Data Sets,” Science, vol. 334, no. 6062, pp.
1518-1524, 2011.
■ D. N. Reshef et al., “Supporting Online Material for
Detecting Novel Associations in Large Data Sets”
36