machine learning lunch carnegie mellon universitybapoczos/other... · machine learning lunch...
TRANSCRIPT
Nonparametric information estimation using Rank based Euclidean Graph Optimization
methods
Barnabás Póczos
Department of Computing Science,
University of Alberta, Canada
Machine Learning Lunch
Carnegie Mellon University
Feb 8, 2010
2
Outline
GOAL: Information estimation
Background on information and entropy
Applications in machine learning
Graph optimization methods
TSP, MST, kNNCopula transformation
Information estimation
Background on information and entropy
4
Who cares about dependence?
• Observing physical, biological, chemical systems
– Which observations are dependent / independent?
– Is there dependence between inputs and outputs?
• Neuroscience: dependence between stimuli and neural response
• Stock markets– which stocks are dependent of each others, which are
independent,
– dependence on other events?
5
What is dependence?
We want to measure dependence
Correlation?
… nope that’s not good enough…
Correlation can be zero when there is dependence between the variables
Isn’t this only an artificial problem?
It is a serious problem…
6
Correlation?… nope that’s not good enough
Correlation measures only pairwise dependencies.
What can we do when we have more variables?
X1
X2
covariance matrix = I
cross-correlation = 0,
but there is dependence
that’s why PCA is not able to separate mixed sources, and we need
ICA methods for considering higher order dependencies, too.
7
What is dependence ?
Alfréd Rényi
1961, Hungary
8
-divergence
• Rényi’s -information
• Shannon’s mutual information
Special cases:
Theorem:
1967, Hungary
Imre Csiszár
9
Measuring Dependence and Uncertainty
Shannon mutual information
Measuring uncertainty (Shannon entropy)
1948, Princeton
Claude Shannon
10
Rényi’s information and entropy
Rényi’s entropy
Rényi’s information
Applications of mutual information in machine learning
12
Feature selection (Peng et. al, 2005)
Goal:
• find a set of features that has the largest dependence
with the target class
• find dependent and independent features
XT =XT Rn x d
labels
Microarray dataprocessing
d = number of genes
n = number of patients
13
Feature selection (Peng et. al, 2005)
XT =XT Rn x d
labels
d = number of genes
n = number of patients
Max dependency
Max relevance
Min redundancy, Max relevance
Don’t want dependent features
14
Independent Subspace Analysis
Observation
Hidden, independent sources
(subspaces)
Estimate A and S observing samples from X only
Goal:
15
Independent Subspace Analysis
In case of perfect
separation WA is a
block permutation
matrix
Objective:
16
Mutual Information Applications
• Information theory, Channel capacity
• Information geometry
• Clustering
• Optimal experiment design, active learning
• Prediction of protein structure
• Drug design
• Boosting
• fMRI data processing
• …
The estimation problem
18
The estimation problem
GOAL:
19
How can we estimate these?
How can we estimate them?
Graph optimization methods
TSP, MST, kNN
21
History• 1959, Beardwood, Halton, Hammersley
– Given uniform iid samples on [0,1]2. TSP length=?
• 1976, Karp
– randomly distributed cities
– This asymptotic can be used for developing a deterministic poly time near-optimal(1+) TSP algorithm (a.s.)
– 1st to show that 9 alg. for a stochastic version
of an NP complete problem (TSP) that inpolynomial time gives a near optimal solution (a.s.)
John Hammersley
Oxford, UK
Richard Karp
Berkeley, CA
– uniform iid samples on [0,1]d. TSP length=?
22
• 1981 – TSP ) MST, Minimal Matching graphs, conjecture for kNN
• Renyi’s -entropy estimation?
observations with density f on [0,1]d
• Beardwood, Halton, Hammersley
Michael Steele Joseph Yukich Wansoo Rhee
1999, Hero, Costa & Michel
23
Steele, Yukich theorem for MST, TSP, Minimal Matching
(Fig taken from J. Costa and A. Hero)Sensitive to outliers…!
24
A new result: This procedure works on Set-Nearest Neighborhood
graphs, too
Calculate:
S={1,3,7} ) consider 1st, 3rd, 7th nearest neighbors of each node
S={k} ) consider the kth nearest neighbors of each node
S={1,2,…,k} ) kNN graphs
25
Rényi’s entropy estimators using Set-Nearest Neighborhood graphs
(Fig. taken from J. Costa and A. Hero)
How can we get information estimators from entropy estimators?
How can we make the estimators more robust?
27
The invariance trick
Information is preserved under monotonic transformations.
28
Transformation to get uniform marginals
The transformation (copula transformation):
Monotone transform Uniform marginals
• Information estimation problem is reduced for
Rényi’s entropy estimation
• a little problem: we don’t know Fi distribution functions
Monotone transformation leading to uniform marginals?
Prob theory 101:
29
Copula methods
• Very popular in financial analysis
– Capture dependence between marginal variables
– An alternative form of normalization
• … but not so widespread in machine learning…
Well in financial analysis, they led
to the global financial crisis…
Maybe we should use them in ML more often!
30
The copula transformation
Monotone transform Uniform marginals
a little problem: we don’t know Fi distribution functions
Solution:
Use empirical distributions Fjn and empirical copula transform.
We need this in 1D only! ) no curse of dimensionality.
31
Empirical copula transformation
We don’t know F1,…,Fd distribution functions
) estimate them with empirical distribution functions
32
Empirical copula transformation
“true” copula transformation:
empirical copula transformation:
empirical copula transformation of the observations:
33
REGO AlgorithmRank-based Euclidean Graph Optimization
Theoretical Results
consistency
35
Is consistency trivial?
Difficulties:
• The true copula transformation is not known… only the empirical
• Samples lie on a grid, not coming from the true copula distribution
True copula transform
Empirical copula transform
36
• Empirical copula transformed points are not iid in n
) Steele & Yukich’s theorem cannot be applied
We need different proofs for the two cases:
• 0<p· 1: We have triangle inequality for ||.||p, but it is not Lipschitz
• 1<p<d/2: ||.||p is Lipschitz, but no triangle inequality
Difficulties:
Is consistency trivial?
37
Consistency Results
Main theorem:
Note:• Consistent MI estimator using ranks only
• Ranks only ) Robust
• We don’t have theoretical results for other d and values…
Empirical Results
39
Empirical results, consistency in 2D
n
kMST=0.8n
40
Empirical results, consistency in 10D
41
Image registration
n=4900=70 x 70
200 outliers
42
Independent subspace analysis
43
Independent subspace analysisin the presence of outliers
44
Infinitesimal robustness (finite sample influence function)
The finite sample influence functions
The amount of change caused by adding one outlier, x
(with some modification)
It cannot be arbitrarily big!
45
Conclusion
Marriage of seemingly unrelated mathematical areas produced a consistent and robust
Rényi’s information estimator
Graph optimization methods
TSP, MST, kNNCopula transformation
Information estimation
46
Thanks for your attention!
“Hey Rege, Róka Rege, Hey REGŐ Rejtem…
I am calling my grandfather, I feel his soul here.
I am listening to his words, they are breaking the silence.
Impart your knowledge to your grandson please,
1000 years have already passed…,
Knowledge equals life, hey!”