sparse nonparametric density estimation in high dimensions …hanliu/slides/drodeo_slides.pdf ·...
Post on 16-Sep-2020
2 Views
Preview:
TRANSCRIPT
Outline
Sparse Nonparametric Density Estimationin High Dimensions Using the Rodeo
Han Liu1,2 John Lafferty2,3 Larry Wasserman1,2
1Statistics Department, 2Machine Learning Department, 3Computer ScienceDepartment, Carnegie Mellon University
July 1st, 2006
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
Outline
Motivation
Research background
Rodeo is a general strategy for nonparametric inference. It has beensuccessfully applied to solve sparse nonparametric regression problemsin high dimensions by Lafferty & Wasserman, 2005.
Our goal
Trying to adapt the rodeo framework to nonparametricdensity estimation problems. So that we have a unifiedframework for both density estimation and regression problems whichis computationally efficient and theoretically soundable
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
Outline
Outline
1 BackgroundNonparametric density estimation in high dimensionsSparsity assumptions for density estimation
2 Methodology and AlgorithmsThe main ideaThe local rodeo algorithm for the kernel density estimator
3 Asymptotic PropertiesThe asymptotic running time and minimax risk
4 Extension and VariationsThe global density rodeo and the reverse density rodeoUsing other distributions as irrelevant dimensions
5 Experimental ResultsEmpirical results on both synthetic and real-world datasets
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Nonparametric density estimation in high dimensionsSparsity assumptions and the rodeo framework
Problem statement
ProblemTo estimate the joint density of a continuous d-dimensional randomvector
X = (X1, X2, ..., Xd) ∼ F , d 3
where F is the unknown distribution with density function f(x).
This problem is essentially hard, since the high dimensionality causesboth computational and theoretical problems.
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Nonparametric density estimation in high dimensionsSparsity assumptions and the rodeo framework
Previous work
From a frequentist perspective
Kernel density estimation and the local likelihood methodProjection pursuit methodLog-spline models and the penalized likelihood method
From a Bayesian perspective
Mixture of normals with Dirichlet processes as prior
Difficulties of current approaches
Some methods only work well for low-dimensional problemsSome heuristics lack the theoretical guaranteesMore importantly, they suffer from the curse of dimensionality
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Nonparametric density estimation in high dimensionsSparsity assumptions and the rodeo framework
The curse of dimensionality
Characterizing the curse
In a Sobolev space of order k, minimax theory shows that the bestconvergence rate for the mean squared error is
Ropt = O(n−2k/(2k+d)
)which is practically slow when the dimension d is large.
Combating the curse by some sparsity assumptions
If the high-dimensional data has a low dimensional structure or asparsity condition, we expect that some methods could be developedto combat the curse of dimensionality. This motivates thedevelopment of the rodeo framework
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Nonparametric density estimation in high dimensionsSparsity assumptions and the rodeo framework
Rodeo for nonparametric regression (I)
Rodeo (regularization of derivative expectation operator) is a generalstrategy for nonparametric inference. Which has been used fornonparametric regression
For a regression problem
Yi = m(Xi) + εi, i = 1, . . . , n
where Xi = (Xi1, ..., Xid) ∈ Rd is a d-dimensional covariate. If m is ina d-dimensional Sobolev space of order 2, the best convergence ratefor the risk is
R∗ = O(n−4/(4+d)
)Which shows the curse of dimensionality in a regression setting.
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Nonparametric density estimation in high dimensionsSparsity assumptions and the rodeo framework
Rodeo for nonparametric regression (II)
Assume the true function only depends on r covariates (r d)
m(x) = m(x1, ..., xr)
for any ε > 0, the rodeo can simultaneously perform bandwidthselection and (implicitly) variable selection to achieve a betterminimax convergence rate of
Rrodeo = O(n−4/(4+r)+ε
)as if the r relevant variables were explicitly isolated in advance.
Rodeo beats the curse of dimensionality in this sense. We expect toapply the same idea to solve density estimation problems.
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Nonparametric density estimation in high dimensionsSparsity assumptions and the rodeo framework
Sparse density estimation
For many applications, the true density function can be characterizedby some low dimensional structure
Sparsity assumption for density estimation problems
Assume hjj(x) is the second partial derivative of h on the j-thvaraible, there exists some r d, such that
f(x) ∝ g(x1, ..., xr)h(x) where hjj(x) = 0 for j = 1, ..., d.
Where xR = x1, ..., xr are the relevant dimensions. This conditionimposes that h(·) belongs to a family of very smooth functions (e.g.the uniform distribution).
h(·) can be generalized to be any parametric distribution!
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Nonparametric density estimation in high dimensionsSparsity assumptions and the rodeo framework
Generalized sparse density estimation
We can generalize h(·) to other distributions (e.g. Gaussian).
General sparsity assumption for density estimation problems
Assume h(·) is any distribution (e.g. Gaussian) that we are notinterested in
f(x) ∝ g(x1, ..., xr)h(x) where r d.
Thus, the density function f(·) can be factored into two parts: therelevant components g(·) and the irrelevant components h(·). WherexR = x1, ..., xr are the relevant dimensions.
Under this framework, we can hope to achieve a better minimax rate
R∗rodeo = O(n−4/(4+r)
)Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Nonparametric density estimation in high dimensionsSparsity assumptions and the rodeo framework
Related work
Recent work that addressed this problem
Minimum volume set (Scott& Nowak, JMLR06)Nongaussian component analysis; (Blanchard et al. JMLR06)Log-ANOVA model; (Lin & Joen, Statistical Sinica 2006)
Advantages of our approach:
Rodeo can utilize well-established nonparametric estimatorsA unified framework for different kinds of problemseasy to implement and is amenable to theoretical analysis
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
The main ideaThe local rodeo algorithm for the kernel density estimator
Density rodeo: the main idea
The key intuition: if a dimension is irrelevant, then changing thesmoothing parameter of that dimension should only result in a smallchange in the whole estimator
Basically, Rodeo is just a regularization strategyUse a kernel density estimator start with large bandwidthsCalculate the gradient of the estimator w.r.t. the bandwidthSequentially decrease the bandwidths in a greedy way, and try tofreeze this decay process by some thresholding strategy to achievea sparse solution
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
The main ideaThe local rodeo algorithm for the kernel density estimator
Density rodeo: the main idea
Assuming a fixed point x and let fH(x) denote an estimator of f(x)based on smoothing parameter matrix H = diag(h1, ..., hd)Let M(h) = E(fh(x)) denote the mean of fh(x), therefore,f(x) = M(0) = E(f0(x)). Assuming P = h(t) : 0 ≤ t ≤ 1 is asmooth path through the set of smoothing parameters with h(0) = 0and h(1) = 1, then
f(x) = M(1)− (M(1)−M(0)) = M(1)−∫ 1
0
dM(h(s))ds
ds
= M(1)−∫ 1
0
〈D(s), h(s)〉ds
where D(h) = ∇M(h) =(
∂M∂h1
, ..., ∂M∂hd
)T
is the gradient of M(h) and
h(s) = dh(s)ds is the derivative of h(s) along the path.
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
The main ideaThe local rodeo algorithm for the kernel density estimator
Density rodeo: the main idea
A biased, low variance estimator of M(1) is f1(x). An unbiasedestimator of D(h) is
Z(h) =
(∂fH(x)
∂h1, ...,
∂fH(x)∂hd
)T
This naive estimator has poor risk due to the large variance of Z(h)for small bandwidth. However, the sparsity assumption on f suggeststhat there should be paths for which D(h) is also sparse. Along sucha path, Z(h) could be replaced with an estimator D(h) that makesuse of the sparsity assumption. Then, the estimate of f(x) becomes
fH(x) = f1(x)−∫ 1
0
〈D(s), h(s)〉ds
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
The main ideaThe local rodeo algorithm for the kernel density estimator
Kernel density estimator
Let fH(x) represents the kernel density estimator of f(x) with abandwidth matrix H. assuming that K is a standard symmetrickernel, s.t. ∫
K(u)du = 1,
∫uK(u)du = 0d
while KH(·) = 1det(H)K(H−1·) represents the kernel with bandwidth
matrix H = diag(h1, ..., hd).
fH(x) =1
n det(H)
n∑i=1
K(H−1(x−Xi))
=1n
n∑i=i
d∏j=1
1hj
K
(xj −Xij
hj
)Here, we assume that K is a product kernel.
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
The main ideaThe local rodeo algorithm for the kernel density estimator
The rodeo statistics for the kernel density estimator
The density estimation Rodeo is based on the statistics
Zj =∂fH(x)
∂hj
=1n
n∑i=1
K(
xj−Xij
hj
)K(
xj−Xij
hj
) d∏k=1
K
(xk −Xik
hk
)≡ 1
n
n∑i=1
Zji
For the conditional variance term
s2j = Var(Zj |X1, ..., Xn)
= Var
(1n
n∑i=1
Zji|X1, ..., Xn
)=
1nVar(Zj1|X1, ..., Xn)
Here, we used the sample variance of the Zji to estimate Var(Zj1).
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
The main ideaThe local rodeo algorithm for the kernel density estimator
Density rodeo algorithms
Density Rodeo: Hard thresholding version
1. Select parameter 0 < β < 1 and initial bandwidth h0, whereh0 = c0/log log n
for some constant c0. Let cn be a sequence, s.t. cn = O(1).2. Initialize the bandwidths, and activate all dimensions:
(a) hj = h0, j = 1, 2, ..., d.(b) A = 1, 2, ..., d.
3. While A is nonempty, do for each j ∈ A(a) Compute the derivative and variance: Zj and sj .(b) Compute the threshold λj = sj
√2 log(ncn).
(c) If |Zj | > λj , then set hj ← βhj ;Otherwise remove j from A.
4. Output bandwidths h∗ and the estimator f(x) = fH∗(x)
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Asymptotic Properties
The purpose of the analysis
The analysis is trying to characterize the asymptotic aspects of
the selected bandwidthsthe asymptotic running time (efficiency)the convergence rate of the risk (accuracy)
To make the theoretical results more realistic, a key aspect of ouranalysis is that we allow the dimension d to increase withthe sample size n! For this, we need to make some assumptionsabout the unknown density function to take the increasing dimensioninto account.
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Asymptotic Properties
Assumptions
(A1) Kernel assumption:∫uuTK(u)du = v2Id and v2 <∞ and
∫K2(u)du = R(K) <∞
(A2) Dimension assumption:
d log d = O(log n)
(A3) Initial bandwidth assumption:
h(0)j = c0/log log n for (j = 1, ..., d)
Combing with A2, this implies that limn→∞ n∏d
j=1 h(0)j =∞
(A4) Sparsity assumption:
f(x) ∝ g(x1, ..., xr)h(x) where hjj(x) = 0, and satisfies r = O(1)(A5) Hessian assumption:∫
tr(HTR(u)HR(u))du <∞, and fjj(x) 6= 0 for j = 1, 2, ..., r
where HR(u) represents the Hessian for the relevant dimensions
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Asymptotic Properties
Derivatives of both relevant and irrelevant dimensions
Key Lemma: Under assumptions A1−A5, suppose that x isinterior to the support of f . Suppose that K is a product kernel withbandwidth matrix H(s) = diag(h(s)
1 , ..., h(s)d ). Then
µ(s)j =
∂
∂h(s)j
E[fH(s)(x)− f(x)] = oP (h(s)j ) for all j ∈ Rc
For j ∈ R we have
µ(s)j =
∂
∂h(s)j
E[fH(s)(x)− f(x)] = h(s)j v2fjj(x) + oP (h(s)
j ).
Thus, for any integer s > 0, hs = h0βs, each j > r satisfies
µ(s)j = oP (h(s)
j ) = oP (h(0)j ).
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Asymptotic Properties
Main theorem
Main Theorem: Suppose that r = O(1) and (A1)–(A5) hold. Inaddition, suppose that Amin = minj≤r |fjj(x)| = Ω(1) andAmax = maxj≤r |fjj(x)| = O(1). Then, for every ε > 0, the number ofiterations Tn until the Rodeo stops satisfies
P(
14 + r
log1/β(n1−εan) ≤ Tn ≤1
4 + rlog1/β(n1+εbn)
)−→ 1
where an = Ω(1) and bn = O(1). More over, the algorithm outputsbandwidths h∗ that satisfy
P(h∗j = h
(0)j for all j > r
)−→ 1
Also, we have
P(h
(0)j (nbn)−1/(4+r)−ε ≤ h∗j ≤ h
(0)j (nan)−1/(4+r)+ε for all j ≤ r
)−→ 1
assuming that h(0)j is defined as in A3.
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Asymptotic Properties
Convergence rate of the risk
Theorem 2 Under the same condition of the main theorem, the riskRh∗ of the density Rodeo estimator satisfies
Rh∗ = OP
(n−4/(4+r)+ε
)for every ε > 0.
We write Yn = OP (an) to mean that Yn = O(bnan) where bn islogarithmic in n. As noted earlier, we write an = Ω(bn) iflim infn
∣∣∣an
bn
∣∣∣ > 0; similarly an = Ω(bn) if an = Ω(bncn) where cn islogarithmic in n.
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Different versions of the density rodeo algorithmUsing other distributions as irrelevant dimensions
Possible extensions
The density rodeo algorithm could be extended in many ways
the soft-thresholding versionthe global versionthe reverse versionthe bootstrapping versionthe rodeo algorithm for local linear density estimatorthe greedier versionusing other distributions as irrelevant
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Different versions of the density rodeo algorithmUsing other distributions as irrelevant dimensions
Global density rodeo
The idea is by averaging the test statistics for multiple evaluationpoints x1, ..., xk sampled from the empirical distribution.
To avoid the cancellation problem, the statistic is squared, let Zj(xi)represents the derivative for the i-th evaluation point with respect tothe bandwidth hj . We define the test statistic
Tj =1m
m∑k=1
Z2j (xk), j = 1, ..., d
while
sj =√
Var(Tj) =1m
√2tr(C2
j ) + 4µjT Cjµj
where µ = 1m
∑mi=1 Zj(xi). The threshold is
λj = s2j + 2sj
√log(ncn)
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Different versions of the density rodeo algorithmUsing other distributions as irrelevant dimensions
The bootstrapping version
Instead of derive the expression explicitly, bootstrapping can be usedto evaluate the variance of Zj .
Bootstrapping Method to calculate the s2j
1. Draw a sample X∗1 , ..., X∗
n of size n, with replacement:Loop for i = 1, ..., B,
Compute the estimate Z∗ji from data X∗1 , ..., X∗
n
2. Compute the bootstrapped variance
s2j = 1
B
∑Bb=1(Z
∗ji − Zj·)2. where Zj· = 1
B
∑Bb=1 Z∗j
3. Output the resulted s2j .
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Different versions of the density rodeo algorithmUsing other distributions as irrelevant dimensions
The reverse and greedier version
Reverse density Rodeo: Instead of using a sequence ofdecreasing bandwidths. On the contrary, we could begin from avery small bandwidth, and use a sequence of increasingbandwidths to estimate the optimal value
Greedier density rodeo: Instead of decaying all thebandwidths, only the bandwidths associated the largest Zj/λj
quantities is decayed
Hybrid density Rodeo: Different variations could be combinedarbitrarily
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Different versions of the density rodeo algorithmUsing other distributions as irrelevant dimensions
Using other distributions as irrelevant dimensions
We can use a general parametric distribution h(x) as irrelevantdimensions. The key trick is that a new semi-parametric densityestimator will be used
fH(x) =h(x)
∑ni=1KH(Xi − x)
n∫KH(u− x)h(u)du
where h(x) is a parametric density estimator at point x. Themotivation of this estimator comes from local likelihood method(Loader 1996).
We see that, for one dimensional case, starting from a largebandwidth, if the true function is h(x), the algorithm willtend to freeze the bandwidth decaying process immediately.
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Experimental design
The density rodeo algorithms were applied on both synthetic andreal data, including one-dimensional, two-dimensional,high-dimensional and very high-dimensional examplesto measure the distance between the estimated density functionwith the true density function. The Hellinger distance are used
D(f‖f) =∫ (√
f(x)−√
f(x))2
dx = 2− 2∫
f(x)
√f(x)f(x)
dx
Assuming that we have altogether m evaluation points, then thehellinger distance could be numerically calculated by the MonteCarlo integration
D(f‖f) ≈ 2− 2m
m∑i=1
√fH(Xi)f(Xi)
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
One-dimensional example: strongly skewed density
Strongly skewed density: We simulated 200 samples from theskewed distribution. The boxplot of the Hellinger distance isproduced based on 100 simulations.
This density is chosen to resemble tolognormal distribution, it distributes as
X ∼7∑
i=0
1
8N
(3
((2
3)i − 1
),
(2
3
)2i)
The true density is plotted
−3 −2 −1 0 1 2 30.
00.
20.
40.
60.
81.
01.
21.
4
Strongly skewed
x
dens
ity
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Strongly skewed density: result
−3 −2 −1 0 1 2
0.0
0.5
1.0
1.5
Unbiased Cross Validation
x , bandwidth = 0.0612
dens
ity
True densityUnbiased C.V.
−3 −2 −1 0 1 2
0.0
0.5
1.0
1.5
Local kernel density Rodeo
x
dens
ity
True densityLocal Rodeo
−3 −2 −1 0 1 2
0.0
0.5
1.0
1.5
Global kernel density Rodeo
x , bandwidth = 0.05
dens
ity
True densityGlobal Rodeo
−3 −2 −1 0 1 2
0.0
0.2
0.4
0.6
0.8
Bandwidth estimated
x
band
widt
h
Unibased CV Local Rodeo Global Rodeo
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Two-dimensional example: Combined Beta and Uniform
Combined Beta distribution with unifrom distribution asirrelevant dimension . We simulate a 2-dimensional dataset withn = 500 points.
The two dimensions are independentlygenerated as
X1 ∼ 2
3Beta(1, 2) +
1
3Beta(10, 10)
X2 ∼ Uniform(0, 1)
The true density for the relevant dimension is
0.0 0.2 0.4 0.6 0.8 1.00.
00.
51.
01.
52.
0
true relevant dimension
x , bandwidth = 0.0718
dens
ity
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Combined Beta and Uniform: result
The first plot is the rodeo result, the second plot is the result fitted bythe built-in function KDE2d (MASS package in R)
relevant dimension
irrel
evan
t dim
ensio
n
Density
relevant dimension
irrel
evan
t dim
ensio
n
Density
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Combined Beta and Uniform: marginal distribution
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
True relevant density
x
dens
ity
0.0 0.2 0.4 0.6 0.8 1.0
0.6
0.8
1.0
1.2
1.4
True irrelevant dimension
x
dens
ity
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.01
0.02
0.03
0.04
relevant by Rodeo
relevant dimension
dens
ity
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.01
0.02
0.03
0.04
irrelvant by Rodeo
irrelevant dimension
dens
ity
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.01
0.02
0.03
0.04
relevant by KDE2d
relevant dimension
dens
ity
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.01
0.02
0.03
0.04
irrelevant by KDE2d
irrelevant dimension
dens
ity
Numerically integrated
marginal distributions
based on the perspective
plots of the two estimators
(not normalized).
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Two-dimensional example: geyser data
Geyser Data: A version of the eruptions data from the “OldFaithful” geyser in Yellowstone National Park, Wyoming. (Azzalini and Bowman 1990) and is of continuous measurementfrom August 1 to August 15, 1985.
There are altogether n = 299 samples and two variables.
“Duration” = the numeric eruption time in minutes.
“waiting” = the waiting time to next eruption.
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Geyser data: result
The first plot is the rodeo result, the second plot is the result fitted bythe built-in function KDE2d (MASS package in R)
first dimension
sec
ond
dim
ensio
n
Density
first dimension
seco
nd d
imen
sion
Density
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Geyser data: contour plot
The first plot is the contour plot fitted by the built-in function KDE2d(MASS package in R), the second one is fitted by the rodeo algorithm
first dimension
seco
nd d
imen
sion
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
first dimension
seco
nd d
imen
sion
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
High dimensional example
30-dimensional example: We generate 30-dimensional syntheticdataset with r = 5 relevant dimensions (n = 100, with 30 trials). Therelevant dimensions are generated as
Xi ∼ N (0.5, (0.02i)2), for i = 1, ..., 5
while the irrelevant dimensions are generated as
Xi ∼ Uniform(0, 1), for i = 6, ..., 30
The test point is x = (12 , ..., 1
2 ).
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
30-dimensional example: result
The Rodeo path for the 30-dimensional synthetic dataset and theboxplot of the selected bandwidths for 30 trials
0 5 10 15 20 25
0.2
0.4
0.6
0.8
Rodeo Step
Ba
nd
wid
th
12
34
5
6789
10111213141516
17
18192021222324252627282930
X1 X4 X7 X10 X14 X18 X22 X26 X30
0.0
0.2
0.4
0.6
0.8
dimensions
sele
cte
d b
an
dw
idth
s
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Very high dimensional example: image processing
The algorithm was run on 2200 grayscale images of 1s and 2s, eachwith 256 = 16× 16 pixels with some unknown background noise; thusthis is a 256-dimensional density estimation problem. A test pointand the output bandwidth plot are shown here
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Image processing example: evolution plot
The output bandwidth plots sampled at the Rodeo step 10, 20, 30,40, 50, 60, 70, 80, 90, 100. Which visualizes the evolution of thebandwidths and could be viewed as a dynamic process for featureselection −− the earlier a dimension’s bandwidth decays, the moreinformative it is.
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
0.0 0.4 0.8
0.0
0.4
0.8
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
A example using Gaussian as irrelevant
Using Gaussian as irrelevant dimensions: We apply thesemiparametric rodeo algorithm on both 15-dimensional and20-dimensional synthetic datasets with r = 5 relevant dimensions(n = 1000). When using gaussian distributions as irrelevantdimensions, the relevant dimensions are generated as
Xi ∼ Uniform(0, 1), for i = 1, ..., 5
while the irrelevant dimensions are generated as
Xi ∼ N (0.5, (0.05i)2), for i = 6, ..., d
The test point is x = (12 , ..., 1
2 ).
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Using Gaussian as irrelevant: result
The Rodeo path for the 15-dimensional synthetic data(Left) and forthe 20-dimensional data (Right)
5 10 15
0.0
0.2
0.4
0.6
0.8
Rodeo Step
Ba
nd
wid
th
12345
6789101112131415
5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
Rodeo Step
Ba
nd
wid
th
1
2
34
567
8
91011121314151617181920
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Using Gaussian as irrelevant: one-dimensional case (I)
0.2 0.4 0.6 0.8 1.0
0.6
0.8
1.0
1.2
1.4
True density
x
dens
ity
0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
fitted by Rodeo
x
dens
ity
0.2 0.4 0.6 0.8 1.0
0.01
20.
014
0.01
60.
018
0.02
00.
022
0.02
4
Bandwidth estimated
x
band
widt
h
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
By unbiased CV
N = 1000 Bandwidth = 0.02894
Dens
ity
1000 one-dimensionaldata points withxi ∼ Uniform(0, 1).
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Using Gaussian as irrelevant: one-dimensional case (II)
−3 −2 −1 0 1 2
0.0
0.1
0.2
0.3
0.4
Gaussian
x
dens
ity
−3 −2 −1 0 1 2
0.0
0.1
0.2
0.3
0.4
fitted by Rodeo
x
dens
ity
−3 −2 −1 0 1 2
0.05
0.10
0.15
0.20
Bandwidth estimated
x
band
widt
h
−3 −2 −1 0 1 2
0.90
0.92
0.94
0.96
0.98
1.00
1.02
correction factor
test
C0
1000 one-dimensionaldata points withxi ∼ N (0, 1).
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
BackgroundMethodology and Algorithms
Asymptotic PropertiesExtension and Variations
Experimental Results
Experimental Results
Summary
This work adapts the general rodeo framework to solve densityestimation problemsThe sparsity assumption is crucial to guarantee the success of thedensity rodeo algorithmThe density rodeo algorithm is efficient on high-dimensionalproblems both theoretically and empirically.The rodeo framework can utilize current available densityestimators, the implementation is simpleFuture work: develop the rodeo algorithms for the other kinds ofproblems, e.g. classification.
Liu, Lafferty, Wasserman Sparse Nonparametric Density Estimation
top related