quantile regression for extraordinarily large...
TRANSCRIPT
Quantile regression Two-step algorithm Oracle rules Simulation References
Quantile Regression for Extraordinarily LargeData
Shih-Kang ChaoDepartment of Statistics
Purdue University
November, 2016A joint work with Stanislav Volgushev and Guang Cheng
Quantile regression Two-step algorithm Oracle rules Simulation References
With the advance of technology, it is increasingly common thatdata set is so large that it cannot be stored in a single machine
Social media (views, likes, comments, images...)Meteorological and environmental surveillanceTransactions in e-commerceOthers...
Figure: A server room in Council Bluffs, Iowa. Photo: Google/ConnieZhou.
Quantile regression Two-step algorithm Oracle rules Simulation References
Divide and Conquer (D&C)
To take advantage of the opportunities in massive data, we needto deal with storing (disc) and computational (memory,processor) bottlenecks.
Divide and conquer paradigm: Randomly divide N samples intoS groups. n = N/S
Problem(N)
subproblem(n)
subproblem(n)
subproblem(n)
subproblem(n)
Quantile regression Two-step algorithm Oracle rules Simulation References
Aggregate solutions from subproblems to get a solution fororiginal problem
Can be implemented by computational platforms such asHadoop (White, 2012)
Data that are stored in distributed locations can beanalyzed in similar manner
Problem(N)
subproblem(n)
subproblem(n)
subproblem(n)
subproblem(n)
Quantile regression Two-step algorithm Oracle rules Simulation References
Does D&C fit for statistical analysis?
Sometimes it does, but sometimes it doesn’t...
In the following, we give two simple examples.
Quantile regression Two-step algorithm Oracle rules Simulation References
Example 1: sample mean
Problem(N)
subproblem(n)
subproblem(n)
subproblem(n)
subproblem(n)
Xn,1 Xn,2Xn,3
Xn,4
1
4
4∑
s=1
Xs =1
4n
4∑
s=1
n∑
i=1
Xis =1
N
N∑
i=1
Xi = XN .
It fits!
Quantile regression Two-step algorithm Oracle rules Simulation References
Example 2: sample median
Problem(N)
subproblem(n)
subproblem(n)
subproblem(n)
subproblem(n)
X1(0.5) X2
(0.5)X3
(0.5)X4
(0.5)
Xs(0.5) = the middle value of ordered n samples in s group;
X(0.5) = overall median
1
4
4∑
s=1
Xs(0.5)
??= X(0.5)
Quantile regression Two-step algorithm Oracle rules Simulation References
Example 2: sample median
Simulation 1: Xi ∼ N(0, 1); Simulation 2: Xi ∼ Exponential(1).N = 215. True median v.s. simulated S−1
∑Ss=1X
s(0.5)
0.2 0.4 0.6 0.8−0.2
0−0
.10
0.00
0.10
τ = 0.5 , N (0,1)
logN(S)
0.2 0.4 0.6 0.8
0.7
0.8
0.9
1.0
τ = 0.5 , EXP (1)
logN(S)
Quantile regression Two-step algorithm Oracle rules Simulation References
Example 2: sample median
Simulation 1: Xi ∼ N(0, 1); Simulation 2: Xi ∼ Exponential(1).N = 215. True median v.s. simulated S−1
∑Ss=1X
s(0.5)
0.2 0.4 0.6 0.8−0.2
0−0
.10
0.00
0.10
τ = 0.5 , N (0,1)
logN(S)
0.2 0.4 0.6 0.8
0.7
0.8
0.9
1.0
τ = 0.5 , EXP (1)
logN(S)
Quantile regression Two-step algorithm Oracle rules Simulation References
Major issues
When does the D&C algorithm work?
Especially for skewed and heavy tail distribution
Statistical inference
Asymptotic distribution
Inference for the ”whole” distribution?
Take advantage of massive size to discover subtle patternshidden in the ”whole” distribution
Quantile regression Two-step algorithm Oracle rules Simulation References
Outline
1 Quantile regression
2 Two-step algorithm: D&C and quantile projection
3 Oracle rules: linear model and nonparametric model
4 Simulation
Quantile regression Two-step algorithm Oracle rules Simulation References
Quantile
Response Y , predictors X. For τ ∈ (0, 1), conditional quantilecurve Q(·; τ) of Y ∈ R conditional on X is defined through
P (Y ≤ Q(X; τ)|X = x) = τ ∀x.
Q(x; τ) at τ = 0.1, 0.25, 0.5, 0.75, 0.9 under different models
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
● ●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
−2
−1
01
23
Y = 0.5*X + N(0,1)
x
y
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
● ●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●●● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
−2
02
46
Y = X + (.5+X)*N(0,1)
x
y
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Y|X=x ~ Beta(1.5−x,.5+x)
x
y
Quantile regression Two-step algorithm Oracle rules Simulation References
Quantile regression v.s. mean regression
Mean regression:Yi = m(Xi) + εi,E[ε|X = x] = 0
m: Regression function, object ofinterest.
εi: ’errors’.
Quantile regression:P (Y ≤ Q(x; τ)|X = x) = τ
No strict distinction between ’signal’and ’noise’.
Object of interest: properties ofconditional distribution of Y |X = x.
Contains much richer informationthan just conditional mean.
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
10 15 20 25
−0.
050.
000.
050.
100.
150.
20
Age
Bon
e M
ass
Den
sity
(B
MD
)
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
10 15 20 25
−0.
050.
000.
050.
100.
150.
20
Age
Bon
e M
ass
Den
sity
(B
MD
)
Quantile regression Two-step algorithm Oracle rules Simulation References
Quantile curves v.s. conditional distribution
Q(x0; τ) = F−1(τ |x0),
where FY |X(y|x) is the conditional distribution function of Y givenX
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
τ
BM
D
Q( Age x0 = 13 ; τ)Q( Age x0 = 18 ; τ)Q( Age x0 = 23 ; τ)
0.00 0.05 0.10
0.0
0.2
0.4
0.6
0.8
1.0
BMD
CD
F
F(y | Age x0 = 13)F(y | Age x0 = 18)F(y | Age x0 = 23)
Quantile regression Two-step algorithm Oracle rules Simulation References
Quantile Regression as Optimization
{(Xi, Yi)}Ni=1 independent and identical samples in Rd+1
Koenker and Bassett (1978): if Q(x; τ) = β(τ)>x, estimate by
βor(τ) := arg minb
N∑
i=1
ρτ (Yi − b>Xi) (1.1)
where ρτ (u) := τu+ + (1− τ)u− ’check function’.
Optimization problem (1.2) is convex (but non-smooth),which can be solved by linear programming
Q(x0; τ) := x>0 βor(τ) for any x0
More generally, we can consider a series approximationmodel
Quantile regression Two-step algorithm Oracle rules Simulation References
A unified framework:
Q(x; τ) ≈ Z(x)>β(τ)
m := dim(Z) (it is possible that m→∞). Solve
βor(τ) := arg minb
N∑
i=1
ρτ{Yi − b>Z(Xi)
}(1.2)
Examples of Z(x): linear model with fixed/increasingdimension, B-splines, polynomials, trigonometricpolynomials
Q(x; τ) := Z(x)>βor(τ)
Need to control the ”bias” Q(x; τ)− Z(x)>β(τ)
Quantile Regression
Quantile regression Two-step algorithm Oracle rules Simulation References
Summary for quantile regression
βor(τ) is computationally infeasible when N is so largethat cannot be handled with a single machine
To infer the ”whole” conditional distribution for fixed x0,by
FY |X(y|x0) =
∫ 1
01{Q(x0; τ) < y}dτ
where 1(A) = 1 if A is true. To approximate the integral,we need a to compute Q(x0; τ) for many τ
Quantile regression Two-step algorithm Oracle rules Simulation References
D&C algorithm at fixed τ
Problem(N)
subproblem(n)
subproblem(n)
subproblem(n)
subproblem(n)
β1(τ1) β2(τ1) β3(τ1)β4(τ1)
βs(τ) := arg minb∈Rm
n∑
i=1
ρτ{Yis − b>Z(Xis)
}(2.1)
β(τ) :=1
S
S∑
s=1
βs(τ). (2.2)
However, this is only for a fixed τ !
Quantile regression Two-step algorithm Oracle rules Simulation References
Quantile projection algorithm
To avoid repetitively applying D&C, take B := (B1, ..., Bq)>
B-spline basis defined on equidistant knots {t1, ..., tG} ⊂ T withdegree rτ ∈ N,
β(τ) := Ξ>B(τ). (2.3)
Computation of Ξ:
1 Define a grid of quantile levels {τ1, ..., τK} on [τL, τU ],K > q. For each τk, compute β(τk)
2 Compute for j = 1, ...,m
αj := arg minα∈Rq
K∑
k=1
(βj(τk)−α>B(τk)
)2. (2.4)
3 Set the matrix Ξ := [α1 α2 ... αm].
Quantile regression Two-step algorithm Oracle rules Simulation References
Computation of FY |X(y|x)
Define Q(x; τ) := Z(x)>β(τ) = Z(x)>Ξ>B(τ).
Compute
FY |X(y|x0) := τL +
∫ τU
τL
1{Q(x0; τ) < y}dτ. (2.5)
where 0 < τL < τU < 1.
Quantile regression Two-step algorithm Oracle rules Simulation References
Remarks
Both S and K can grow with N , and they decide thecomputational limit as well as the statistical performance
Find S and K such that β(τ) and β(τ) are ”close” to
βor(τ) in some statistical senseThis results in ”sharp” upper bound of S and lower boundof K which impose computational limit
The two-step procedure requires only one pass through theentire data
Quantile projection requires only {β(τ1), ...,β(τK)} of sizem×K, without the need to access the raw data set
Quantile regression Two-step algorithm Oracle rules Simulation References
Oracle rules
For T = [τL, τU ] ⊂ (0, 1), under regularity conditions, C.,Volgushev and Cheng (2016): Chao et al. (2016)
aNu>(βor(τ)− β(τ)
) N at any fixed τ ∈ T (3.1)
aNu>(βor(τ)− β(τ)
) G(τ) as a process in τ ∈ T (3.2)
where N is a centered normal distribution and G is a centeredGaussian process
The oracle rule holds for β(τ) and β(τ) if β(τ) satisfies(3.1) and β(τ) satisfies (3.2)
Inference follows from the oracle rule
Quantile regression Two-step algorithm Oracle rules Simulation References
Two leading models
Linear model: m = dim(Z(x)) is fixed, and Q(x; τ) = Z(x)>β(τ);
Univariate spline nonparametric model: m = dim(Z(x)) →∞ with N . cN (γN ) :=
∣∣Q(x; τ)− Z(x)>γN (τ)∣∣ 6= 0,
γN (τ) := arg minγ∈Rm
E[(Z>γ −Q(X; τ))2f(Q(X; τ)|X)
]. (3.3)
Oracle rule: spline model
γN can be defined in many other ways. We do not go intodetail here
Conditions imposed throughout the talk: Assumption (A)
Jm(τ) := E[ZZ>fY |X(Q(X; τ)|X)]
Quantile regression Two-step algorithm Oracle rules Simulation References
Linear model with fixed dimension
P1(ξm,M, f , f ′, fmin): all distributions satisfying (A1)-(A3)with some constants 0 < ξm,M, f , f ′ <∞ and fmin > 0
Theorem 3.1 (Oracle Rule for β(τ))
Suppose that S = o(N1/2(logN)−2). ∀τ ∈ T and u ∈ Rm,
√Nu>(β(τ)− β(τ))
(u>Jm(τ)−1E[Z(X)Z(X)>]Jm(τ)−1u)1/2 N
(0, τ(1− τ)
),
If S & N1/2, then the weak convergence result above failsfor some P ∈ P1(1, d, f , f ′, fmin).
S = o(N1/2(logN)−2) is sharp: only miss by a factor of(logN)−2
Quantile regression Two-step algorithm Oracle rules Simulation References
Λητc (T ): Holder class with smoothness ητ > 0.`∞(T ): set of all uniformly bounded real functions on T
Theorem 3.2 (Oracle Rule for β(·))
Suppose that τ 7→ Q(x0; τ) ∈ Λητc (T ) for a x0 ∈ X .
If S = o(N1/2(logN)−2), and K � G� N1/(2ητ ) withrτ ≥ ητ then the projection estimator β(τ) defined in (2.3)satisfies
√N(Z(x0)
>β(·)−Q(x0; ·)) G(·) in `∞(T ), (3.4)
where G is a centered Gaussian process. Covariance function
If G . N1/(2ητ ) the weak convergence in (3.4) fails for someP ∈ P1(1, d, f , f ′, fmin) with τ 7→ Q(x0; τ) ∈ Λητc (T ).
K � N1/(2ητ ) is sharp!
Quantile regression Two-step algorithm Oracle rules Simulation References
Splines nonparametric model
We require an additional condition:
(L) For each x ∈ X , the vector Z(x) has zeroes in all but at mostr consecutive entries, where r is fixed. Furthermore,supx∈X E(Z(x),γN ) = O(1).
which guarantees that the matrix Jm(τ) to be a block matrix
PL(Z,M, f , f ′, fmin, R) :={
all sequences of distributions of (X,Y ) on Rd+1 satisfying
(A1)-(A3) with M, f , f ′ <∞, fmin > 0, and (L) for some r < R;
m2(logN)6 = o(N), cN (γN )2 = o(N−1/2)}. (3.5)
Definition cN
Quantile regression Two-step algorithm Oracle rules Simulation References
Splines nonparametric model
Theorem 3.3 (Oracle rule for β(τ))
Let {(Xi, Yi)}Ni=1 be distributed PN ∈ PL(Z,M, f , f ′, fmin, R),m� N ζ for some ζ > 0.
S = o((Nm−1(logN)−4)1/2) then ∀τ ∈ T , x0 ∈ X ,
√NZ(x0)
>(β(τ)− γN (τ))
(Z(x0)>Jm(τ)−1E[ZZ>]Jm(τ)−1Z(x0))1/2 N
(0, τ(1− τ)
),
If S & (N/m)1/2 the weak convergence above fails for somePN ∈ PL(Z,M, f , f ′, fmin, R)
Sharpness of S = o((N/m)1/2(logN)−2): nonparametric rate isslower, again miss by a factor of (logN)−2
Quantile regression Two-step algorithm Oracle rules Simulation References
Theorem 3.4 (Oracle rule for β(τ))
Let the assumptions of Theorem 3.3 hold and assume thatτ 7→ Q(x0; τ) ∈ Λητc (T ), rτ ≥ ητ ,
Suppose that cN (γN ) = o(N−1/2‖Z(x0)‖) andK � G� N1/(2ητ )‖Z(x0)‖−1/ητ , and the limitH(τ1, τ2) := limN→∞KN (τ1, τ2) > 0 KN ∀τ1, τ2 ∈ T .
√N
‖Z(x0)‖(Z(x0)
>β(·)−Q(x0; ·)) G(·) in `∞(T ),
where G is a centered Gaussian process with covariancestructure E[G(τ)G(τ ′)] = H(τ, τ ′).
If G . N1/(2ητ )‖Z(x0)‖−1/ητ the weak convergence fails forsome τ 7→ Q(x0; τ) ∈ Λητc (T ) for all S.
Quantile regression Two-step algorithm Oracle rules Simulation References
Distribution function
F orY |X(·|x0) := τL +
∫ τU
τL
1{Z(x)>βor(τ) < y}dτ
The oracle rule holds for both models.
Corollary 3.5
Under the same conditions as Theorem 3.2 (linear model) orTheorem 3.4 (nonparametric spline model), we have for anyx0 ∈ X ,
√N(FY |X(·|x0)− FY |X(·|x0)
) −fY |X(·|x0)G
(FY |X(·|x0)
),
in `∞((Q(x0; τL), Q(x0; τU ))
), where G is a centered Gaussian
process defined in respective theorem. The same holds forF orY |X(·|x0).
Quantile regression Two-step algorithm Oracle rules Simulation References
Phase transitions
K
SN1/2
log2N
N1/(2ητ )
N1/2
?
?
N1/2
m1/2 log2N
(Nm
)1/2
(Nm
)1/(2ητ )
Figure: Regions (S,K) for the oracle rule of linear model and splinenonparametric model. ”?” region is the discrepancy between thesufficient and necessary conditions.
Quantile regression Two-step algorithm Oracle rules Simulation References
Setting: linear model
Model: Yi = 0.21 + β>m−1Xi + εi, m = 4, 16, 32
Xi follows a multivariate uniform distribution U([0, 1]m−1)with covariance matrix ΣX := E[XiX
>i ] with
Σjk = 0.120.7|j−k| for j, k = 1, ...,m− 1
ε ∼ N or ε ∼ EXP (skewed)
x0 = (1, (m− 1)−1/2l>m−1)>
From Theorem 3.1, the coverage probability of theα = 95% confidence interval is
P{x>0 β(τ) ∈
[x>0 β(τ)±N−1/2f−1ε,τ
√τ(1− τ)x>0 Σ−1X x0Φ
−1(1− α/2)]}
S∗ : the point of S where the coverage starts to drop
Additional information
Quantile regression Two-step algorithm Oracle rules Simulation References
β(τ), ε ∼ N (0, 0.12)
● ●● ●
●●
●
●
●
● ● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Normal(0, 0.12) , τ = 0.1
logN(S)
Cov
erag
e P
roba
bilit
y
N = 214 , m = 4N = 222 , m = 4N = 214 , m = 16N = 222 , m = 16N = 214 , m = 32N = 222 , m = 32
●●
● ● ● ● ●
●● ● ●
●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Normal(0, 0.12) , τ = 0.5
logN(S)
Cov
erag
e P
roba
bilit
y
N = 214 , m = 4N = 222 , m = 4N = 214 , m = 16N = 222 , m = 16N = 214 , m = 32N = 222 , m = 32
●● ● ●
● ●●
●
●
● ● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Normal(0, 0.12) , τ = 0.9
logN(S)
Cov
erag
e P
roba
bilit
y
N = 214 , m = 4N = 222 , m = 4N = 214 , m = 16N = 222 , m = 16N = 214 , m = 32N = 222 , m = 32
Quantile regression Two-step algorithm Oracle rules Simulation References
β(τ), ε ∼ Exp(0.8)
●●
● ●
●
●
●
● ● ● ● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Exp (0.8) , τ = 0.1
logN(S)
Cov
erag
e P
roba
bilit
y
N = 214 , m = 4N = 222 , m = 4N = 214 , m = 16N = 222 , m = 16N = 214 , m = 32N = 222 , m = 32
● ● ● ●● ●
●
●
●
● ● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Exp (0.8) , τ = 0.5
logN(S)
Cov
erag
e P
roba
bilit
y
N = 214 , m = 4N = 222 , m = 4N = 214 , m = 16N = 222 , m = 16N = 214 , m = 32N = 222 , m = 32
● ● ●● ●
● ●●
●●
●
●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Exp (0.8) , τ = 0.9
logN(S)
Cov
erag
e P
roba
bilit
y
N = 214 , m = 4N = 222 , m = 4N = 214 , m = 16N = 222 , m = 16N = 214 , m = 32N = 222 , m = 32
Quantile regression Two-step algorithm Oracle rules Simulation References
Summary for simulation of β(τ)
m increases, S∗ decreases
N increases, S∗ gets close to N1/2
ε ∼ N , coverage is symmetric in τ , ε ∼ Exp, coverage isasymmetric in τ
t distribution behaves similarly to normal distribution
Quantile regression Two-step algorithm Oracle rules Simulation References
Quantile projection setting
B: cubic B-spline with q = dim(B) defined on G = 4 + qequidistant knots on [τL, τU ]. We require K > q so thatβ(τ) is computable (see (2.4))
N = 214
y0 = Q(x0; τ) so that FY |X(y0|x0) = τ
From Theorem 3.2, the coverage probability with sizeα = 0.95 is
P{τ ∈
[FY |X(Q(x0; τ)|x0)±N−1/2
√τ(1− τ)x>0 Σ−1X x0Φ
−1(1− α/2)]}
Quantile regression Two-step algorithm Oracle rules Simulation References
FY |X(y|x), ε ∼ N (0, 0.12), m = 4 for β(τ)
● ●● ●
●●
●
●
●
●
● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Normal(0, 0.12) , m= 4 , y0 = Q( x0 ; τ = 0.1 )
logN(S)
Cov
erag
e P
roba
bilit
y
q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30
● ●
●
●●
● ● ●●
●
●
●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Normal(0, 0.12) , m= 4 , y0 = Q( x0 ; τ = 0.5 )
logN(S)
Cov
erag
e P
roba
bilit
y
q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30
● ● ● ●●
●●
●
●
● ● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Normal(0, 0.12) , m= 4 , y0 = Q( x0 ; τ = 0.9 )
logN(S)
Cov
erag
e P
roba
bilit
y
q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30
Quantile regression Two-step algorithm Oracle rules Simulation References
FY |X(y|x), ε ∼ N (0, 0.12), m = 32 for β(τ)
●●
●●
●
●
●
● ● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Normal(0, 0.12) , m= 32 , y0 = Q( x0 ; τ = 0.1 )
logN(S)
Cov
erag
e P
roba
bilit
y
q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30
● ●●
● ● ● ●
●
●
●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Normal(0, 0.12) , m= 32 , y0 = Q( x0 ; τ = 0.5 )
logN(S)
Cov
erag
e P
roba
bilit
y
q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30
●● ●
●●
●
●
● ● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Normal(0, 0.12) , m= 32 , y0 = Q( x0 ; τ = 0.9 )
logN(S)
Cov
erag
e P
roba
bilit
y
q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30
Quantile regression Two-step algorithm Oracle rules Simulation References
FY |X(y|x), ε ∼ Exp(0.8), m = 4 for β(τ)
● ●● ●
●
●
●
● ● ● ● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Exp (0.8) , m= 4 , y0 = Q( x0 ; τ = 0.1 )
logN(S)
Cov
erag
e P
roba
bilit
y
q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30
● ● ●● ● ●
●
●
●● ● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Exp (0.8) , m= 4 , y0 = Q( x0 ; τ = 0.5 )
logN(S)
Cov
erag
e P
roba
bilit
y
q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30
● ●● ● ● ● ●
●
● ●
● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Exp (0.8) , m= 4 , y0 = Q( x0 ; τ = 0.9 )
logN(S)
Cov
erag
e P
roba
bilit
y
q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30
Quantile regression Two-step algorithm Oracle rules Simulation References
FY |X(y|x), ε ∼ Exp(0.8), m = 32 for β(τ)
● ●●
●
●
●
● ● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Exp (0.8) , m= 32 , y0 = Q( x0 ; τ = 0.1 )
logN(S)
Cov
erag
e P
roba
bilit
y
q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30
● ● ● ●
●
●
●
● ●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Exp (0.8) , m= 32 , y0 = Q( x0 ; τ = 0.5 )
logN(S)
Cov
erag
e P
roba
bilit
y
q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30
● ● ● ● ●● ●
●
●
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Exp (0.8) , m= 32 , y0 = Q( x0 ; τ = 0.9 )
logN(S)
Cov
erag
e P
roba
bilit
y
q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30
Quantile regression Two-step algorithm Oracle rules Simulation References
Summary for simulation of FY |X(y|x)
Increase in m lowers both S∗ and q∗ (the critical point in qfor the oracle rule)
Either S > S∗ or q < q∗ (q = G− 4) leads to the collapse ofthe oracle rule
Increase in q and K improves the coverage probability
For ε ∼ N , coverage is no longer symmetric in τ
Quantile regression Two-step algorithm Oracle rules Simulation References
Thank you for your attention
Quantile regression Two-step algorithm Oracle rules Simulation References
References I
Chao, S., Volgushev, S., and Cheng, G. (2016). Quantile process forsemi and nonparametric regression models. ArXiv Preprint Arxiv1604.02130.
White, T. (2012). Hadoop: The Definitive Guide. O’Reilly Media /Yahoo Press.
Quantile regression Two-step algorithm Oracle rules Simulation References
Assumption (A): data (Xi, Yi)i=1,...,N form triangular array andare row-wise i.i.d. with Linear models
(A1) Assume that ‖Zi‖ ≤ ξm <∞, where ξm = O(N b) is allowedto increase to infinity, and that
1/M ≤ λmin(E[ZZT ]) ≤ λmax(E[ZZT ]) ≤M
holds uniformly in n for some fixed constant M .
(A2) The conditional distribution FY |X(y|x) is twicedifferentiable w.r.t. y. Denote the correspondingderivatives by fY |X(y|x) and f ′Y |X(y|x). Assume that
f := supy,x|fY |X(y|x)| <∞, f ′ := sup
y,x|f ′Y |X(y|x)| <∞
uniformly in n.
(A3) Assume that
0 < fmin ≤ infτ∈T
infxfY |X(Q(x; τ)|x)
uniformly in n.
Quantile regression Two-step algorithm Oracle rules Simulation References
E[G(τ)G(τ ′)
](5.1)
= Z(x0)>Jm(τ)−1E
[Z(X)Z(X)>
]Jm(τ ′)−1Z(x0)(τ ∧ τ ′ − ττ ′).
Process: linear model
K(τ1, τ2)
:= ‖Z(x0)‖−2Z(x0)>J−1m (τ1)E[ZZ>]J−1m (τ2)Z(x0)(τ1 ∧ τ2 − τ1τ2)
Process: local polynomial
Quantile regression Two-step algorithm Oracle rules Simulation References
β3 = (0.21,−0.89, 0.38)>;
β15 = (β>3 , 0.63, 0.11, 1.01,−1.79,−1.39,
0.52,−1.62, 1.26,−0.72, 0.43,−0.41,−0.02)>;
β31 = (β>15, 0.21,β>15)>.
(5.2)
β(τ) = (0.21 + 0.1× Φ−1σ=0.1(τ),β>m−1)>
fε,τ > 0 is the height of the density of εi evaluated at εi’s τquantile. Simulation setting