quantile regression for extraordinarily large...

45
Quantile regression Two-step algorithm Oracle rules Simulation References Quantile Regression for Extraordinarily Large Data Shih-Kang Chao Department of Statistics Purdue University November, 2016 A joint work with Stanislav Volgushev and Guang Cheng

Upload: others

Post on 05-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Quantile Regression for Extraordinarily LargeData

Shih-Kang ChaoDepartment of Statistics

Purdue University

November, 2016A joint work with Stanislav Volgushev and Guang Cheng

Page 2: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

With the advance of technology, it is increasingly common thatdata set is so large that it cannot be stored in a single machine

Social media (views, likes, comments, images...)Meteorological and environmental surveillanceTransactions in e-commerceOthers...

Figure: A server room in Council Bluffs, Iowa. Photo: Google/ConnieZhou.

Page 3: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Divide and Conquer (D&C)

To take advantage of the opportunities in massive data, we needto deal with storing (disc) and computational (memory,processor) bottlenecks.

Divide and conquer paradigm: Randomly divide N samples intoS groups. n = N/S

Problem(N)

subproblem(n)

subproblem(n)

subproblem(n)

subproblem(n)

Page 4: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Aggregate solutions from subproblems to get a solution fororiginal problem

Can be implemented by computational platforms such asHadoop (White, 2012)

Data that are stored in distributed locations can beanalyzed in similar manner

Problem(N)

subproblem(n)

subproblem(n)

subproblem(n)

subproblem(n)

Page 5: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Does D&C fit for statistical analysis?

Sometimes it does, but sometimes it doesn’t...

In the following, we give two simple examples.

Page 6: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Example 1: sample mean

Problem(N)

subproblem(n)

subproblem(n)

subproblem(n)

subproblem(n)

Xn,1 Xn,2Xn,3

Xn,4

1

4

4∑

s=1

Xs =1

4n

4∑

s=1

n∑

i=1

Xis =1

N

N∑

i=1

Xi = XN .

It fits!

Page 7: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Example 2: sample median

Problem(N)

subproblem(n)

subproblem(n)

subproblem(n)

subproblem(n)

X1(0.5) X2

(0.5)X3

(0.5)X4

(0.5)

Xs(0.5) = the middle value of ordered n samples in s group;

X(0.5) = overall median

1

4

4∑

s=1

Xs(0.5)

??= X(0.5)

Page 8: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Example 2: sample median

Simulation 1: Xi ∼ N(0, 1); Simulation 2: Xi ∼ Exponential(1).N = 215. True median v.s. simulated S−1

∑Ss=1X

s(0.5)

0.2 0.4 0.6 0.8−0.2

0−0

.10

0.00

0.10

τ = 0.5 , N (0,1)

logN(S)

0.2 0.4 0.6 0.8

0.7

0.8

0.9

1.0

τ = 0.5 , EXP (1)

logN(S)

Page 9: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Example 2: sample median

Simulation 1: Xi ∼ N(0, 1); Simulation 2: Xi ∼ Exponential(1).N = 215. True median v.s. simulated S−1

∑Ss=1X

s(0.5)

0.2 0.4 0.6 0.8−0.2

0−0

.10

0.00

0.10

τ = 0.5 , N (0,1)

logN(S)

0.2 0.4 0.6 0.8

0.7

0.8

0.9

1.0

τ = 0.5 , EXP (1)

logN(S)

Page 10: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Major issues

When does the D&C algorithm work?

Especially for skewed and heavy tail distribution

Statistical inference

Asymptotic distribution

Inference for the ”whole” distribution?

Take advantage of massive size to discover subtle patternshidden in the ”whole” distribution

Page 11: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Outline

1 Quantile regression

2 Two-step algorithm: D&C and quantile projection

3 Oracle rules: linear model and nonparametric model

4 Simulation

Page 12: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Quantile

Response Y , predictors X. For τ ∈ (0, 1), conditional quantilecurve Q(·; τ) of Y ∈ R conditional on X is defined through

P (Y ≤ Q(X; τ)|X = x) = τ ∀x.

Q(x; τ) at τ = 0.1, 0.25, 0.5, 0.75, 0.9 under different models

● ●

● ●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

0.0 0.2 0.4 0.6 0.8 1.0

−2

−1

01

23

Y = 0.5*X + N(0,1)

x

y

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●● ●

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

−2

02

46

Y = X + (.5+X)*N(0,1)

x

y

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Y|X=x ~ Beta(1.5−x,.5+x)

x

y

Page 13: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Quantile regression v.s. mean regression

Mean regression:Yi = m(Xi) + εi,E[ε|X = x] = 0

m: Regression function, object ofinterest.

εi: ’errors’.

Quantile regression:P (Y ≤ Q(x; τ)|X = x) = τ

No strict distinction between ’signal’and ’noise’.

Object of interest: properties ofconditional distribution of Y |X = x.

Contains much richer informationthan just conditional mean.

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

10 15 20 25

−0.

050.

000.

050.

100.

150.

20

Age

Bon

e M

ass

Den

sity

(B

MD

)

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

10 15 20 25

−0.

050.

000.

050.

100.

150.

20

Age

Bon

e M

ass

Den

sity

(B

MD

)

Page 14: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Quantile curves v.s. conditional distribution

Q(x0; τ) = F−1(τ |x0),

where FY |X(y|x) is the conditional distribution function of Y givenX

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.05

0.10

τ

BM

D

Q( Age x0 = 13 ; τ)Q( Age x0 = 18 ; τ)Q( Age x0 = 23 ; τ)

0.00 0.05 0.10

0.0

0.2

0.4

0.6

0.8

1.0

BMD

CD

F

F(y | Age x0 = 13)F(y | Age x0 = 18)F(y | Age x0 = 23)

Page 15: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Quantile Regression as Optimization

{(Xi, Yi)}Ni=1 independent and identical samples in Rd+1

Koenker and Bassett (1978): if Q(x; τ) = β(τ)>x, estimate by

βor(τ) := arg minb

N∑

i=1

ρτ (Yi − b>Xi) (1.1)

where ρτ (u) := τu+ + (1− τ)u− ’check function’.

Optimization problem (1.2) is convex (but non-smooth),which can be solved by linear programming

Q(x0; τ) := x>0 βor(τ) for any x0

More generally, we can consider a series approximationmodel

Page 16: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

A unified framework:

Q(x; τ) ≈ Z(x)>β(τ)

m := dim(Z) (it is possible that m→∞). Solve

βor(τ) := arg minb

N∑

i=1

ρτ{Yi − b>Z(Xi)

}(1.2)

Examples of Z(x): linear model with fixed/increasingdimension, B-splines, polynomials, trigonometricpolynomials

Q(x; τ) := Z(x)>βor(τ)

Need to control the ”bias” Q(x; τ)− Z(x)>β(τ)

Quantile Regression

Page 17: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Summary for quantile regression

βor(τ) is computationally infeasible when N is so largethat cannot be handled with a single machine

To infer the ”whole” conditional distribution for fixed x0,by

FY |X(y|x0) =

∫ 1

01{Q(x0; τ) < y}dτ

where 1(A) = 1 if A is true. To approximate the integral,we need a to compute Q(x0; τ) for many τ

Page 18: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

D&C algorithm at fixed τ

Problem(N)

subproblem(n)

subproblem(n)

subproblem(n)

subproblem(n)

β1(τ1) β2(τ1) β3(τ1)β4(τ1)

βs(τ) := arg minb∈Rm

n∑

i=1

ρτ{Yis − b>Z(Xis)

}(2.1)

β(τ) :=1

S

S∑

s=1

βs(τ). (2.2)

However, this is only for a fixed τ !

Page 19: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Quantile projection algorithm

To avoid repetitively applying D&C, take B := (B1, ..., Bq)>

B-spline basis defined on equidistant knots {t1, ..., tG} ⊂ T withdegree rτ ∈ N,

β(τ) := Ξ>B(τ). (2.3)

Computation of Ξ:

1 Define a grid of quantile levels {τ1, ..., τK} on [τL, τU ],K > q. For each τk, compute β(τk)

2 Compute for j = 1, ...,m

αj := arg minα∈Rq

K∑

k=1

(βj(τk)−α>B(τk)

)2. (2.4)

3 Set the matrix Ξ := [α1 α2 ... αm].

Page 20: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Computation of FY |X(y|x)

Define Q(x; τ) := Z(x)>β(τ) = Z(x)>Ξ>B(τ).

Compute

FY |X(y|x0) := τL +

∫ τU

τL

1{Q(x0; τ) < y}dτ. (2.5)

where 0 < τL < τU < 1.

Page 21: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Remarks

Both S and K can grow with N , and they decide thecomputational limit as well as the statistical performance

Find S and K such that β(τ) and β(τ) are ”close” to

βor(τ) in some statistical senseThis results in ”sharp” upper bound of S and lower boundof K which impose computational limit

The two-step procedure requires only one pass through theentire data

Quantile projection requires only {β(τ1), ...,β(τK)} of sizem×K, without the need to access the raw data set

Page 22: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Oracle rules

For T = [τL, τU ] ⊂ (0, 1), under regularity conditions, C.,Volgushev and Cheng (2016): Chao et al. (2016)

aNu>(βor(τ)− β(τ)

) N at any fixed τ ∈ T (3.1)

aNu>(βor(τ)− β(τ)

) G(τ) as a process in τ ∈ T (3.2)

where N is a centered normal distribution and G is a centeredGaussian process

The oracle rule holds for β(τ) and β(τ) if β(τ) satisfies(3.1) and β(τ) satisfies (3.2)

Inference follows from the oracle rule

Page 23: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Two leading models

Linear model: m = dim(Z(x)) is fixed, and Q(x; τ) = Z(x)>β(τ);

Univariate spline nonparametric model: m = dim(Z(x)) →∞ with N . cN (γN ) :=

∣∣Q(x; τ)− Z(x)>γN (τ)∣∣ 6= 0,

γN (τ) := arg minγ∈Rm

E[(Z>γ −Q(X; τ))2f(Q(X; τ)|X)

]. (3.3)

Oracle rule: spline model

γN can be defined in many other ways. We do not go intodetail here

Conditions imposed throughout the talk: Assumption (A)

Jm(τ) := E[ZZ>fY |X(Q(X; τ)|X)]

Page 24: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Linear model with fixed dimension

P1(ξm,M, f , f ′, fmin): all distributions satisfying (A1)-(A3)with some constants 0 < ξm,M, f , f ′ <∞ and fmin > 0

Theorem 3.1 (Oracle Rule for β(τ))

Suppose that S = o(N1/2(logN)−2). ∀τ ∈ T and u ∈ Rm,

√Nu>(β(τ)− β(τ))

(u>Jm(τ)−1E[Z(X)Z(X)>]Jm(τ)−1u)1/2 N

(0, τ(1− τ)

),

If S & N1/2, then the weak convergence result above failsfor some P ∈ P1(1, d, f , f ′, fmin).

S = o(N1/2(logN)−2) is sharp: only miss by a factor of(logN)−2

Page 25: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Λητc (T ): Holder class with smoothness ητ > 0.`∞(T ): set of all uniformly bounded real functions on T

Theorem 3.2 (Oracle Rule for β(·))

Suppose that τ 7→ Q(x0; τ) ∈ Λητc (T ) for a x0 ∈ X .

If S = o(N1/2(logN)−2), and K � G� N1/(2ητ ) withrτ ≥ ητ then the projection estimator β(τ) defined in (2.3)satisfies

√N(Z(x0)

>β(·)−Q(x0; ·)) G(·) in `∞(T ), (3.4)

where G is a centered Gaussian process. Covariance function

If G . N1/(2ητ ) the weak convergence in (3.4) fails for someP ∈ P1(1, d, f , f ′, fmin) with τ 7→ Q(x0; τ) ∈ Λητc (T ).

K � N1/(2ητ ) is sharp!

Page 26: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Splines nonparametric model

We require an additional condition:

(L) For each x ∈ X , the vector Z(x) has zeroes in all but at mostr consecutive entries, where r is fixed. Furthermore,supx∈X E(Z(x),γN ) = O(1).

which guarantees that the matrix Jm(τ) to be a block matrix

PL(Z,M, f , f ′, fmin, R) :={

all sequences of distributions of (X,Y ) on Rd+1 satisfying

(A1)-(A3) with M, f , f ′ <∞, fmin > 0, and (L) for some r < R;

m2(logN)6 = o(N), cN (γN )2 = o(N−1/2)}. (3.5)

Definition cN

Page 27: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Splines nonparametric model

Theorem 3.3 (Oracle rule for β(τ))

Let {(Xi, Yi)}Ni=1 be distributed PN ∈ PL(Z,M, f , f ′, fmin, R),m� N ζ for some ζ > 0.

S = o((Nm−1(logN)−4)1/2) then ∀τ ∈ T , x0 ∈ X ,

√NZ(x0)

>(β(τ)− γN (τ))

(Z(x0)>Jm(τ)−1E[ZZ>]Jm(τ)−1Z(x0))1/2 N

(0, τ(1− τ)

),

If S & (N/m)1/2 the weak convergence above fails for somePN ∈ PL(Z,M, f , f ′, fmin, R)

Sharpness of S = o((N/m)1/2(logN)−2): nonparametric rate isslower, again miss by a factor of (logN)−2

Page 28: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Theorem 3.4 (Oracle rule for β(τ))

Let the assumptions of Theorem 3.3 hold and assume thatτ 7→ Q(x0; τ) ∈ Λητc (T ), rτ ≥ ητ ,

Suppose that cN (γN ) = o(N−1/2‖Z(x0)‖) andK � G� N1/(2ητ )‖Z(x0)‖−1/ητ , and the limitH(τ1, τ2) := limN→∞KN (τ1, τ2) > 0 KN ∀τ1, τ2 ∈ T .

√N

‖Z(x0)‖(Z(x0)

>β(·)−Q(x0; ·)) G(·) in `∞(T ),

where G is a centered Gaussian process with covariancestructure E[G(τ)G(τ ′)] = H(τ, τ ′).

If G . N1/(2ητ )‖Z(x0)‖−1/ητ the weak convergence fails forsome τ 7→ Q(x0; τ) ∈ Λητc (T ) for all S.

Page 29: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Distribution function

F orY |X(·|x0) := τL +

∫ τU

τL

1{Z(x)>βor(τ) < y}dτ

The oracle rule holds for both models.

Corollary 3.5

Under the same conditions as Theorem 3.2 (linear model) orTheorem 3.4 (nonparametric spline model), we have for anyx0 ∈ X ,

√N(FY |X(·|x0)− FY |X(·|x0)

) −fY |X(·|x0)G

(FY |X(·|x0)

),

in `∞((Q(x0; τL), Q(x0; τU ))

), where G is a centered Gaussian

process defined in respective theorem. The same holds forF orY |X(·|x0).

Page 30: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Phase transitions

K

SN1/2

log2N

N1/(2ητ )

N1/2

?

?

N1/2

m1/2 log2N

(Nm

)1/2

(Nm

)1/(2ητ )

Figure: Regions (S,K) for the oracle rule of linear model and splinenonparametric model. ”?” region is the discrepancy between thesufficient and necessary conditions.

Page 31: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Setting: linear model

Model: Yi = 0.21 + β>m−1Xi + εi, m = 4, 16, 32

Xi follows a multivariate uniform distribution U([0, 1]m−1)with covariance matrix ΣX := E[XiX

>i ] with

Σjk = 0.120.7|j−k| for j, k = 1, ...,m− 1

ε ∼ N or ε ∼ EXP (skewed)

x0 = (1, (m− 1)−1/2l>m−1)>

From Theorem 3.1, the coverage probability of theα = 95% confidence interval is

P{x>0 β(τ) ∈

[x>0 β(τ)±N−1/2f−1ε,τ

√τ(1− τ)x>0 Σ−1X x0Φ

−1(1− α/2)]}

S∗ : the point of S where the coverage starts to drop

Additional information

Page 32: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

β(τ), ε ∼ N (0, 0.12)

● ●● ●

●●

● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , τ = 0.1

logN(S)

Cov

erag

e P

roba

bilit

y

N = 214 , m = 4N = 222 , m = 4N = 214 , m = 16N = 222 , m = 16N = 214 , m = 32N = 222 , m = 32

●●

● ● ● ● ●

●● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , τ = 0.5

logN(S)

Cov

erag

e P

roba

bilit

y

N = 214 , m = 4N = 222 , m = 4N = 214 , m = 16N = 222 , m = 16N = 214 , m = 32N = 222 , m = 32

●● ● ●

● ●●

● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , τ = 0.9

logN(S)

Cov

erag

e P

roba

bilit

y

N = 214 , m = 4N = 222 , m = 4N = 214 , m = 16N = 222 , m = 16N = 214 , m = 32N = 222 , m = 32

Page 33: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

β(τ), ε ∼ Exp(0.8)

●●

● ●

● ● ● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , τ = 0.1

logN(S)

Cov

erag

e P

roba

bilit

y

N = 214 , m = 4N = 222 , m = 4N = 214 , m = 16N = 222 , m = 16N = 214 , m = 32N = 222 , m = 32

● ● ● ●● ●

● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , τ = 0.5

logN(S)

Cov

erag

e P

roba

bilit

y

N = 214 , m = 4N = 222 , m = 4N = 214 , m = 16N = 222 , m = 16N = 214 , m = 32N = 222 , m = 32

● ● ●● ●

● ●●

●●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , τ = 0.9

logN(S)

Cov

erag

e P

roba

bilit

y

N = 214 , m = 4N = 222 , m = 4N = 214 , m = 16N = 222 , m = 16N = 214 , m = 32N = 222 , m = 32

Page 34: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Summary for simulation of β(τ)

m increases, S∗ decreases

N increases, S∗ gets close to N1/2

ε ∼ N , coverage is symmetric in τ , ε ∼ Exp, coverage isasymmetric in τ

t distribution behaves similarly to normal distribution

Page 35: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Quantile projection setting

B: cubic B-spline with q = dim(B) defined on G = 4 + qequidistant knots on [τL, τU ]. We require K > q so thatβ(τ) is computable (see (2.4))

N = 214

y0 = Q(x0; τ) so that FY |X(y0|x0) = τ

From Theorem 3.2, the coverage probability with sizeα = 0.95 is

P{τ ∈

[FY |X(Q(x0; τ)|x0)±N−1/2

√τ(1− τ)x>0 Σ−1X x0Φ

−1(1− α/2)]}

Page 36: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

FY |X(y|x), ε ∼ N (0, 0.12), m = 4 for β(τ)

● ●● ●

●●

● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , m= 4 , y0 = Q( x0 ; τ = 0.1 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ●

●●

● ● ●●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , m= 4 , y0 = Q( x0 ; τ = 0.5 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ● ● ●●

●●

● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , m= 4 , y0 = Q( x0 ; τ = 0.9 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

Page 37: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

FY |X(y|x), ε ∼ N (0, 0.12), m = 32 for β(τ)

●●

●●

● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , m= 32 , y0 = Q( x0 ; τ = 0.1 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ●●

● ● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , m= 32 , y0 = Q( x0 ; τ = 0.5 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

●● ●

●●

● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Normal(0, 0.12) , m= 32 , y0 = Q( x0 ; τ = 0.9 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

Page 38: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

FY |X(y|x), ε ∼ Exp(0.8), m = 4 for β(τ)

● ●● ●

● ● ● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , m= 4 , y0 = Q( x0 ; τ = 0.1 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ● ●● ● ●

●● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , m= 4 , y0 = Q( x0 ; τ = 0.5 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ●● ● ● ● ●

● ●

● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , m= 4 , y0 = Q( x0 ; τ = 0.9 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

Page 39: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

FY |X(y|x), ε ∼ Exp(0.8), m = 32 for β(τ)

● ●●

● ● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , m= 32 , y0 = Q( x0 ; τ = 0.1 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ● ● ●

● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , m= 32 , y0 = Q( x0 ; τ = 0.5 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

● ● ● ● ●● ●

0.0 0.2 0.4 0.6 0.8

0.0

0.2

0.4

0.6

0.8

1.0

Exp (0.8) , m= 32 , y0 = Q( x0 ; τ = 0.9 )

logN(S)

Cov

erag

e P

roba

bilit

y

q = 7 , K = 20q = 7 , K = 30q = 10 , K = 20q = 10 , K = 30q = 15 , K = 20q = 15 , K = 30

Page 40: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Summary for simulation of FY |X(y|x)

Increase in m lowers both S∗ and q∗ (the critical point in qfor the oracle rule)

Either S > S∗ or q < q∗ (q = G− 4) leads to the collapse ofthe oracle rule

Increase in q and K improves the coverage probability

For ε ∼ N , coverage is no longer symmetric in τ

Page 41: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Thank you for your attention

Page 42: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

References I

Chao, S., Volgushev, S., and Cheng, G. (2016). Quantile process forsemi and nonparametric regression models. ArXiv Preprint Arxiv1604.02130.

White, T. (2012). Hadoop: The Definitive Guide. O’Reilly Media /Yahoo Press.

Page 43: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

Assumption (A): data (Xi, Yi)i=1,...,N form triangular array andare row-wise i.i.d. with Linear models

(A1) Assume that ‖Zi‖ ≤ ξm <∞, where ξm = O(N b) is allowedto increase to infinity, and that

1/M ≤ λmin(E[ZZT ]) ≤ λmax(E[ZZT ]) ≤M

holds uniformly in n for some fixed constant M .

(A2) The conditional distribution FY |X(y|x) is twicedifferentiable w.r.t. y. Denote the correspondingderivatives by fY |X(y|x) and f ′Y |X(y|x). Assume that

f := supy,x|fY |X(y|x)| <∞, f ′ := sup

y,x|f ′Y |X(y|x)| <∞

uniformly in n.

(A3) Assume that

0 < fmin ≤ infτ∈T

infxfY |X(Q(x; τ)|x)

uniformly in n.

Page 44: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

E[G(τ)G(τ ′)

](5.1)

= Z(x0)>Jm(τ)−1E

[Z(X)Z(X)>

]Jm(τ ′)−1Z(x0)(τ ∧ τ ′ − ττ ′).

Process: linear model

K(τ1, τ2)

:= ‖Z(x0)‖−2Z(x0)>J−1m (τ1)E[ZZ>]J−1m (τ2)Z(x0)(τ1 ∧ τ2 − τ1τ2)

Process: local polynomial

Page 45: Quantile Regression for Extraordinarily Large Datafaculty.missouri.edu/~chaosh/doc/pubs/BDQR_Chao.pdf · 2018-08-16 · Quantile regressionTwo-step algorithmOracle rulesSimulationReferences

Quantile regression Two-step algorithm Oracle rules Simulation References

β3 = (0.21,−0.89, 0.38)>;

β15 = (β>3 , 0.63, 0.11, 1.01,−1.79,−1.39,

0.52,−1.62, 1.26,−0.72, 0.43,−0.41,−0.02)>;

β31 = (β>15, 0.21,β>15)>.

(5.2)

β(τ) = (0.21 + 0.1× Φ−1σ=0.1(τ),β>m−1)>

fε,τ > 0 is the height of the density of εi evaluated at εi’s τquantile. Simulation setting