bahadur slope and bahadur e ciency a linear-time kernel...

1
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1 Wenkai Xu 1 Zolt ´ an Szab´o 2 Kenji Fukumizu 3 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, ´ Ecole Polytechnique 3 The Institute of Statistical Mathematics A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1 Wenkai Xu 1 Zolt ´ an Szab´o 2 Kenji Fukumizu 3 Arthur Gretton 1 1 Gatsby Unit, University College London 2 CMAP, ´ Ecole Polytechnique 3 The Institute of Statistical Mathematics Summary Given: {x i } n i=1 q (unknown), and a density p. Goal: Test H 0 : p = q vs H 1 : p 6= q quickly. New multivariate goodness-of-fit test (FSSD): 1. Nonparametric : arbitrary, unnormalized p. x R d . 2. Linear-time : O (n) runtime complexity. Fast. 3. Interpretable : tell where p does not fit the data. Previous: Kernel Stein Discrepancy (KSD) Let ξ (x, v ) := 1 p(x) x [k (x, v ) p(x)] R d . Stein witness function: g (v )= E xq [ξ (x, v )] where g =(g 1 ,..., g d ) and each g i ∈F , an RKHS associated with kernel k . - 4 - 2 2 4 - 0.2 0.2 0.4 p ( x ) q( x ) g ( x ) Known: Under some conditions, kg k F d =0 ⇐⇒ p = q. [Chwialkowski et al., 2016, Liu et al., 2016] Statistic: KSD 2 = kg k 2 F d = double sums z }| { E xq E yq h p (x, y ) 2 n(n-1) i< j h p (x i , x j ). where h p (x, y) := [x log p(x)] k (x, y)[y log p(y)] + x y k (x, y) +[y log p(y)] x k (x, y)+[x log p(x)] y k (x, y). Characteristics of KSD: Nonparametric. Applicable to a wide range of p. Do not need the normalizer of p. Runtime: O (n 2 ). Computationally expensive. L inear-Time KS D (LKS) Test: [Liu et al., 2016] kg k 2 F d 2 n n/2 i=1 h p (x 2i-1 , x 2i ). Runtime: O (n). High variance. Low test power. The Finite Set Stein Discrepancy (FSSD) Idea: Evaluate witness g at J locations {v 1 ,..., v J }. Fast. FSSD 2 = 1 dJ J X j =1 kg (v j )k 2 2 . Proposition (FSSD is a discrepancy measure). Main conditions: 1. (Nice kernel) Kernel k is C 0 -universal, and real analytic (Taylor series at any point converges) e.g., Gaussian kernel. 2. (Vanishing boundary) lim kxk→∞ p(x)g (x)= 0. 3. (Avoid blind spots) Locations {v 1 ,..., v J } are drawn from a distribution η which has a density. Then, for any J 1, η -a.s. FSSD 2 =0 ⇐⇒ p = q. Characteristics of FSSD: Nonparametric. Do not need the normalizer of p. Runtime: O (n). Higher test power than LKS. Model Criticism with FSSD Proposal: Optimize locations {v 1 ,..., v J } and kernel bandwidth by arg max score = FSSD 2 /σ H 1 (runtime: O (n)). Proposition: This procedure maximizes the true positive rate = P(detect difference | p 6= q). score: 0.034 score: 0.44 Interpretable Features for Model Criticism 12K robbery events in Chicago in 2016 Model p = 10-component Gaussian mixture F = v * = where model does not fit well. Maximization objective FSSD 2 /σ H 1 . Optimized v * is highly interpretable. Bahadur Slope and Bahadur Efficiency Bahadur slope u rate of p-value 0 of statistic T n under H 1 . High = good. Bahadur efficiency = ratio slope (1) slope (2) of slopes of two tests. > 1 means test (1) better. Results: Slopes of FSSD and LKS tests when p = N (0, 1) and q = N ( μ q , 1). 0 50 100 n 0.0 0.5 1.0 p-value T (1) n T (2) n Proposition. Let σ 2 k , κ 2 be kernel bandwidths of FSSD and LKS. Fix σ 2 k =1. Then, μ q 6=0, v R, κ 2 > 0, the Bahadur efficiency slope (FSSD) ( μ q , v , σ 2 k ) slope (LKS) ( μ q , κ 2 ) > 2. FSSD is statistically more efficient than LKS. Experiment: Restricted Boltzmann Machine 40 binary hidden units. d = 50 visible units. Significance level α =0.05. ··· ··· Model p Perturb one weight to get q. 2000 4000 Sample size n 0.00 0.25 0.50 0.75 Rejection rate 1000 2000 3000 4000 Sample size n 0 100 200 300 Time (s) FSSD-opt FSSD-rand KSD LKS MMD-opt ME-opt Better FSSD-opt,(FSSD-rand) = Proposed tests. J =5 optimized, (random) locations. MMD-opt [Gretton et al., 2012] = State-of-the-art two-sample test (quadratic-time). ME-opt [Jitkrittum et al., 2016] = Linear-time two-sample test with optimized locations. Key : FSSD (O (n)), KSD (O (n 2 )) have comparable power. FSSD is much faster. WJ, WX, and AG thank the Gatsby Charitable Foundation for the financial support. ZSz was financially supported by the Data Science Initiative. KF has been supported by KAKENHI Innovative Areas 25120012. Contact: [email protected] Code: github.com/wittawatj/kernel-gof

Upload: letram

Post on 07-May-2019

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bahadur Slope and Bahadur E ciency A Linear-Time Kernel ...wittawat.com/assets/poster/kgof_nips2017_poster.pdf · A Linear-Time Kernel Goodness-of-Fit Test ... 3Higher test power

A Linear-Time Kernel Goodness-of-Fit TestWittawat Jitkrittum1 Wenkai Xu1 Zoltan Szabo2 Kenji Fukumizu3 Arthur Gretton1

1Gatsby Unit, University College London 2CMAP, Ecole Polytechnique 3The Institute of Statistical Mathematics

A Linear-Time Kernel Goodness-of-Fit TestWittawat Jitkrittum1 Wenkai Xu1 Zoltan Szabo2 Kenji Fukumizu3 Arthur Gretton1

1Gatsby Unit, University College London 2CMAP, Ecole Polytechnique 3The Institute of Statistical Mathematics

Summary

•Given: {xi}ni=1 ∼ q (unknown), and a density p.

•Goal: Test H0 : p = q vs H1 : p 6= q quickly.•New multivariate goodness-of-fit test (FSSD):

1. Nonparametric: arbitrary, unnormalized p. x ∈ Rd.

2. Linear-time: O(n) runtime complexity. Fast.

3. Interpretable: tell where p does not fit the data.

Previous: Kernel Stein Discrepancy (KSD)

•Let ξ (x,v) := 1p(x)∇x[k(x,v)p(x)] ∈ Rd.

Stein witness function: g(v) = Ex∼q[ξ (x,v)] whereg = (g1, . . . , gd) and each gi ∈ F , an RKHS associatedwith kernel k.

- 4 - 2 2 4

- 0.2

0.2

0.4

p(x)

q(x)

g(x)

Known: Under some conditions, ‖g‖Fd = 0 ⇐⇒ p = q.[Chwialkowski et al., 2016, Liu et al., 2016]

Statistic: KSD2 = ‖g‖2Fd =double sums︷ ︸︸ ︷Ex∼qEy∼q hp(x,y) ≈

2n(n−1)

∑i< j hp(xi,x j). where

hp(x,y) := [∇x log p(x)] k(x,y) [∇y log p(y)] +∇x∇yk(x,y)+ [∇y log p(y)]∇xk(x,y) + [∇x log p(x)]∇yk(x,y).

Characteristics of KSD:3 Nonparametric. Applicable to a wide range of p.3 Do not need the normalizer of p.

7 Runtime: O(n2). Computationally expensive.

Linear-Time KSD (LKS) Test: [Liu et al., 2016]

‖g‖2Fd ≈ 2n

∑n/2i=1 hp(x2i−1,x2i).

3 Runtime: O(n). 7 High variance. Low test power.

The Finite Set Stein Discrepancy (FSSD)

Idea: Evaluate witness g at J locations {v1, . . . ,vJ}. Fast.

FSSD2 =1

dJ

J∑j=1

‖g(v j)‖22.

Proposition (FSSD is a discrepancy measure).Main conditions:

1. (Nice kernel) Kernel k is C0-universal, and real analytic(Taylor series at any point converges) e.g., Gaussian kernel.

2. (Vanishing boundary) lim‖x‖→∞ p(x)g(x) = 0.

3. (Avoid “blind spots”) Locations {v1, . . . ,vJ} are drawnfrom a distribution η which has a density.

Then, for any J ≥ 1, η-a.s. FSSD2 = 0 ⇐⇒ p = q.

Characteristics of FSSD:3 Nonparametric. 3 Do not need the normalizer of p.

3 Runtime: O(n). 3 Higher test power than LKS.

Model Criticism with FSSD

Proposal: Optimize locations {v1, . . . ,vJ} and kernelbandwidth by argmax score = FSSD2/σH1

(runtime: O(n)).

Proposition: This procedure maximizes the true positiverate = P(detect difference | p 6= q).

score: 0.034 score: 0.44

Interpretable Features for Model Criticism

12K robberyevents in Chicagoin 2016

Model p =10-component

Gaussian mixture

F = v∗ = where modeldoes not fit well.

Maximization objectiveFSSD2/σH1

.

Optimized v∗

is highlyinterpretable.

Bahadur Slope and Bahadur Efficiency

•Bahadur slope u rate of p-value → 0 of statistic Tn under H1. High = good.

•Bahadur efficiency = ratio slope(1)

slope(2)of slopes of two tests. > 1 means test(1) better.

•Results: Slopes of FSSD and LKS tests when p = N (0, 1) and q = N (µq, 1).

0 50 100

n

0.0

0.5

1.0

p-v

alu

e T(1)n

T(2)n

Proposition. Let σ 2k , κ2 be kernel bandwidths of

FSSD and LKS. Fix σ 2k = 1. Then, ∀µq 6= 0,

∃v ∈ R, ∀κ2 > 0, the Bahadur efficiency

slope(FSSD)(µq,v,σ 2k )

slope(LKS)(µq, κ2)> 2. FSSD is statistically

more efficient than LKS.

Experiment: Restricted Boltzmann Machine

•40 binary hidden units. d = 50 visible units. Significance level α = 0.05.

· · ·

· · ·

Model p

Perturb oneweight to get q.

2000 4000Sample size n

0.00

0.25

0.50

0.75

Rej

ecti

onra

te

1000 2000 3000 4000Sample size n

0

100

200

300

Tim

e(s

)

2000 4000Sample size n

0.0

0.5

1.0

Rej

ecti

onra

te FSSD-opt

FSSD-rand

KSD

LKS

MMD-opt

ME-opt

Bet

ter

•FSSD-opt, (FSSD-rand) = Proposed tests. J = 5 optimized, (random) locations.•MMD-opt [Gretton et al., 2012] = State-of-the-art two-sample test (quadratic-time).•ME-opt [Jitkrittum et al., 2016] = Linear-time two-sample test with optimized locations.

•Key: FSSD (O(n)), KSD (O(n2)) have comparable power. FSSD is much faster.

WJ, WX, and AG thank the Gatsby Charitable Foundationfor the financial support. ZSz was financially supportedby the Data Science Initiative. KF has been supported byKAKENHI Innovative Areas 25120012.Contact: [email protected]

Code: github.com/wittawatj/kernel-gof