efficient direct density ratio estimation for non-stationarity adaptation and outlier detection...

22
Efficient Direct Density Ratio Estimation for Non-stationarity Adaptat ion and Outlier Detectio n Takafumi Kanamori Shohei Hido NIPS 2008

Upload: brittany-campbell

Post on 28-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Efficient Direct Density Ratio Estimation for

Non-stationarity Adaptation and Outlier Detection

Takafumi Kanamori

Shohei Hido

NIPS 2008

Outline

• Motivation

• Importance Estimation

• Direct Importance Estimation

• Approximation Algorithm

• Experiments

• Conclusions

Motivation

• Importance Sampling

• Covariate Shift

• Outlier Detection

Importance Sampling• Rather than sampling from t

he distribution p, importance sampling is to reduce the variance of Ê[f(X)] by an appropriate choice of q, hence the name importance sampling, as samples from q can be more "important" for the estimation of the integral.

• Other reasons include difficulties to draw samples from distribution p or efficiency considerations.

1 2

1 2

[ ( )| ] ( ) ( )

, , ,...,

1( ) ( )

1ˆ [ ] ( ) ( ) ( )

( ) ( ) ( )[ ( )| ]

( ) ( )

, ,

i

n

n xi

n n ii

E f X p f x p x dx

Samplingaccrordingpdistribution x x x

P x xn

MontoCarloEstimate

E ff x dP x f xn

f x w x q x dxE f X p

w x q x dx

Samplingaccrordingqdistribution x x

d

=

=

= =

=

ò

å

åò

òò

,

,...,

1( ) ( )

ˆ [ ]1

( )

n

i ii

n q

ii

x

f x w xn

E fw x

n

å

[2] R. Srinivasan, Importance sampling - Applications in communications and detection, Springer-Verlag, Berlin, 2002. [3]P. J.Smith, M.Shafi, and H. Gao, "Quick simulation: A review of importance sampling techniques in communication systems," IEEE J.Select.Areas Commun.,

vol. 15, pp. 597-613, May 1997.

Covariate Shift

Compensated by weighting the training samples according to the importance

( | ) ( | )

( ) ( )

( , ) ( ) ( | ) ( )

( , ) ( ) ( | ) ( )

but

Train Test

Train Test

test test test test

train train train train

P Y X x P Y X x for all x

P X P X

P x y P x P y x P x

P x y P x P y x P x

= = = Î

¹

= =

χ

Distribution of input training and testing set changed, while the conditional distribution that output given input unchanged. Then, standard learning techniques such as MLE or CV are biased.

[4]Jiayuan Huang, Alexander J. Smola,Arthur Gretton,et al. Correcting Sample Selection Bias by Unlabeled Data, NIPS 2006.

Outlier Detection

• The importance for regular samples are close to one, while those for outliers tend to be significantly deviated from one.

• The values of the importance could be used as an index of the degree of outlyingness.

Related Works

• Kernel Density Estimation

• Kernel Mean Matching

a map into the feature space

the expectation operator

μ(Pr) := Ex~Pr(x)[Φ(x)] .

:F ®X F

:Pm ®F

[4]Jiayuan Huang, Alexander J. Smola,Arthur Gretton,et al. Correcting Sample Selection Bias by Unlabeled Data, NIPS 2006.

Direct Importance Estimation

Least-square Approach

• Model w(x) with linear model

• Determine the parameter alpha so that the squared error on training samples is minimized:

1 2 1 2ˆ( ) ( ) ( , ,..., )( ( ), ( ),..., ( ))T T

b bw x x x x xa j a a a j j j= =J

1( ) ( )

2 testC w x p x dx= ò

Least Square Importance FittingLSIF

0

1min ( ) ( )

2T TJ J C H ha a a a a- = -=

H ( ) ( ) ( ) , ( ) ( )Ttrain testx x p x dx h x p x dxj j j=ò ò=

1 ˆˆmin [ 1 ] . . 02

b

T T Tb bR

H h sta

a a a l a aÎ

Þ

- + ³

Empirical estimation

1 1

1 1ˆH ( ) ( ) , ( )train testn n

train train T testi i j

i jtrain test

x x h xn n

j j j= =

=å å=

Regularization term to avoid over-fitting

Model Selection for LSIF

• Model

the parameter lambda, the basis function phi

• Model selection

Cross Validation

Heuristics for Basic function Design

• Gaussian kernel centered at the test samples

1

2

( ) ( , ),

( , ') exp( ' )

testn testl ll

w x K x x

where K x x x x

s

s

a=

=

= - -

å

Unconstrained Least-squares Approach (uLSIF)

• Ignore the non-negativity constraints

• Learned parameters could be negative• To compensate for the approximation error,

modify the solution

1 ˆˆmin [ ]2 2

b

T T T

RH h

b

lb b b b b

Î- +

1ˆˆ ˆmax(0 , ), ( )b bH I hb b b l -= = +% %

Efficient Computation of LOOCV

• Samples

• learned without the

• LOOCV score

• According to the Sherman-Woodbury-Morrison formula, the matrix inverse needs to be computed only once.

( )ˆ ilb

{ } { }1 1, ,

testtrain nntrain testi j train testi jx x n n

= =<

train testi ix andx

( ) 2 ( )1 1 ˆ ˆ[ ( ( ) ) ( ) ]2

train T i test T ii i

itrain

x xn l lj b j b-å

ExperimentsImportance Estimation

• ptrain is the d-dimensional normal distribution with mean zero and covariance identity.

• ptest is the d-dimensional normal distribution with mean (1,0,…,0)T and covariance identity.

• Normalized mean squared error

Covariate Shift Adaptationin classification and regression

• Given the training samples, the test samples, and the outputs of the training samples

• The task is to predict the outputs for test samples

{ } { }1 1, ,

testtrain nntrain testi ji jx x

= =

{ }1

trainntraini iy

=

1

( ; ) ( , )t

l h ll

f x K x mq q=

22

1

Importanceweighted regularized least-squares

ˆmin ( )( ( ; ) )trn

tr tr tri i i

i

IWRLS

w x f x yq q g q=

é ùê ú- +ê úê úë ûå

Experimental Description

• Divide the training samples into R disjoint subsets

• The function is learned using

by IWRLS and its mean test error for the remaining samples is computed:

Where

{ }trr j rZ ¹

1{ }tr Rr rZ =

1{ }tr Rr rZ =

( , )

1 ˆˆ( ) ( ( ), )trr

rx y Ztrr

w x loss f x yZ Îå

ˆ( )rf x

2 1ˆ ˆ( , ) is( ) in regressionand (1 ( )) in classification.2

loss y y y y sign yy- -

Covariate shift adaptation

ExperimentOutlier Detection

Conclusions

• Application– Covariate shift adaptation– Outlier detection– Feature selection– Conditional distribution estimation– ICA– ……

Reference

• [1]Takafumi Kanamori, Shohei Hido. Efficient direct density ratio estimation for non-stationarity adaptation and outlier detection, NIPS 2008.

• [2] R. Srinivasan, Importance sampling - Applications in communications and detection, Springer-Verlag, Berlin, 2002.

• [3]P. J.Smith, M.Shafi, and H. Gao, "Quick simulation: A review of importance sampling techniques in communication systems," IEEE J.Select.Areas Commun., vol. 15, pp. 597-613, May 1997.

• [4]Jiayuan Huang, Alexander J. Smola,Arthur Gretton,et al. Correcting Sample Selection Bias by Unlabeled Data, NIPS 2006.

• [5] Jing Jiang. A Literature Survey on Domain Adaptation of Statistical Classifiers