efficient direct density ratio estimation for non-stationarity adaptation and outlier detection...
TRANSCRIPT
Efficient Direct Density Ratio Estimation for
Non-stationarity Adaptation and Outlier Detection
Takafumi Kanamori
Shohei Hido
NIPS 2008
Outline
• Motivation
• Importance Estimation
• Direct Importance Estimation
• Approximation Algorithm
• Experiments
• Conclusions
Importance Sampling• Rather than sampling from t
he distribution p, importance sampling is to reduce the variance of Ê[f(X)] by an appropriate choice of q, hence the name importance sampling, as samples from q can be more "important" for the estimation of the integral.
• Other reasons include difficulties to draw samples from distribution p or efficiency considerations.
1 2
1 2
[ ( )| ] ( ) ( )
, , ,...,
1( ) ( )
1ˆ [ ] ( ) ( ) ( )
( ) ( ) ( )[ ( )| ]
( ) ( )
, ,
i
n
n xi
n n ii
E f X p f x p x dx
Samplingaccrordingpdistribution x x x
P x xn
MontoCarloEstimate
E ff x dP x f xn
f x w x q x dxE f X p
w x q x dx
Samplingaccrordingqdistribution x x
d
=
=
= =
=
ò
å
åò
òò
,
,...,
1( ) ( )
ˆ [ ]1
( )
n
i ii
n q
ii
x
f x w xn
E fw x
n
=å
å
[2] R. Srinivasan, Importance sampling - Applications in communications and detection, Springer-Verlag, Berlin, 2002. [3]P. J.Smith, M.Shafi, and H. Gao, "Quick simulation: A review of importance sampling techniques in communication systems," IEEE J.Select.Areas Commun.,
vol. 15, pp. 597-613, May 1997.
Covariate Shift
Compensated by weighting the training samples according to the importance
( | ) ( | )
( ) ( )
( , ) ( ) ( | ) ( )
( , ) ( ) ( | ) ( )
but
Train Test
Train Test
test test test test
train train train train
P Y X x P Y X x for all x
P X P X
P x y P x P y x P x
P x y P x P y x P x
= = = Î
¹
= =
χ
Distribution of input training and testing set changed, while the conditional distribution that output given input unchanged. Then, standard learning techniques such as MLE or CV are biased.
[4]Jiayuan Huang, Alexander J. Smola,Arthur Gretton,et al. Correcting Sample Selection Bias by Unlabeled Data, NIPS 2006.
Outlier Detection
• The importance for regular samples are close to one, while those for outliers tend to be significantly deviated from one.
• The values of the importance could be used as an index of the degree of outlyingness.
Related Works
• Kernel Density Estimation
• Kernel Mean Matching
a map into the feature space
the expectation operator
μ(Pr) := Ex~Pr(x)[Φ(x)] .
:F ®X F
:Pm ®F
[4]Jiayuan Huang, Alexander J. Smola,Arthur Gretton,et al. Correcting Sample Selection Bias by Unlabeled Data, NIPS 2006.
Least-square Approach
• Model w(x) with linear model
• Determine the parameter alpha so that the squared error on training samples is minimized:
1 2 1 2ˆ( ) ( ) ( , ,..., )( ( ), ( ),..., ( ))T T
b bw x x x x xa j a a a j j j= =J
1( ) ( )
2 testC w x p x dx= ò
Least Square Importance FittingLSIF
0
1min ( ) ( )
2T TJ J C H ha a a a a- = -=
H ( ) ( ) ( ) , ( ) ( )Ttrain testx x p x dx h x p x dxj j j=ò ò=
1 ˆˆmin [ 1 ] . . 02
b
T T Tb bR
H h sta
a a a l a aÎ
Þ
- + ³
Empirical estimation
1 1
1 1ˆH ( ) ( ) , ( )train testn n
train train T testi i j
i jtrain test
x x h xn n
j j j= =
=å å=
Regularization term to avoid over-fitting
Model Selection for LSIF
• Model
the parameter lambda, the basis function phi
• Model selection
Cross Validation
Heuristics for Basic function Design
• Gaussian kernel centered at the test samples
1
2
( ) ( , ),
( , ') exp( ' )
testn testl ll
w x K x x
where K x x x x
s
s
a=
=
= - -
å
Unconstrained Least-squares Approach (uLSIF)
• Ignore the non-negativity constraints
• Learned parameters could be negative• To compensate for the approximation error,
modify the solution
1 ˆˆmin [ ]2 2
b
T T T
RH h
b
lb b b b b
Î- +
1ˆˆ ˆmax(0 , ), ( )b bH I hb b b l -= = +% %
Efficient Computation of LOOCV
• Samples
• learned without the
• LOOCV score
• According to the Sherman-Woodbury-Morrison formula, the matrix inverse needs to be computed only once.
( )ˆ ilb
{ } { }1 1, ,
testtrain nntrain testi j train testi jx x n n
= =<
train testi ix andx
( ) 2 ( )1 1 ˆ ˆ[ ( ( ) ) ( ) ]2
train T i test T ii i
itrain
x xn l lj b j b-å
ExperimentsImportance Estimation
• ptrain is the d-dimensional normal distribution with mean zero and covariance identity.
• ptest is the d-dimensional normal distribution with mean (1,0,…,0)T and covariance identity.
• Normalized mean squared error
Covariate Shift Adaptationin classification and regression
• Given the training samples, the test samples, and the outputs of the training samples
• The task is to predict the outputs for test samples
{ } { }1 1, ,
testtrain nntrain testi ji jx x
= =
{ }1
trainntraini iy
=
1
( ; ) ( , )t
l h ll
f x K x mq q=
=å
22
1
Importanceweighted regularized least-squares
ˆmin ( )( ( ; ) )trn
tr tr tri i i
i
IWRLS
w x f x yq q g q=
é ùê ú- +ê úê úë ûå
Experimental Description
• Divide the training samples into R disjoint subsets
• The function is learned using
by IWRLS and its mean test error for the remaining samples is computed:
Where
{ }trr j rZ ¹
1{ }tr Rr rZ =
1{ }tr Rr rZ =
( , )
1 ˆˆ( ) ( ( ), )trr
rx y Ztrr
w x loss f x yZ Îå
ˆ( )rf x
2 1ˆ ˆ( , ) is( ) in regressionand (1 ( )) in classification.2
loss y y y y sign yy- -
Conclusions
• Application– Covariate shift adaptation– Outlier detection– Feature selection– Conditional distribution estimation– ICA– ……
Reference
• [1]Takafumi Kanamori, Shohei Hido. Efficient direct density ratio estimation for non-stationarity adaptation and outlier detection, NIPS 2008.
• [2] R. Srinivasan, Importance sampling - Applications in communications and detection, Springer-Verlag, Berlin, 2002.
• [3]P. J.Smith, M.Shafi, and H. Gao, "Quick simulation: A review of importance sampling techniques in communication systems," IEEE J.Select.Areas Commun., vol. 15, pp. 597-613, May 1997.
• [4]Jiayuan Huang, Alexander J. Smola,Arthur Gretton,et al. Correcting Sample Selection Bias by Unlabeled Data, NIPS 2006.
• [5] Jing Jiang. A Literature Survey on Domain Adaptation of Statistical Classifiers