universal and composite hypothesis testing via mismatched divergence jayakrishnan unnikrishnan lcav,...
TRANSCRIPT
Universal and composite hypothesis testing via Mismatched Divergence
Jayakrishnan Unnikrishnan
LCAV, EPFL
CollaboratorsDayu Huang, Sean Meyn, Venu Veeravalli, University of Illinois
Amit Surana, UTRC
IPG seminar2 March 2011
Outline
• Universal Hypothesis Testing– Hoeffding test
• Problems with large alphabets
– Mismatched test• Dimensionality reduction• Improved performance
• Extensions– Composite null hypotheses– Model-fitting with outliers– Rate-distortion test– Source coding with training
• Conclusions2
Universal Hypothesis Testing
• Given a sequence of i.i.d. observations test the hypothesis
– Focus on finite alphabets i.e. PMFs
• Applications: anomaly detection, spam filtering etc.
3
1 2, , , nXX X
0 0
1 0
Null
Alternate : ,
: ~
~ unknowni
i
H X p
p ppH X
Sufficient statistic
• Empirical distribution:
– where denotes the number of times letter appears in
– is a random vector
4
1 2: , , , N
T
aa an
nn n
n np
n
an a1 2, , , nXX X
np
Hoeffding’s Universal Test
• Hoeffding test [1965]:
– Uses KL divergence between and as test statistic
5
0ˆ { ( ) }nH D p p I ‖
2N n
0{ : ( ) }q D q p ‖
0p
np 0p
Hoeffding’s Universal Test
• Hoeffding test is optimal in error-exponent sense:
– Sanov’s Theorem in Large Deviations implies
6
0ˆ { ( ) }nH D p p I ‖
2N n
0FA
*MD
ˆ( 0) exp( )
ˆ( 1) exp( ( ))
p
p n
p
p
H n
H
P
P
Hoeffding’s Universal Test
• Hoeffding test is optimal in error-exponent sense:
– Sanov’s Theorem in Large Deviations implies
• Better approximation of false alarm probability via– Weak convergence under
7
0ˆ { ( ) }nH D p p I ‖
2N n
0FA
*MD
ˆ( 0) exp( )
ˆ( 1) exp( ( ))
p
p n
p
p
H n
H
P
P
20 1
1( )
2n AD p pn ‖
0p
Error exponents are inaccurate
8Alphabet size, A = 20
Large Alphabet Regime
• Hoeffding test performs poorly for large (alphabet size)– suffers from high bias and variance
9
0
0
0
0 2
1)]
21
[ ( )]2
[ (p n
p n
Ap
nA
D p pn
D p
‖
‖
E
Var
2N n
A
Large Alphabet Regime
• Hoeffding test performs poorly for large (alphabet size)– suffers from high bias and variance
• A popular fix: Merging low probability bins
10
0
0
0
0 2
1)]
21
[ ( )]2
[ (p n
p n
Ap
nA
D p pn
D p
‖
‖
E
Var
2N n
A
Binning
11
Quantization
12
General principle
• Dimensionality reduction
• Essentially we compromise on universality but improve performance against typical alternatives
• Generalization: parametric family for typical alternatives
13
{ }p
Hoeffding test
14
0p
np0( )nD p p‖
Mismatched test
15
0p
np{ }p
Mismatched test
16
0p
np
n̂p
{ }p
Mismatched test
17
0p
np
n̂p ˆ 0( )
nD p p
‖
Mismatched test
18
0p
np
n̂p
0( )nD p p‖
0( )MMnD p p‖
Mismatched test
• Use mismatched divergence instead of KL divergence
– interpretable as a lower bound to KL divergence
• Idea in short: replace with ML estimate from i.e., it is a GLRT
19
0ˆ { ( ) }MM
nH D p p ‖I
np{ }p
ˆ0 0( ) ( )n
MMnD p p D p p
‖ ‖ML
Exponential family example
• Mismatched divergence is solution to a convex problem
20
01
( ) ( ) exp ( ) ( ) ,d
di i
i
x x fp p x
0( ) sup , ( )MMi i
i
D p p f p
‖
Exponential family example
• Mismatched divergence is solution to a convex problem
• Binning when
21
01
( ) ( ) exp ( ) ( ) ,d
di i
i
x x fp p x
0( ) sup , ( )MMi i
i
D p p f p
‖
( ) ( )iBi x xf I
Mismatched Test properties+ Addresses high variance issues
- However not universally optimal in error-exponent sense
+ Optimal when alternate distribution lies in • achieves same error exponents as Hoeffding• implies optimality of GLRT for composite hypotheses
22
0
0
0
0 2
)]2
[ ( )]
(
2
[ MMp n
MMp n
dp
nd
D p p
D p
n
E
Var
‖
‖
{ }p
where d
Performance comparison
23A = 19, n = 40
Weak convergence
• When observations
– Approximate thresholds for target false alarm
24
20
1( )
2MM
n dD p pn‖
0~ p
Weak convergence
• When observations
– Approximate thresholds for target false alarm
• When observations
– Approximate power of test25
20
1( )
2MM
n dD p pn‖
0~ p
0~ p p
20 0
1( ) ( ) (0, )MM MM
n pD p p D p pn
‖ ‖ N
EXTENSIONSAND
APPLICATIONS
26
Composite null hypotheses
• Composite null hypotheses / model fitting
27
0
1
: ~ for any
~ , for any
,
: i
i
H X p p
q qH X
P _
P
P
Composite null hypotheses
• Composite null hypotheses / model fitting
28
0
1
: ~ for any
~ , for any
,
: i
i
H X p p
q qH X
P _
P
P
{ : inf ( ) }q q pD ‖
Composite null hypotheses
• Composite null hypotheses / model fitting
29
0
1
: ~ for any
~ , for any
,
: i
i
H X p p
q qH X
P _
P
P
{ : ( ) }q qD ‖ P
Weak convergence
• When observations
30
21
1( )
2n A dpn
D ‖ P
~ p P
Weak convergence
• When observations
• When observations
31
21
1( )
2n A dpn
D ‖ P
~ p P
~ p P21
( ) ( ) (0, )n pp pn
D D ‖ ‖P P N
Weak convergence
• When observations
• When observations
– Approximate thresholds for target false alarm– Approximate power of test– Study outlier effects
32
21
1( )
2n A dpn
D ‖ P
~ p P
~ p P21
( ) ( ) (0, )n pp pn
D D ‖ ‖P P N
Outliers in model-fitting
• Data corrupted by outliers or model-mismatch– Contamination mixture model
33
(1 ) p qp ò ò
Outliers in model-fitting
• Data corrupted by outliers or model-mismatch– Contamination mixture model
34
(1 ) p qp ò ò
Outliers in model-fitting
• Data corrupted by outliers or model-mismatch– Contamination mixture model
• Goodness of fit metric– Limiting behavior used to quantify the goodness of fit
35
(1 ) p qp ò ò
)( nD p ‖ P
Outliers in model-fitting
• Data corrupted by outliers or model-mismatch– Contamination mixture model
• Limiting behavior of goodness of fit metric changes
36
(1 ) p qp ò ò
21
1( )
2n A dpn
D ‖ P
21( ) ( ) (0, )n pp p
nD D P P N‖ ‖
Outliers in model-fitting
• Data corrupted by outliers or model-mismatch– Contamination mixture model
• Sensitivity of goodness of fit metric to outliers
37
(1 ) p qp ò ò
21) ( ) ( )
2( Tq p GD p q p P ò‖
2 2 ( ) ( )Tp q p G q p ò
Rate-distortion test
• Different generalization of binning– Rate-distortion optimal compression
• Test based on optimally compressed observations [P. Harremoës 09]
– Results on limiting distribution of test statistic
38
0ˆ { ( )( )( ) }nH pD p I ‖
Source coding with training
• A wants to encode and transmit source to B– Unknown distribution on known alphabet – Given training samples
39
~X pp
1, , nX X
Source coding with training
• A wants to encode and transmit source to B– Unknown distribution on known alphabet– Given training samples
• Choose codelengths based on empirical frequencies
40
~X pp
1, , nX X
log( ( ))x np x
Source coding with training
• A wants to encode and transmit source to B– Unknown distribution on known alphabet– Given training samples
• Choose codelengths based on empirical frequencies
• Expected excess codelength is chi-squared
41
~X pp
1, , nX X
log( ( ))x np x
21 1
1[ | ] ( )
2n
AX H pn E
CLT vs LDP
• Empirical distribution (type) of
42
1
1( ) { }
n
n ii
x X xn
p
I
1{ }niX
CLT vs LDP
• Empirical distribution (type) of
• Obeys LDP (Sanov’s theorem):
• Obeys CLT:
43
1
1( ) { }
n
n ii
x X xn
p
I
1{ }niX
{ ( )} exp( ( , ))p np N p n p ò òP
( ) (0, )n pn pp N
CLT vs LDP
LDP• Good for large
deviations
• Approximates asymptotic slope of log-probability – Pre-exponential factor
may be significant
CLT• Good for moderate
deviations
• Approximates probability
44
Conclusions
– Error exponents do not tell the whole story• Not a good indicator of exact probability• Tests with identical error exponents can differ drastically over finite
samples
– Weak convergence results give better approximations than error exponents (LDPs)
– Compromising universality for performance improvement against typical alternatives
– Threshold selection, Outlier sensitivity, Source coding with training
45
References• J. Unnikrishnan, D. Huang, S. Meyn, A. Surana, and V. V. Veeravalli,
“Universal and Composite Hypothesis Testing via Mismatched Divergence” IEEE Trans. Inf. Theory, to appear.
• J. Unnikrishnan, S. Meyn, and V. Veeravalli, “On Thresholds for Robust Goodness-of-Fit Tests” presented at IEEE Information Theory Workshop, Dublin, Aug. 2010.
• J. Unnikrishnan, “Model-fitting in the presence of outliers” submitted to ISIT 2011.
– available at http://lcavwww.epfl.ch/~unnikris/
46
Thank You!
47