ir2009f-lecture13-learning to rank using language models and...
TRANSCRIPT
Learning to Rank using Learning to Rank using Language Models and SVMsg g
Berlin ChenD t t f C t S i & I f ti E i iDepartment of Computer Science & Information Engineering
National Taiwan Normal University
References:1. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information
Retrieval, Chapte 15 & associated slides, Cambridge University PressRetrieval, Chapte 15 & associated slides, Cambridge University Press2. Raymond J. Mooney’s teaching materials3. Berlin Chen et al., “A discriminative HMM/N-gram-based retrieval approach for Mandarin spoken
documents,” ACM Transactions on Asian Language Information Processing 3(2), June 2004.
Discriminatively-Trained Language Models (1/9) y g g ( )
• A simple document-based language model (LM) for i f ti t i l b t d binformation retrieval can be represented by
is 21 N
nn CorpusqPmDqPmRDQP
– The use of general corpus LM is for probability smoothing and better retrieval performance
1
21n
nn pqqQ
CorpusqP n
smoothing and better retrieval performance– Conventionally, the mixture weights are
empirically tuned or optimized by using the Expectation-M i i ti (EM) l ith
1 , 1121 mmmm
Maximization (EM) algorithm
121 mmNn qqqqQ ....21 2m
DqP n1m
CorpusqP n
A mixture of Nprobability distributions
IR – Berlin Chen 2
‐D.R.H. Miller et al., “A hidden Markov model information retrieval system, SIGIR 1999.‐ Berlin Chen et al., "An HMM/N‐gram‐based Linguistic Processing Approach for Mandarin Spoken Document Retrieval," Interspeech 2001
Discriminatively-Trained Language Models (2/9)y g g ( )
• For those documents with training queries, m1 and m2b ti t d b i th Mi i Cl ifi tican be estimated by using the Minimum Classification
Error (MCE) training algorithmThe ordering of relevant documents and irrelevant documents*D D– The ordering of relevant documents and irrelevant documents in the ranked list for a training query exemplar is adjusted to preserve the relationships ; i.e., should precede on the ranked list
D DQ
'* DD *D Dthe ranked list
• A learning-to-rank algorithm– Documents thus can have different weightsDocuments thus can have different weights
‐ Berlin Chen et al., “A discriminative HMM/N‐gram‐based retrieval approach for Mandarin spoken documents,” ACM Transactions on Asian Language Information Processing 3(2), June 2004.
IR – Berlin Chen 3
Discriminatively-Trained Language Models (3/9)y g g ( )• Minimum Classification Error (MCE) Training
– Given a query and a desired relevant doc define the Q *D– Given a query and a desired relevant doc , define the classification error function as:
Q D
1 RDQPRDQPQ
DQED
not is 'logmaxislog1),( '
**
Also can take all irrelevant doc in the answer set into consideration
“>0”: means misclassified; “<=0”: means a correct decision
in the answer set into consideration
– Transform the error function to the loss function
1)( *DQL
),( *DQL
1
• In the range between 0 and 1
)),(exp(11),(
**
DQEDQL 1
IR – Berlin Chen 4
In the range between 0 and 1 – : controls the slope– : controls the offset
),( *DQE
Discriminatively-Trained Language Models (4/9)y g g ( )• Minimum Classification Error (MCE) Training
Apply the loss function to the MCE procedure for iteratively– Apply the loss function to the MCE procedure for iteratively updating the weighting parameters
• Constraints:
lossfunction
• Parameter Transformation (e g Type I HMM)
1 , 0 k
kk mm
Parameter Transformation, (e.g.,Type I HMM)
and21
1
~~
~
1 mm
m
eeem
21
2
~~
~
2 mm
m
eeem
im~ 1~ im
– Iteratively update (e.g., Type I HMM)1m
DQLiimim*,~1~
Gradient descent
• Where,
iDDmiimim **
111 ~1
~),( *
~, 1*
mDQLimD
),(1),(
),(),( **
*
*DQLDQL
DQEDQL
IR – Berlin Chen 5
,~),(
),(),(
1
*
*
*1
mDQE
DQEDQLi
m
)(Q
Discriminatively-Trained Language Models (5/9)y g g ( )
• Minimum Classification Error (MCE) TrainingIt ti l d t ( T I HMM)– Iteratively update (e.g., Type I HMM)
log ~~
~*
~~
~21
n
m
n
mCorpusqPeDqPe
1m
xfxf
xf 1log
:Note
~
g1
~),(
~~~
~
1
~~~~
1
*
11
2121
mm
Qqnmmnmm
n
ee
m
pqee
qee
QmDQE
xg
xgxfxgxfxgxf
xgxfxgxfxgxf
2
1 ~~
~*
~~
~
*~~
~*~2~~
21
2
21
1
21
21
21
Qqnmm
m
nmm
m
nmmnm
nm
mm
n CorpusqPee
eDqPee
e
DqPee
eCorpusqPeDqPeeee
Q
1 ~~
*~~
~
~~
~21
1
1
mm
nmm
m
m DqPee
e
Qe
eeee
1 *1
~~*
~~21
2
21
121
n
Qqnmm
m
nmm
mmmn
DqPmm
CorpusqPee
eDqPee
eQee
IR – Berlin Chen 6
, 2
*1
1
Qq nnn CorpusqPmDqPmQ
m
Discriminatively-Trained Language Models (6/9)y g g ( )• Minimum Classification Error (MCE) Training
Iteratively update m– Iteratively update
,1,)( **~, 1*
mD DQLDQLii
1m
,1
2*
1
*1
1
Qq nn
n
n CorpusqPimDqPimDqPim
Qim
1~1 im
the new weight
1
)(~
1~1~
1
1
1~,*1
21
1
iim
imim
im
mDee
eeeim
)(~1~1
~,*11 iimimmD
~~)(~
)(~)(~
211~,*1
2~,*21~,*
1
imimiim
iimiim
mD
mDmD
eeee
eeee
ee
)(
~~)(~~~)(~212~,*
2211~,*1
i
imimiimimimiim mDmD eeeeeeee
eeee
the old weight
IR – Berlin Chen 7
, )(
2
)(
1
)(
1
2~,*1~,*
1~,*
ii
i
mDmD
mD
eimeim
eim
Discriminatively-Trained Language Models (7/9)y g g ( )
• Minimum Classification Error (MCE) TrainingFi l E ti– Final Equations
• Iteratively update 1m
n
mD
DqPimim
DQLDQLii*
1
**
1~,*
1
,1,)(
Qnq
nn CorpusqPimDqPimQim
2*
1
1
)(1~,* i
mDeim
)(
2~,*
2
)(1~,*
1
1,1
1 1 imD
imD eimeim
eimim
• can be updated in the similar way2m
IR – Berlin Chen 8
Discriminatively-Trained Language Models (8/9)y g g ( )
• Experimental results with MCE trainingWord-level Syllable-level Average Precision
Uni Uni+Bi* Fusion
TQ/TD 0.6459 (0 6327)
0.6858 (0 5718)
0.7329 Iterations=100Before (0.6327) (0.5718) TDT2
TQ/SD 0.5810
(0.5658) 0.6300
(0.5307) 0.6914
MCE Training
TQ/TD TQ/TD
TQ/SD
TQ/SD
– The results for the syllable-level indexing features wereMCE Iterations (Word-based) MCE Iterations (syllable-based)
IR – Berlin Chen 9IR – Berlin Chen 9
– The results for the syllable-level indexing features were significantly improved
Discriminatively-Trained Language Models (9/9)
• Similar treatments have been recently applied to D t T i M d l ( PLSA) d W d T i
y g g ( )
Document Topic Models (e.g., PLSA) and Word Topic Models (WTM) with good success
• For example, the ranking formula for PLSA can be represented byrepresented by
kk DqPCorpusqPDTPTqPDqP 11
k
k
Tkk
T
DqPCorpusqPDTPTqP 11
The weighting parameters and document topic distributions
kT
kk DTPDqPCorpusqPTqP 11
– The weighting parameters and document topic distributionscan be trained by the MCE algorithm
DTP k
IR – Berlin Chen 10
Vector Representations• Data points (e.g., documents) of different classes (e.g.,
relevant/non-relevant classes) are represented as ) pvectors in a n-dimensional vector space – Each dimension has to do with a specific feature, whose value
usually is normalizeddecision hyperplane of a
two-class problem bxw T
bxwxf Tsign
Niii yx 1, Ddecision function
• Support vector machines (SVM)
data point class label(e.g., +1, -1)
– Look for a decision surface (or hyperplane) that is maximally far away from any data point
– Margin: the distance from the decision surface to the closest– Margin: the distance from the decision surface to the closest data points on either side (or the support vectors)
– SVM is a kind of large-margin classifier IR – Berlin Chen 11
Support Vectors• SVM is fully specified by a small subset of the data (i.e.,
the support vectors) that defines the position of thethe support vectors) that defines the position of the separator (the decision hyperplane)
The support vectors are 5 points rightup against the margin of the classifier
Intercept term
M i i ti f th inormal (weight) vector of the hyperplane
bxw T
• Maximization of the margin– If there are no points near the decision surface, then there are no
very uncertain classification decisionsvery uncertain classification decisions– Also, a slight error in measurement or a slight document variation
will not cause a misclassification IR – Berlin Chen 12
Formulation of SVM with Algebra (1/2)g ( )
• Assume here that data points are linearly separable• Euclidean distance of a point to the decision boundary
1. The shortest distance between a point to a hyperplane is perpendicular to the plane, i.e., parallel to
x
w
bxw T 2. The point on the plane closest to is x x
wwyrxx
bxwbxwy
bwwyrxw
TT
T 0
w
bxww
bxwyr
or
3. We can scale , the so-called “functional margin” as we please; for example to 1
bxwy T
functional margin , as we please; for example, to 1
Therefore, the margin defined by the support vectorsis expressed by
2
IR – Berlin Chen 13
(i.e., for support vectors ; while for the others )
w
1T bxwy
1T bxwy
Assume data points are linear separable !
Formulation of SVM with Algebra (2/2)g ( )
• SVM is designed to find and that can maximize the t i i
bwgeometric margin
– (maximization of is equivalent to minimization of )w2
w2 ww T
21
– For all ,
w
Dii yx , 1T bxwy ii
Mathematical formulation (assume linear separability)• Primal Problem
– Minimize with respect to and ii ibxwyww TT ,1 subject to
21min
NL2
bwpL
NN
i
N
iiiip bxwyww
1
TT
1
0 121
2
L00
01
N
iip
tiii
p
yb
xyww
L
L To be minimized To be maximized
?
IR – Berlin Chen 14
N
ti
N
tiii bxwyww
11
TT
21 1
tii yb
13
Formulation of SVM with Algebra (3/3)g ( )• Dual problem (plug and into )
– Maximize with respect to2 3 1
idLMaximize with respect to idL
21
11
TT bxwywwN
ii
N
iiiid L
A convex quadratic-optimization problem
21
2
111
TT
11
ybxywwwN
ii
N
iii
N
iiii
ii
121
1
Tww
NN N
N
ii
scalar
w 0
0and0thatsconstrainttheSubject to
21
11
T
1
iy
xxyy
N
ii
ijij
jiji
• Most are 0 and only a small number have (they are support vectors)
0and 0that sconstrainttheSubject to 1
iy ii
ii
0iipp )
• Have to do with the number of training instances, but not the input dimension
IR – Berlin Chen 15
Dealing with Nonseparability (1/2) g y ( )
• Datasets that are linearly separable (with some noise) k t twork out great:
•0 •x
• But what are we going to do if the dataset is just too hard?
0
•0 •x
• How about mapping data to a higher-dimensional space?•x2
IR – Berlin Chen 16
•0 •x
Dealing with Nonseparability (2/2) g y ( )
• General idea: The original feature space can always be d b f ti t hi h di i l mapped by a function to some higher-dimensional
feature space where the training set is separable
kernel trick
kernel trick
xx :
IR – Berlin Chen 17
Purposes: - Make non-separable problem separable- Map data into better representational space
Kernel Trick (1/2)( )
• The SVM decision function for an input at a high-di i l (th t f d ) b t d
xdimensional (the transformed ) space can be represented as bxwxf sign T
bxxy
N
Ti
N
iii
sign 1
A kernel function is introduced defined by the inner (dot)
bxxKy i
N
iii
,sign 1
xxK – A kernel function is introduced, defined by the inner (dot)
product of points (vectors) in the high-dimensional space• can be computed simply and efficiently in terms of xxK i
,
xxK i ,
the original data points• We wouldn’t have to actually map from
(however we still can directly compute ) xx
K T (however, we still can directly compute )
IR – Berlin Chen 18
xxxxK ii ,
Kernel Trick (2/2)( )• Common Kernel Functions
Polynomials of degree q: qvuvuK 1T – Polynomials of degree q:
• Polynomial of degree two (quadratic kernel) vuvuK 1,
K 2T 1 vvuuuuvuvu
vuvuK
222221
T21
T22211
T
,,, where 1
1,
two-dimensional points
Tuuuuuuu
vuvuvvuuvuvu22
212121
22
22
21
2121212211
,,2,2,2,1
2221
– Radial-basis function (Gaussian distribution): 22 2, σvuevuK
– Sigmoidal function: 12tanh, T vuvuK
IR – Berlin Chen 19
The above kernels are not always very useful in text classification !
Soft-Margin Hyperplane (1/2)g y ( )
• Even for very high-dimensional problems, data points ld b li l i blcould be linearly inseparable
• We can instead look for the hyperplane that incurs the least errorleast error– Define slack variables that store the variation from the
margin for each data points0i
g p
- Reformulation the optimization criterionwith slack variables- Find , , and such that is minimum
x b 0i
N
iCww T
21
For all ,
i 12
Dii yx , iii bxwy 1T
IR – Berlin Chen 20
N
iii
N
iiiii
N
iip bxwyCww
11
T
1
T 121 ˆ L
Soft-Margin Hyperplane (2/2)g y ( )– Dual Problem
N NN
1ˆ T
iCy
xxyy
i
N
ii
i jjijiji
iid
0and 0 subject to
21-
1 1
T
1
L
• Neither slack variables nor their Lagrange multipliers appear in the dual problem!
y ii
ii
j1
i i
• Again, with non-zero will be support vectors• Solution to the dual problem is:
x i
N
kkkkk
iiii
kxwyb
xyw
maxargfor 1 T1
– Parameter can be viewed as a way to control overfitting – a regularization term
• The larger the value , the more we should pay attention to each individual
C
Cg , p ydata point
• The smaller the value , the more we can model the bulk of the dataIR – Berlin Chen 21
C
Using SVM for Ad-Hoc Retrieval (1/2)g ( )
• For example, documents are simply represented by two-di i l t i ti f i dimensional vectors consisting of cosine score and term proximity
qd i ,
IR – Berlin Chen 22
Using SVM for Ad-Hoc Retrieval (2/2)g ( )• Examples: Nallapati, Discriminative Models for Information
Retrieval, SIGIR 2004B i F t d i SVM– Basic Features used in SVM
– Compared with LM and ME (maximum entropy) modelsTested on 4 TREC ll tiTREC collections
IR – Berlin Chen 23
Ranking SVM (1/2)g ( )
• Construct an SVM that not only considers the relevance f d t t th t i i b t l th d fof documents to the a training query but also the order of
each document pair on the ideal ranked listFirst construct a vector of features for each document- qd– First, construct a vector of features for each document-query pair
– Second, capture the relationship between each document pair
qd i ,
by introducing a new vector representation for each document pair
qdd ji ,,
qdqdqdd
– Third, if is more relevant than given (denoted ,
qdqdqdd jiji ,,,,
id jd qji dd
di.e., should precede on the ranked list), then associate they with the label ; otherwise,
id jd1ijqy 1ijqy
IR – Berlin Chen 24
Cf. T. Joachims and F. Radlinski, Search Engines that Learn from Implicit Feedback,IEEE Trans. on Computer 40(8), pp. 34-40, 2007
Ranking SVM (2/2)g ( )
• Therefore, the above ranking task is formulated as: 0– Find , , and such that
• is minimizedx b 0ijq
qji
qjiCww,,
,,T
21
• For all ,
(Note that are left out here. Why?)
jiji ddqdd :,, qjiji bqddw ,,T 1,,
ijy(Note that are left out here. Why?)ijqy
IR – Berlin Chen 25