scoring, term weighting and the vector space
TRANSCRIPT
Is a document simply a sequence of words?
Many structural components like authors, title, date of publication …..
Metadata – data about documents
Fields – document features where possible values are finite. Example – dates, ISBN
Zones – document features whose content can be arbitrary text fields. Example –title, abstract
A user may specify requirements on fields and zones
One parametric index for each zone/field
Dictionary comes from a fixed vocabulary
Separate inverted index is build for each zone of the document
Dictionary structure whatever vocabulary stems from the text of that zone
Advantages:
Reduced size of the dictionary
Efficient query answering using weighted zone scoring
Different field/zones have different importance in evaluating how a document matches a query
For a query q and a document d, weighted zone scoring assigns a pair to (q, d) [(query, document)] a score in range [0, 1] by computing a linear combination of zone scores
Let each document has l zones. Let g1….gl belongs to [0,1] such that 𝑖=1𝑙 𝑔𝑖 = 1
Each field/zone contributes a Boolean value – let si be the Boolean score denoting a match or absence between q and the ith zone
The weighted zone is 𝑖=1𝑙 𝑔𝑖𝑠𝑖
Consider the query Shakespeare in a collection in which each document has three zones: author, title and body
Boolean score function take the value 1 if the query term Shakespeare is present in the zone otherwise 0
Weight score term require three weights gbody, gtitle and gauthor
Let gbody=0.5, gtitle=0.3 and gauthor=0.2
Here author zone is least important, title zone is somewhat more and body contributes the most
Could have been specified by an expert
Can be judged editorially
Each training example is a tuple consisting of a query q and a document d and a relevance judgment of q on d
The judgment can be binary
A judgment score can also be used
Compute the weights such that the learned scores approximate the relevance judgments as much as possible
An optimization problem
Consider two zones: title and body
Compute Boolean variables sT(d, q), sB(d, q) depending on the query matching
Compute a score between 0 and 1 by using the relation:
Score (d, q) = g sT(d, q) + (1-g)sB(d, q)
Constant g is determined from a set of training examples µj = (dj, qj, r(dj, qj))
In each training example, each training document and a training query is accessed by a human editor who delivers a relevant judgment r(dj, qj).
For each training example µj ,we have Boolean values sT(d, q) and sB(d, q), that we use to compute a score
Error of scoring function
µ(g, µj) = (r(dj, qj) – score(dj, qj))2
Total error 𝑗 µ(g, µj)
Let n01r (n01n) be the numbers of training examples that STitle = 0 and SBody = 1 and the judgment is relevant (irrelevant). The contribution of those examples that STitle = 0 and SBody = 1 to the total error is
[1-(1-g)]2n01r + [0-(1-g)]2n01n
The total error is (n01r+n10n)g2 + (n10r + n01n)(1-g)2 +n00r + n11n
By differentiating with respect to g and setting the result to 0, the optimal value of g is
𝒏𝟏𝟎𝒓+𝒏
𝟎𝟏𝒏
𝒏𝟏𝟎𝒓+𝒏
𝟏𝟎𝒏+𝒏
𝟎𝟏𝒓+𝒏
𝟎𝟏𝒏