scoring, term weighting and the vector space

Is a document simply a sequence of words?

Many structural components like authors, title, date of publication …..

Metadata – data about documents

Fields – document features where possible values are finite. Example – dates, ISBN

Zones – document features whose content can be arbitrary text fields. Example –title, abstract

A user may specify requirements on fields and zones

One parametric index for each zone/field

Dictionary comes from a fixed vocabulary

Separate inverted index is build for each zone of the document

Dictionary structure whatever vocabulary stems from the text of that zone

Advantages:

Reduced size of the dictionary

Efficient query answering using weighted zone scoring

Different field/zones have different importance in evaluating how a document matches a query

For a query q and a document d, weighted zone scoring assigns a pair to (q, d) [(query, document)] a score in range [0, 1] by computing a linear combination of zone scores

Let each document has l zones. Let g1….gl belongs to [0,1] such that 𝑖=1𝑙 𝑔𝑖 = 1

Each field/zone contributes a Boolean value – let si be the Boolean score denoting a match or absence between q and the ith zone

The weighted zone is 𝑖=1𝑙 𝑔𝑖𝑠𝑖

Consider the query Shakespeare in a collection in which each document has three zones: author, title and body

Boolean score function take the value 1 if the query term Shakespeare is present in the zone otherwise 0

Weight score term require three weights gbody, gtitle and gauthor

Let gbody=0.5, gtitle=0.3 and gauthor=0.2

Here author zone is least important, title zone is somewhat more and body contributes the most

Could have been specified by an expert

Can be judged editorially

Each training example is a tuple consisting of a query q and a document d and a relevance judgment of q on d

The judgment can be binary

A judgment score can also be used

Compute the weights such that the learned scores approximate the relevance judgments as much as possible

An optimization problem

Consider two zones: title and body

Compute Boolean variables sT(d, q), sB(d, q) depending on the query matching

Compute a score between 0 and 1 by using the relation:

Score (d, q) = g sT(d, q) + (1-g)sB(d, q)

Constant g is determined from a set of training examples µj = (dj, qj, r(dj, qj))

In each training example, each training document and a training query is accessed by a human editor who delivers a relevant judgment r(dj, qj).

For each training example µj ,we have Boolean values sT(d, q) and sB(d, q), that we use to compute a score

Error of scoring function

µ(g, µj) = (r(dj, qj) – score(dj, qj))2

Total error 𝑗 µ(g, µj)

Let n01r (n01n) be the numbers of training examples that STitle = 0 and SBody = 1 and the judgment is relevant (irrelevant). The contribution of those examples that STitle = 0 and SBody = 1 to the total error is

[1-(1-g)]2n01r + [0-(1-g)]2n01n

The total error is (n01r+n10n)g2 + (n10r + n01n)(1-g)2 +n00r + n11n

By differentiating with respect to g and setting the result to 0, the optimal value of g is

𝒏𝟏𝟎𝒓+𝒏

𝟎𝟏𝒏

𝒏𝟏𝟎𝒓+𝒏

𝟏𝟎𝒏+𝒏

𝟎𝟏𝒓+𝒏

𝟎𝟏𝒏

scoring, term weighting and the vector space

Data & Analytics