scoring, term weighting and the vector space

10

Click here to load reader

Upload: ujjawal

Post on 06-Jul-2015

84 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Scoring, term weighting and the vector space
Page 2: Scoring, term weighting and the vector space

Is a document simply a sequence of words?

Many structural components like authors, title, date of publication …..

Metadata – data about documents

Fields – document features where possible values are finite. Example – dates, ISBN

Zones – document features whose content can be arbitrary text fields. Example –title, abstract

Page 3: Scoring, term weighting and the vector space

A user may specify requirements on fields and zones

Page 4: Scoring, term weighting and the vector space

One parametric index for each zone/field

Dictionary comes from a fixed vocabulary

Separate inverted index is build for each zone of the document

Page 5: Scoring, term weighting and the vector space

Dictionary structure whatever vocabulary stems from the text of that zone

Advantages:

Reduced size of the dictionary

Efficient query answering using weighted zone scoring

Page 6: Scoring, term weighting and the vector space

Different field/zones have different importance in evaluating how a document matches a query

For a query q and a document d, weighted zone scoring assigns a pair to (q, d) [(query, document)] a score in range [0, 1] by computing a linear combination of zone scores

Let each document has l zones. Let g1….gl belongs to [0,1] such that 𝑖=1𝑙 𝑔𝑖 = 1

Each field/zone contributes a Boolean value – let si be the Boolean score denoting a match or absence between q and the ith zone

The weighted zone is 𝑖=1𝑙 𝑔𝑖𝑠𝑖

Page 7: Scoring, term weighting and the vector space

Consider the query Shakespeare in a collection in which each document has three zones: author, title and body

Boolean score function take the value 1 if the query term Shakespeare is present in the zone otherwise 0

Weight score term require three weights gbody, gtitle and gauthor

Let gbody=0.5, gtitle=0.3 and gauthor=0.2

Here author zone is least important, title zone is somewhat more and body contributes the most

Page 8: Scoring, term weighting and the vector space

Could have been specified by an expert

Can be judged editorially

Each training example is a tuple consisting of a query q and a document d and a relevance judgment of q on d

The judgment can be binary

A judgment score can also be used

Compute the weights such that the learned scores approximate the relevance judgments as much as possible

An optimization problem

Page 9: Scoring, term weighting and the vector space

Consider two zones: title and body

Compute Boolean variables sT(d, q), sB(d, q) depending on the query matching

Compute a score between 0 and 1 by using the relation:

Score (d, q) = g sT(d, q) + (1-g)sB(d, q)

Constant g is determined from a set of training examples µj = (dj, qj, r(dj, qj))

In each training example, each training document and a training query is accessed by a human editor who delivers a relevant judgment r(dj, qj).

For each training example µj ,we have Boolean values sT(d, q) and sB(d, q), that we use to compute a score

Error of scoring function

µ(g, µj) = (r(dj, qj) – score(dj, qj))2

Total error 𝑗 µ(g, µj)

Page 10: Scoring, term weighting and the vector space

Let n01r (n01n) be the numbers of training examples that STitle = 0 and SBody = 1 and the judgment is relevant (irrelevant). The contribution of those examples that STitle = 0 and SBody = 1 to the total error is

[1-(1-g)]2n01r + [0-(1-g)]2n01n

The total error is (n01r+n10n)g2 + (n10r + n01n)(1-g)2 +n00r + n11n

By differentiating with respect to g and setting the result to 0, the optimal value of g is

𝒏𝟏𝟎𝒓+𝒏

𝟎𝟏𝒏

𝒏𝟏𝟎𝒓+𝒏

𝟏𝟎𝒏+𝒏

𝟎𝟏𝒓+𝒏

𝟎𝟏𝒏