internet client-server systems · data mining, large -scale data analytics and big data) typical ir...

86
Information Retrieval

Upload: others

Post on 24-Mar-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Information Retrieval

Page 2: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Information Retrieval

Information Retrieval constructs an index for a given corpus and responds to queries by retrieving all the relevant documents and as few non-relevant documents as possible.

index a collection of documents (access efficiency) given user’s query rank documents by importance (accuracy)

Page 3: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Query

How exact is the representation of the document ?

How exact is the representation of the query ?

How well is query matched to data? How relevant is the result to the query ?

Document collection

Document Representation

Query representation

Query Answer TYPICAL IR

PROBLEM

Page 4: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

History of IR Systems

Role of documentalists Role of database researchers Role of researchers in information

retrieval systems Role of researchers in information

retrieval systems and knowledge management systems.

Page 5: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Sources of Information on IR Top Tier Journals:

Journal of the American Society for Information Science and Technology (JASIST)

Information Processing & Management (IPM) Information Retrieval (IR) Information Sciences (IS) Journal of Documentation (JDoc) IEEE Transactions on Knowledge and Data Eng. (TKDE) ACM Transactions on Information Systems (TOIS)

Top Tier Conferences: ACM SIGIR (Special Interest Group on Information Retrieval) ACM CIKM (Int. Conf. on Info. and Know. Management) AAAI Conference on Artificial Intelligence Annual Meeting of the Association for Computational Linguistics European Conference on Information Retrieval (ECIR) TREC (Text REtrieval Evaluation Conference) * ACM SIGKDD (Special Interest Group on Knowledge Discovery,

Data Mining, Large-scale Data Analytics and Big Data)

Page 6: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Typical IR Task

Given: A corpus of textual natural-language

documents. A user query in the form of a textual

string. Find: A ranked set of documents that are

relevant to the query

Page 7: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Traditional IR System

IR System

Query String

Document corpus

Ranked Documents

1. Doc1 2. Doc2 3. Doc3 . .

Page 8: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Web Search System

Query String

IR System

Ranked Documents

1. Page1 2. Page2 3. Page3 . .

Document corpus

Web Crawler

Page 9: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Retrieval Models

A retrieval model specifies the details of: Document representation Query representation Retrieval function

Page 10: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Information Retrieval Models Three ‘classic’ models:

Boolean Model

Vector Space Model

Probabilistic Model

Additional models Extended Boolean

Fuzzy matching

Cluster-based retrieval

Language models

Page 11: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

“Classic” Retrieval Models Boolean

Documents and queries are sets of index terms

‘set theoretic’ Vector

Documents and queries are documents in n-dimensional space

‘algebraic’ Probabilistic

Based on probability theory

Page 12: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Documents

A document is a stored data record in any form

Examples: Book, journal article, report, dissertation,

encyclopedia Part of a text, e.g. paragraph,

encyclopedia article Also: Web page, image, music, sound,

video, video clip

Page 13: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Are Queries Documents?

Similarities: text based, similar terminology Differences usually shorter, linguistically less formed,

differ in statistics of text Simpler to think of queries as documents

Retrieval as a “matching” process

Page 14: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Sample TREC Topic (Query)

Paragraph

<top> <num> Number: 327 <title> Topic: Windows Longhorn <desc> Description: Microsoft is currently developing its newest incarnation of the Windows operating system: Longhorn. <narr> Narrative: As the competition against Microsoft increases, the company is also seeking out new battlefields with its new version of Windows, such as improved file-searching technology. Including this new searching technology, what improvements will be added to Windows, and how is the competition responding?

<related-text> Relevant Longhorn will include a database-like storage engine called Windows Future Storage (WinFS), which is based on technology from SQL Server 2003 (code-named Yukon). This storage engine builds on NTFS and will abstract physical file locations from the user and allow for the sorts of complex data searching that are impossible today. For example, today, your email messages, contacts, Word documents, and music files are all completely separate. That won't be the case in Longhorn. WinFS requires NTFS. </top>

SGML Markup Short Phrase

Sentence (fragment)

Page 15: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Retrieval Matching Process

Binary: D = 1, 1, 1, 0, 1, 1, 0

Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3 Size of vector = size of vocabulary = 7 0 means corresponding term not found in document or query

Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3

Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

Page 16: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Document Processing

Page 17: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Document Processing in IR Systems

Assign identifier, store document Identify “Words” Positional Information Word Stemming Term Weighting

Page 18: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Relevance Feedback in IR

After initial retrieval results are presented, allow the user to provide feedback on the relevance of one or more of the retrieved documents.

Use this feedback information to reformulate the query.

Produce new results based on reformulated query. Allows more interactive, multi-pass process.

Page 19: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Relevance Feedback Architecture

Rankings IR System

Document corpus

Ranked Documents

1. Doc1 2. Doc2 3. Doc3 . .

1. Doc1 ⇓ 2. Doc2 ⇑ 3. Doc3 ⇓ . .

Feedback

Query String

Revised

Query ReRanked Documents

1. Doc2 2. Doc4 3. Doc5 . .

Query Reformulation

Page 20: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Boolean Information Retrieval

Page 21: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Boolean Model

Based on set theory and Boolean algebra

Queries are specified as Boolean expressions

Widely used in commercial IR systems (Dialog, Lexis/Nexis)

Based on inverted index file Usually supplemented with proximity

operators

Page 22: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Boolean Model Output: Document is relevant or not. No partial

matches or ranking and requires an exact match.

A document is represented as a set of keywords.

Queries are Boolean expressions of keywords, connected by logical AND, OR, and NOT, including the use of brackets to indicate scope. [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]

Page 23: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Logical AND (∧) (Set Intersection)

A ∧ B

is the set of things in common, i.e., in both sets A and B

A B Aged Blind

A ∧ B (Aged, Blind People)

Page 24: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Logical OR (∨) (Set Union)

A ∨ B

is the set of: things in either A, B or both.

A B Aged Blind

A ∨ B (people that are either Aged or Blind or both)

Page 25: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Logical NOT (¬) (Set Complement)

¬ B

is the set of things outside the set B

B

(people who aren’t blind)

Blind

¬ B

Page 26: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Example Combination

A ∧ (¬ B)

B

(old people who aren’t blind)

Blind

A ∧ (¬ B)

A Aged

Page 27: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

More Examples

D1 = “computer information retrieval” D2 = “computer retrieval” D3 = “information” D4 = “computer information”

Q1 = “information ∧ retrieval” Q2 = “information ∧ ¬ computer”

D1

D3

Page 28: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Popular retrieval model because: Easy to understand for simple queries. Clean formalism.

Reasonably efficient implementations possible

for normal queries.

Boolean Retrieval Model

Page 29: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Very rigid: AND means all; OR means any. Difficult to express complex user requests. Difficult to control the number of documents

retrieved. All matched documents will be returned.

Difficult to rank output. All matched documents logically satisfy the

query. Difficult to perform relevance feedback.

If a document is identified by the user as relevant or irrelevant, how should the query be modified?

Drawbacks of the Boolean Model

Page 30: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Drawbacks of the Boolean Model

Retrieval based on binary decision criteria with no notion of partial matching

No ranking of the documents is provided (absence of a grading scale)

Information need has to be translated into a Boolean expression which most users find awkward

The Boolean queries formulated by the users are most often too simplistic

As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

Page 31: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Vector Space Information Retrieval

Page 32: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Vector Space Model

Based on idea of n-dimensional document space

Query is also located in document space Documents are ranked in order of their

“closeness” to the query Many possible matching functions

Page 33: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Issues for Vector Space Model How to determine important words in a document?

Word sense?

Word n-grams (and phrases, idioms,…) terms

How to determine the degree of importance of a term within a document and within the entire collection?

How to determine the degree of similarity between a document and the query?

In the case of the web, what is a collection and what are the effects of links, formatting information, etc.?

Page 34: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Vector-Space Model Assume t distinct terms remain after preprocessing; call

them index terms or the vocabulary. These “orthogonal” terms form a vector space.

Dimension = t = |vocabulary| Each term, i, in a document or query, j, is given a real-

valued weight, wij.

Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)

Page 35: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Graphic Representation Example: D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

T3

T1

T2

D1 = 2T1+ 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

3 2

5

• Is D1 or D2 more similar to Q? • How to measure the degree of

similarity? Distance? Angle? Projection?

Page 36: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Inner Product -- Examples Binary:

D = 1, 1, 1, 0, 1, 1, 0

Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3

Size of vector = size of vocabulary = 7 0 means corresponding term not found in

document or query

Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3

Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

Page 37: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Document Collection A collection of n documents can be represented in the

vector space model by a term-document matrix. An entry in the matrix corresponds to the “weight” of a

term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document.

T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn

Page 38: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Term Weights: Term Frequency

More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j

May want to normalize term frequency (tf) across the entire corpus: tfij = fij / max{fij}

This image cannot currently be displayed.

Page 39: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Term Weights: Inverse Document Frequency

Terms that appear in many different documents are less indicative of overall topic.

df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents) An indication of a term’s discrimination power. Log used to dampen the effect relative to tf.

Page 40: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

TF-IDF Weighting A typical combined term importance

indicator is tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi)

A term occurring frequently in the document but rarely in the rest of the collection is given high weight.

Many other ways of determining term weights have been proposed.

Experimentally, tf-idf has been found to work well.

Page 41: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Computing TF-IDF -- An Example

Given a document containing terms with given frequencies:

A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 B: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 C: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2

Page 42: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Query Vector

Query vector is typically treated as a document and also tf-idf weighted.

Alternative is for the user to supply weights for the given query terms.

Page 43: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Similarity Measure A similarity measure is a function that

computes the degree of similarity between two vectors.

Using a similarity measure between the query and each document: It is possible to rank the retrieved documents in

the order of presumed relevance. It is possible to enforce a certain threshold so that

the size of the retrieved set can be controlled.

Page 44: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Similarity Measure - Inner Product Similarity between vectors for the document di and

query q can be computed as the vector inner product: sim(dj,q) = dj•q = wij · wiq

where wij is the weight of term i in document j and

wiq is the weight of term i in the query For binary vectors, the inner product is the number of

matched query terms in the document (size of intersection).

For weighted term vectors, it is the sum of the products of the weights of the matched terms.

∑=

t

i 1

Page 45: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Inner Product -- Examples Binary:

D = 1, 1, 1, 0, 1, 1, 0

Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3

Size of vector = size of vocabulary = 7 0 means corresponding term not found in

document or query

Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3

Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

Page 46: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Cosine Similarity Measure Cosine similarity measures the cosine

of the angle between two vectors. Inner product normalized by the

vector lengths.

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3

θ2

t3

t1

t2

D1

D2

Q θ1

D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product.

∑ ∑

= =

=•

⋅=

⋅t

i

t

i

t

i

ww

wwqdqd

iqij

iqij

j

j

1 1

22

1)(

CosSim(dj, q) =

Page 47: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Outline

Probabilistic Information Retrieval

System Evaluation

Web Mining

Page 48: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Probabilistic Information Retrieval

Page 49: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

The Basics

Bayesian probability formulas

Odds:

)()|()()|()(

)()|()|(

)()|()()()|(

apabpbpbapbp

apabpbap

apabpbapbpbap

=

=

=∩=

)(1)(

)()()(

ypyp

ypypyO

−==

Page 50: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

The Basics

)()()|()|(

)()()|()|(

xpNRpNRxpxNRp

xpRpRxpxRp

=

=

• Document Relevance:

• Note:

1)|()|( =+ xNRpxRp

Page 51: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Binary Independence Model “Binary” = Boolean: documents are

represented as binary vectors of terms: iff term i is present in document x.

“Independence”: terms occur in documents

independently Different documents can be modeled as same

vector.

),,( 1 nxxx =1=ix

Page 52: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Binary Independence Model Queries: binary vectors of terms Given query q,

for each document d need to compute p(R|q,d).

replace with computing p(R|q,x) where x is vector representing d

Interested only in ranking Will use odds:

),|(),|(

)|()|(

),|(),|(),|(

qNRxpqRxp

qNRpqRp

xqNRpxqRpxqRO ⋅==

Page 53: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Binary Independence Model

• Using Independence Assumption:

∏=

=n

i i

i

qNRxpqRxp

qNRxpqRxp

1 ),|(),|(

),|(),|(

),|(),|(

)|()|(

),|(),|(),|(

qNRxpqRxp

qNRpqRp

xqNRpxqRpxqRO ⋅==

Constant for each query Needs estimation

∏=

⋅=n

i i

i

qNRxpqRxpqROdqRO

1 ),|(),|()|(),|(•So :

Page 54: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Binary Independence Model

∏=

⋅=n

i i

i

qNRxpqRxpqROdqRO

1 ),|(),|()|(),|(

• Since xi is either 0 or 1:

∏∏== =

=⋅

==

⋅=01 ),|0(

),|0(),|1(

),|1()|(),|(ii x i

i

x i

i

qNRxpqRxp

qNRxpqRxpqROdqRO

• Let );,|1( qRxpp ii == );,|1( qNRxpr ii ==

Then...

Page 55: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

All matching terms Non-matching query terms

Binary Independence Model

All matching terms All query terms

∏ ∏

∏ ∏

= = =

= = = =

− −

⋅ − −

⋅ =

− −

⋅ ⋅ =

1 1

1 0 1

1 1

) 1 ( ) 1 ( ) | (

1 1 ) | ( ) , | (

i i i

i i i i

q i

i

q x i i

i i

q x i

i

q x i

i

r p

p r r p q R O

r p

r p q R O x q R O

All matching terms

Page 56: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Binary Independence Model

Constant for each query

Only quantity to be estimated for rankings

∏∏=== −

−⋅

−−

⋅=11 11

)1()1()|(),|(

iii q i

i

qx ii

ii

rp

prrpqROxqRO

• Retrieval Status Value:

∑∏==== −

−=

−−

=11 )1(

)1(log)1()1(log

iiii qx ii

ii

qx ii

ii

prrp

prrpRSV

Page 57: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Binary Independence Model

• All boils down to computing RSV.

∑∏==== −

−=

−−

=11 )1(

)1(log)1()1(log

iiii qx ii

ii

qx ii

ii

prrp

prrpRSV

∑==

=1

;ii qx

icRSV)1()1(log

ii

iii pr

rpc−−

=

So, how do we compute ci’s from our data ?

Page 58: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Binary Independence Model • Estimating RSV coefficients. • For each term i look at the following table: Documents Relevant Non-Relevant Total

Xi=1 r n-r nXi=0 R-r N-n-R+r N-nTotal R N-R N

Rrpi ≈ )(

)(RNrnri −

−≈

)()()(log),,,(

rRnNrnrRrrRnNKci +−−−

−=≈

• Estimates: Add 0.5 to every expression

Page 59: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

System Evaluation

Page 60: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Why System Evaluation? There are many retrieval models/

algorithms/ systems, which one is the best? What is the best component for:

Ranking function (dot-product, cosine, …) Term selection (stemming…) Term weighting (TF, TF-IDF,…)

How far down the ranked list will a user need to look to find some/all relevant documents?

Page 61: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

What Can We Measure? Algorithm (Efficiency)

Speed of algorithm Update potential of indexing scheme Size of storage required Potential for distribution & parallelism

User Experience (Effectiveness) How many of all relevant docs were found How many were missed How many errors in selection How many need to be scanned before get good ones

Page 62: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Measures Based on Relevance

RR

NN

NR RN

not retrieved not relevant

retrieved not relevant

retrieved relevant

not retrieved relevant

Doc set

Page 63: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

documents relevant of number Totalretrieved documents relevant of Number recall =

retrieved documents of number Totalretrieved documents relevant of Number precision =

Relevant documents

Retrieved documents

Entire document collection

retrieved & relevant

not retrieved but relevant

retrieved & irrelevant

Not retrieved & irrelevant

retrieved not retrieved

rele

vant

irr

elev

ant

Precision and Recall Relevant and retrieved

Presenter
Presentation Notes
Precision: The ability to retrieve top-ranked documents that are mostly relevant. Recall: The ability of the search to find all of the relevant items in the corpus.
Page 64: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Trade-off between Recall and Precision

1 0

1

Recall

Prec

isio

n The ideal

Returns relevant documents but misses many useful ones too

Returns most relevant documents but includes lots of junk

Page 65: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

R=3/6=0.5; P=3/4=0.75

Computing Recall/Precision Points: An Example

n doc # relevant1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 578

10 98511 10312 59113 772 x14 990

Let total # of relevant docs = 6 Check each new recall point:

R=1/6=0.167; P=1/1=1

R=2/6=0.333; P=2/2=1

R=5/6=0.833; p=5/13=0.38

R=4/6=0.667; P=4/6=0.667

Missing one relevant document.

Never reach 100% recall

Page 66: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

R- Precision Precision at the R-th position in the ranking

of results for a query that has R relevant documents.

n doc # relevant1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990

R = # of relevant docs = 6

R-Precision = 4/6 = 0.67

Page 67: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Compare Two or More Systems

The curve closest to the upper right-hand corner of the graph indicates the best performance

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

isio

n

NoStem Stem

Page 68: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

An Example for Precision-Recall Curve

Page 69: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Famous Examples of System Evaluation

• The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, 1957 –1968 (hundreds of docs)

• Okapi System, Jimmy Huang and Stephen Robertson York University & Microsoft • SMART System, Gerald Salton, Cornell University

• TREC, Donna Harman, National Institute of Standards and Technology (NIST), 1992 - (millions of docs, 100k to 7.5M per set, training Q’s and test Q’s, 150 each)

Page 70: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Evaluating Retrieval Systems: Text REtrieval Conference

“TREC” An annual bake-off for text retrieval systems Sponsored by Roughly 2.5 gigabytes of text (428 gigabytes of Web data) 50 “topics” (queries) Return top 1000 documents for each topic Results judged by retired CIA and NSA analysts No-gloat rule Numerous tracks, including text routing, very large corpus,

cross-language retrieval

Page 71: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Web Mining

Page 72: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Contents What is Web mining? What can Web mining do? What is challenge for Web mining? Web mining categories

Web usage mining Web content mining Web structure mining

Applications of Web mining Examples

Page 73: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

What is Web Mining? Web Mining is

the use of data mining techniques to automatically discover and extract information from the Web documents.

Page 74: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

What is Web Mining ?

By the development of Computer technology, people begin to “abuse” data!

More and more data are available on the Web. However, the fact is : Some interesting things are buried.

So we need ………

Page 75: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

What is Web Mining ?

Our objective is to find valuable knowledge hidden among the data ………..

Page 76: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Web Mining Techniques - Navigation Patterns

A

B

C D

E

Web Page Hierarchy of a Web Site

Page 77: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Web Mining Techniques - Navigation Patterns

A

B

C D

E

A link could be provided from C to E

Page 78: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

What Data Mining can do ? An Example

Page 79: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

What Web Mining can do ?

sales

month

Page 80: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

What is challenge for Web Mining?

The Web is a huge collection of documents

The Web is very dynamic Challenge: Develop new Web

mining algorithms and adapt traditional data mining algorithms

Presenter
Presentation Notes
The Web is a huge collection of documents except for Hyperlink information Access and usage information The Web is very dynamic New pages are constantly being generated Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to Exploit hyper-link and access patterns Be incremental
Page 81: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Categories of Web Mining Web Usage Mining Web Content Mining

Text Multimedia

Web Structure Mining Reference R. Kosala and H. Blockeel, “Web Mining Research: A

Survey”, SIGKDD Exploration, vol. 2, issue 1, 2000. J. Srivastava et al, “Web Usage Mining: Discovery and

Applications of Usage Patterns from Web Data”, SIGKDD Exploration, vol. 2, issue 1, 1999.

Page 82: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Web Usage Mining Process

Preprocessing Mining Patterns Pattern Analysis

Background Knowledge

Raw Logs User Session File

Rules & Patterns Interesting rules & patterns

Page 83: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Web Usage Mining

Discovery information about how the Web pages are being accessed: By whom For how long When What is the order of page references

Can be used to determine a better way to organize the Web site

Page 84: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Web Usage Mining - Pattern Discovery

Applies Web mining techniques to generate rules and patterns

Web Mining Techniques Statistical Analysis Association Rule Generation on Web Clustering Classification Sequential Pattern

Presenter
Presentation Notes
The knowledge discovery phase uses existing data mining techniques to generate rules and patterns. Included in this phase is the generation ofg eneral usage statistics, such as number of“hit s” per page, page most frequently accessed, most common starting page, and average time spent on each page. Association rule and sequential pattern generation are the only data mining algorithms currently implemented in the WEBMINER system, but the open architecture can easily accommodate any data mining or path analysis algorithm. The discovered information is then fed into various pattern analysis tools. The site filter is used to identify interesting rules and patterns by comparing the discovered knowledge with the Web site designer’s view ofho w the site should be used, as discussed in the next section. As shown in Fig. 2, the site .lter can be applied to the data mining algorithms in order to reduce the computation time, or the discovered rules and patterns.
Page 85: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Generate simple statistical reports: A report of hits and bytes transferred A list of top requested URLs A list of top referrers Learn: Who is visiting your site How much time visitors spend on each page The most common starting page

Web Usage Mining - Statistical Analysis

Presenter
Presentation Notes
Web usage Usage information can be used to restructure a Web site in order to better serve the needs of users of a site. Generate simple statistical reports: A summary report of hits and bytes transferred A list of top requested URLs A list of top referrers A list of most common browsers used Hits per hour/day/week/month reports Hits per domain reports Learn: Who is visiting your site The path visitors take through your pages How much time visitors spend on each page The most common starting page Where visitors are leaving your site
Page 86: Internet Client-Server Systems · Data Mining, Large -scale Data Analytics and Big Data) Typical IR Task ... Usually supplemented with proximity operators . Boolean Model Output:

Web Usage Mining - Statistical Analysis

Statistical Analysis is useful for

Improving the system performance

Enhancing the security of the system

Facilitation the site modification task

Providing support for marketing decisions