internet client-server systems · data mining, large -scale data analytics and big data) typical ir...

Information Retrieval

Information Retrieval

Information Retrieval constructs an index for a given corpus and responds to queries by retrieving all the relevant documents and as few non-relevant documents as possible.

index a collection of documents (access efficiency) given user’s query rank documents by importance (accuracy)

Query

How exact is the representation of the document ?

How exact is the representation of the query ?

How well is query matched to data? How relevant is the result to the query ?

Document collection

Document Representation

Query representation

Query Answer TYPICAL IR

PROBLEM

History of IR Systems

Role of documentalists Role of database researchers Role of researchers in information

retrieval systems Role of researchers in information

retrieval systems and knowledge management systems.

Sources of Information on IR Top Tier Journals:

Journal of the American Society for Information Science and Technology (JASIST)

Information Processing & Management (IPM) Information Retrieval (IR) Information Sciences (IS) Journal of Documentation (JDoc) IEEE Transactions on Knowledge and Data Eng. (TKDE) ACM Transactions on Information Systems (TOIS)

Top Tier Conferences: ACM SIGIR (Special Interest Group on Information Retrieval) ACM CIKM (Int. Conf. on Info. and Know. Management) AAAI Conference on Artificial Intelligence Annual Meeting of the Association for Computational Linguistics European Conference on Information Retrieval (ECIR) TREC (Text REtrieval Evaluation Conference) * ACM SIGKDD (Special Interest Group on Knowledge Discovery,

Data Mining, Large-scale Data Analytics and Big Data)

Typical IR Task

Given: A corpus of textual natural-language

documents. A user query in the form of a textual

string. Find: A ranked set of documents that are

relevant to the query

Traditional IR System

IR System

Query String

Document corpus

Ranked Documents

1. Doc1 2. Doc2 3. Doc3 . .

Web Search System

Query String

IR System

Ranked Documents

1. Page1 2. Page2 3. Page3 . .

Document corpus

Web Crawler

Retrieval Models

A retrieval model specifies the details of: Document representation Query representation Retrieval function

Information Retrieval Models Three ‘classic’ models:

Boolean Model

Vector Space Model

Probabilistic Model

Additional models Extended Boolean

Fuzzy matching

Cluster-based retrieval

Language models

“Classic” Retrieval Models Boolean

Documents and queries are sets of index terms

‘set theoretic’ Vector

Documents and queries are documents in n-dimensional space

‘algebraic’ Probabilistic

Based on probability theory

Documents

A document is a stored data record in any form

Examples: Book, journal article, report, dissertation,

encyclopedia Part of a text, e.g. paragraph,

encyclopedia article Also: Web page, image, music, sound,

video, video clip

Are Queries Documents?

Similarities: text based, similar terminology Differences usually shorter, linguistically less formed,

differ in statistics of text Simpler to think of queries as documents

Retrieval as a “matching” process

Sample TREC Topic (Query)

Paragraph

<top> <num> Number: 327 <title> Topic: Windows Longhorn <desc> Description: Microsoft is currently developing its newest incarnation of the Windows operating system: Longhorn. <narr> Narrative: As the competition against Microsoft increases, the company is also seeking out new battlefields with its new version of Windows, such as improved file-searching technology. Including this new searching technology, what improvements will be added to Windows, and how is the competition responding?

<related-text> Relevant Longhorn will include a database-like storage engine called Windows Future Storage (WinFS), which is based on technology from SQL Server 2003 (code-named Yukon). This storage engine builds on NTFS and will abstract physical file locations from the user and allow for the sorts of complex data searching that are impossible today. For example, today, your email messages, contacts, Word documents, and music files are all completely separate. That won't be the case in Longhorn. WinFS requires NTFS. </top>

SGML Markup Short Phrase

Sentence (fragment)

Retrieval Matching Process

Binary: D = 1, 1, 1, 0, 1, 1, 0

Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3 Size of vector = size of vocabulary = 7 0 means corresponding term not found in document or query

Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3

Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

Document Processing

Document Processing in IR Systems

Assign identifier, store document Identify “Words” Positional Information Word Stemming Term Weighting

Relevance Feedback in IR

After initial retrieval results are presented, allow the user to provide feedback on the relevance of one or more of the retrieved documents.

Use this feedback information to reformulate the query.

Produce new results based on reformulated query. Allows more interactive, multi-pass process.

Relevance Feedback Architecture

Rankings IR System

Document corpus

Ranked Documents

1. Doc1 2. Doc2 3. Doc3 . .

1. Doc1 ⇓ 2. Doc2 ⇑ 3. Doc3 ⇓ . .

Feedback

Query String

Revised

Query ReRanked Documents

1. Doc2 2. Doc4 3. Doc5 . .

Query Reformulation

Boolean Information Retrieval

Boolean Model

Based on set theory and Boolean algebra

Queries are specified as Boolean expressions

Widely used in commercial IR systems (Dialog, Lexis/Nexis)

Based on inverted index file Usually supplemented with proximity

operators

Boolean Model Output: Document is relevant or not. No partial

matches or ranking and requires an exact match.

A document is represented as a set of keywords.

Queries are Boolean expressions of keywords, connected by logical AND, OR, and NOT, including the use of brackets to indicate scope. [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]

Logical AND (∧) (Set Intersection)

A ∧ B

is the set of things in common, i.e., in both sets A and B

A B Aged Blind

A ∧ B (Aged, Blind People)

Logical OR (∨) (Set Union)

A ∨ B

is the set of: things in either A, B or both.

A B Aged Blind

A ∨ B (people that are either Aged or Blind or both)

Logical NOT (¬) (Set Complement)

¬ B

is the set of things outside the set B

B

(people who aren’t blind)

Blind

¬ B

Example Combination

A ∧ (¬ B)

B

(old people who aren’t blind)

Blind

A ∧ (¬ B)

A Aged

More Examples

D1 = “computer information retrieval” D2 = “computer retrieval” D3 = “information” D4 = “computer information”

Q1 = “information ∧ retrieval” Q2 = “information ∧ ¬ computer”

D1

D3

Popular retrieval model because: Easy to understand for simple queries. Clean formalism.

Reasonably efficient implementations possible

for normal queries.

Boolean Retrieval Model

Very rigid: AND means all; OR means any. Difficult to express complex user requests. Difficult to control the number of documents

retrieved. All matched documents will be returned.

Difficult to rank output. All matched documents logically satisfy the

query. Difficult to perform relevance feedback.

If a document is identified by the user as relevant or irrelevant, how should the query be modified?

Drawbacks of the Boolean Model

Drawbacks of the Boolean Model

Retrieval based on binary decision criteria with no notion of partial matching

No ranking of the documents is provided (absence of a grading scale)

Information need has to be translated into a Boolean expression which most users find awkward

The Boolean queries formulated by the users are most often too simplistic

As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

Vector Space Information Retrieval

Vector Space Model

Based on idea of n-dimensional document space

Query is also located in document space Documents are ranked in order of their

“closeness” to the query Many possible matching functions

Issues for Vector Space Model How to determine important words in a document?

Word sense?

Word n-grams (and phrases, idioms,…) terms

How to determine the degree of importance of a term within a document and within the entire collection?

How to determine the degree of similarity between a document and the query?

In the case of the web, what is a collection and what are the effects of links, formatting information, etc.?

Vector-Space Model Assume t distinct terms remain after preprocessing; call

them index terms or the vocabulary. These “orthogonal” terms form a vector space.

Dimension = t = |vocabulary| Each term, i, in a document or query, j, is given a real-

valued weight, wij.

Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj)

Graphic Representation Example: D1 = 2T1 + 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

T3

T1

T2

D1 = 2T1+ 3T2 + 5T3

D2 = 3T1 + 7T2 + T3

Q = 0T1 + 0T2 + 2T3

7

3 2

5

• Is D1 or D2 more similar to Q? • How to measure the degree of

similarity? Distance? Angle? Projection?

Inner Product -- Examples Binary:

D = 1, 1, 1, 0, 1, 1, 0

Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3

Size of vector = size of vocabulary = 7 0 means corresponding term not found in

document or query


Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

Document Collection A collection of n documents can be represented in the

vector space model by a term-document matrix. An entry in the matrix corresponds to the “weight” of a

term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document.

T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn

Term Weights: Term Frequency

More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j

May want to normalize term frequency (tf) across the entire corpus: tfij = fij / max{fij}

This image cannot currently be displayed.

Term Weights: Inverse Document Frequency

Terms that appear in many different documents are less indicative of overall topic.

df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents) An indication of a term’s discrimination power. Log used to dampen the effect relative to tf.

TF-IDF Weighting A typical combined term importance

indicator is tf-idf weighting: wij = tfij idfi = tfij log2 (N/ dfi)

A term occurring frequently in the document but rarely in the rest of the collection is given high weight.

Many other ways of determining term weights have been proposed.

Experimentally, tf-idf has been found to work well.

Computing TF-IDF -- An Example

Given a document containing terms with given frequencies:

A(3), B(2), C(1) Assume collection contains 10,000 documents and document frequencies of these terms are: A(50), B(1300), C(250) Then: A: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 B: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 C: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2

Query Vector

Query vector is typically treated as a document and also tf-idf weighted.

Alternative is for the user to supply weights for the given query terms.

Similarity Measure A similarity measure is a function that

computes the degree of similarity between two vectors.

Using a similarity measure between the query and each document: It is possible to rank the retrieved documents in

the order of presumed relevance. It is possible to enforce a certain threshold so that

the size of the retrieved set can be controlled.

Similarity Measure - Inner Product Similarity between vectors for the document di and

query q can be computed as the vector inner product: sim(dj,q) = dj•q = wij · wiq

where wij is the weight of term i in document j and

wiq is the weight of term i in the query For binary vectors, the inner product is the number of

matched query terms in the document (size of intersection).

For weighted term vectors, it is the sum of the products of the weights of the matched terms.

∑=

t

i 1

Inner Product -- Examples Binary:

D = 1, 1, 1, 0, 1, 1, 0

Q = 1, 0 , 1, 0, 0, 1, 1

sim(D, Q) = 3

Size of vector = size of vocabulary = 7 0 means corresponding term not found in

document or query


Q = 0T1 + 0T2 + 2T3 sim(D1 , Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2 , Q) = 3*0 + 7*0 + 1*2 = 2

Cosine Similarity Measure Cosine similarity measures the cosine

of the angle between two vectors. Inner product normalized by the

vector lengths.

D1 = 2T1 + 3T2 + 5T3 CosSim(D1 , Q) = 10 / √(4+9+25)(0+0+4) = 0.81 D2 = 3T1 + 7T2 + 1T3 CosSim(D2 , Q) = 2 / √(9+49+1)(0+0+4) = 0.13 Q = 0T1 + 0T2 + 2T3

θ2

t3

t1

t2

D1

D2

Q θ1

D1 is 6 times better than D2 using cosine similarity but only 5 times better using inner product.

∑ ∑

∑

= =

=•

⋅

⋅=

⋅t

i

t

i

t

i

ww

wwqdqd

iqij

iqij

j

j

1 1

22

1)(

CosSim(dj, q) =

Outline

Probabilistic Information Retrieval

System Evaluation

Web Mining

Probabilistic Information Retrieval

The Basics

Bayesian probability formulas

Odds:

)()|()()|()(

)()|()|(

)()|()()()|(

apabpbpbapbp

apabpbap

apabpbapbpbap

=

=

=∩=

)(1)(

)()()(

ypyp

ypypyO

−==

The Basics

)()()|()|(

)()()|()|(

xpNRpNRxpxNRp

xpRpRxpxRp

=

=

• Document Relevance:

• Note:

1)|()|( =+ xNRpxRp

Binary Independence Model “Binary” = Boolean: documents are

represented as binary vectors of terms: iff term i is present in document x.

“Independence”: terms occur in documents

independently Different documents can be modeled as same

vector.

),,( 1 nxxx =1=ix

Binary Independence Model Queries: binary vectors of terms Given query q,

for each document d need to compute p(R|q,d).

replace with computing p(R|q,x) where x is vector representing d

Interested only in ranking Will use odds:

),|(),|(

)|()|(

),|(),|(),|(

qNRxpqRxp

qNRpqRp

xqNRpxqRpxqRO ⋅==

Binary Independence Model

• Using Independence Assumption:

∏=

=n

i i

i

qNRxpqRxp

qNRxpqRxp

1 ),|(),|(

),|(),|(

),|(),|(

)|()|(

),|(),|(),|(

qNRxpqRxp

qNRpqRp

xqNRpxqRpxqRO ⋅==

Constant for each query Needs estimation

∏=

⋅=n

i i

i

qNRxpqRxpqROdqRO

1 ),|(),|()|(),|(•So :


∏=

⋅=n

i i

i

qNRxpqRxpqROdqRO

1 ),|(),|()|(),|(

• Since xi is either 0 or 1:

∏∏== =

=⋅

==

⋅=01 ),|0(

),|0(),|1(

),|1()|(),|(ii x i

i

x i

i

qNRxpqRxp

qNRxpqRxpqROdqRO

• Let );,|1( qRxpp ii == );,|1( qNRxpr ii ==

Then...

All matching terms Non-matching query terms


All matching terms All query terms

∏ ∏

∏ ∏

= = =

= = = =

− −

⋅ − −

⋅ =

− −

⋅ ⋅ =

1 1

1 0 1

1 1

) 1 ( ) 1 ( ) | (

1 1 ) | ( ) , | (

i i i

i i i i

q i

i

q x i i

i i

q x i

i

q x i

i

r p

p r r p q R O

r p

r p q R O x q R O

All matching terms


Constant for each query

Only quantity to be estimated for rankings

∏∏=== −

−⋅

−−

⋅=11 11

)1()1()|(),|(

iii q i

i

qx ii

ii

rp

prrpqROxqRO

• Retrieval Status Value:

∑∏==== −

−=

−−

=11 )1(

)1(log)1()1(log

iiii qx ii

ii

qx ii

ii

prrp

prrpRSV


• All boils down to computing RSV.

∑∏==== −

−=

−−

=11 )1(

)1(log)1()1(log

iiii qx ii

ii

qx ii

ii

prrp

prrpRSV

∑==

=1

;ii qx

icRSV)1()1(log

ii

iii pr

rpc−−

=

So, how do we compute ci’s from our data ?

Binary Independence Model • Estimating RSV coefficients. • For each term i look at the following table: Documents Relevant Non-Relevant Total

Xi=1 r n-r nXi=0 R-r N-n-R+r N-nTotal R N-R N

Rrpi ≈ )(

)(RNrnri −

−≈

)()()(log),,,(

rRnNrnrRrrRnNKci +−−−

−=≈

• Estimates: Add 0.5 to every expression

System Evaluation

Why System Evaluation? There are many retrieval models/

algorithms/ systems, which one is the best? What is the best component for:

Ranking function (dot-product, cosine, …) Term selection (stemming…) Term weighting (TF, TF-IDF,…)

How far down the ranked list will a user need to look to find some/all relevant documents?

What Can We Measure? Algorithm (Efficiency)

Speed of algorithm Update potential of indexing scheme Size of storage required Potential for distribution & parallelism

User Experience (Effectiveness) How many of all relevant docs were found How many were missed How many errors in selection How many need to be scanned before get good ones

Measures Based on Relevance

RR

NN

NR RN

not retrieved not relevant

retrieved not relevant

retrieved relevant

not retrieved relevant

Doc set

documents relevant of number Totalretrieved documents relevant of Number recall =

retrieved documents of number Totalretrieved documents relevant of Number precision =

Relevant documents

Retrieved documents

Entire document collection

retrieved & relevant

not retrieved but relevant

retrieved & irrelevant

Not retrieved & irrelevant

retrieved not retrieved

rele

vant

irr

elev

ant

Precision and Recall Relevant and retrieved

Presenter

Presentation Notes

Precision: The ability to retrieve top-ranked documents that are mostly relevant. Recall: The ability of the search to find all of the relevant items in the corpus.

Trade-off between Recall and Precision

1 0

1

Recall

Prec

isio

n The ideal

Returns relevant documents but misses many useful ones too

Returns most relevant documents but includes lots of junk

R=3/6=0.5; P=3/4=0.75

Computing Recall/Precision Points: An Example

n doc # relevant1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 578

10 98511 10312 59113 772 x14 990

Let total # of relevant docs = 6 Check each new recall point:

R=1/6=0.167; P=1/1=1

R=2/6=0.333; P=2/2=1

R=5/6=0.833; p=5/13=0.38

R=4/6=0.667; P=4/6=0.667

Missing one relevant document.

Never reach 100% recall

R- Precision Precision at the R-th position in the ranking

of results for a query that has R relevant documents.

n doc # relevant1 588 x2 589 x3 5764 590 x5 9866 592 x7 9848 9889 57810 98511 10312 59113 772 x14 990

R = # of relevant docs = 6

R-Precision = 4/6 = 0.67

Compare Two or More Systems

The curve closest to the upper right-hand corner of the graph indicates the best performance

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Prec

isio

n

NoStem Stem

An Example for Precision-Recall Curve

Famous Examples of System Evaluation

• The Cranfield Experiments, Cyril W. Cleverdon, Cranfield College of Aeronautics, 1957 –1968 (hundreds of docs)

• Okapi System, Jimmy Huang and Stephen Robertson York University & Microsoft • SMART System, Gerald Salton, Cornell University

• TREC, Donna Harman, National Institute of Standards and Technology (NIST), 1992 - (millions of docs, 100k to 7.5M per set, training Q’s and test Q’s, 150 each)

Evaluating Retrieval Systems: Text REtrieval Conference

“TREC” An annual bake-off for text retrieval systems Sponsored by Roughly 2.5 gigabytes of text (428 gigabytes of Web data) 50 “topics” (queries) Return top 1000 documents for each topic Results judged by retired CIA and NSA analysts No-gloat rule Numerous tracks, including text routing, very large corpus,

cross-language retrieval

Web Mining

Contents What is Web mining? What can Web mining do? What is challenge for Web mining? Web mining categories

Web usage mining Web content mining Web structure mining

Applications of Web mining Examples

What is Web Mining? Web Mining is

the use of data mining techniques to automatically discover and extract information from the Web documents.

What is Web Mining ?

By the development of Computer technology, people begin to “abuse” data!

More and more data are available on the Web. However, the fact is : Some interesting things are buried.

So we need ………

What is Web Mining ?

Our objective is to find valuable knowledge hidden among the data ………..

Web Mining Techniques - Navigation Patterns

A

B

C D

E

Web Page Hierarchy of a Web Site

Web Mining Techniques - Navigation Patterns

A

B

C D

E

A link could be provided from C to E

What Data Mining can do ? An Example

What Web Mining can do ?

sales

month

What is challenge for Web Mining?

The Web is a huge collection of documents

The Web is very dynamic Challenge: Develop new Web

mining algorithms and adapt traditional data mining algorithms

Presenter

Presentation Notes

The Web is a huge collection of documents except for Hyperlink information Access and usage information The Web is very dynamic New pages are constantly being generated Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to Exploit hyper-link and access patterns Be incremental

Categories of Web Mining Web Usage Mining Web Content Mining

Text Multimedia

Web Structure Mining Reference R. Kosala and H. Blockeel, “Web Mining Research: A

Survey”, SIGKDD Exploration, vol. 2, issue 1, 2000. J. Srivastava et al, “Web Usage Mining: Discovery and

Applications of Usage Patterns from Web Data”, SIGKDD Exploration, vol. 2, issue 1, 1999.

Web Usage Mining Process

Preprocessing Mining Patterns Pattern Analysis

Background Knowledge

Raw Logs User Session File

Rules & Patterns Interesting rules & patterns

Web Usage Mining

Discovery information about how the Web pages are being accessed: By whom For how long When What is the order of page references

Can be used to determine a better way to organize the Web site

Web Usage Mining - Pattern Discovery

Applies Web mining techniques to generate rules and patterns

Web Mining Techniques Statistical Analysis Association Rule Generation on Web Clustering Classification Sequential Pattern

Presenter

Presentation Notes

The knowledge discovery phase uses existing data mining techniques to generate rules and patterns. Included in this phase is the generation ofg eneral usage statistics, such as number of“hit s” per page, page most frequently accessed, most common starting page, and average time spent on each page. Association rule and sequential pattern generation are the only data mining algorithms currently implemented in the WEBMINER system, but the open architecture can easily accommodate any data mining or path analysis algorithm. The discovered information is then fed into various pattern analysis tools. The site filter is used to identify interesting rules and patterns by comparing the discovered knowledge with the Web site designer’s view ofho w the site should be used, as discussed in the next section. As shown in Fig. 2, the site .lter can be applied to the data mining algorithms in order to reduce the computation time, or the discovered rules and patterns.

Generate simple statistical reports: A report of hits and bytes transferred A list of top requested URLs A list of top referrers Learn: Who is visiting your site How much time visitors spend on each page The most common starting page

Web Usage Mining - Statistical Analysis

Presenter

Presentation Notes

Web usage Usage information can be used to restructure a Web site in order to better serve the needs of users of a site. Generate simple statistical reports: A summary report of hits and bytes transferred A list of top requested URLs A list of top referrers A list of most common browsers used Hits per hour/day/week/month reports Hits per domain reports Learn: Who is visiting your site The path visitors take through your pages How much time visitors spend on each page The most common starting page Where visitors are leaving your site

Web Usage Mining - Statistical Analysis

Statistical Analysis is useful for

Improving the system performance

Enhancing the security of the system

Facilitation the site modification task

Providing support for marketing decisions