Download - Chapter 2 Information Retrieval
1
Chapter 2Information Retrieval
Chapter2 in the textbookSections: 2.1, 2.2 (2.2.1, 2.2.2), 2.3 (2.3.1, 2.3.2, 2.3.3), 2.4(2.4.1, 2,4,2)
2
Modern Information Retrieval Document representation
Using keywords Relative weight of keywords
Query representation Keywords Relative importance of keywords
Retrieval model Similarity between document and query
Rank the documents Performance evaluation of the retrieval
process
5
Sample DocumentData Mining has emerged as one of the most exciting and dynamic fields in computing science. The driving force for data mining is the presence of petabyte-scale online archives that potentially contain valuable bits of information hidden in them. Commercial enterprises have been quick to recognize the value of this concept; consequently, within the span of a few years, the software market itself for data mining is expected to be in excess of $10 billion. Data mining refers to a family of techniques used to detect interesting nuggets of relationships/knowledge in data. While the theoretical underpinnings of the field have been around for quite some time (in the form of pattern recognition, statistics, data analysis and machine learning), the practice and use of these techniques have been largely ad-hoc. With the availability of large databases to store, manage and assimilate data, the new thrust of data mining lies at the intersection of database systems, artificial intelligence and algorithms that efficiently analyze data. The distributed nature of several databases, their size and the high complexity of many techniques present interesting computational challenges.
7
StemmingA given word may occur in a variety of syntactic forms
plurals past tense gerund forms (a noun derived from a verb)
ExampleThe word connect, may appear as
connector, connection, connections, connected, connecting, connects, preconnection, and postconnection.
8
StemmingA stem is what is left after its affixes (prefixes and suffixes) are removedSuffixes connector, connection, connections,
connected, connecting, connects, Prefixes preconnection, and postconnection.Stem connect
9
Porter’s Algorithm Letters A, E, I, O, and U are vowels A consonant in a word is a letter other than A, E,
I, O, or U, with the exception of Y The letter Y is a vowel if it is preceded by a
consonant, otherwise it is a consonant For example, Y in synopsis is a vowel, while in toy,
it is a consonant A consonant in the algorithm description is
denoted by c, and a vowel by v
10
Porter’s Algorithm m is the measure of vc repetition
m = 0 TR, EE, TREE, Y, BY m = 1 TROUBLE, OATS, TREES, IVY m = 2 TROUBLES, PRIVATE, OATEN, ORRERY
*S – the stem ends with S (Similarly for other letters) *v* - the stem contains a vowel *d – the stem ends with a double consonant (e.g., -TT) *o – the stem ends cvc, where the seconds c is not W, X,
or Y (e.g. -WIL)
16
Example generalizations
Step1: GENERALIZATION Step2: GENERALIZE Step3: GENERAL Step4: GENER
OSCILLATORS Step1: OSCILLATOR Step2: OSCILLATE Step4: OSCILL Step5: OSCIL
17
Suffix stripping of a vocabulary of 10,000 words (http://www.tartarus.org/~martin/)
Porter’s algorithm
19
Term-Document Matrix• Term-document matrix (TDM) is a two-
dimensional representation of a document collection.
• Rows of the matrix represent various documents
• Columns correspond to various index terms• Values in the matrix can be either the
frequency or weight of the index term (identified by the column) in the document (identified by the row).
23
Normalization• raw frequency values are not useful for a
retrieval model• prefer normalized weights, usually between
0 and 1, for each term in a document• dividing all the keyword frequencies by the
largest frequency in the document is a simple method of normalization:
26
Retrieval modelsRetrieval models match query with documents to:
separate documents into relevant and non-relevant class
rank the documents according to the relevance
29
Boolean Retrieval Model One of the simplest and most efficient
retrieval mechanisms Based on set theory and Boolean algebra Conventional numeric representations of false
as 0 and true as 1 Boolean model is interested only in the
presence or absence of a term in a document In the term-document matrix replace all the
nonzero values with 1
33
Boolean Query User Boolean queries are usually
simple Boolean expressions A Boolean query can be represented
in a “disjunctive normal form” (DNF) disjunction corresponds to or conjunction refers to and DNF consists of a disjunction of
conjunctive Boolean expressions
34
DNF form K0 or (not K3 and K5) is in DNF DNF query processing can be very
efficient If any one of the conjunctive expressions
is true, the entire DNF will be true Short-circuit the expression evaluation Stop matching the expression with a
document as soon as a conjunctive expression matches the document; label the document as relevant to the query
35
Boolean ModelAdvantages
Simplicity and efficiency of implementation Binary values can be stored using bits
reduced storage requirements retrieval using bitwise operations is efficient
Boolean retrieval was adopted by many commercial bibliographic systems
Boolean queries are akin to database queries
36
Boolean Model Disadvantages A document is either relevant or non-relevant to
the query It is not possible to assign a degree of relevance Complicated Boolean queries are difficult for
users Boolean queries retrieve too few or too many
documents. K0 and K4 retrieved only 1 out of 6 documents K0 or K4 retrieved 5 out of a possible 6 documents
38
Vector Space Model Treats both the documents and queries
as vectors A weight based on the frequency in the
document:
42
Relevance Values and Ranking
RankingD0 (0.7774)D6 (0.4953)D2 (0.3123)D1 (0.2590)D5 (0.2122)D4 (0.1727)D3 (0.1084)
43
Variations of VSM Variations of the normalized frequency Inverse document frequency (idf) N = no. of documents nj = no. of documents containing jth term Modified weights :
44
Inverse Document Frequencies for Collection (normalized)
0 1 2 37log 0.3683
idf idf idf idf
4 5 67log 0.2434
idf idf idf
46
)0,3.0,2.0,0,6.0,2.0,0(q
RankingD0 (0.7867)D6 (0.4953)D2 (0.3361)D1 (0.2590)D5 (0.2215)D4 (0.1208)D3 (0.0969)
47
VSM vs. Boolean Queries are easier to express: allow users to
attach relative weights to terms A descriptive query can be transformed to a
query vector similar to documents Matching between a query and a document is
not precise: document is allocated a degree of similarity
Documents are ranked based on their similarity scores instead of relevant/non-relevant classes
Users can go through the ranked list until their information needs are met.
49
Evaluation of Retrieval Performance Evaluation should include:
FunctionalityResponse timeStorage requirementAccuracy
50
Accuracy TestingEarly days:
Batch testing Document collection such as cacm.all Query collection such as query.text
Present day: interactive tests are used Difficult to conduct and time consuming
Batch testing still important
51
Precision and Recall
Precision How many from the retrieved are relevant?
Recall How many from the relevant are retrieved?
52
Our earlier example illustrating the VSM o Documents from Fig. 2.15 o query )0,3.0,2.0,0,6.0,2.0,0(q
Ranking 1. D0* 2. D6 3. D2* 4. D1 5. D5* 6. D4 7. D3*
Semantic analysis: documents with asterisk as relevant Retrieved the three top ranked documents Relevant documents: {D0, D2, D5, D3}R Retrieved documents: {D0, D6,D2}A {D0, D2}R A
{D0,D2} 2 0.67{D0,D6,D2} 3
R Aprecision
A
{D0,D2} 2 0.5{D0,D2,D5,D3} 4
R Arecall
R
53
F-measure2
2
precision recall precision recallFprecision recall precision recall
2 2 0.67 0.5 0.67 0.570.67 0.5 1.17
precision recallFprecision recall
54
Average Precision
Three retrieved document was arbitraryRank retrieved Precision Recall
1 1.00 0.25 2 0.50 0.25 3 0.67 0.50 4 0.50 0.50 5 0.60 0.75 6 0.50 0.75 7 0.57 1.00