Download - Chapter 2 Information Retrieval Part-1
1
Chapter 2Information RetrievalPart-1
2
Modern Information Retrieval Document representation
Using keywords Relative weight of keywords
Query representation Keywords Relative importance of keywords
Retrieval model Similarity between document and query Rank the documents Performance evaluation of the retrieval
process
3
Document Representation
Transforming a text document to a weighted list of keywords
4
Stopwords
Figure 2.2 A partial list of stopwords
5
Activity: Document Representation
Transform the text in the document given into a weighted list of keywords.
6
StemmingA given word may occur in a variety of syntactic forms
plurals past tense gerund forms (a noun derived from a verb)
ExampleThe word connect, may appear as
connector, connection, connections, connected, connecting, connects, preconnection, and postconnection.
7
StemmingA stem is what is left after its affixes (prefixes and suffixes) are removedSuffixes connector, connection, connections,
connected, connecting, connects, Prefixes preconnection, and postconnection.Stem connect
8
Porter’s Algorithm Letters A, E, I, O, and U are vowels A consonant in a word is a letter other than A, E,
I, O, or U, with the exception of Y The letter Y is a vowel if it is preceded by a
consonant, otherwise it is a consonant For example, Y in synopsis is a vowel, while in toy,
it is a consonant A consonant in the algorithm description is
denoted by c, and a vowel by v
9
Porter’s algorithmStep 1
Step 1:plurals and past participles
10
Porter’s algorithmStep 2
Steps 2–4: straightforward stripping of suffixes
11
Porter’s algorithmStep 3
Steps 2–4: straightforward stripping of suffixes
12
Porter’s algorithmStep 4
Steps 2–4: straightforward stripping of suffixes
13
Porter’s algorithmStep 5
Steps 5: tidying-up
14
Suffix stripping of a vocabulary of 10,000 words (http://www.tartarus.org/~martin/)
Porter’s algorithm
15
For the Tutorial Bring your laptop/ lab Make sure you have Java installed Bring any English language text
document, extension must be .txt Number of words (no more than 1000
words)
16
Document Representation
17
Term-Document Matrix• Term-document matrix (TDM) is a two-
dimensional representation of a document collection.
• Rows of the matrix represent various documents
• Columns correspond to various index terms• Values in the matrix can be either the
frequency or weight of the index term (identified by the column) in the document (identified by the row).
18
Term-Document matrix
19
Sparse Matrixes- triples
20
Sparse Matrixes- Pairs
21
Normalization• raw frequency values are not useful for a
retrieval model• prefer normalized weights, usually between
0 and 1, for each term in a document• dividing all the keyword frequencies by the
largest frequency in the document is a simple method of normalization:
22
Normalized Term-Document Matrix
23
Vector Representation of document d1
(word, frequency, normalized frequency)
24
Mini project (Survey)Arabic language stemmer design Survey and compare existing Arabic
language stemmers and write a research paper.
Design an Arabic Language stemmer Reading: Hints on writing technical reports and papers