modern information retrieval chapter 1: introduction
DESCRIPTION
Modern Information Retrieval Chapter 1: Introduction. Ricardo Baeza-Yates Berthier Ribeiro-Neto. Motivation. Example of the user information need Topic: NCAA college tennis team - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/1.jpg)
1
Modern Information Retrieval
Chapter 1: Introduction
Ricardo Baeza-YatesBerthier Ribeiro-Neto
![Page 2: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/2.jpg)
2
Motivation
Example of the user information need Topic: NCAA college tennis team Description: Find all the pages (documents) containing information on
college tennis teams which (1) are maintained by an university in the USA and (2) participate in the NCAA tennis tournament.
Narrative: To be relevant, the page must include information on the national ranking of the team in the last three years and the email or phone number of the team coach.
![Page 3: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/3.jpg)
3
IR Research
Information retrieval vs Data retrieval
Research information search information filtering (routing) document classification and categorization user interfaces and data visualization cross-language retrieval
![Page 4: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/4.jpg)
4
IR History
1970
1990, WWW
![Page 5: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/5.jpg)
5
The User Task
Retrieval (Searching) classic information search process where clear
objectives are defined Browsing
a process where one’s main objectives are not clearly defined and might change during the interaction with the system
![Page 6: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/6.jpg)
6
Logical View of the Documents
Text Operations reduce the complexity of the document representation a full text a set of index terms
Steps1. Stopwords removing2. Stemming3. Noun groups4. ...
![Page 7: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/7.jpg)
7
Past, Present, and Future
Early Development Index
Library Author name, title, subject headings, keywords
The Web and Digital Libraries Hyperlinks
![Page 8: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/8.jpg)
8
Conventional Text-Retrieval Systems
Automatic Text Processing
G. Salton, Addison-Wesley, 1989.(Chapter 9)
![Page 9: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/9.jpg)
9
Data Retrieval
A specified set of attributes is used to characterize each record.EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO)
Exact match between the attributes used inquery formulations and those attached to the document.
SELECT BDATE, ADDRFROM EMPLOYEEWHERE NAME = ‘John Smith’
![Page 10: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/10.jpg)
10
Text-Retrieval Systems
Content identifiers (keywords, index terms, descriptors) characterize the stored texts.
Degrees of coincidence between the sets of identifiers attached to queries and documents
content analysisquery formulation
![Page 11: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/11.jpg)
11
Possible Representation
Document representation (Text operation) unweighted index terms (term vectors) weighted index terms …
Query (Query operation) unweighted or weighted index terms Boolean combinations (or, and, not) …
Search operation must be effective (Indexing)
![Page 12: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/12.jpg)
12
File Structures
Main requirements fast-access for various kinds of searches large number of indices
Alternatives Inverted Files Signature Files PAT trees
![Page 13: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/13.jpg)
13
Inverted Files File is represented as an array of indexed documents.
Term 1 Term 2 Term 3 Term 4
Doc 1 1 1 0 1
Doc 2 0 1 1 1
Doc 3 1 0 1 1
Doc 4 0 0 1 1
![Page 14: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/14.jpg)
14
Inverted-file process The document-term array is inverted (transposed).
Doc 1 Doc 2 Doc 3 Doc 4
Term 1 1 0 1 0
Term 2 1 1 0 0
Term 3 0 1 1 1
Term 4 1 1 1 1
![Page 15: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/15.jpg)
15
Inverted-file process (Continued)
Take two or more rows of an inverted term-document array, and produce a single combined list of document identifiers.
Ex: Query= (term2 and term3)
term2 1 1 0 0term3 0 1 1 1------------------------------------------------------
1 <-- D2
![Page 16: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/16.jpg)
16
List-merging for two ordered lists
The inverted-index operations to obtain answers are based on list-merging process.
ExampleT1: {D1, D3}T2: {D1, D2}Merged(T1, T2): {D1, D1, D2, D3}
![Page 17: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/17.jpg)
17
Extensions of Inverted Index Operations(Distance Constraints)
Distance Constraints (A within sentence B)
terms A and B must co-occur in a common sentence
(A adjacent B)terms A and B must occur adjacently in the text
![Page 18: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/18.jpg)
18
Extensions of Inverted Index Operations(Distance Constraints)
Implementation include term-location in the inverted indexes
information: {P345, P348, P350, …}retrieval: {P123, P128, P345, …}
include sentence-location in the indexes information:
{P345, 25; P345, 37; P348, 10; P350, 8; …} retrieval:
{P123, 5; P128, 25; P345, 37; P345, 40; …}
![Page 19: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/19.jpg)
19
Extensions of Inverted Index Operations(Distance Constraints)
Include paragraph numbers in the indexessentence numbers within paragraphsword numbers within sentencesinformation: {P345, 2, 3, 5; …}retrieval: {P345, 2, 3, 6; …}
Query examples(information adjacent retrieval)(information within five words retrieval)
Cost: the size of indexes
![Page 20: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/20.jpg)
20
Retrieval models
Classic Models
BooleanVector
Probabilistic
FuzzyExtended Boolean
Set Theoretic
AlgebraicGeneralized Vector
Latent Semantic IndexNeural Networks
Inference NetworkBelief Network
Probabilistic
![Page 21: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/21.jpg)
21
Classic IR Model
Basic concepts : Each document is described by a set of representative keywords called index terms.
Assign a numerical weights to distinct relevance between index terms.
![Page 22: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/22.jpg)
22
Boolean model
Binary decision criterion Data retrieval model Advantage
clean formalism, simplicity Disadvantage
It is not simple to translate an information need into a Boolean expression.
exact matching may lead to retrieval of too few or too many documents
![Page 23: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/23.jpg)
23
Vector model
Assign non-binary weights to index terms in queries and in documents. => TFxIDF
Compute the similarity between documents and query. => Sim(Dj, Q)
More precise than Boolean model.
![Page 24: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/24.jpg)
24
Term Weights
Term WeightsDi={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6}
Issues How to generate the term weights? How to apply the term weights?
• Sum the weights of all document terms that match the given query.
• Rank the output documents in the descending order of term weight.
![Page 25: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/25.jpg)
25
Boolean Query with Term Weights
Transform a Boolean expression into disjunctive normal form.
T1 and (T2 or T3)= (T1 and T2) or (T1 and T3)
For each conjunct, compute the minimum term weight of any document term in that conjunct.
The document weight is the maximum of all the conjunct weights.
![Page 26: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/26.jpg)
26
Boolean Query with Term Weights
Example: Q=(T1 and T2) or T3Document Conjunct QueryVectors Weights Weight
(T1 and T2) (T3) (T1 and T2) or T3D1=(T1,0.2;T2,0.5;T3,0.6)
0.2 0.6 0.6D2=(T1,0.7;T2,0.2;T3,0.1)
0.2 0.1 0.2D1 is preferred.
![Page 27: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/27.jpg)
27
Summary
Conventional IR systems Evaluation Text operations (Term selection) Query operations (Pattern matching, Relevance
feedback) Indexing (File structure) Modeling
![Page 28: Modern Information Retrieval Chapter 1: Introduction](https://reader036.vdocuments.us/reader036/viewer/2022062721/5681361d550346895d9d927c/html5/thumbnails/28.jpg)
28
Resources
Journals Journal of American Society of Information Sciences ACM Transactions on Information Systems Information Processing and Management Information Systems (Elsevier) Knowledge and Information Systems (Springer)
Conferences ACM SIGIR, DL, CIKM, CHI, etc. Text Retrieval Conference (TREC)