| 1 › gertjan van noord – based on the sheets by leonoor van der beek2013 information retrieval...
TRANSCRIPT
![Page 1: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/1.jpg)
| 1
› Gertjan van Noord – based on the sheets by Leonoor van der Beek 2013
Information Retrieval
Lecture 1: introduction
![Page 2: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/2.jpg)
Agenda for today
• Who’s who• Intro to the course• Chapter 1 of Introduction to Information
Retrieval• Homework/lab assignment
![Page 3: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/3.jpg)
3
Intro to the course
• What is IR?• What will we study and how?• Objectives of the course• Exercises and lab sessions• Final exam
![Page 4: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/4.jpg)
4
What is IR?
Individuals, administrations, organizations have lots of digital information
• how to organize and store it?• how to retrieve documents?• how to retrieve info inside them?
An IR system is a tool to facilitate retrieval of such information
![Page 5: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/5.jpg)
Book’s definition
Information retrieval (IR) isfinding material (usually documents)of an unstructured nature (usually text)that satisfies an information needfrom within large collections(usually stored on computers).
5
![Page 6: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/6.jpg)
... finding material (usually documents) ...
What else can you think of?
6
• parts of documents• facts, like the day of birth of Rembrandt• a book in the library• a work of art in a museum
![Page 7: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/7.jpg)
... from within large collections (usually stored on computers)…
WWW? What else?
7
• Specific collections, like legal information or scientific medical papers (Medline)
• Information on your own computer• Information within a company• Subparts of the www, like one domain
![Page 8: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/8.jpg)
… of an unstructured nature (usually text)
Can you explain this?
8
Unstructured: differences between text and databasesis a text document really unstructured?how about XML?Beyond text: image, sound, video, ….
![Page 9: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/9.jpg)
9
Database search vs. IR •structured semantic info:
• fields• datatypes• validation• relations
•search of fields•exact search for data
•order of found records alfanumerical
•no semantic structureno fixed format, but
• text structure• metadata• XML
•full text search•not-exact search for data or information•order of found documents often by similarity with query
![Page 10: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/10.jpg)
Book’s definition
Information retrieval (IR) isfinding material (usually documents)of an unstructured nature (usually text)that satisfies an information needfrom within large collections(usually stored on computers).
10
![Page 11: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/11.jpg)
11
that satisfies an information need ...
What information needs can we discern?Try to formulate some different types of
goals of a search
-facts and question answering-definitions-information on a subject-retrieving a known document
and in websearch?
![Page 12: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/12.jpg)
12
User needs in web search
Navigational. The immediate intent is to reach a particular site.
Informational. The intent is to acquire some information assumed to be present on one or more web pages.
Transactional. The intent is to perform some web-mediated activity.
Broder, A. 2002 A taxonomy of Web search. SIGIR Forum 36, no.23-10
![Page 13: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/13.jpg)
13
Translation of info need
Each information need has to be translated into the "language" of the IR system
reality document
info need query
![Page 14: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/14.jpg)
14
Translation of info need
Query: Hilton, Paris
![Page 15: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/15.jpg)
15
Translation of info need
Query: champagne
![Page 16: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/16.jpg)
Translation of info need
Query: Rene Froger “Een eigen huis”
![Page 17: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/17.jpg)
Translation of info need
Information need:
Query: ??
![Page 18: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/18.jpg)
Translation of info need
Information need:
Query: ??
![Page 19: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/19.jpg)
Are the results satisfying?
Search engines produce often a lot of resultsWhen are you satisfied with the results?How can we evaluate a system?
• the most relevant results are easy to find (on top of the list, and/or sorted by subject, …)
• only few results are not relevant• new information• info corrobarated (more sources)• relevant documents that I know are presented
![Page 20: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/20.jpg)
Precision and recall
Key statistics for evaluation with a test set (fixed questions, set of documents, evaluations of documents for the queries available)
Precision: what fraction of the results are relevant to the information need?
Recall: what fraction of the relevant documents in the collection were returned by the system?
![Page 21: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/21.jpg)
Precision and recall
But how relevant are Precision and Recall if you search for e.g. The date of birth of Vincent van Gogh?
![Page 22: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/22.jpg)
Overview of an IR system
(book: Baeza-Yates:Modern IR)
![Page 23: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/23.jpg)
Web site
Overview and exercises:
* http://www.let.rug.nl/vannoord/College/Zoekmachines/
* Nestor
![Page 24: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/24.jpg)
Course Book
Introduction to Information RetrievalD. Manning, P. Raghaven and H. Schütze
Online version
NB: the book is also used for the Information Retrieval course
The book is written for CS students, we will skip sections and exercises that are a bit too technical
![Page 25: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/25.jpg)
Schedule for this coursewk 1ch 1 boolean retrieval, posting lists wk 2ch 2 decoding, tokenization and
normalization, sublinear posting list intersection
wk 3ch 3 dictionaries, wild cards, spell correction
wk 4ch 6 scoring and term weighting, term and document frequency weighting, vector space models
wk 5ch 8 evalutationwk 6ch 21 link analysis, page rankwk 7ch 9 relevance feedback and query analysis
![Page 26: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/26.jpg)
HOW will we study the book?
Homework: read the chapter thoroughlyLectures: overview of chapter Labtime/homework: do exercises Next lecture: remaining questions
Full slide presentation of the chapters by one of the authors available as well author's slides
![Page 27: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/27.jpg)
Labtime
1. Exercises (from the book + more)
2. Try out simple techniques in Python
3. More...
4. More...
![Page 28: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/28.jpg)
Course objectives
• knowledge of IR terminology• insight in IR models and IR processes• knowledge of methods of indexing, querying,
retrieving and ranking• knowledge of methods of evaluation of IR
systems• practical experience with use, adaptation and
testing of some of the basic IR algorithms and techniques
![Page 29: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/29.jpg)
Chapter 1: Boolean retrieval
1. General introduction on IR2. Boolean systems3. Representation of information4. Retrieving documents5. Efficiency aspects
![Page 30: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/30.jpg)
Boolean retrieval
The first IR systems were Boolean systemsQueries are formulated with the Boolean
operators AND, OR and NOT:• Brutus AND Caesar • (Brutus OR Caesar) AND NOT Cleopatra• Brutus OR (Caesar AND NOT Cleopatra)• NOT Brutus
How about Google queries?
![Page 31: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/31.jpg)
Information from documents
• Each document in the system needs a unique docID
• Tokenization is the process of splitting a text into separate tokens (not trivial!)
• For a simple boolean system we just need to know which terms are present in which doc
![Page 32: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/32.jpg)
Term document incidence matrix
Doc 1 Doc 2 Doc 3 Doc 4
Antony 1 1 0 0
Brutus 1 1 1 0
Caesar 1 1 0 1
Cleopatra 1 0 0 0Antony AND Brutus AND NOT Cleopatra?in huge collections > 99% of entries are 0not a good representation, no efficient processing
![Page 33: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/33.jpg)
Building an inverted file
1. Give DocIDs and tokenize the texts2. Gather terms with their docID3. Sort on terms and docID4. Now list the unique terms with their
document frequency and link to the postings list with docIDs
term docfreq postings list
[Caesar, 3] [1,2,4]
![Page 34: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/34.jpg)
Inverted file / index
Antony AND Brutus AND NOT Cleopatra?efficient processing if sorted on DocIDsimple merging algorithms for AND / OR
(term) (df) (postings list)
Antony 2 1 , 2 , 6
Brutus 3 1 , 2 , 3
Caesar 3 1 , 2 , 4 , 5 , 6
Cleopatra 1 1
![Page 35: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/35.jpg)
Distributive laws
a AND (b OR c) = (a AND b) OR (a AND c)(a OR b) AND (c OR d) = ??
NOT(a OR b) = NOT(a) AND NOT(b) NOT(a AND b) = ??
![Page 36: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/36.jpg)
Conjunctive and disjunctive queries
The outer level of processing can be either conjunctive (AND) or disjunctive (OR):
• Conjunctive normal form:• a conjunction of disjunctions• (a OR NOT b) AND (c OR d) AND e
• Disjunctive normal form:• a disjunction of conjunctions• (a AND NOT b) OR (c AND d) OR e
![Page 37: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/37.jpg)
The order of the sizeExample
f(x) = 2x3 + 5x2 +x + 9 This is a function of O(x3): if x grows to infinity the factor x3 is what really determines the size of the outcome, the rest can be neglected
![Page 38: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/38.jpg)
The order of time complexity
ExampleTo find similar elements in two ordered lists, the
number of steps depends on the size of both lists:O(x + y) (linear)
Need to check all combinationsO(x * y) (quadratic)
![Page 39: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/39.jpg)
Big O notation
• Used to classify algorithms by how they respond (e.g. in their processing time or working space requirements) to changes in input
• Best case, worst case, average case?• Big O represents the upper bound (worst
case)• Other symbols used for lower bound,
tight bound, ….
![Page 40: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/40.jpg)
Guidance/questions on the text
Write down and try to find explanations of terms you don’t know
p4 you know the KB, MB, GB .. etc sizes?p5 fig 1.3: look back to fig 1.1p7 what types of linguistic preprocessing do you
see in the examples in step 3?p11/12 do you understand the algorithms?
Are you able to explain now what an inverted file is and how it is constructed?
![Page 41: | 1 › Gertjan van Noord – based on the sheets by Leonoor van der Beek2013 Information Retrieval Lecture 1: introduction](https://reader036.vdocuments.us/reader036/viewer/2022062422/56649e735503460f94b73167/html5/thumbnails/41.jpg)
Homework
…. is on the web site ….