introduction to information retrieval - homepages of uva...
TRANSCRIPT
Introduction toInformation RetrievalChristof Monz and Maarten de Rijke
Spring 2002
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 1
Today’s Program
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2
Today’s Program
• What’s Information Retrieval?
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2
Today’s Program
• What’s Information Retrieval?
• Some administrative stuff
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2
Today’s Program
• What’s Information Retrieval?
• Some administrative stuffI Overview of the course
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2
Today’s Program
• What’s Information Retrieval?
• Some administrative stuffI Overview of the courseI Grading, homework etc.
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2
Today’s Program
• What’s Information Retrieval?
• Some administrative stuffI Overview of the courseI Grading, homework etc.
• How to represent information
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2
Today’s Program
• What’s Information Retrieval?
• Some administrative stuffI Overview of the courseI Grading, homework etc.
• How to represent information
• Our first retrieval model: boolean retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2
What is Information Retrieval?
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3
What is Information Retrieval?
• Finding relevant information in large collections of data
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3
What is Information Retrieval?
• Finding relevant information in large collections of data
• In such a collection you may want to find:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3
What is Information Retrieval?
• Finding relevant information in large collections of data
• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3
What is Information Retrieval?
• Finding relevant information in large collections of data
• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’
An article about the Kennedys
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3
What is Information Retrieval?
• Finding relevant information in large collections of data
• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’
An article about the Kennedys (text retrieval)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3
What is Information Retrieval?
• Finding relevant information in large collections of data
• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’
An article about the Kennedys (text retrieval)I ‘What does a brain tumor look like on a CT-scan’
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3
What is Information Retrieval?
• Finding relevant information in large collections of data
• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’
An article about the Kennedys (text retrieval)I ‘What does a brain tumor look like on a CT-scan’
A picture of a brain tumor
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3
What is Information Retrieval?
• Finding relevant information in large collections of data
• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’
An article about the Kennedys (text retrieval)I ‘What does a brain tumor look like on a CT-scan’
A picture of a brain tumor (image retrieval)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3
What is Information Retrieval?
• Finding relevant information in large collections of data
• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’
An article about the Kennedys (text retrieval)I ‘What does a brain tumor look like on a CT-scan’
A picture of a brain tumor (image retrieval)I ‘It goes like this: hmm hmm hahmmm . . . ’
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3
What is Information Retrieval?
• Finding relevant information in large collections of data
• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’
An article about the Kennedys (text retrieval)I ‘What does a brain tumor look like on a CT-scan’
A picture of a brain tumor (image retrieval)I ‘It goes like this: hmm hmm hahmmm . . . ’
A certain song
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3
What is Information Retrieval?
• Finding relevant information in large collections of data
• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’
An article about the Kennedys (text retrieval)I ‘What does a brain tumor look like on a CT-scan’
A picture of a brain tumor (image retrieval)I ‘It goes like this: hmm hmm hahmmm . . . ’
A certain song (music retrieval)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3
Text Retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4
Text Retrieval
• Online library catalogs (OPAC)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4
Text Retrieval
• Online library catalogs (OPAC)
• Internet search engines, such as
AltaVista, Google, Ilse
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4
Text Retrieval
• Online library catalogs (OPAC)
• Internet search engines, such as
AltaVista, Google, Ilse
• Specialized systems (aka vendors):
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4
Text Retrieval
• Online library catalogs (OPAC)
• Internet search engines, such as
AltaVista, Google, Ilse
• Specialized systems (aka vendors):I MEDLINE (medical articles)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4
Text Retrieval
• Online library catalogs (OPAC)
• Internet search engines, such as
AltaVista, Google, Ilse
• Specialized systems (aka vendors):I MEDLINE (medical articles)I Lexis-Nexis (legal, business, academic, . . . )
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4
Text Retrieval
• Online library catalogs (OPAC)
• Internet search engines, such as
AltaVista, Google, Ilse
• Specialized systems (aka vendors):I MEDLINE (medical articles)I Lexis-Nexis (legal, business, academic, . . . )I Westlaw (legal articles)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4
Text Retrieval
• Online library catalogs (OPAC)
• Internet search engines, such as
AltaVista, Google, Ilse
• Specialized systems (aka vendors):I MEDLINE (medical articles)I Lexis-Nexis (legal, business, academic, . . . )I Westlaw (legal articles)I Dialog (business information)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4
Retrieval vs. Browsing
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5
Retrieval vs. Browsing
• Popular Web Directories:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5
Retrieval vs. Browsing
• Popular Web Directories:I Yahoo!, Open Directory Project (dmoz)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5
Retrieval vs. Browsing
• Popular Web Directories:I Yahoo!, Open Directory Project (dmoz)
• The user has to ‘guess’ the ‘right’ directories to find
the information
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5
Retrieval vs. Browsing
• Popular Web Directories:I Yahoo!, Open Directory Project (dmoz)
• The user has to ‘guess’ the ‘right’ directories to find
the informationI The user has to adapt to the designers’
conceptualization of the directory
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5
Retrieval vs. Browsing
• Popular Web Directories:I Yahoo!, Open Directory Project (dmoz)
• The user has to ‘guess’ the ‘right’ directories to find
the informationI The user has to adapt to the designers’
conceptualization of the directory
• The goal of information retrieval is to provide
immediate random access to the data
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5
Retrieval vs. Browsing
• Popular Web Directories:I Yahoo!, Open Directory Project (dmoz)
• The user has to ‘guess’ the ‘right’ directories to find
the informationI The user has to adapt to the designers’
conceptualization of the directory
• The goal of information retrieval is to provide
immediate random access to the dataI The user can specifiy his information need
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5
IR vs. Database Querying
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 6
IR vs. Database Querying
• IR is not the same thing as querying a database
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 6
IR vs. Database Querying
• IR is not the same thing as querying a database
• Database querying assumes that the data is in a
standardized format
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 6
IR vs. Database Querying
• IR is not the same thing as querying a database
• Database querying assumes that the data is in a
standardized format
• Transforming all information, news articles, web sites
into a database format is difficult and impossible for
large data collections
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 6
IR vs. Database Querying
• IR is not the same thing as querying a database
• Database querying assumes that the data is in a
standardized format
• Transforming all information, news articles, web sites
into a database format is difficult and impossible for
large data collections
• Text retrieval can work with plain, unformatted data
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 6
Relevance as Similarity
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7
Relevance as Similarity
• A fundamental idea within IR is:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7
Relevance as Similarity
• A fundamental idea within IR is:
‘A document is relevant to a query
if they are similar ’
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7
Relevance as Similarity
• A fundamental idea within IR is:
‘A document is relevant to a query
if they are similar ’
• Similarity can be defined as
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7
Relevance as Similarity
• A fundamental idea within IR is:
‘A document is relevant to a query
if they are similar ’
• Similarity can be defined asI string matching/comparison
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7
Relevance as Similarity
• A fundamental idea within IR is:
‘A document is relevant to a query
if they are similar ’
• Similarity can be defined asI string matching/comparisonI similar vocabulary
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7
Relevance as Similarity
• A fundamental idea within IR is:
‘A document is relevant to a query
if they are similar ’
• Similarity can be defined asI string matching/comparisonI similar vocabularyI same meaning of text
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7
The Ubiquity of IR
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8
The Ubiquity of IR
• Information filtering
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8
The Ubiquity of IR
• Information filteringI E-mail routing
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8
The Ubiquity of IR
• Information filteringI E-mail routingI Text categorization
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8
The Ubiquity of IR
• Information filteringI E-mail routingI Text categorization
• Detecting information structure
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8
The Ubiquity of IR
• Information filteringI E-mail routingI Text categorization
• Detecting information structureI Hyperlink generation
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8
The Ubiquity of IR
• Information filteringI E-mail routingI Text categorization
• Detecting information structureI Hyperlink generationI Topic/Information detection/screening
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8
The Ubiquity of IR
• Information filteringI E-mail routingI Text categorization
• Detecting information structureI Hyperlink generationI Topic/Information detection/screeningI Portal development and maintenance
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8
The Ubiquity of IR
• Information filteringI E-mail routingI Text categorization
• Detecting information structureI Hyperlink generationI Topic/Information detection/screeningI Portal development and maintenance
• Question Answering
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8
Some Research Groups in IR
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9
Some Research Groups in IR
• Industrial IR research:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9
Some Research Groups in IR
• Industrial IR research: AT&T, NEC, Sun
Microsystems, Microsoft, G&E Research, Sabir
Research, NTT, AltaVista, Xerox, Q-Go, GO.com
(Infoseek), Lexiquest, Answers.com, AnswerLogics,
Google, Ask-Jeeves, Lucent Technologies, IBM . . .
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9
Some Research Groups in IR
• Industrial IR research: AT&T, NEC, Sun
Microsystems, Microsoft, G&E Research, Sabir
Research, NTT, AltaVista, Xerox, Q-Go, GO.com
(Infoseek), Lexiquest, Answers.com, AnswerLogics,
Google, Ask-Jeeves, Lucent Technologies, IBM . . .
• Academic IR Groups:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9
Some Research Groups in IR
• Industrial IR research: AT&T, NEC, Sun
Microsystems, Microsoft, G&E Research, Sabir
Research, NTT, AltaVista, Xerox, Q-Go, GO.com
(Infoseek), Lexiquest, Answers.com, AnswerLogics,
Google, Ask-Jeeves, Lucent Technologies, IBM . . .
• Academic IR Groups: Cornell, Massachusetts, Twente,
Glasgow, Sheffield, Dortmund, Dublin, Stanford,
Syracruse, Virginia Tech, Pisa . . .
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9
Some Research Groups in IR
• Industrial IR research: AT&T, NEC, Sun
Microsystems, Microsoft, G&E Research, Sabir
Research, NTT, AltaVista, Xerox, Q-Go, GO.com
(Infoseek), Lexiquest, Answers.com, AnswerLogics,
Google, Ask-Jeeves, Lucent Technologies, IBM . . .
• Academic IR Groups: Cornell, Massachusetts, Twente,
Glasgow, Sheffield, Dortmund, Dublin, Stanford,
Syracruse, Virginia Tech, Pisa . . .
• Other:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9
Some Research Groups in IR
• Industrial IR research: AT&T, NEC, Sun
Microsystems, Microsoft, G&E Research, Sabir
Research, NTT, AltaVista, Xerox, Q-Go, GO.com
(Infoseek), Lexiquest, Answers.com, AnswerLogics,
Google, Ask-Jeeves, Lucent Technologies, IBM . . .
• Academic IR Groups: Cornell, Massachusetts, Twente,
Glasgow, Sheffield, Dortmund, Dublin, Stanford,
Syracruse, Virginia Tech, Pisa . . .
• Other: CIA, DARPA, ERCIM, Mitre, NIST . . .
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9
History of IR
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10
History of IR
• 1950: Calvin N. Moors coins the term ‘Information Retrieval’
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10
History of IR
• 1950: Calvin N. Moors coins the term ‘Information Retrieval’
• 1959: Luhn describes statistical retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10
History of IR
• 1950: Calvin N. Moors coins the term ‘Information Retrieval’
• 1959: Luhn describes statistical retrieval
• 1960: Maron and Kuhns define a probabilistic model of IR
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10
History of IR
• 1950: Calvin N. Moors coins the term ‘Information Retrieval’
• 1959: Luhn describes statistical retrieval
• 1960: Maron and Kuhns define a probabilistic model of IR
• 1966: Cranfield project defines evaluation measures
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10
History of IR
• 1950: Calvin N. Moors coins the term ‘Information Retrieval’
• 1959: Luhn describes statistical retrieval
• 1960: Maron and Kuhns define a probabilistic model of IR
• 1966: Cranfield project defines evaluation measures
• 1968: Gerard Salton’s first book about the SMART retrieval
system
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10
History of IR
• 1950: Calvin N. Moors coins the term ‘Information Retrieval’
• 1959: Luhn describes statistical retrieval
• 1960: Maron and Kuhns define a probabilistic model of IR
• 1966: Cranfield project defines evaluation measures
• 1968: Gerard Salton’s first book about the SMART retrieval
system
• 1972: Lockheed introduces DIALOG as commercial online service
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10
History of IR
• 1950: Calvin N. Moors coins the term ‘Information Retrieval’
• 1959: Luhn describes statistical retrieval
• 1960: Maron and Kuhns define a probabilistic model of IR
• 1966: Cranfield project defines evaluation measures
• 1968: Gerard Salton’s first book about the SMART retrieval
system
• 1972: Lockheed introduces DIALOG as commercial online service
• Late 1980’s: First PC systems incorporate retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10
History of IR
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 11
History of IR
• Early 1990’s: Cheap disks lead to the information storage
revolution
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 11
History of IR
• Early 1990’s: Cheap disks lead to the information storage
revolution
• 1992: Westlaw is the first large-scale information service using
probabilistic retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 11
History of IR
• Early 1990’s: Cheap disks lead to the information storage
revolution
• 1992: Westlaw is the first large-scale information service using
probabilistic retrieval
• Mid 1990’s: Multi-media databases
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 11
History of IR
• Early 1990’s: Cheap disks lead to the information storage
revolution
• 1992: Westlaw is the first large-scale information service using
probabilistic retrieval
• Mid 1990’s: Multi-media databases
• 1994: The internet and web explosion
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 11
History of IR
• Early 1990’s: Cheap disks lead to the information storage
revolution
• 1992: Westlaw is the first large-scale information service using
probabilistic retrieval
• Mid 1990’s: Multi-media databases
• 1994: The internet and web explosion
• 1995: IR techniques are incorporated in all kinds of information
management applications
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 11
Overview of the Course
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12
Overview of the Course
• Basic IR models (week 1 & 2)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12
Overview of the Course
• Basic IR models (week 1 & 2)
• Evaluating the quality of IR methods (week 3)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12
Overview of the Course
• Basic IR models (week 1 & 2)
• Evaluating the quality of IR methods (week 3)
• Text representation (week 4)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12
Overview of the Course
• Basic IR models (week 1 & 2)
• Evaluating the quality of IR methods (week 3)
• Text representation (week 4)
• Components of an IR system (week 5 & 6)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12
Overview of the Course
• Basic IR models (week 1 & 2)
• Evaluating the quality of IR methods (week 3)
• Text representation (week 4)
• Components of an IR system (week 5 & 6)
• Improving effectiveness and efficiency (week 6 & 7)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12
Overview of the Course
• Basic IR models (week 1 & 2)
• Evaluating the quality of IR methods (week 3)
• Text representation (week 4)
• Components of an IR system (week 5 & 6)
• Improving effectiveness and efficiency (week 6 & 7)
• Web-based IR (week 8 & 9)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12
Overview of the Course
• Basic IR models (week 1 & 2)
• Evaluating the quality of IR methods (week 3)
• Text representation (week 4)
• Components of an IR system (week 5 & 6)
• Improving effectiveness and efficiency (week 6 & 7)
• Web-based IR (week 8 & 9)
• Current research themes (week 10)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12
Objectives of the Course
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13
Objectives of the Course
• At the end of the course you will be able to. . .
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13
Objectives of the Course
• At the end of the course you will be able to. . .I Exploit web specific information when searching
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13
Objectives of the Course
• At the end of the course you will be able to. . .I Exploit web specific information when searchingI Understand the core components of modern IR
systems
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13
Objectives of the Course
• At the end of the course you will be able to. . .I Exploit web specific information when searchingI Understand the core components of modern IR
systemsI Understand the potential of IR techniques for today’s
information society
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13
Objectives of the Course
• At the end of the course you will be able to. . .I Exploit web specific information when searchingI Understand the core components of modern IR
systemsI Understand the potential of IR techniques for today’s
information societyI Build your own search engine (in principle)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13
Objectives of the Course
• At the end of the course you will be able to. . .I Exploit web specific information when searchingI Understand the core components of modern IR
systemsI Understand the potential of IR techniques for today’s
information societyI Build your own search engine (in principle)I Make some serious dough
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13
Grading etc.
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14
Grading etc.
• Prerequisites:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14
Grading etc.
• Prerequisites:I Computer literacy (including an account on gene plus
the ability to use the unix command line interface)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14
Grading etc.
• Prerequisites:I Computer literacy (including an account on gene plus
the ability to use the unix command line interface)
• Assessment:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14
Grading etc.
• Prerequisites:I Computer literacy (including an account on gene plus
the ability to use the unix command line interface)
• Assessment:I Weekly reading assignments (1 or 2 papers per week)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14
Grading etc.
• Prerequisites:I Computer literacy (including an account on gene plus
the ability to use the unix command line interface)
• Assessment:I Weekly reading assignments (1 or 2 papers per week)I (3-5) assignments
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14
Grading etc.
• Prerequisites:I Computer literacy (including an account on gene plus
the ability to use the unix command line interface)
• Assessment:I Weekly reading assignments (1 or 2 papers per week)I (3-5) assignmentsI Final exam
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14
Grading etc.
• Prerequisites:I Computer literacy (including an account on gene plus
the ability to use the unix command line interface)
• Assessment:I Weekly reading assignments (1 or 2 papers per week)I (3-5) assignmentsI Final examI Final mark is obtained as the average of the final
exam (60%), assignments (30%) and reading (10%)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14
Web Site of the Course
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15
Web Site of the Course
• URL: www.science.uva.nl/∼christof/courses/ir/
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15
Web Site of the Course
• URL: www.science.uva.nl/∼christof/courses/ir/
• Features of the web site:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15
Web Site of the Course
• URL: www.science.uva.nl/∼christof/courses/ir/
• Features of the web site:I Some of the reading material is available online
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15
Web Site of the Course
• URL: www.science.uva.nl/∼christof/courses/ir/
• Features of the web site:I Some of the reading material is available onlineI Links to universities, companies and people relevant
to IR
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15
Web Site of the Course
• URL: www.science.uva.nl/∼christof/courses/ir/
• Features of the web site:I Some of the reading material is available onlineI Links to universities, companies and people relevant
to IRI Printer-friendly versions of the transparancies
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15
Web Site of the Course
• URL: www.science.uva.nl/∼christof/courses/ir/
• Features of the web site:I Some of the reading material is available onlineI Links to universities, companies and people relevant
to IRI Printer-friendly versions of the transparancies
• Fill out the online form to be added to the mailing list
for this course (important!)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15
Retrieval Models
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 16
Retrieval Models
• A retrieval model is an idealization or abstraction of an
actual retrieval process
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 16
Retrieval Models
• A retrieval model is an idealization or abstraction of an
actual retrieval process
• Conclusions derived from a model depend on whether
the model is a good approximation of the retrieval
situation
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 16
Retrieval Models
• A retrieval model is an idealization or abstraction of an
actual retrieval process
• Conclusions derived from a model depend on whether
the model is a good approximation of the retrieval
situation
• Note that a retrieval model is not the same thing as a
retrieval implementation
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 16
Retrieval Models
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 17
Retrieval Models
display documentsto the user
queryformulation
identify relevantinformation
document
User
representations
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 17
Components of a Retrieval Model
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18
Components of a Retrieval Model
• The user:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18
Components of a Retrieval Model
• The user:I Search expert (e.g., librarian) vs. non-expert
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18
Components of a Retrieval Model
• The user:I Search expert (e.g., librarian) vs. non-expertI Backgound of the user (knowledge of the topic)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18
Components of a Retrieval Model
• The user:I Search expert (e.g., librarian) vs. non-expertI Backgound of the user (knowledge of the topic)I In-depth searching vs. ‘just-wanna-get-an-idea’
searching
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18
Components of a Retrieval Model
• The user:I Search expert (e.g., librarian) vs. non-expertI Backgound of the user (knowledge of the topic)I In-depth searching vs. ‘just-wanna-get-an-idea’
searching
• The documents:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18
Components of a Retrieval Model
• The user:I Search expert (e.g., librarian) vs. non-expertI Backgound of the user (knowledge of the topic)I In-depth searching vs. ‘just-wanna-get-an-idea’
searching
• The documents:I Different languages
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18
Components of a Retrieval Model
• The user:I Search expert (e.g., librarian) vs. non-expertI Backgound of the user (knowledge of the topic)I In-depth searching vs. ‘just-wanna-get-an-idea’
searching
• The documents:I Different languagesI Semi-structured (e.g. HTML or XML) vs. plain
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18
Document Representation
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19
Document Representation
• Meta-descriptions
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19
Document Representation
• Meta-descriptionsI Field information (author, title, date)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19
Document Representation
• Meta-descriptionsI Field information (author, title, date)I Key words
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19
Document Representation
• Meta-descriptionsI Field information (author, title, date)I Key words
- Predefined
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19
Document Representation
• Meta-descriptionsI Field information (author, title, date)I Key words
- Predefined
- Manually extracted (by author/editor)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19
Document Representation
• Meta-descriptionsI Field information (author, title, date)I Key words
- Predefined
- Manually extracted (by author/editor)
• Content: automatically identifying what the document
is about
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19
Document Representation
Manual Automatic
Controlled
Vocabulary
Free Text
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 20
Document Representation
Manual Automatic
Controlled Current indexing
Vocabulary practice
Free Text
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 20
Document Representation
Manual Automatic
Controlled Current indexing Text categorization
Vocabulary practice ‘intelligent’ IR
Free Text
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 20
Document Representation
Manual Automatic
Controlled Current indexing Text categorization
Vocabulary practice ‘intelligent’ IR
Current indexingFree Text practice
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 20
Document Representation
Manual Automatic
Controlled Current indexing Text categorization
Vocabulary practice ‘intelligent’ IR
Current indexing Text search enginesFree Text practice ‘statistical’ IR
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 20
Controlled Vocabularies
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21
Controlled Vocabularies
• Examples are:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21
Controlled Vocabularies
• Examples are:I ACM Computing Classification System
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21
Controlled Vocabularies
• Examples are:I ACM Computing Classification System
An article on Web search engines would (probably)
be classified as H.3.5 where:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21
Controlled Vocabularies
• Examples are:I ACM Computing Classification System
An article on Web search engines would (probably)
be classified as H.3.5 where:
- H: Information Systems
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21
Controlled Vocabularies
• Examples are:I ACM Computing Classification System
An article on Web search engines would (probably)
be classified as H.3.5 where:
- H: Information Systems
- H.3: Information Storage and Retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21
Controlled Vocabularies
• Examples are:I ACM Computing Classification System
An article on Web search engines would (probably)
be classified as H.3.5 where:
- H: Information Systems
- H.3: Information Storage and Retrieval
- H.3.5: Online Information Services
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21
Controlled Vocabularies
• Examples are:I ACM Computing Classification System
An article on Web search engines would (probably)
be classified as H.3.5 where:
- H: Information Systems
- H.3: Information Storage and Retrieval
- H.3.5: Online Information ServicesI NLM Medical Subject Headings (MeSH)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21
Controlled Vocabularies
• Examples are:I ACM Computing Classification System
An article on Web search engines would (probably)
be classified as H.3.5 where:
- H: Information Systems
- H.3: Information Storage and Retrieval
- H.3.5: Online Information ServicesI NLM Medical Subject Headings (MeSH)I Yahoo!
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21
Manual vs. Automatic Indexing
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22
Manual vs. Automatic Indexing
• Pros of manual indexing:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22
Manual vs. Automatic Indexing
• Pros of manual indexing:
+ Human judgements are most reliable
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22
Manual vs. Automatic Indexing
• Pros of manual indexing:
+ Human judgements are most reliable
+ Searching controlled vocabularies is more efficient
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22
Manual vs. Automatic Indexing
• Pros of manual indexing:
+ Human judgements are most reliable
+ Searching controlled vocabularies is more efficient
• Cons of manual indexing:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22
Manual vs. Automatic Indexing
• Pros of manual indexing:
+ Human judgements are most reliable
+ Searching controlled vocabularies is more efficient
• Cons of manual indexing:
− Time consuming
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22
Manual vs. Automatic Indexing
• Pros of manual indexing:
+ Human judgements are most reliable
+ Searching controlled vocabularies is more efficient
• Cons of manual indexing:
− Time consuming
− The person using the retrieval system has to be
familiar with the classification system
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22
Manual vs. Automatic Indexing
• Pros of manual indexing:
+ Human judgements are most reliable
+ Searching controlled vocabularies is more efficient
• Cons of manual indexing:
− Time consuming
− The person using the retrieval system has to be
familiar with the classification system
− Classification systems are sometimes incoherent
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22
Automatic Content Representation
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23
Automatic Content Representation
• Using natural language understanding?
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23
Automatic Content Representation
• Using natural language understanding?I Computationally too expensive in real-world settings
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23
Automatic Content Representation
• Using natural language understanding?I Computationally too expensive in real-world settingsI Coverage
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23
Automatic Content Representation
• Using natural language understanding?I Computationally too expensive in real-world settingsI CoverageI Language dependence
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23
Automatic Content Representation
• Using natural language understanding?I Computationally too expensive in real-world settingsI CoverageI Language dependenceI The resulting representations may be too explicit to
deal with the vagueness of a user’s information need
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23
Automatic Content Representation
• Using natural language understanding?I Computationally too expensive in real-world settingsI CoverageI Language dependenceI The resulting representations may be too explicit to
deal with the vagueness of a user’s information need
• Alternative: a document is simply an unstructured set
of words appearing in it: bag of words
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23
Bag-of-Words Approach
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24
Bag-of-Words Approach• A document is an unordered list of words
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24
Bag-of-Words Approach• A document is an unordered list of words
Grammatical information is lost
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24
Bag-of-Words Approach• A document is an unordered list of words
Grammatical information is lost
• Tokenization: What is a word?
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24
Bag-of-Words Approach• A document is an unordered list of words
Grammatical information is lost
• Tokenization: What is a word?Is ‘White House’ one or two words?
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24
Bag-of-Words Approach• A document is an unordered list of words
Grammatical information is lost
• Tokenization: What is a word?Is ‘White House’ one or two words?
• Case folding
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24
Bag-of-Words Approach• A document is an unordered list of words
Grammatical information is lost
• Tokenization: What is a word?Is ‘White House’ one or two words?
• Case folding
‘President Bush’ becomes ‘president’ , ‘bush’
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24
Bag-of-Words Approach• A document is an unordered list of words
Grammatical information is lost
• Tokenization: What is a word?Is ‘White House’ one or two words?
• Case folding
‘President Bush’ becomes ‘president’ , ‘bush’
• Stemming or lemmatization
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24
Bag-of-Words Approach• A document is an unordered list of words
Grammatical information is lost
• Tokenization: What is a word?Is ‘White House’ one or two words?
• Case folding
‘President Bush’ becomes ‘president’ , ‘bush’
• Stemming or lemmatization
Morphological information is thrown away
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24
Bag-of-Words Approach• A document is an unordered list of words
Grammatical information is lost
• Tokenization: What is a word?Is ‘White House’ one or two words?
• Case folding
‘President Bush’ becomes ‘president’ , ‘bush’
• Stemming or lemmatization
Morphological information is thrown away
‘agreements’ becomes ‘agreement’ (lemmatization)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24
Bag-of-Words Approach• A document is an unordered list of words
Grammatical information is lost
• Tokenization: What is a word?Is ‘White House’ one or two words?
• Case folding
‘President Bush’ becomes ‘president’ , ‘bush’
• Stemming or lemmatization
Morphological information is thrown away
‘agreements’ becomes ‘agreement’ (lemmatization)
or even ‘agree’ (stemming)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24
Example Bag of Words
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 25
Example Bag of Words
Scientists have found compelling new evidence of possible
ancient microscopic life on Mars, derived from magnetic
crystals in a meteorite that fell to Earth from the red planet,
NASA announced on Monday.
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 25
Example Bag of Words
Scientists have found compelling new evidence of possible
ancient microscopic life on Mars, derived from magnetic
crystals in a meteorite that fell to Earth from the red planet,
NASA announced on Monday.
a, ancient, announced, compelling, crystals, derived, earth,
evidence, fell, found, from (2×), have, in, life, magnetic,
mars, meteorite, microscopic, monday, nasa, new, of,
on (2×), planet, possible, red, scientists, that, the, to
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 25
What is this about?
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 26
What is this about?
?
added, al, an, and, ballots, been, completed, count,
county (2×), even, former, gore, ground, had, hand,
have (2×), he, if, in (2×), independent, lost, many, miami-
dade, might, new, not, of, president, presidential, requested,
shows, study, that, the, vice, votes, would
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 26
What is this about?
?
added, al, an, and, ballots, been, completed, count,
county (2×), even, former, gore, ground, had, hand,
have (2×), he, if, in (2×), independent, lost, many, miami-
dade, might, new, not, of, president, presidential, requested,
shows, study, that, the, vice, votes, would
=
An independent study shows former Vice President Al Gore
would not have added many new votes in Miami-Dade County
and might even have lost ground in that county, if the
hand count of presidential ballots he requested had been
completed.
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 26
Boolean Retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27
Boolean Retrieval
• Boolean operators are: AND (NEAR), OR, NOT
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27
Boolean Retrieval
• Boolean operators are: AND (NEAR), OR, NOT
• The semantics of the Boolean operators:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27
Boolean Retrieval
• Boolean operators are: AND (NEAR), OR, NOT
• The semantics of the Boolean operators:I t1 AND t2 =
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27
Boolean Retrieval
• Boolean operators are: AND (NEAR), OR, NOT
• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27
Boolean Retrieval
• Boolean operators are: AND (NEAR), OR, NOT
• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}
Documents whose representation contains t1 and t2
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27
Boolean Retrieval
• Boolean operators are: AND (NEAR), OR, NOT
• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}
Documents whose representation contains t1 and t2I t1 OR t2 =
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27
Boolean Retrieval
• Boolean operators are: AND (NEAR), OR, NOT
• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}
Documents whose representation contains t1 and t2I t1 OR t2 = {d | t1 ∈ r(d)} ∪ {d | t2 ∈ r(d)}
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27
Boolean Retrieval
• Boolean operators are: AND (NEAR), OR, NOT
• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}
Documents whose representation contains t1 and t2I t1 OR t2 = {d | t1 ∈ r(d)} ∪ {d | t2 ∈ r(d)}
Documents whose representation contains t1 or t2
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27
Boolean Retrieval
• Boolean operators are: AND (NEAR), OR, NOT
• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}
Documents whose representation contains t1 and t2I t1 OR t2 = {d | t1 ∈ r(d)} ∪ {d | t2 ∈ r(d)}
Documents whose representation contains t1 or t2I NOT t1 =
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27
Boolean Retrieval
• Boolean operators are: AND (NEAR), OR, NOT
• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}
Documents whose representation contains t1 and t2I t1 OR t2 = {d | t1 ∈ r(d)} ∪ {d | t2 ∈ r(d)}
Documents whose representation contains t1 or t2I NOT t1 = {d | t1 6∈ r(d)}
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27
Boolean Retrieval
• Boolean operators are: AND (NEAR), OR, NOT
• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}
Documents whose representation contains t1 and t2I t1 OR t2 = {d | t1 ∈ r(d)} ∪ {d | t2 ∈ r(d)}
Documents whose representation contains t1 or t2I NOT t1 = {d | t1 6∈ r(d)}
Documents whose representation doesn’t contain t1
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27
Boolean Retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 28
Boolean Retrieval
• Information need: President Bill Clinton
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 28
Boolean Retrieval
• Information need: President Bill Clinton
• Boolean query: clinton AND (bill OR president)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 28
Boolean Retrieval
• Information need: President Bill Clinton
• Boolean query: clinton AND (bill OR president)
bill
clinton
president
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 28
Boolean Retrieval in Action
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 29
Boolean Retrieval in Action
1President George W. Bush on Tuesday makes his first address to a jointsession of Congress and has promised a ”to the point” speech laying outhis plans for tax cuts and spending priorities.
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 29
Boolean Retrieval in Action
1President George W. Bush on Tuesday makes his first address to a jointsession of Congress and has promised a ”to the point” speech laying outhis plans for tax cuts and spending priorities.
2
While he was still president, Bill Clinton telephoned the chief executive oftelevision network CBS seeking to help two old friends in a million-dollarbilling dispute, The Wall Street Journal reported in its online editionTuesday.
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 29
Boolean Retrieval in Action
1President George W. Bush on Tuesday makes his first address to a jointsession of Congress and has promised a ”to the point” speech laying outhis plans for tax cuts and spending priorities.
2
While he was still president, Bill Clinton telephoned the chief executive oftelevision network CBS seeking to help two old friends in a million-dollarbilling dispute, The Wall Street Journal reported in its online editionTuesday.
3The White House press office did return calls seeking President Bush’sposition on the bill, but Bell, from the national partnership, said she isoptimistic that paid family leave will become a reality under a Republicanadministration.
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 29
Zipf’s Law
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 30
Zipf’s Law
no
. occ
urr
ence
s
words (sorted by freq.)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 30
Zipf’s Law
no
. occ
urr
ence
s
words (sorted by freq.)
• only a few words occur many times
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 30
Zipf’s Law
no
. occ
urr
ence
s
words (sorted by freq.)
• only a few words occur many times
• a lot of words occur only once (hapax legomina)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 30
Searching the Collection
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31
Searching the Collection
• Finding a word by linear search can be inefficient
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31
Searching the Collection
• Finding a word by linear search can be inefficient
• Solution: Construct a matrix which indexes documents
and words
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31
Searching the Collection
• Finding a word by linear search can be inefficient
• Solution: Construct a matrix which indexes documents
and wordsI Matrix can be extremely large
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31
Searching the Collection
• Finding a word by linear search can be inefficient
• Solution: Construct a matrix which indexes documents
and wordsI Matrix can be extremely largeI The matrix will be sparse (Zipf’s law)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31
Searching the Collection
• Finding a word by linear search can be inefficient
• Solution: Construct a matrix which indexes documents
and wordsI Matrix can be extremely largeI The matrix will be sparse (Zipf’s law)
• Better: Construct an inverted index
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31
Searching the Collection
• Finding a word by linear search can be inefficient
• Solution: Construct a matrix which indexes documents
and wordsI Matrix can be extremely largeI The matrix will be sparse (Zipf’s law)
• Better: Construct an inverted indexI A word points to the documents in which it occurs
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31
Searching the Collection
• Finding a word by linear search can be inefficient
• Solution: Construct a matrix which indexes documents
and wordsI Matrix can be extremely largeI The matrix will be sparse (Zipf’s law)
• Better: Construct an inverted indexI A word points to the documents in which it occurs
• This is an implementational (not a modeling) issue!
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31
Pros and Cons of Boolean Retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32
Pros and Cons of Boolean Retrieval
• Pros of Boolean Retrieval:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32
Pros and Cons of Boolean Retrieval
• Pros of Boolean Retrieval:
+ Clean and simple formalism
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32
Pros and Cons of Boolean Retrieval
• Pros of Boolean Retrieval:
+ Clean and simple formalism
+ Firm grip on query formulation
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32
Pros and Cons of Boolean Retrieval
• Pros of Boolean Retrieval:
+ Clean and simple formalism
+ Firm grip on query formulation
• Cons of Boolean Retrieval:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32
Pros and Cons of Boolean Retrieval
• Pros of Boolean Retrieval:
+ Clean and simple formalism
+ Firm grip on query formulation
• Cons of Boolean Retrieval:
− Most non-experts cannot handle boolean expressions,
and query formulation may be time consuming
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32
Pros and Cons of Boolean Retrieval
• Pros of Boolean Retrieval:
+ Clean and simple formalism
+ Firm grip on query formulation
• Cons of Boolean Retrieval:
− Most non-experts cannot handle boolean expressions,
and query formulation may be time consuming
− No ranking of retrieved documents
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32
Pros and Cons of Boolean Retrieval
• Pros of Boolean Retrieval:
+ Clean and simple formalism
+ Firm grip on query formulation
• Cons of Boolean Retrieval:
− Most non-experts cannot handle boolean expressions,
and query formulation may be time consuming
− No ranking of retrieved documents
− Exact matching may lead to too few or too many
retrieved documents
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32
Alternatives to Boolean Retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 33
Alternatives to Boolean Retrieval
• Vector Space Retrieval:
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 33
Alternatives to Boolean Retrieval
• Vector Space Retrieval:I Users can enter free text
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 33
Alternatives to Boolean Retrieval
• Vector Space Retrieval:I Users can enter free textI Documents are ranked
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 33
Alternatives to Boolean Retrieval
• Vector Space Retrieval:I Users can enter free textI Documents are rankedI Best match instead of exact match
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 33
Alternatives to Boolean Retrieval
• Vector Space Retrieval:I Users can enter free textI Documents are rankedI Best match instead of exact match
• Probalistic Retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 33
What We’ve Done Today
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34
What We’ve Done Today
• What’s Information Retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34
What We’ve Done Today
• What’s Information Retrieval
• What’s a retrieval model
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34
What We’ve Done Today
• What’s Information Retrieval
• What’s a retrieval model
• Documents and their representations
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34
What We’ve Done Today
• What’s Information Retrieval
• What’s a retrieval model
• Documents and their representations
• Boolean retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34
What We’ve Done Today
• What’s Information Retrieval
• What’s a retrieval model
• Documents and their representations
• Boolean retrieval
• Searching words
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34
What We’ve Done Today
• What’s Information Retrieval
• What’s a retrieval model
• Documents and their representations
• Boolean retrieval
• Searching words
• Discussion of boolean retrieval
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34
Homework
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 35
Homework
• Read Chaper 1 of Salton & McGill: Modern
Introduction to Information Retrieval (library)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 35
Homework
• Read Chaper 1 of Salton & McGill: Modern
Introduction to Information Retrieval (library)
• Read Jansen et al.: Real life, real users, and real needs:
A study and analysis of user queries on the web
(available online)
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 35
Homework
• Read Chaper 1 of Salton & McGill: Modern
Introduction to Information Retrieval (library)
• Read Jansen et al.: Real life, real users, and real needs:
A study and analysis of user queries on the web
(available online)
• Give an intuitive ranking for 7 documents
For details go to:
www.science.uva.nl/∼christof/courses/ir/
Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 35