introduction to information retrieval - homepages of uva...

231
Introduction to Information Retrieval Christof Monz and Maarten de Rijke Spring 2002 Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c Christof Monz & Maarten de Rijke 1

Upload: nguyenthuy

Post on 06-Jun-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Introduction toInformation RetrievalChristof Monz and Maarten de Rijke

Spring 2002

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 1

Page 2: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Today’s Program

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2

Page 3: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Today’s Program

• What’s Information Retrieval?

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2

Page 4: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Today’s Program

• What’s Information Retrieval?

• Some administrative stuff

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2

Page 5: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Today’s Program

• What’s Information Retrieval?

• Some administrative stuffI Overview of the course

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2

Page 6: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Today’s Program

• What’s Information Retrieval?

• Some administrative stuffI Overview of the courseI Grading, homework etc.

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2

Page 7: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Today’s Program

• What’s Information Retrieval?

• Some administrative stuffI Overview of the courseI Grading, homework etc.

• How to represent information

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2

Page 8: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Today’s Program

• What’s Information Retrieval?

• Some administrative stuffI Overview of the courseI Grading, homework etc.

• How to represent information

• Our first retrieval model: boolean retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 2

Page 9: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is Information Retrieval?

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3

Page 10: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is Information Retrieval?

• Finding relevant information in large collections of data

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3

Page 11: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is Information Retrieval?

• Finding relevant information in large collections of data

• In such a collection you may want to find:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3

Page 12: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is Information Retrieval?

• Finding relevant information in large collections of data

• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3

Page 13: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is Information Retrieval?

• Finding relevant information in large collections of data

• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’

An article about the Kennedys

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3

Page 14: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is Information Retrieval?

• Finding relevant information in large collections of data

• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’

An article about the Kennedys (text retrieval)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3

Page 15: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is Information Retrieval?

• Finding relevant information in large collections of data

• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’

An article about the Kennedys (text retrieval)I ‘What does a brain tumor look like on a CT-scan’

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3

Page 16: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is Information Retrieval?

• Finding relevant information in large collections of data

• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’

An article about the Kennedys (text retrieval)I ‘What does a brain tumor look like on a CT-scan’

A picture of a brain tumor

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3

Page 17: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is Information Retrieval?

• Finding relevant information in large collections of data

• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’

An article about the Kennedys (text retrieval)I ‘What does a brain tumor look like on a CT-scan’

A picture of a brain tumor (image retrieval)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3

Page 18: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is Information Retrieval?

• Finding relevant information in large collections of data

• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’

An article about the Kennedys (text retrieval)I ‘What does a brain tumor look like on a CT-scan’

A picture of a brain tumor (image retrieval)I ‘It goes like this: hmm hmm hahmmm . . . ’

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3

Page 19: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is Information Retrieval?

• Finding relevant information in large collections of data

• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’

An article about the Kennedys (text retrieval)I ‘What does a brain tumor look like on a CT-scan’

A picture of a brain tumor (image retrieval)I ‘It goes like this: hmm hmm hahmmm . . . ’

A certain song

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3

Page 20: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is Information Retrieval?

• Finding relevant information in large collections of data

• In such a collection you may want to find:I ‘Give me information on the history of the Kennedys’

An article about the Kennedys (text retrieval)I ‘What does a brain tumor look like on a CT-scan’

A picture of a brain tumor (image retrieval)I ‘It goes like this: hmm hmm hahmmm . . . ’

A certain song (music retrieval)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 3

Page 21: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Text Retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4

Page 22: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Text Retrieval

• Online library catalogs (OPAC)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4

Page 23: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Text Retrieval

• Online library catalogs (OPAC)

• Internet search engines, such as

AltaVista, Google, Ilse

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4

Page 24: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Text Retrieval

• Online library catalogs (OPAC)

• Internet search engines, such as

AltaVista, Google, Ilse

• Specialized systems (aka vendors):

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4

Page 25: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Text Retrieval

• Online library catalogs (OPAC)

• Internet search engines, such as

AltaVista, Google, Ilse

• Specialized systems (aka vendors):I MEDLINE (medical articles)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4

Page 26: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Text Retrieval

• Online library catalogs (OPAC)

• Internet search engines, such as

AltaVista, Google, Ilse

• Specialized systems (aka vendors):I MEDLINE (medical articles)I Lexis-Nexis (legal, business, academic, . . . )

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4

Page 27: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Text Retrieval

• Online library catalogs (OPAC)

• Internet search engines, such as

AltaVista, Google, Ilse

• Specialized systems (aka vendors):I MEDLINE (medical articles)I Lexis-Nexis (legal, business, academic, . . . )I Westlaw (legal articles)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4

Page 28: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Text Retrieval

• Online library catalogs (OPAC)

• Internet search engines, such as

AltaVista, Google, Ilse

• Specialized systems (aka vendors):I MEDLINE (medical articles)I Lexis-Nexis (legal, business, academic, . . . )I Westlaw (legal articles)I Dialog (business information)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 4

Page 29: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval vs. Browsing

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5

Page 30: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval vs. Browsing

• Popular Web Directories:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5

Page 31: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval vs. Browsing

• Popular Web Directories:I Yahoo!, Open Directory Project (dmoz)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5

Page 32: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval vs. Browsing

• Popular Web Directories:I Yahoo!, Open Directory Project (dmoz)

• The user has to ‘guess’ the ‘right’ directories to find

the information

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5

Page 33: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval vs. Browsing

• Popular Web Directories:I Yahoo!, Open Directory Project (dmoz)

• The user has to ‘guess’ the ‘right’ directories to find

the informationI The user has to adapt to the designers’

conceptualization of the directory

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5

Page 34: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval vs. Browsing

• Popular Web Directories:I Yahoo!, Open Directory Project (dmoz)

• The user has to ‘guess’ the ‘right’ directories to find

the informationI The user has to adapt to the designers’

conceptualization of the directory

• The goal of information retrieval is to provide

immediate random access to the data

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5

Page 35: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval vs. Browsing

• Popular Web Directories:I Yahoo!, Open Directory Project (dmoz)

• The user has to ‘guess’ the ‘right’ directories to find

the informationI The user has to adapt to the designers’

conceptualization of the directory

• The goal of information retrieval is to provide

immediate random access to the dataI The user can specifiy his information need

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 5

Page 36: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

IR vs. Database Querying

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 6

Page 37: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

IR vs. Database Querying

• IR is not the same thing as querying a database

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 6

Page 38: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

IR vs. Database Querying

• IR is not the same thing as querying a database

• Database querying assumes that the data is in a

standardized format

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 6

Page 39: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

IR vs. Database Querying

• IR is not the same thing as querying a database

• Database querying assumes that the data is in a

standardized format

• Transforming all information, news articles, web sites

into a database format is difficult and impossible for

large data collections

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 6

Page 40: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

IR vs. Database Querying

• IR is not the same thing as querying a database

• Database querying assumes that the data is in a

standardized format

• Transforming all information, news articles, web sites

into a database format is difficult and impossible for

large data collections

• Text retrieval can work with plain, unformatted data

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 6

Page 41: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Relevance as Similarity

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7

Page 42: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Relevance as Similarity

• A fundamental idea within IR is:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7

Page 43: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Relevance as Similarity

• A fundamental idea within IR is:

‘A document is relevant to a query

if they are similar ’

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7

Page 44: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Relevance as Similarity

• A fundamental idea within IR is:

‘A document is relevant to a query

if they are similar ’

• Similarity can be defined as

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7

Page 45: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Relevance as Similarity

• A fundamental idea within IR is:

‘A document is relevant to a query

if they are similar ’

• Similarity can be defined asI string matching/comparison

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7

Page 46: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Relevance as Similarity

• A fundamental idea within IR is:

‘A document is relevant to a query

if they are similar ’

• Similarity can be defined asI string matching/comparisonI similar vocabulary

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7

Page 47: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Relevance as Similarity

• A fundamental idea within IR is:

‘A document is relevant to a query

if they are similar ’

• Similarity can be defined asI string matching/comparisonI similar vocabularyI same meaning of text

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 7

Page 48: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

The Ubiquity of IR

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8

Page 49: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

The Ubiquity of IR

• Information filtering

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8

Page 50: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

The Ubiquity of IR

• Information filteringI E-mail routing

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8

Page 51: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

The Ubiquity of IR

• Information filteringI E-mail routingI Text categorization

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8

Page 52: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

The Ubiquity of IR

• Information filteringI E-mail routingI Text categorization

• Detecting information structure

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8

Page 53: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

The Ubiquity of IR

• Information filteringI E-mail routingI Text categorization

• Detecting information structureI Hyperlink generation

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8

Page 54: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

The Ubiquity of IR

• Information filteringI E-mail routingI Text categorization

• Detecting information structureI Hyperlink generationI Topic/Information detection/screening

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8

Page 55: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

The Ubiquity of IR

• Information filteringI E-mail routingI Text categorization

• Detecting information structureI Hyperlink generationI Topic/Information detection/screeningI Portal development and maintenance

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8

Page 56: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

The Ubiquity of IR

• Information filteringI E-mail routingI Text categorization

• Detecting information structureI Hyperlink generationI Topic/Information detection/screeningI Portal development and maintenance

• Question Answering

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 8

Page 57: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Some Research Groups in IR

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9

Page 58: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Some Research Groups in IR

• Industrial IR research:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9

Page 59: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Some Research Groups in IR

• Industrial IR research: AT&T, NEC, Sun

Microsystems, Microsoft, G&E Research, Sabir

Research, NTT, AltaVista, Xerox, Q-Go, GO.com

(Infoseek), Lexiquest, Answers.com, AnswerLogics,

Google, Ask-Jeeves, Lucent Technologies, IBM . . .

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9

Page 60: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Some Research Groups in IR

• Industrial IR research: AT&T, NEC, Sun

Microsystems, Microsoft, G&E Research, Sabir

Research, NTT, AltaVista, Xerox, Q-Go, GO.com

(Infoseek), Lexiquest, Answers.com, AnswerLogics,

Google, Ask-Jeeves, Lucent Technologies, IBM . . .

• Academic IR Groups:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9

Page 61: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Some Research Groups in IR

• Industrial IR research: AT&T, NEC, Sun

Microsystems, Microsoft, G&E Research, Sabir

Research, NTT, AltaVista, Xerox, Q-Go, GO.com

(Infoseek), Lexiquest, Answers.com, AnswerLogics,

Google, Ask-Jeeves, Lucent Technologies, IBM . . .

• Academic IR Groups: Cornell, Massachusetts, Twente,

Glasgow, Sheffield, Dortmund, Dublin, Stanford,

Syracruse, Virginia Tech, Pisa . . .

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9

Page 62: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Some Research Groups in IR

• Industrial IR research: AT&T, NEC, Sun

Microsystems, Microsoft, G&E Research, Sabir

Research, NTT, AltaVista, Xerox, Q-Go, GO.com

(Infoseek), Lexiquest, Answers.com, AnswerLogics,

Google, Ask-Jeeves, Lucent Technologies, IBM . . .

• Academic IR Groups: Cornell, Massachusetts, Twente,

Glasgow, Sheffield, Dortmund, Dublin, Stanford,

Syracruse, Virginia Tech, Pisa . . .

• Other:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9

Page 63: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Some Research Groups in IR

• Industrial IR research: AT&T, NEC, Sun

Microsystems, Microsoft, G&E Research, Sabir

Research, NTT, AltaVista, Xerox, Q-Go, GO.com

(Infoseek), Lexiquest, Answers.com, AnswerLogics,

Google, Ask-Jeeves, Lucent Technologies, IBM . . .

• Academic IR Groups: Cornell, Massachusetts, Twente,

Glasgow, Sheffield, Dortmund, Dublin, Stanford,

Syracruse, Virginia Tech, Pisa . . .

• Other: CIA, DARPA, ERCIM, Mitre, NIST . . .

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 9

Page 64: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10

Page 65: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

• 1950: Calvin N. Moors coins the term ‘Information Retrieval’

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10

Page 66: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

• 1950: Calvin N. Moors coins the term ‘Information Retrieval’

• 1959: Luhn describes statistical retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10

Page 67: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

• 1950: Calvin N. Moors coins the term ‘Information Retrieval’

• 1959: Luhn describes statistical retrieval

• 1960: Maron and Kuhns define a probabilistic model of IR

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10

Page 68: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

• 1950: Calvin N. Moors coins the term ‘Information Retrieval’

• 1959: Luhn describes statistical retrieval

• 1960: Maron and Kuhns define a probabilistic model of IR

• 1966: Cranfield project defines evaluation measures

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10

Page 69: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

• 1950: Calvin N. Moors coins the term ‘Information Retrieval’

• 1959: Luhn describes statistical retrieval

• 1960: Maron and Kuhns define a probabilistic model of IR

• 1966: Cranfield project defines evaluation measures

• 1968: Gerard Salton’s first book about the SMART retrieval

system

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10

Page 70: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

• 1950: Calvin N. Moors coins the term ‘Information Retrieval’

• 1959: Luhn describes statistical retrieval

• 1960: Maron and Kuhns define a probabilistic model of IR

• 1966: Cranfield project defines evaluation measures

• 1968: Gerard Salton’s first book about the SMART retrieval

system

• 1972: Lockheed introduces DIALOG as commercial online service

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10

Page 71: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

• 1950: Calvin N. Moors coins the term ‘Information Retrieval’

• 1959: Luhn describes statistical retrieval

• 1960: Maron and Kuhns define a probabilistic model of IR

• 1966: Cranfield project defines evaluation measures

• 1968: Gerard Salton’s first book about the SMART retrieval

system

• 1972: Lockheed introduces DIALOG as commercial online service

• Late 1980’s: First PC systems incorporate retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 10

Page 72: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 11

Page 73: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

• Early 1990’s: Cheap disks lead to the information storage

revolution

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 11

Page 74: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

• Early 1990’s: Cheap disks lead to the information storage

revolution

• 1992: Westlaw is the first large-scale information service using

probabilistic retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 11

Page 75: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

• Early 1990’s: Cheap disks lead to the information storage

revolution

• 1992: Westlaw is the first large-scale information service using

probabilistic retrieval

• Mid 1990’s: Multi-media databases

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 11

Page 76: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

• Early 1990’s: Cheap disks lead to the information storage

revolution

• 1992: Westlaw is the first large-scale information service using

probabilistic retrieval

• Mid 1990’s: Multi-media databases

• 1994: The internet and web explosion

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 11

Page 77: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

History of IR

• Early 1990’s: Cheap disks lead to the information storage

revolution

• 1992: Westlaw is the first large-scale information service using

probabilistic retrieval

• Mid 1990’s: Multi-media databases

• 1994: The internet and web explosion

• 1995: IR techniques are incorporated in all kinds of information

management applications

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 11

Page 78: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Overview of the Course

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12

Page 79: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Overview of the Course

• Basic IR models (week 1 & 2)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12

Page 80: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Overview of the Course

• Basic IR models (week 1 & 2)

• Evaluating the quality of IR methods (week 3)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12

Page 81: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Overview of the Course

• Basic IR models (week 1 & 2)

• Evaluating the quality of IR methods (week 3)

• Text representation (week 4)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12

Page 82: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Overview of the Course

• Basic IR models (week 1 & 2)

• Evaluating the quality of IR methods (week 3)

• Text representation (week 4)

• Components of an IR system (week 5 & 6)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12

Page 83: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Overview of the Course

• Basic IR models (week 1 & 2)

• Evaluating the quality of IR methods (week 3)

• Text representation (week 4)

• Components of an IR system (week 5 & 6)

• Improving effectiveness and efficiency (week 6 & 7)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12

Page 84: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Overview of the Course

• Basic IR models (week 1 & 2)

• Evaluating the quality of IR methods (week 3)

• Text representation (week 4)

• Components of an IR system (week 5 & 6)

• Improving effectiveness and efficiency (week 6 & 7)

• Web-based IR (week 8 & 9)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12

Page 85: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Overview of the Course

• Basic IR models (week 1 & 2)

• Evaluating the quality of IR methods (week 3)

• Text representation (week 4)

• Components of an IR system (week 5 & 6)

• Improving effectiveness and efficiency (week 6 & 7)

• Web-based IR (week 8 & 9)

• Current research themes (week 10)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 12

Page 86: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Objectives of the Course

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13

Page 87: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Objectives of the Course

• At the end of the course you will be able to. . .

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13

Page 88: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Objectives of the Course

• At the end of the course you will be able to. . .I Exploit web specific information when searching

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13

Page 89: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Objectives of the Course

• At the end of the course you will be able to. . .I Exploit web specific information when searchingI Understand the core components of modern IR

systems

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13

Page 90: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Objectives of the Course

• At the end of the course you will be able to. . .I Exploit web specific information when searchingI Understand the core components of modern IR

systemsI Understand the potential of IR techniques for today’s

information society

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13

Page 91: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Objectives of the Course

• At the end of the course you will be able to. . .I Exploit web specific information when searchingI Understand the core components of modern IR

systemsI Understand the potential of IR techniques for today’s

information societyI Build your own search engine (in principle)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13

Page 92: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Objectives of the Course

• At the end of the course you will be able to. . .I Exploit web specific information when searchingI Understand the core components of modern IR

systemsI Understand the potential of IR techniques for today’s

information societyI Build your own search engine (in principle)I Make some serious dough

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 13

Page 93: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Grading etc.

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14

Page 94: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Grading etc.

• Prerequisites:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14

Page 95: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Grading etc.

• Prerequisites:I Computer literacy (including an account on gene plus

the ability to use the unix command line interface)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14

Page 96: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Grading etc.

• Prerequisites:I Computer literacy (including an account on gene plus

the ability to use the unix command line interface)

• Assessment:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14

Page 97: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Grading etc.

• Prerequisites:I Computer literacy (including an account on gene plus

the ability to use the unix command line interface)

• Assessment:I Weekly reading assignments (1 or 2 papers per week)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14

Page 98: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Grading etc.

• Prerequisites:I Computer literacy (including an account on gene plus

the ability to use the unix command line interface)

• Assessment:I Weekly reading assignments (1 or 2 papers per week)I (3-5) assignments

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14

Page 99: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Grading etc.

• Prerequisites:I Computer literacy (including an account on gene plus

the ability to use the unix command line interface)

• Assessment:I Weekly reading assignments (1 or 2 papers per week)I (3-5) assignmentsI Final exam

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14

Page 100: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Grading etc.

• Prerequisites:I Computer literacy (including an account on gene plus

the ability to use the unix command line interface)

• Assessment:I Weekly reading assignments (1 or 2 papers per week)I (3-5) assignmentsI Final examI Final mark is obtained as the average of the final

exam (60%), assignments (30%) and reading (10%)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 14

Page 101: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Web Site of the Course

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15

Page 102: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Web Site of the Course

• URL: www.science.uva.nl/∼christof/courses/ir/

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15

Page 103: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Web Site of the Course

• URL: www.science.uva.nl/∼christof/courses/ir/

• Features of the web site:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15

Page 104: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Web Site of the Course

• URL: www.science.uva.nl/∼christof/courses/ir/

• Features of the web site:I Some of the reading material is available online

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15

Page 105: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Web Site of the Course

• URL: www.science.uva.nl/∼christof/courses/ir/

• Features of the web site:I Some of the reading material is available onlineI Links to universities, companies and people relevant

to IR

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15

Page 106: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Web Site of the Course

• URL: www.science.uva.nl/∼christof/courses/ir/

• Features of the web site:I Some of the reading material is available onlineI Links to universities, companies and people relevant

to IRI Printer-friendly versions of the transparancies

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15

Page 107: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Web Site of the Course

• URL: www.science.uva.nl/∼christof/courses/ir/

• Features of the web site:I Some of the reading material is available onlineI Links to universities, companies and people relevant

to IRI Printer-friendly versions of the transparancies

• Fill out the online form to be added to the mailing list

for this course (important!)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 15

Page 108: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval Models

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 16

Page 109: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval Models

• A retrieval model is an idealization or abstraction of an

actual retrieval process

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 16

Page 110: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval Models

• A retrieval model is an idealization or abstraction of an

actual retrieval process

• Conclusions derived from a model depend on whether

the model is a good approximation of the retrieval

situation

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 16

Page 111: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval Models

• A retrieval model is an idealization or abstraction of an

actual retrieval process

• Conclusions derived from a model depend on whether

the model is a good approximation of the retrieval

situation

• Note that a retrieval model is not the same thing as a

retrieval implementation

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 16

Page 112: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval Models

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 17

Page 113: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Retrieval Models

display documentsto the user

queryformulation

identify relevantinformation

document

User

representations

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 17

Page 114: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Components of a Retrieval Model

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18

Page 115: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Components of a Retrieval Model

• The user:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18

Page 116: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Components of a Retrieval Model

• The user:I Search expert (e.g., librarian) vs. non-expert

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18

Page 117: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Components of a Retrieval Model

• The user:I Search expert (e.g., librarian) vs. non-expertI Backgound of the user (knowledge of the topic)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18

Page 118: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Components of a Retrieval Model

• The user:I Search expert (e.g., librarian) vs. non-expertI Backgound of the user (knowledge of the topic)I In-depth searching vs. ‘just-wanna-get-an-idea’

searching

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18

Page 119: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Components of a Retrieval Model

• The user:I Search expert (e.g., librarian) vs. non-expertI Backgound of the user (knowledge of the topic)I In-depth searching vs. ‘just-wanna-get-an-idea’

searching

• The documents:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18

Page 120: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Components of a Retrieval Model

• The user:I Search expert (e.g., librarian) vs. non-expertI Backgound of the user (knowledge of the topic)I In-depth searching vs. ‘just-wanna-get-an-idea’

searching

• The documents:I Different languages

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18

Page 121: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Components of a Retrieval Model

• The user:I Search expert (e.g., librarian) vs. non-expertI Backgound of the user (knowledge of the topic)I In-depth searching vs. ‘just-wanna-get-an-idea’

searching

• The documents:I Different languagesI Semi-structured (e.g. HTML or XML) vs. plain

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 18

Page 122: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Document Representation

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19

Page 123: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Document Representation

• Meta-descriptions

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19

Page 124: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Document Representation

• Meta-descriptionsI Field information (author, title, date)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19

Page 125: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Document Representation

• Meta-descriptionsI Field information (author, title, date)I Key words

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19

Page 126: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Document Representation

• Meta-descriptionsI Field information (author, title, date)I Key words

- Predefined

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19

Page 127: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Document Representation

• Meta-descriptionsI Field information (author, title, date)I Key words

- Predefined

- Manually extracted (by author/editor)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19

Page 128: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Document Representation

• Meta-descriptionsI Field information (author, title, date)I Key words

- Predefined

- Manually extracted (by author/editor)

• Content: automatically identifying what the document

is about

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 19

Page 129: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Document Representation

Manual Automatic

Controlled

Vocabulary

Free Text

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 20

Page 130: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Document Representation

Manual Automatic

Controlled Current indexing

Vocabulary practice

Free Text

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 20

Page 131: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Document Representation

Manual Automatic

Controlled Current indexing Text categorization

Vocabulary practice ‘intelligent’ IR

Free Text

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 20

Page 132: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Document Representation

Manual Automatic

Controlled Current indexing Text categorization

Vocabulary practice ‘intelligent’ IR

Current indexingFree Text practice

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 20

Page 133: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Document Representation

Manual Automatic

Controlled Current indexing Text categorization

Vocabulary practice ‘intelligent’ IR

Current indexing Text search enginesFree Text practice ‘statistical’ IR

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 20

Page 134: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Controlled Vocabularies

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21

Page 135: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Controlled Vocabularies

• Examples are:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21

Page 136: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Controlled Vocabularies

• Examples are:I ACM Computing Classification System

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21

Page 137: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Controlled Vocabularies

• Examples are:I ACM Computing Classification System

An article on Web search engines would (probably)

be classified as H.3.5 where:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21

Page 138: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Controlled Vocabularies

• Examples are:I ACM Computing Classification System

An article on Web search engines would (probably)

be classified as H.3.5 where:

- H: Information Systems

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21

Page 139: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Controlled Vocabularies

• Examples are:I ACM Computing Classification System

An article on Web search engines would (probably)

be classified as H.3.5 where:

- H: Information Systems

- H.3: Information Storage and Retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21

Page 140: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Controlled Vocabularies

• Examples are:I ACM Computing Classification System

An article on Web search engines would (probably)

be classified as H.3.5 where:

- H: Information Systems

- H.3: Information Storage and Retrieval

- H.3.5: Online Information Services

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21

Page 141: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Controlled Vocabularies

• Examples are:I ACM Computing Classification System

An article on Web search engines would (probably)

be classified as H.3.5 where:

- H: Information Systems

- H.3: Information Storage and Retrieval

- H.3.5: Online Information ServicesI NLM Medical Subject Headings (MeSH)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21

Page 142: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Controlled Vocabularies

• Examples are:I ACM Computing Classification System

An article on Web search engines would (probably)

be classified as H.3.5 where:

- H: Information Systems

- H.3: Information Storage and Retrieval

- H.3.5: Online Information ServicesI NLM Medical Subject Headings (MeSH)I Yahoo!

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 21

Page 143: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Manual vs. Automatic Indexing

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22

Page 144: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Manual vs. Automatic Indexing

• Pros of manual indexing:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22

Page 145: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Manual vs. Automatic Indexing

• Pros of manual indexing:

+ Human judgements are most reliable

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22

Page 146: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Manual vs. Automatic Indexing

• Pros of manual indexing:

+ Human judgements are most reliable

+ Searching controlled vocabularies is more efficient

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22

Page 147: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Manual vs. Automatic Indexing

• Pros of manual indexing:

+ Human judgements are most reliable

+ Searching controlled vocabularies is more efficient

• Cons of manual indexing:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22

Page 148: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Manual vs. Automatic Indexing

• Pros of manual indexing:

+ Human judgements are most reliable

+ Searching controlled vocabularies is more efficient

• Cons of manual indexing:

− Time consuming

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22

Page 149: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Manual vs. Automatic Indexing

• Pros of manual indexing:

+ Human judgements are most reliable

+ Searching controlled vocabularies is more efficient

• Cons of manual indexing:

− Time consuming

− The person using the retrieval system has to be

familiar with the classification system

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22

Page 150: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Manual vs. Automatic Indexing

• Pros of manual indexing:

+ Human judgements are most reliable

+ Searching controlled vocabularies is more efficient

• Cons of manual indexing:

− Time consuming

− The person using the retrieval system has to be

familiar with the classification system

− Classification systems are sometimes incoherent

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 22

Page 151: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Automatic Content Representation

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23

Page 152: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Automatic Content Representation

• Using natural language understanding?

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23

Page 153: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Automatic Content Representation

• Using natural language understanding?I Computationally too expensive in real-world settings

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23

Page 154: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Automatic Content Representation

• Using natural language understanding?I Computationally too expensive in real-world settingsI Coverage

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23

Page 155: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Automatic Content Representation

• Using natural language understanding?I Computationally too expensive in real-world settingsI CoverageI Language dependence

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23

Page 156: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Automatic Content Representation

• Using natural language understanding?I Computationally too expensive in real-world settingsI CoverageI Language dependenceI The resulting representations may be too explicit to

deal with the vagueness of a user’s information need

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23

Page 157: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Automatic Content Representation

• Using natural language understanding?I Computationally too expensive in real-world settingsI CoverageI Language dependenceI The resulting representations may be too explicit to

deal with the vagueness of a user’s information need

• Alternative: a document is simply an unstructured set

of words appearing in it: bag of words

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 23

Page 158: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Bag-of-Words Approach

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24

Page 159: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Bag-of-Words Approach• A document is an unordered list of words

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24

Page 160: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Bag-of-Words Approach• A document is an unordered list of words

Grammatical information is lost

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24

Page 161: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Bag-of-Words Approach• A document is an unordered list of words

Grammatical information is lost

• Tokenization: What is a word?

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24

Page 162: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Bag-of-Words Approach• A document is an unordered list of words

Grammatical information is lost

• Tokenization: What is a word?Is ‘White House’ one or two words?

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24

Page 163: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Bag-of-Words Approach• A document is an unordered list of words

Grammatical information is lost

• Tokenization: What is a word?Is ‘White House’ one or two words?

• Case folding

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24

Page 164: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Bag-of-Words Approach• A document is an unordered list of words

Grammatical information is lost

• Tokenization: What is a word?Is ‘White House’ one or two words?

• Case folding

‘President Bush’ becomes ‘president’ , ‘bush’

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24

Page 165: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Bag-of-Words Approach• A document is an unordered list of words

Grammatical information is lost

• Tokenization: What is a word?Is ‘White House’ one or two words?

• Case folding

‘President Bush’ becomes ‘president’ , ‘bush’

• Stemming or lemmatization

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24

Page 166: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Bag-of-Words Approach• A document is an unordered list of words

Grammatical information is lost

• Tokenization: What is a word?Is ‘White House’ one or two words?

• Case folding

‘President Bush’ becomes ‘president’ , ‘bush’

• Stemming or lemmatization

Morphological information is thrown away

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24

Page 167: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Bag-of-Words Approach• A document is an unordered list of words

Grammatical information is lost

• Tokenization: What is a word?Is ‘White House’ one or two words?

• Case folding

‘President Bush’ becomes ‘president’ , ‘bush’

• Stemming or lemmatization

Morphological information is thrown away

‘agreements’ becomes ‘agreement’ (lemmatization)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24

Page 168: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Bag-of-Words Approach• A document is an unordered list of words

Grammatical information is lost

• Tokenization: What is a word?Is ‘White House’ one or two words?

• Case folding

‘President Bush’ becomes ‘president’ , ‘bush’

• Stemming or lemmatization

Morphological information is thrown away

‘agreements’ becomes ‘agreement’ (lemmatization)

or even ‘agree’ (stemming)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 24

Page 169: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Example Bag of Words

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 25

Page 170: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Example Bag of Words

Scientists have found compelling new evidence of possible

ancient microscopic life on Mars, derived from magnetic

crystals in a meteorite that fell to Earth from the red planet,

NASA announced on Monday.

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 25

Page 171: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Example Bag of Words

Scientists have found compelling new evidence of possible

ancient microscopic life on Mars, derived from magnetic

crystals in a meteorite that fell to Earth from the red planet,

NASA announced on Monday.

a, ancient, announced, compelling, crystals, derived, earth,

evidence, fell, found, from (2×), have, in, life, magnetic,

mars, meteorite, microscopic, monday, nasa, new, of,

on (2×), planet, possible, red, scientists, that, the, to

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 25

Page 172: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is this about?

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 26

Page 173: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is this about?

?

added, al, an, and, ballots, been, completed, count,

county (2×), even, former, gore, ground, had, hand,

have (2×), he, if, in (2×), independent, lost, many, miami-

dade, might, new, not, of, president, presidential, requested,

shows, study, that, the, vice, votes, would

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 26

Page 174: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What is this about?

?

added, al, an, and, ballots, been, completed, count,

county (2×), even, former, gore, ground, had, hand,

have (2×), he, if, in (2×), independent, lost, many, miami-

dade, might, new, not, of, president, presidential, requested,

shows, study, that, the, vice, votes, would

=

An independent study shows former Vice President Al Gore

would not have added many new votes in Miami-Dade County

and might even have lost ground in that county, if the

hand count of presidential ballots he requested had been

completed.

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 26

Page 175: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27

Page 176: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Boolean operators are: AND (NEAR), OR, NOT

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27

Page 177: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Boolean operators are: AND (NEAR), OR, NOT

• The semantics of the Boolean operators:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27

Page 178: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Boolean operators are: AND (NEAR), OR, NOT

• The semantics of the Boolean operators:I t1 AND t2 =

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27

Page 179: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Boolean operators are: AND (NEAR), OR, NOT

• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27

Page 180: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Boolean operators are: AND (NEAR), OR, NOT

• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}

Documents whose representation contains t1 and t2

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27

Page 181: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Boolean operators are: AND (NEAR), OR, NOT

• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}

Documents whose representation contains t1 and t2I t1 OR t2 =

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27

Page 182: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Boolean operators are: AND (NEAR), OR, NOT

• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}

Documents whose representation contains t1 and t2I t1 OR t2 = {d | t1 ∈ r(d)} ∪ {d | t2 ∈ r(d)}

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27

Page 183: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Boolean operators are: AND (NEAR), OR, NOT

• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}

Documents whose representation contains t1 and t2I t1 OR t2 = {d | t1 ∈ r(d)} ∪ {d | t2 ∈ r(d)}

Documents whose representation contains t1 or t2

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27

Page 184: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Boolean operators are: AND (NEAR), OR, NOT

• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}

Documents whose representation contains t1 and t2I t1 OR t2 = {d | t1 ∈ r(d)} ∪ {d | t2 ∈ r(d)}

Documents whose representation contains t1 or t2I NOT t1 =

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27

Page 185: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Boolean operators are: AND (NEAR), OR, NOT

• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}

Documents whose representation contains t1 and t2I t1 OR t2 = {d | t1 ∈ r(d)} ∪ {d | t2 ∈ r(d)}

Documents whose representation contains t1 or t2I NOT t1 = {d | t1 6∈ r(d)}

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27

Page 186: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Boolean operators are: AND (NEAR), OR, NOT

• The semantics of the Boolean operators:I t1 AND t2 = {d | t1 ∈ r(d)} ∩ {d | t2 ∈ r(d)}

Documents whose representation contains t1 and t2I t1 OR t2 = {d | t1 ∈ r(d)} ∪ {d | t2 ∈ r(d)}

Documents whose representation contains t1 or t2I NOT t1 = {d | t1 6∈ r(d)}

Documents whose representation doesn’t contain t1

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 27

Page 187: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 28

Page 188: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Information need: President Bill Clinton

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 28

Page 189: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Information need: President Bill Clinton

• Boolean query: clinton AND (bill OR president)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 28

Page 190: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval

• Information need: President Bill Clinton

• Boolean query: clinton AND (bill OR president)

bill

clinton

president

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 28

Page 191: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval in Action

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 29

Page 192: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval in Action

1President George W. Bush on Tuesday makes his first address to a jointsession of Congress and has promised a ”to the point” speech laying outhis plans for tax cuts and spending priorities.

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 29

Page 193: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval in Action

1President George W. Bush on Tuesday makes his first address to a jointsession of Congress and has promised a ”to the point” speech laying outhis plans for tax cuts and spending priorities.

2

While he was still president, Bill Clinton telephoned the chief executive oftelevision network CBS seeking to help two old friends in a million-dollarbilling dispute, The Wall Street Journal reported in its online editionTuesday.

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 29

Page 194: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Boolean Retrieval in Action

1President George W. Bush on Tuesday makes his first address to a jointsession of Congress and has promised a ”to the point” speech laying outhis plans for tax cuts and spending priorities.

2

While he was still president, Bill Clinton telephoned the chief executive oftelevision network CBS seeking to help two old friends in a million-dollarbilling dispute, The Wall Street Journal reported in its online editionTuesday.

3The White House press office did return calls seeking President Bush’sposition on the bill, but Bell, from the national partnership, said she isoptimistic that paid family leave will become a reality under a Republicanadministration.

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 29

Page 195: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Zipf’s Law

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 30

Page 196: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Zipf’s Law

no

. occ

urr

ence

s

words (sorted by freq.)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 30

Page 197: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Zipf’s Law

no

. occ

urr

ence

s

words (sorted by freq.)

• only a few words occur many times

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 30

Page 198: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Zipf’s Law

no

. occ

urr

ence

s

words (sorted by freq.)

• only a few words occur many times

• a lot of words occur only once (hapax legomina)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 30

Page 199: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Searching the Collection

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31

Page 200: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Searching the Collection

• Finding a word by linear search can be inefficient

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31

Page 201: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Searching the Collection

• Finding a word by linear search can be inefficient

• Solution: Construct a matrix which indexes documents

and words

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31

Page 202: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Searching the Collection

• Finding a word by linear search can be inefficient

• Solution: Construct a matrix which indexes documents

and wordsI Matrix can be extremely large

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31

Page 203: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Searching the Collection

• Finding a word by linear search can be inefficient

• Solution: Construct a matrix which indexes documents

and wordsI Matrix can be extremely largeI The matrix will be sparse (Zipf’s law)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31

Page 204: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Searching the Collection

• Finding a word by linear search can be inefficient

• Solution: Construct a matrix which indexes documents

and wordsI Matrix can be extremely largeI The matrix will be sparse (Zipf’s law)

• Better: Construct an inverted index

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31

Page 205: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Searching the Collection

• Finding a word by linear search can be inefficient

• Solution: Construct a matrix which indexes documents

and wordsI Matrix can be extremely largeI The matrix will be sparse (Zipf’s law)

• Better: Construct an inverted indexI A word points to the documents in which it occurs

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31

Page 206: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Searching the Collection

• Finding a word by linear search can be inefficient

• Solution: Construct a matrix which indexes documents

and wordsI Matrix can be extremely largeI The matrix will be sparse (Zipf’s law)

• Better: Construct an inverted indexI A word points to the documents in which it occurs

• This is an implementational (not a modeling) issue!

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 31

Page 207: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Pros and Cons of Boolean Retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32

Page 208: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Pros and Cons of Boolean Retrieval

• Pros of Boolean Retrieval:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32

Page 209: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Pros and Cons of Boolean Retrieval

• Pros of Boolean Retrieval:

+ Clean and simple formalism

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32

Page 210: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Pros and Cons of Boolean Retrieval

• Pros of Boolean Retrieval:

+ Clean and simple formalism

+ Firm grip on query formulation

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32

Page 211: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Pros and Cons of Boolean Retrieval

• Pros of Boolean Retrieval:

+ Clean and simple formalism

+ Firm grip on query formulation

• Cons of Boolean Retrieval:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32

Page 212: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Pros and Cons of Boolean Retrieval

• Pros of Boolean Retrieval:

+ Clean and simple formalism

+ Firm grip on query formulation

• Cons of Boolean Retrieval:

− Most non-experts cannot handle boolean expressions,

and query formulation may be time consuming

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32

Page 213: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Pros and Cons of Boolean Retrieval

• Pros of Boolean Retrieval:

+ Clean and simple formalism

+ Firm grip on query formulation

• Cons of Boolean Retrieval:

− Most non-experts cannot handle boolean expressions,

and query formulation may be time consuming

− No ranking of retrieved documents

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32

Page 214: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Pros and Cons of Boolean Retrieval

• Pros of Boolean Retrieval:

+ Clean and simple formalism

+ Firm grip on query formulation

• Cons of Boolean Retrieval:

− Most non-experts cannot handle boolean expressions,

and query formulation may be time consuming

− No ranking of retrieved documents

− Exact matching may lead to too few or too many

retrieved documents

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 32

Page 215: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Alternatives to Boolean Retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 33

Page 216: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Alternatives to Boolean Retrieval

• Vector Space Retrieval:

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 33

Page 217: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Alternatives to Boolean Retrieval

• Vector Space Retrieval:I Users can enter free text

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 33

Page 218: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Alternatives to Boolean Retrieval

• Vector Space Retrieval:I Users can enter free textI Documents are ranked

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 33

Page 219: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Alternatives to Boolean Retrieval

• Vector Space Retrieval:I Users can enter free textI Documents are rankedI Best match instead of exact match

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 33

Page 220: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Alternatives to Boolean Retrieval

• Vector Space Retrieval:I Users can enter free textI Documents are rankedI Best match instead of exact match

• Probalistic Retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 33

Page 221: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What We’ve Done Today

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34

Page 222: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What We’ve Done Today

• What’s Information Retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34

Page 223: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What We’ve Done Today

• What’s Information Retrieval

• What’s a retrieval model

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34

Page 224: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What We’ve Done Today

• What’s Information Retrieval

• What’s a retrieval model

• Documents and their representations

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34

Page 225: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What We’ve Done Today

• What’s Information Retrieval

• What’s a retrieval model

• Documents and their representations

• Boolean retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34

Page 226: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What We’ve Done Today

• What’s Information Retrieval

• What’s a retrieval model

• Documents and their representations

• Boolean retrieval

• Searching words

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34

Page 227: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

What We’ve Done Today

• What’s Information Retrieval

• What’s a retrieval model

• Documents and their representations

• Boolean retrieval

• Searching words

• Discussion of boolean retrieval

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 34

Page 228: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Homework

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 35

Page 229: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Homework

• Read Chaper 1 of Salton & McGill: Modern

Introduction to Information Retrieval (library)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 35

Page 230: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Homework

• Read Chaper 1 of Salton & McGill: Modern

Introduction to Information Retrieval (library)

• Read Jansen et al.: Real life, real users, and real needs:

A study and analysis of user queries on the web

(available online)

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 35

Page 231: Introduction to Information Retrieval - Homepages of UvA ...staff.science.uva.nl/~christof/courses/ir/transparencies/w-01-prst.pdf · immediate random access to the data Introduction

Homework

• Read Chaper 1 of Salton & McGill: Modern

Introduction to Information Retrieval (library)

• Read Jansen et al.: Real life, real users, and real needs:

A study and analysis of user queries on the web

(available online)

• Give an intuitive ranking for 7 documents

For details go to:

www.science.uva.nl/∼christof/courses/ir/

Introduction to Information Retrieval, Spring 2002, Week 1 Copyright c© Christof Monz & Maarten de Rijke 35