january 8th, 2020pawang/courses/ir20/intro.pdf · mid-sem : 30% end-sem : 40% term project: 30%...

70
Information Retrieval: Course Introduction Saptarshi Ghosh, Pawan Goyal CSE, IITKGP January 8th, 2020 Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 1 / 22

Upload: others

Post on 20-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Information Retrieval: Course Introduction

Saptarshi Ghosh, Pawan Goyal

CSE, IITKGP

January 8th, 2020

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 1 / 22

Page 2: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Course Info

Course Website:http://cse.iitkgp.ac.in/~pawang/courses/IR20.htmlShared with Prof. Saptarshi Ghosh

Meeting TimesRegular Hours:

I Wednesday - 10:00 - 11:00 (NC - 132)I Thursday - 9:00 - 10:00 (NC - 132)I Friday - 11:00 - 12:00 (NC - 132)

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 2 / 22

Page 3: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Course Info

Course Website:http://cse.iitkgp.ac.in/~pawang/courses/IR20.htmlShared with Prof. Saptarshi Ghosh

Meeting TimesRegular Hours:

I Wednesday - 10:00 - 11:00 (NC - 132)I Thursday - 9:00 - 10:00 (NC - 132)I Friday - 11:00 - 12:00 (NC - 132)

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 2 / 22

Page 4: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Course Info

Teaching AssistantsRajdeep Mukherjee

Shounak Paul

Avirup Mukherjee

Prajwal Singhania

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 3 / 22

Page 5: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Books and Materials

Reference BooksChristopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze.2008. Introduction to Information Retrieval, Cambridge university press.

Lecture MaterialAdditional Readings

Lecture Slides

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 4 / 22

Page 6: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Books and Materials

Reference BooksChristopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze.2008. Introduction to Information Retrieval, Cambridge university press.

Lecture MaterialAdditional Readings

Lecture Slides

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 4 / 22

Page 7: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Course Evaluation Plan: Tentative

Mid-Sem : 30%

End-Sem : 40%

Term Project: 30%

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 5 / 22

Page 8: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Course Evaluation Plan: Tentative

Mid-Sem : 30%

End-Sem : 40%

Term Project: 30%

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 5 / 22

Page 9: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Course Evaluation Plan: Tentative

Mid-Sem : 30%

End-Sem : 40%

Term Project: 30%

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 5 / 22

Page 10: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Course Evaluation Plan: Tentative

Mid-Sem : 30%

End-Sem : 40%

Term Project: 30%

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 5 / 22

Page 11: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What is Information Retrieval?

Information Retrieval (IR) is finding material (usually documents) of anunstructured nature (usually text) that satisfies an information need (usuallyspecified using a user query) from within large collections.

What is a document?web pages, emails, books, news stories, scholarly papers, text messages,Powerpoint, PDF, forum postings, patents, tweets, question answer postings,etc.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 6 / 22

Page 12: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What is Information Retrieval?

Information Retrieval (IR) is finding material (usually documents) of anunstructured nature (usually text) that satisfies an information need (usuallyspecified using a user query) from within large collections.

What is a document?

web pages, emails, books, news stories, scholarly papers, text messages,Powerpoint, PDF, forum postings, patents, tweets, question answer postings,etc.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 6 / 22

Page 13: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What is Information Retrieval?

Information Retrieval (IR) is finding material (usually documents) of anunstructured nature (usually text) that satisfies an information need (usuallyspecified using a user query) from within large collections.

What is a document?web pages, emails, books, news stories, scholarly papers, text messages,Powerpoint, PDF, forum postings, patents, tweets, question answer postings,etc.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 6 / 22

Page 14: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Document vs. Database Records

Database records (or tuples in relational databases) are typically madeup of well-defined fields (or attributes),

I e.g., bank records with account numbers, balances, names, addresses,social security numbers, dates of birth, etc.

Easy to compare fields with well-defined semantics to queries in order tofind matches

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 7 / 22

Page 15: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Document vs. Database Records

Database records (or tuples in relational databases) are typically madeup of well-defined fields (or attributes),

I e.g., bank records with account numbers, balances, names, addresses,social security numbers, dates of birth, etc.

Easy to compare fields with well-defined semantics to queries in order tofind matches

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 7 / 22

Page 16: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Document vs. Database Records

Database records (or tuples in relational databases) are typically madeup of well-defined fields (or attributes),

I e.g., bank records with account numbers, balances, names, addresses,social security numbers, dates of birth, etc.

Easy to compare fields with well-defined semantics to queries in order tofind matches

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 7 / 22

Page 17: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Document vs. Database Records

Database records (or tuples in relational databases) are typically madeup of well-defined fields (or attributes),

I e.g., bank records with account numbers, balances, names, addresses,social security numbers, dates of birth, etc.

Easy to compare fields with well-defined semantics to queries in order tofind matches

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 7 / 22

Page 18: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Document vs. Database Records

Example bank database query

Find records with balance > $50,000 in branches located in Amherst, MA.

Matches easily found by comparison with field values of records

Example search engine querybank scandals in 2019 in India

This text must be compared to the text of entire news stories

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 8 / 22

Page 19: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Document vs. Database Records

Example bank database query

Find records with balance > $50,000 in branches located in Amherst, MA.

Matches easily found by comparison with field values of records

Example search engine querybank scandals in 2019 in India

This text must be compared to the text of entire news stories

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 8 / 22

Page 20: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Document vs. Database Records

Example bank database query

Find records with balance > $50,000 in branches located in Amherst, MA.

Matches easily found by comparison with field values of records

Example search engine querybank scandals in 2019 in India

This text must be compared to the text of entire news stories

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 8 / 22

Page 21: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Document vs. Database Records

Example bank database query

Find records with balance > $50,000 in branches located in Amherst, MA.

Matches easily found by comparison with field values of records

Example search engine querybank scandals in 2019 in India

This text must be compared to the text of entire news stories

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 8 / 22

Page 22: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

So, what do we do in IR?

The indexing and retrieval of textual documents.

Concerned first with retrieving relevant documents to a query.

Concerned secondly with retrieving from large sets of documentsefficiently.

What is the “killer” app?Searching for the pages on WWW

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 9 / 22

Page 23: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

So, what do we do in IR?

The indexing and retrieval of textual documents.

Concerned first with retrieving relevant documents to a query.

Concerned secondly with retrieving from large sets of documentsefficiently.

What is the “killer” app?Searching for the pages on WWW

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 9 / 22

Page 24: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

So, what do we do in IR?

The indexing and retrieval of textual documents.

Concerned first with retrieving relevant documents to a query.

Concerned secondly with retrieving from large sets of documentsefficiently.

What is the “killer” app?Searching for the pages on WWW

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 9 / 22

Page 25: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

So, what do we do in IR?

The indexing and retrieval of textual documents.

Concerned first with retrieving relevant documents to a query.

Concerned secondly with retrieving from large sets of documentsefficiently.

What is the “killer” app?Searching for the pages on WWW

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 9 / 22

Page 26: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

So, what do we do in IR?

The indexing and retrieval of textual documents.

Concerned first with retrieving relevant documents to a query.

Concerned secondly with retrieving from large sets of documentsefficiently.

What is the “killer” app?

Searching for the pages on WWW

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 9 / 22

Page 27: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

So, what do we do in IR?

The indexing and retrieval of textual documents.

Concerned first with retrieving relevant documents to a query.

Concerned secondly with retrieving from large sets of documentsefficiently.

What is the “killer” app?Searching for the pages on WWW

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 9 / 22

Page 28: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Typical IR tasks

Given:

A corpus of textual natural-language documents.

A user query in the form of a textual string.

Find:A ranked set of documents that are relevant to the query.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 10 / 22

Page 29: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Typical IR tasks

Given:A corpus of textual natural-language documents.

A user query in the form of a textual string.

Find:A ranked set of documents that are relevant to the query.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 10 / 22

Page 30: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Typical IR tasks

Given:A corpus of textual natural-language documents.

A user query in the form of a textual string.

Find:A ranked set of documents that are relevant to the query.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 10 / 22

Page 31: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Typical IR tasks

Given:A corpus of textual natural-language documents.

A user query in the form of a textual string.

Find:

A ranked set of documents that are relevant to the query.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 10 / 22

Page 32: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Typical IR tasks

Given:A corpus of textual natural-language documents.

A user query in the form of a textual string.

Find:A ranked set of documents that are relevant to the query.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 10 / 22

Page 33: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

IR System

The system should be able to retrieve the relevant docs efficiently

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 11 / 22

Page 34: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

So, what is relevance?

Relevant document contains the information that a person was looking forwhen they submitted the query. This may include:

Being on the proper subject.

Being timely (recent information).

Being authoritative (from a trusted source).

Satisfying the goals of the user and his/her intended use of theinformation (information need).

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 12 / 22

Page 35: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

So, what is relevance?

Relevant document contains the information that a person was looking forwhen they submitted the query. This may include:

Being on the proper subject.

Being timely (recent information).

Being authoritative (from a trusted source).

Satisfying the goals of the user and his/her intended use of theinformation (information need).

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 12 / 22

Page 36: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

So, what is relevance?

Relevant document contains the information that a person was looking forwhen they submitted the query. This may include:

Being on the proper subject.

Being timely (recent information).

Being authoritative (from a trusted source).

Satisfying the goals of the user and his/her intended use of theinformation (information need).

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 12 / 22

Page 37: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

So, what is relevance?

Relevant document contains the information that a person was looking forwhen they submitted the query. This may include:

Being on the proper subject.

Being timely (recent information).

Being authoritative (from a trusted source).

Satisfying the goals of the user and his/her intended use of theinformation (information need).

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 12 / 22

Page 38: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

So, what is relevance?

Relevant document contains the information that a person was looking forwhen they submitted the query. This may include:

Being on the proper subject.

Being timely (recent information).

Being authoritative (from a trusted source).

Satisfying the goals of the user and his/her intended use of theinformation (information need).

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 12 / 22

Page 39: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Simplest notion of Relevance from Retrieval Models’Perspective

Keyword SearchSimplest notion of relevance is that the query string appears verbatim inthe document.

Slightly less strict notion is that (most of) the words in the query appearfrequently in the document, in any order (bag of words).

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 13 / 22

Page 40: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Simplest notion of Relevance from Retrieval Models’Perspective

Keyword Search

Simplest notion of relevance is that the query string appears verbatim inthe document.

Slightly less strict notion is that (most of) the words in the query appearfrequently in the document, in any order (bag of words).

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 13 / 22

Page 41: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Simplest notion of Relevance from Retrieval Models’Perspective

Keyword SearchSimplest notion of relevance is that the query string appears verbatim inthe document.

Slightly less strict notion is that (most of) the words in the query appearfrequently in the document, in any order (bag of words).

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 13 / 22

Page 42: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Simplest notion of Relevance from Retrieval Models’Perspective

Keyword SearchSimplest notion of relevance is that the query string appears verbatim inthe document.

Slightly less strict notion is that (most of) the words in the query appearfrequently in the document, in any order (bag of words).

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 13 / 22

Page 43: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Problems with Keywords Search

Term mismatchMay not retrieve relevant documents that include synonymous terms

PRC vs. China

car vs. automobile

AmbiguityMay retrieve irrelevant document that include ambiguous terms (due topolysemy)

‘Apple’ (company vs. fruit)

‘Java’ (programming language vs. Island)

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 14 / 22

Page 44: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Problems with Keywords Search

Term mismatchMay not retrieve relevant documents that include synonymous terms

PRC vs. China

car vs. automobile

AmbiguityMay retrieve irrelevant document that include ambiguous terms (due topolysemy)

‘Apple’ (company vs. fruit)

‘Java’ (programming language vs. Island)

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 14 / 22

Page 45: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Problems with Keywords Search

Term mismatchMay not retrieve relevant documents that include synonymous terms

PRC vs. China

car vs. automobile

AmbiguityMay retrieve irrelevant document that include ambiguous terms (due topolysemy)

‘Apple’ (company vs. fruit)

‘Java’ (programming language vs. Island)

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 14 / 22

Page 46: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Problems with Keywords Search

Term mismatchMay not retrieve relevant documents that include synonymous terms

PRC vs. China

car vs. automobile

AmbiguityMay retrieve irrelevant document that include ambiguous terms (due topolysemy)

‘Apple’ (company vs. fruit)

‘Java’ (programming language vs. Island)

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 14 / 22

Page 47: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Problems with Keywords Search

Term mismatchMay not retrieve relevant documents that include synonymous terms

PRC vs. China

car vs. automobile

AmbiguityMay retrieve irrelevant document that include ambiguous terms (due topolysemy)

‘Apple’ (company vs. fruit)

‘Java’ (programming language vs. Island)

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 14 / 22

Page 48: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Problems with Keywords Search

Term mismatchMay not retrieve relevant documents that include synonymous terms

PRC vs. China

car vs. automobile

AmbiguityMay retrieve irrelevant document that include ambiguous terms (due topolysemy)

‘Apple’ (company vs. fruit)

‘Java’ (programming language vs. Island)

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 14 / 22

Page 49: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Problems with Keywords Search

Term mismatchMay not retrieve relevant documents that include synonymous terms

PRC vs. China

car vs. automobile

AmbiguityMay retrieve irrelevant document that include ambiguous terms (due topolysemy)

‘Apple’ (company vs. fruit)

‘Java’ (programming language vs. Island)

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 14 / 22

Page 50: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

An Intelligent IR system will

Take into account the meaning of the words used.

Adapt to the user based on direct or indirect feedback.

Take into account the importance of the page.

...

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 15 / 22

Page 51: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

An Intelligent IR system will

Take into account the meaning of the words used.

Adapt to the user based on direct or indirect feedback.

Take into account the importance of the page.

...

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 15 / 22

Page 52: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

An Intelligent IR system will

Take into account the meaning of the words used.

Adapt to the user based on direct or indirect feedback.

Take into account the importance of the page.

...

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 15 / 22

Page 53: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

An Intelligent IR system will

Take into account the meaning of the words used.

Adapt to the user based on direct or indirect feedback.

Take into account the importance of the page.

...

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 15 / 22

Page 54: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

An Intelligent IR system will

Take into account the meaning of the words used.

Adapt to the user based on direct or indirect feedback.

Take into account the importance of the page.

...

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 15 / 22

Page 55: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Where do we find the latest happenings in the field?

Top Conferences in the fieldSIGIR

WWW

WSDM

Other VenuesECIR

ACM Transactions on Information Systems

Springer IR Journal, JASIST, etc.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 16 / 22

Page 56: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Where do we find the latest happenings in the field?

Top Conferences in the fieldSIGIR

WWW

WSDM

Other VenuesECIR

ACM Transactions on Information Systems

Springer IR Journal, JASIST, etc.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 16 / 22

Page 57: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Active Areas of Research

Compiled based on some recent papers at SIGIR and related conferences,just indicative, not exhaustive

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 17 / 22

Page 58: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What to retrieve

Leveraging User Reviews to Improve Accuracy for Mobile App Retrieval. SIGIR2015.

On Application of Learning to Rank for E-Commerce Search. SIGIR 2017.

Concept Embedded Convolutional Semantic Model for Question Retrieval.WSDM 2017.

Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metricsat Scale. SIGIR 2016.

Toward an Interactive Patent Retrieval Framework based on DistributedRepresentations. SIGIR 2018.

ANNE: Improving Source Code Search using Entity Retrieval Approach. WSDM2017.

Exploiting Food Choice Biases for Healthier Recipe Recommendation. SIGIR2017.

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos.SIGIR 2019.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 18 / 22

Page 59: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What to retrieve

Leveraging User Reviews to Improve Accuracy for Mobile App Retrieval. SIGIR2015.

On Application of Learning to Rank for E-Commerce Search. SIGIR 2017.

Concept Embedded Convolutional Semantic Model for Question Retrieval.WSDM 2017.

Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metricsat Scale. SIGIR 2016.

Toward an Interactive Patent Retrieval Framework based on DistributedRepresentations. SIGIR 2018.

ANNE: Improving Source Code Search using Entity Retrieval Approach. WSDM2017.

Exploiting Food Choice Biases for Healthier Recipe Recommendation. SIGIR2017.

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos.SIGIR 2019.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 18 / 22

Page 60: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What to retrieve

Leveraging User Reviews to Improve Accuracy for Mobile App Retrieval. SIGIR2015.

On Application of Learning to Rank for E-Commerce Search. SIGIR 2017.

Concept Embedded Convolutional Semantic Model for Question Retrieval.WSDM 2017.

Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metricsat Scale. SIGIR 2016.

Toward an Interactive Patent Retrieval Framework based on DistributedRepresentations. SIGIR 2018.

ANNE: Improving Source Code Search using Entity Retrieval Approach. WSDM2017.

Exploiting Food Choice Biases for Healthier Recipe Recommendation. SIGIR2017.

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos.SIGIR 2019.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 18 / 22

Page 61: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What to retrieve

Leveraging User Reviews to Improve Accuracy for Mobile App Retrieval. SIGIR2015.

On Application of Learning to Rank for E-Commerce Search. SIGIR 2017.

Concept Embedded Convolutional Semantic Model for Question Retrieval.WSDM 2017.

Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metricsat Scale. SIGIR 2016.

Toward an Interactive Patent Retrieval Framework based on DistributedRepresentations. SIGIR 2018.

ANNE: Improving Source Code Search using Entity Retrieval Approach. WSDM2017.

Exploiting Food Choice Biases for Healthier Recipe Recommendation. SIGIR2017.

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos.SIGIR 2019.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 18 / 22

Page 62: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What to retrieve

Leveraging User Reviews to Improve Accuracy for Mobile App Retrieval. SIGIR2015.

On Application of Learning to Rank for E-Commerce Search. SIGIR 2017.

Concept Embedded Convolutional Semantic Model for Question Retrieval.WSDM 2017.

Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metricsat Scale. SIGIR 2016.

Toward an Interactive Patent Retrieval Framework based on DistributedRepresentations. SIGIR 2018.

ANNE: Improving Source Code Search using Entity Retrieval Approach. WSDM2017.

Exploiting Food Choice Biases for Healthier Recipe Recommendation. SIGIR2017.

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos.SIGIR 2019.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 18 / 22

Page 63: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What to retrieve

Leveraging User Reviews to Improve Accuracy for Mobile App Retrieval. SIGIR2015.

On Application of Learning to Rank for E-Commerce Search. SIGIR 2017.

Concept Embedded Convolutional Semantic Model for Question Retrieval.WSDM 2017.

Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metricsat Scale. SIGIR 2016.

Toward an Interactive Patent Retrieval Framework based on DistributedRepresentations. SIGIR 2018.

ANNE: Improving Source Code Search using Entity Retrieval Approach. WSDM2017.

Exploiting Food Choice Biases for Healthier Recipe Recommendation. SIGIR2017.

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos.SIGIR 2019.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 18 / 22

Page 64: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What to retrieve

Leveraging User Reviews to Improve Accuracy for Mobile App Retrieval. SIGIR2015.

On Application of Learning to Rank for E-Commerce Search. SIGIR 2017.

Concept Embedded Convolutional Semantic Model for Question Retrieval.WSDM 2017.

Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metricsat Scale. SIGIR 2016.

Toward an Interactive Patent Retrieval Framework based on DistributedRepresentations. SIGIR 2018.

ANNE: Improving Source Code Search using Entity Retrieval Approach. WSDM2017.

Exploiting Food Choice Biases for Healthier Recipe Recommendation. SIGIR2017.

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos.SIGIR 2019.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 18 / 22

Page 65: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What to retrieve

Leveraging User Reviews to Improve Accuracy for Mobile App Retrieval. SIGIR2015.

On Application of Learning to Rank for E-Commerce Search. SIGIR 2017.

Concept Embedded Convolutional Semantic Model for Question Retrieval.WSDM 2017.

Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metricsat Scale. SIGIR 2016.

Toward an Interactive Patent Retrieval Framework based on DistributedRepresentations. SIGIR 2018.

ANNE: Improving Source Code Search using Entity Retrieval Approach. WSDM2017.

Exploiting Food Choice Biases for Healthier Recipe Recommendation. SIGIR2017.

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos.SIGIR 2019.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 18 / 22

Page 66: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What to retrieve

Leveraging User Reviews to Improve Accuracy for Mobile App Retrieval. SIGIR2015.

On Application of Learning to Rank for E-Commerce Search. SIGIR 2017.

Concept Embedded Convolutional Semantic Model for Question Retrieval.WSDM 2017.

Multi-Stage Math Formula Search: Using Appearance-Based Similarity Metricsat Scale. SIGIR 2016.

Toward an Interactive Patent Retrieval Framework based on DistributedRepresentations. SIGIR 2018.

ANNE: Improving Source Code Search using Entity Retrieval Approach. WSDM2017.

Exploiting Food Choice Biases for Healthier Recipe Recommendation. SIGIR2017.

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos.SIGIR 2019.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 18 / 22

Page 67: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Search Experience

Engaged or Frustrated? Disambiguating Emotional State in Search. SIGIR 2017.

Between Clicks and Satisfaction: Study on Multi-Phase User Preferences andSatisfaction for Online News Reading. SIGIR 2018.

Understanding and Modeling Success in Email Search. SIGIR 2017.

Using Information Scent to Understand Mobile and Desktop Web SearchBehavior. SIGIR 2017.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 19 / 22

Page 68: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Personalization, Behavior, Conversation, Social, etc.

The Utility and Privacy Effects of a Click. SIGIR 2017.

Why People Search for Images using Web Search Engines. WSDM 2018.

Joint Learning of Response Ranking and Next Utterance Suggestion inHuman-Computer Conversation System. SIGIR 2017.

Asking Clarifying Questions in Open-Domain Information-SeekingConversations. SIGIR 2019.

Predicting Which Topics You Will Join in the Future on Social Media.SIGIR 2017.

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 20 / 22

Page 69: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

What do we cover in this course

IR Basics

Boolean retrieval

The term vocabulary & postings lists

Dictionaries and tolerant retrieval

Index construction and compression

Scoring, term weighting & the vector space model

Computing scores in a complete search system

Evaluation in information retrieval

Relevance feedback & query expansion

Probabilistic information retrieval

Language models for information retrieval

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 21 / 22

Page 70: January 8th, 2020pawang/courses/IR20/intro.pdf · Mid-Sem : 30% End-Sem : 40% Term Project: 30% Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction

Course Contents

Web Search, Applications, Recent Advances

Web Search and Applications such as Query Auto-completion

Link analysis

Summarization

Neural IR

Learning to Rank

Domain-specific IR

Saptarshi Ghosh, Pawan Goyal (IIT Kharagpur) Information Retrieval: Course Introduction January 8th, 2020 22 / 22