![Page 1: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/1.jpg)
ADVANCED TOPICS
IN INFORMATION RETRIEVAL
AND WEB SEARCH
Lecture 1:
Introduction
S. M. Vahidipour
![Page 2: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/2.jpg)
Outline
2
□ Introduction to the Course
□ Overview of the Semester
![Page 3: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/3.jpg)
Text Books
Search Engines:
Information Retrieval in Practice
W. Bruce Croft, Donald Metzler, Trevor Strohman
Pearson Education, 2010
3
![Page 4: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/4.jpg)
Text Books
Modern Information Retrieval:
The Concepts and Technology behind Search
(2nd Edition)
Ricardo Baeza-Yates, Berthier Ribeiro-Neto
ACM Press Books, 2010
4
![Page 5: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/5.jpg)
Text Books
Introduction to Information Retrieval
C. Manning, P. Raghavan, and H. Schütze
Cambridge University Press, 2008
5
![Page 6: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/6.jpg)
Search and Information Retrieval
Search on the Web is a daily activity for many peoplethroughout the world
□ Google: 40,000 searches per second (3.5 billion per
day; 1.2 trillion per year)
□ Yahoo: 3,200 searches per second (280 million per day;
8.4 billion per month)
□ Bing: 927 searches per second ( 80 million per day;
2.4 billion per month)
6
106: Million, 109: billion, 1012: Trillion, 1015: Quadrillion, 1018: Quintillion, …
![Page 7: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/7.jpg)
Search and Information Retrieval
7
□ Search and communication are most popular uses of the computer.
□ Applications involving search are everywhere.
□ The field of computer science that is most involved with R&D for search
is information retrieval (IR).
![Page 8: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/8.jpg)
Information Retrieval
“Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.”(Salton, 1968)
□ General definition that can be applied to many types of information and search applications
□ Still appropriate after 40 years.
□ Primary focus of IR since the 50s has been on text and documents
8
![Page 9: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/9.jpg)
Data/Information
□ Storage
□ Search
9
![Page 10: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/10.jpg)
Data/Information
□ Structured
□ Unstructured
10
![Page 11: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/11.jpg)
Structured vs. Unstructured Data
11
![Page 12: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/12.jpg)
What is a Document?
12
Examples:
Web pages, email, books, news stories, scholarly papers, textmessages, Word™, Powerpoint™, PDF, forum postings, patents, IM (Instant Messages) sessions, etc.
Common properties
Significant text content
Some structure (≈ attributes in DB)
□ Papers: title, author, date
□ Email: subject, sender, destination, date
![Page 13: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/13.jpg)
Comparing Text
13
Comparing the query text to the document text and determining what is
a good match is the core issue of information retrieval.
Exact matching of words is not enough
Many different ways to write the same thing in a “natural language” like
English
Does a news story containing the text “karl benz built the first automobile in 1886” match
the query “car inverter”?
Defining the meaning of a word, a sentence, a paragraph, or a story is
more difficult than defining the meaning of a database field.
![Page 14: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/14.jpg)
Dimensions of IR
20
IR is more than just text, and more than just web search
although these are central
People doing IR work with different media, different types of search
applications, and different tasks
Three dimensions of IR
□ Content
□Applications
□ Tasks
![Page 15: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/15.jpg)
The Content Dimension
15
Textual data, but…
New applications increasingly involve new media
□Video, photos, music, speech
□Scanned documents (for legal purposes)
Like text, content is difficult to describe and compare
□Text may be used to represent them (e.g., tags)
IR approaches to search and evaluation are appropriate
![Page 16: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/16.jpg)
The Application Dimension
16
Desktop search
□ Personal enterprise search
□ See above plus recent web pages
P2P search
□ No centralized control
□ File sharing, shared locality
Literature search
Forum search
…
Web search
□ Most common
Vertical search
□ Restricted domain/topic
□ Books, movies, suppliers
Enterprise search
□ Corporate intranet
□ Databases, emails, web pages, documentation, code, wikis, tags, directories, presentations, spreadsheets
![Page 17: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/17.jpg)
The Task Dimension
17
User queries / ad-hoc search
□ Range of query enormous, not pre-specified
Filtering
□ Given a profile (interests), notify about interesting news stories
□ Identify relevant user profiles for a new document
Classification / categorization
□ Automatically assign text to one or more classes of a given set
□ Identify relevant labels for documents
Question answering
□ Similar to search
□ Automatically answer a question posed in natural language
□ Provide concrete answer, not list of documents.
![Page 18: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/18.jpg)
Main Issues in IR
18
Relevance
□A relevant document contains the information a user was looking for when
he/she submitted the query
Evaluation
□How well does the ranking meet the expectation of the user
Users and information needs
□Users of a search engine are the ultimate judges of quality
![Page 19: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/19.jpg)
IR and Search Engines
A search engine is the practical application of information retrieval
techniques to large scale text collections
Big issues include main IR issues but also some others…
● Relevance: Effective ranking● Evaluation: Testing and measuring● Information needs: User interaction
Information Retrieval
● Performance: Efficient search and indexing● Incorporating new data: Coverage and freshness● Scalability: Growing with data and users● Adaptability: Tuning for applications● Specific problems: e.g., Spam
Search Engines
Additional
19
![Page 20: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/20.jpg)
Outline
20
□ Introduction to the Course
□ Overview of the Semester
![Page 21: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/21.jpg)
Search Engine
Basic architecture
Main issues
Indexing
Text acquisition
Text
transformation
Index creation
Querying
User interaction
Ranking
Evaluation
21
![Page 22: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/22.jpg)
Overview of Traditional Retrieval Models
Boolean retrieval
Vector space model
Probabilistic models
22
![Page 23: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/23.jpg)
Overview of Evaluation Metrics
Effectiveness metrics
Efficiency metrics
Training, testing, and statistics
23
![Page 24: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/24.jpg)
Advanced Retrieval Models
3 0
Language model-based retrieval
Learning to rank
![Page 25: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/25.jpg)
Word Mismatch Problem
25
Language model-based approaches
□ Translation model
□ Topic model
□Word cluster model
□Wordnet
□Dependency model
Query expansion approaches
![Page 26: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/26.jpg)
Advanced/Specific IR Tasks
26
Query log and query suggestion
Personalized search
Information extraction
Cross-language IR
Question answering
Recommendation systems
Enterprise search
Digital library
Structured text retrieval
Multimedia retrieval
![Page 27: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/27.jpg)
Query Log and Query Suggestion
27
![Page 28: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/28.jpg)
Personalized Search
28
![Page 29: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/29.jpg)
Information Extraction
29
![Page 30: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/30.jpg)
Cross- language Retrieval
30
![Page 31: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/31.jpg)
Question Answering
31
![Page 32: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/32.jpg)
Recommendation Systems
32
![Page 33: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/33.jpg)
Enterprise Search
33
![Page 34: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/34.jpg)
Digital Library
4 0
![Page 35: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/35.jpg)
Structured Text Retrieval
35
![Page 36: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/36.jpg)
Multimedia Retrieval
36
![Page 37: ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB … · Classification / categorization Automatically assign text to one or more classes of a given set Identify relevant labels for](https://reader033.vdocuments.us/reader033/viewer/2022050208/5f5afb2995033d6fa41cf634/html5/thumbnails/37.jpg)
Questions?
37