Download - What are we searching for? {week 9 }
![Page 1: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/1.jpg)
What are we searching for?{week 9}
Rensselaer Polytechnic InstituteCSCI-4220 – Network ProgrammingDavid Goldschmidt, Ph.D.
from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
![Page 2: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/2.jpg)
What is search?
What is search? What are we searching for? How many searches are
processed per day? What is the average number of
words in text-based searches?
![Page 3: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/3.jpg)
Finding things
Applications and varieties of search: Web search Site search Vertical search Enterprise search Desktop search As-you-type search Proximity search
search
![Page 4: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/4.jpg)
Acquisition and indexing
![Page 5: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/5.jpg)
User interaction and querying
![Page 6: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/6.jpg)
Measures of success (i)
Relevance Search results contain information
the searcher was looking for Problems with vocabulary mismatch ▪ Homonyms (e.g. “Jersey shore”)
User relevance Search results relevant to one user
may be completely irrelevant toanother user
SNOOKI
![Page 7: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/7.jpg)
Measures of success (ii)
Precision Proportion of retrieved documents
that are relevant How precise were the results?
Recall (and coverage) Proportion of relevant documents
that were actually retrieved Did we retrieve all of the relevant
documents?
http://trec.nist.gov
![Page 8: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/8.jpg)
Measures of success (iii)
Timeliness and freshness Search results contain information that
is current and up-to-date
Performance Users expect subsecond response times
Media User devices are constantly changing
(cellphones, mobile devices, tablets, etc.)
![Page 9: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/9.jpg)
Measures of success (iv)
Scalability Designs that perform equally well as the
system grows and expands▪ Increased number of documents, number of users,
etc.
Flexibility (or adaptability) Tune search engine components to
keep up with changing landscape
Spam-resistance
![Page 10: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/10.jpg)
Information retrieval (IR) Gerard Salton (1927-1995)
Pioneer in information retrieval
Defined information retrieval as: “a field concerned with the
structure, analysis, organization, storage, searching, and retrieval of information”
This was 1968 (before the Internet and Web!)
![Page 11: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/11.jpg)
(Un)structured information Structured information:
Often stored in a database Organized via predefined
tables, columns, etc. Select all accounts with balances less than $200
Unstructured information Document text (headings, words, phrases) Images, audio, video (often relies on textual
tags)
account number
balance
7004533711 $498.19
7004533712 $781.05
7004533713 $147.15
7004533714 $195.75
![Page 12: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/12.jpg)
Processing text
Search and IR has largelyfocused on text processingand documents
Search typically uses thestatistical properties of text Word counts Word frequencies But ignore linguistic features (noun,
verb, etc.)
![Page 13: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/13.jpg)
Politeness and robots.txt Web crawlers adhere to a politeness
policy: GET requests sent every few seconds or
minutes A robots.txt file
specifies whatcrawlers areallowed to crawl:
![Page 14: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/14.jpg)
Sitemaps
default priority is 0.5
some URLs might not be discovered by crawler
![Page 15: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/15.jpg)
A day in the life of a crawler
what about checkingfor updated pages?
![Page 16: What are we searching for? {week 9 }](https://reader035.vdocuments.us/reader035/viewer/2022070422/56816587550346895dd83f5b/html5/thumbnails/16.jpg)
Freshness vs. age
Freshness is essentially a Boolean value
Age measures the degree to which crawled page is out of date