what are we searching for? {week 9 }
DESCRIPTION
Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. What are we searching for? {week 9 }. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0. - PowerPoint PPT PresentationTRANSCRIPT
What are we searching for?{week 9}
Rensselaer Polytechnic InstituteCSCI-4220 – Network ProgrammingDavid Goldschmidt, Ph.D.
from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
What is search?
What is search? What are we searching for? How many searches are
processed per day? What is the average number of
words in text-based searches?
Finding things
Applications and varieties of search: Web search Site search Vertical search Enterprise search Desktop search As-you-type search Proximity search
search
Acquisition and indexing
User interaction and querying
Measures of success (i)
Relevance Search results contain information
the searcher was looking for Problems with vocabulary mismatch ▪ Homonyms (e.g. “Jersey shore”)
User relevance Search results relevant to one user
may be completely irrelevant toanother user
SNOOKI
Measures of success (ii)
Precision Proportion of retrieved documents
that are relevant How precise were the results?
Recall (and coverage) Proportion of relevant documents
that were actually retrieved Did we retrieve all of the relevant
documents?
http://trec.nist.gov
Measures of success (iii)
Timeliness and freshness Search results contain information that
is current and up-to-date
Performance Users expect subsecond response times
Media User devices are constantly changing
(cellphones, mobile devices, tablets, etc.)
Measures of success (iv)
Scalability Designs that perform equally well as the
system grows and expands▪ Increased number of documents, number of users,
etc.
Flexibility (or adaptability) Tune search engine components to
keep up with changing landscape
Spam-resistance
Information retrieval (IR) Gerard Salton (1927-1995)
Pioneer in information retrieval
Defined information retrieval as: “a field concerned with the
structure, analysis, organization, storage, searching, and retrieval of information”
This was 1968 (before the Internet and Web!)
(Un)structured information Structured information:
Often stored in a database Organized via predefined
tables, columns, etc. Select all accounts with balances less than $200
Unstructured information Document text (headings, words, phrases) Images, audio, video (often relies on textual
tags)
account number
balance
7004533711 $498.19
7004533712 $781.05
7004533713 $147.15
7004533714 $195.75
Processing text
Search and IR has largelyfocused on text processingand documents
Search typically uses thestatistical properties of text Word counts Word frequencies But ignore linguistic features (noun,
verb, etc.)
Politeness and robots.txt Web crawlers adhere to a politeness
policy: GET requests sent every few seconds or
minutes A robots.txt file
specifies whatcrawlers areallowed to crawl:
Sitemaps
default priority is 0.5
some URLs might not be discovered by crawler
A day in the life of a crawler
what about checkingfor updated pages?
Freshness vs. age
Freshness is essentially a Boolean value
Age measures the degree to which crawled page is out of date