basics of information retrieval lillian n. cassel some of these slides are taken or adapted from...

Basics of Information Retrieval

Lillian N. Cassel

Some of these slides are taken or adapted fromSource: http://www.stanford.edu/class/cs276/cs276-2006-syllabus.html

Basic ideas

Information overload The challenging byproduct of the information

age Huge amounts of information available -- how

to find what you need when you need it Think about addresses, e-mail messages, files

of interesting articles, etc. Information retrieval is the formal study of

efficient and effective ways to extract the right bit of information from a collection. The web is a special case, as we will discuss.

Some distinctions

Data, information, knowledge How do you distinguish among them?

http://www.systems-thinking.org/dikw/dikw.htm

Information sources Very well organized, indexed, controlled Totally unorganized, uncharacterized,

uncontrolled Something in between

http://www.systems-thinking.org/dikw/dikw.htm

Databases

Databases hold specific data items Organization is explicit Keys relate items to each other Queries are constrained, but effective in retrieving the

data that is there Databases generally respond to specific queries with

specific results Browsing is difficult Searching for items not anticipated by the designers

can be difficult

The Web

The Web contains many kinds of elements Organization? There are no keys to relate items to each other Queries are unconstrained; effectiveness depends on the

tools used. Web queries generally respond to general queries with

specific results Browsing is possible, though somewhat complicated There are no designers of the overall Web structure. Describe how you frequently use the web

What works easily? What has been difficult?

Digital Library

Something in between the very structured database and the unstructured Web.

Content is controlled. Someone makes the entries. (Maybe a lot of people make the entries, but there are rules for admission.)

Searching and browsing are somewhat open, not controlled by fixed keys and anticipated queries.

Nature of the collection regulates indexing somewhat.

In all cases Trying to connect an information user to the

specific information wanted. Concerned with efficiency and effectiveness

Effectiveness - how well did we do? Efficiency - how well did we use available

resources?

Effectiveness

Two measures: Precision

Of the results returned, what percentage are meaningful to the goal of the query?

Recall Of the materials available that match the query, what

percentage were returned? Ex. Search returns 590,000 responses and 195 are

relevant. How well did we do? Not enough information.

Did the 590,000 include all relevant responses? If so, recall is perfect.

195/590,000 is not good precision!

The process

Query entered

Query Interpreted

Items retrieved

Index searched

Results Ranked

The Collection Where does the collection come from? How is the index created? Those are important distinguishing

characteristics Inverted Index -- Ordered list of terms

related to the collected materials. Each term has an associated pointer to the related material(s). www.cs.cityu.edu.hk/~deng/5286/T51.doc

http://www.cs.cityu.edu.hk/~deng/5286/T51.doc

Crawling the web

Misnomer as the spider or robot does not actually move about the web

Program sends a normal request for the page, just as a browser would. Retrieve the page and parse it.

Look for anchors -- pointers to other pages.• Put them on a list of URLs to visit

Extract key words (possibly all words) to use as index terms related to that page

Take the next URL and do it again Actually, the crawling and processing are parallel

activities

Responding to search queries

Use the query string provided Form a boolean query

Join all words with AND? With OR? Find the related index terms Return the information available about the

pages that correspond to the query terms. Many variations on how to do this. Usually

proprietary to the company.

Making the connections

Stemming Making sure that simple variations in word form are

recognized as equivalent for the purpose of the search: exercise, exercises, exercised, for example.

Indexing A keyword or group of selected words Any word (more general) How to choose the most relevant terms to use as index

elements for a set of documents. Build an inverted file for the chosen index terms.

The Vector model

Let, N be the total number of documents in the collection ni be the number of documents which contain ki

freq(i,j) raw frequency of ki within dj

A normalized tf (term frequency) factor is given by tf(i,j) = freq(i,j) / max(freq(i,j)) where the maximum is computed over all terms which occur within

the document dj

The idf (index term frequency) factor is computed as idf(i) = log (N/ni) the log is used to make the values of tf and idf comparable. It

can also be interpreted as the amount of information associated with the term ki.

Anatomy of a web page

Metatags: Information about the page Primary source of indexing information for a search engine. Ex. Title. Never mind what has an H1 tag (though that may

be considered), what is in the <title> </title> brackets? Other tags provide information about the page. This is

easier for the search engine to use than determining the meaning of the text of the page.

Dealing with the cheaters False information provided in the web page to make the

search engine return this page False metatags, invisible words (repeated many times), etc

Standard Metatags

The Dublin Core (http://dublincore.org/)15 common items to use in labeling any web

document

Title Contributor SourceCreator Date LanguageSubject Resources type RelationDescription Format CoveragePublisher Identifier Rights

http://dublincore.org/



Hubs and authorities

Hub points to a lot of other places. CITIDEL is a hub for computing information NSDL is a hub for science, technology, engineering and

mathematics education. Authorities are pointed to by a lot of other places.

W3C.org is an authority for information about the web. When Hub or Authority status is captured, the search

can be more accurate. If several pages match a query, and one is an authority

page, it will be ranked higher. When a hub matches a query, the pages it points to are

likely to be relevant.

Some Digital Library examples

Between the chaos of the Web and the strict structure of a database, the digital library contains an organized collection.

We saw the digital collection at the Falvey library session.

See also: NSDL www.nsdl.org

And the computing component, CITIDEL: citidel.villanova.edu

American Memory http://memory.loc.gov/ammem/index.html

http://www.nsdl.org/

http://memory.loc.gov/ammem/index.html

Conclusions

The plan was to introduce the basic concepts of information retrieval in a form accessible to most students,before you have read anything about it.

We will look more deeply at these subjects in the coming weeks.

A word about the pattern for these slides …

basics of information retrieval lillian n. cassel some of these slides are taken or adapted from...

Documents

specific information

information user

specific queries

web queries

specific resultsbrowsing

anticipated queries

unstructured web

general queries