csm06 information retrieval lecture 2: text ir part 1 dr andrew salway [email protected]...

CSM06 Information RetrievalLecture 2: Text IR part 1

Dr Andrew Salway [email protected]

mailto:[email protected]

Recap from Lecture 1• IR is increasingly important for people to

access documents on the web, intranets and in personal media collections

• There are important differences between IR in traditional libraries and computer-based IR, however there are similar underlying ideas, e.g. indexing and classification

• When developing or evaluating an IR system it is important to consider the kinds of information to be retrieved, the information needs of the users, and the querying skills of the users

• The performance of IR systems can be compared with Precision and Recall evaluation metrics

Recap from Lecture 1

Generic information access process

(1) Start with an information need(2) Select a system / collections to search(3) Formulate a query(4) Send query(5) Receive results (i.e. information items)(6) Scan, evaluate, interpret results(7) Reformulate query and go to (4) OR stop

From Baeza-Yates and Ribeiro-Neto (1999), p. 263

IndexingQuery Operations

Searching

Text Operations

User Interface

Ranking

INDEX

Text Database

Generic IR System Architecture

Another book to consider…

Mark Levene (2005), An Introduction to Search Engines and Web Navigation. Addison-Wesley / Pearson Education.

– Published in September 2005, this book gives good introductions to a broad range of topics related to web search. Chapters 1, 4 and 5 are particularly relevant to CSM06.

CSM06 Webpage

• Lecture notes will normally be available by 9am, Monday (day before lecture)

• Webpage also has coursework information and links to further reading

http://portal.surrey.ac.uk/computing/resources/lm/csm06

Lecture 2: OVERVIEW

• Basic text processing techniques: tokenization, stemming and stop lists

• Postings Data Structures: simple inverted index: and STAIRS data structure

• The challenges of synonymy and polysemy

• Boolean Model of IR • Vector Space Model of IR: cosine

distance

Idea underlying all Text IR

• Queries and Documents are treated as ‘Bags of Words’– Keywords are used to capture the

‘aboutness’, or topic, of a document– Keywords are used in a query to express

the user’s information needs

• In simple terms, Text IR is a matter of matching the keywords of a query with the keywords of documents

QUESTION

How well do the keywords of a document represent its ‘aboutness’, or topic?

1. Read the text below and decide what you think it is about: imagine you have to store the document in a library of financial news stories.

2. Choose a set of words from the text that you think represent what it is about, i.e. its ‘aboutness’.

3. Write down some examples of users’ information needs that you think would be satisfied by this text. Would the queries these users formulate be matched by the keywords you chose for the text?

Tokyo stocks ended slightly higher on Tuesday, with shares supported by a small gain in the US Nasdaq market overnight, raising hopes of a lull in the slide in equities markets.

In Japan the Nikkei 225 stock average rose 97.26 or 0.9 per cent to 10,292.95 while the broader Topix index gained 2.14 or 0.2 per cent to 1,058.12.

Technology shares were generally higher after a small rebound in Nasdaq listed shares. Kyocera was up Y380 or 4.9 per cent to Y8,090 and NEC gained Y15 or 1.2 per cent to Y1,250.

Restaurant groups were hit by the discovery of Japan's first case of mad-cow disease. Yoshinoya, the fast-food beef-bowl restaurant chain, fell Y20,000 or 9.1 per cent to Y200,000. McDonald's fell Y90 or 2.5 per cent to Y3,490. Both restaurants said they imported their beef but company shares fell on expectations that consumers would still cut down on beef consumption.

Document Preprocessing

Basic text processing techniques applied before an index is created

• Tokenize document • Remove stop words (also known

as noise words)• Perform stemming (saves indexing

space and improves recall)

Create Index: data about remaining keywords is then stored in a postings data structure

Tokenization

• Separate a document into separate words

• For HTML documents there is a question about what to do with the tags - could give useful structural information. May want to exploit structural information for ranking…

• More difficult to work with some formats like .ps / .pdf

Stop Lists

• Stop lists contain words that will not help to discriminate texts so they can be ignored: – closed-class / grammatical words

like the, if, and, but….– vocabulary that is general to the

domain of the text database, e.g. the word computer in a database of computer texts

Stop Lists

• Want to eliminate stopwords which are not meaningful as index terms – (though they are crucial to understanding the full meaning of a document)

• May generate a stopword list by analysing a corpus of documents: stopwords will tend to be the most frequently occurring across the corpus

• Note, pre-compiled lists are available (typically include a few hundred words)

Stemming

• Stemming is applied to reduce grammatically (morphologically) related words to the same form so they will match:– drug, drugs, drugged drug

Stemming

• Assumption is that word-stem contains the important ‘semantics’

• Morphology modifies the stem, according to:– Inflectional morphology, e.g. plurals,

tense, gender– Derivational morphology, e.g. produce

product production

• Morphology adds prefixes and suffixes, and may also change letters in the stem

• In effect a ‘stemmer’ attempts to undo the morphology

Stemming

• May define patterns of character sequences and transformations with a content-sensitive transformation grammar

(.*)SSES /1 SS

Stemming

• The Porter algorithm is a widely used affix removal stemming technique

• Comprises a sequence of rules (to be applied to a word in order): each rule has conditions removal / replacement actions

e.g. sses -> ss• You can download an implementation of

the algorithm from:www.tartarus.org/~martin/PorterStemmer

[See Kowalski and Maybury (2000) pp. 75-77 for more detail; and Appendix of Baeza-Yates and Riberio-Neto 1999]

http://www.tartarus.org/~martin/PorterStemmer

Postings Data Structures

• Data about keywords, documents and especially the occurrences of keywords in documents, needs to be stored in an appropriate postings data structure: this is the index in an IR system

• The details of what data needs to be stored depend on the required functionality of the application

• Aim (as with all data structures) is to facilitate efficient access to and processing of stored data

• Here we consider two data structures:– Simple inverted index– STAIRS postings data structure

SEE: Belew (2000), pages 50-58

So, what’s wrong with this?

Doc1: car, traffic, lorry, fruit, roads…

Doc2: boat, river, traffic, vegetables…

Doc3: train, bread, railways…

……Doc1,000,000: car, roads, traffic, delays…

Inverted Index

• Belew (2000), Fig. 2.4 (with some numbers added)

5

7

2

16

Inverted Index

• The previous diagram shows that the word ‘aardvarck’ occurs in 20 different documents, with a total of 65 occurences

• Under ‘Posting’ we see details about its occurrence in Document 5 (twice) and Document 16 (7 times) – this is just an excerpt of the postings list which would contain details of 18 more documents

Inverted Index

• For each vocabulary item there is a list of pointers to occurrences

• Speeds up searching for occurrences of words – consider what a non-inverted index would be like

• Often used for ‘full-text’ indexing, i.e. when all words in a text are used as keywords

EXERCISE• Make an inverted index for this mini collection of mini

documents. IGNORE words like ‘extra, are, to, the, after, a, on, were, by, in, that, there, were, too, many’ – WHY?

Doc1 – “Extra police are to guard the Commons chamber after protesters entered during a debate on banning hunting”

Doc2 – “Protesters against nuclear power were arrested by police in London”.

Doc3 – “London police protested that there were too many protesters these days”

• What about compound words?

• Why store the value ‘totdoc’, i.e. the number of documents a word appears in?

• What other data about the distribution of words in documents would it be useful to store?

STAIRS Data Structure

• Stores data about location of words in text:– facilitates proximity queries – “find documents

where word X is near word Y”– allows weighting of words in important sections at

query time – e.g. give precedence to documents in which query term appears in title / first section

– can show excerpts of document surrounding query term

• Stores synonym relations, e.g. for query expansion

• Stores document security information, and other ‘metadata’ attributes (formatted fields)

From Belew (2000)

Polysemy and Synonymy

• Polysemy: the same word has >1 meanings

• Synonymy: people use different words to mean the same thing

Potential Problems

Polysemy• If I’m interested in finding

documents about X and make the query “ “

This will match documents about…BUT, it will also match documents

about…

Potential Problems

Synonymy• If a document contains the

word ‘automobile’ then maybe it should be returned when a user queries ‘car’

Use a thesaurus for query expansion: take a word in the query and add its synonyms (words with similar meanings) from a thesaurus

Or, index the documents with a controlled vocabulary, i.e. a standard language that must be used to give keywords to documents and that must be used to make queries

Controlled Vocabularies and Thesauri

• Controlled vocabularies and thesauri for indexing-retrieval in specialist domains:– MeSH – Medical Subject Headings– A&AT – Art and Architecture Thesaurus– NASA Thesaurus – space exploration

• These give preferred and alternative / equivalent terms for concepts in the specialist domains

• Also general language lexical resources, like WordNet

• Available in machine-executable form and can be explored on-line

WordNet• “WordNet® is an online lexical

reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept”

• WordNet has been used as a thesaurus for various information retrieval tasks – there are mixed opinions about its efficacy for these.

You can download WordNet or use it online:

http://www.cogsci.princeton.edu/~wn/

http://www.cogsci.princeton.edu/~wn/

MeSH: Medical Subject Headings

• An example of a controlled vocabulary for the medical domain: controlled vocabularies are only feasible for limited specialist domains

• “designed to help quickly locate descriptors of possible interest and to show the hierarchy in which descriptors of interest appear”

• 18,000 primary headings; 80,000 terms

www.nlm.nih.gov/mesh/meshhome.html

http://www.nlm.nih.gov/mesh/meshhome.html

MeSH Heading

Diabetes Mellitus

Tree Number

C18.452.297Tree Number

C19.246Annotation

caused by insufficient secretion of insulin; GEN or unspecified; prefer specifics; in pregnancy = PREGNANCY IN DIABETICS but do not confuse with DIABETES, GESTATIONAL: see note there; diabetes & obesity = OBESITY IN DIABETES; PREDIABETIC STATE is also available & includes subclinical diabetes; / diet ther: consider also DIABETIC DIET but see note there; alloxan- & streptozocin-induced diabetes: see note on DIABETES MELLITUS, EXPERIMENTAL

Scope Note

A heterogeneous group of disorders that share glucose intolerance in common.

Entry Term

Diabetes

See Also

Diabetic Diet; Gastroparesis

http://www.nlm.nih.gov/cgi/mesh/2003/MB_cgi?term=PREGNANCY+IN+DIABETICS

http://www.nlm.nih.gov/cgi/mesh/2003/MB_cgi?term=DIABETES,+GESTATIONAL

Models for IR

• An IR Model determines how documents are matched / ranked for a query, i.e. it determines which documents a system returns as relevant

• In simple terms, Text IR is a matter of matching the keywords of a query with the keywords of documents

Boolean Model

• Widely used in early IR systems• Documents and queries represented

as sets of index terms; either a document’s set of index terms contains a query term or not (i.e. no consideration of frequency)

• Queries made by joining keywords with Boolean operators (AND / OR / NOT)

• Only documents matching the query exactly are returned

• (For more, see BY&RN pp. 25-27)

Boolean Model

Advantages: – Simple and ‘intuitive’– Precise query semantics

Disadvantages:– Users may misunderstand

AND/OR; difficulty in expressing information need

– Exact matching means often too many or too few documents returned

Vector Model

• Documents and queries represented as vectors in the same vector space

• Dimensionality of the vector space is the number of index terms used

• Vector comprises the (weighted) frequencies of index terms

• Ranking is according to the similarity between query vector and document vectors

Underlying principle is that ‘spatial proximity’ equates to ‘semantic

similarity’

Vector Model

• The cosine distance between query vector and a document vector is often used as the similarity metric

(Belew pp.95-96)

Exercise

Frequency Table for documents D1-D3,

and words W1-W4:D1 D2 D3

W1 3 1 2

W2 2 0 0

W3 0 0 2

W4 1 1 1

Exercise

Given a query, ‘W1, W3’, which document will be ranked most highly according to the vector space model using cosine distance as the similarity metric?

Vector Model

Advantages– Term weighting shown to help clustering– Allows partial matching– Returns ranked documents– ??Maybe less sensitive to synonymy

Disadvantages– Size of frequency table may become

prohibitive– ?Maybe more sensitive to polysemy than

Boolean model– ?Assumes mutual independence of

index terms

Vector Space Model: LAB EXERCISE

• See separate Word document for instructions about VSM exercise using System Quirk and Excel

Set Reading (LECTURE 2)

Available in Library Article Collection

• Baldi, Frasconi and Smyth (2003), Modeling the Internet and the Web, pp. 77-86

• Belew 2000, Finding Out About, pp. 50-58 and 95-97.

• Weiss et al (HANDOUT Lecture 1), sections 2.3, 2.4 and 2.5

Further Reading (LECTURE 2)

• For more background to the Vector Space Model, see Belew 2000, pp. 86-94

Lecture 2: LEARNING OUTCOMES

After this lecture you should be able to:• Describe, apply and explain the use in IR systems

of basic text processing techniques, i.e. tokenization, stemming and stop lists

• Describe and explain a postings data structure, and compare the applicability of a simple inverted index and the STAIRS data structure

• Compare and contrast the Boolean Model of IR and the Vector Space Model of IR in terms of the underlying theory and applicability

• Describe and explain how the Vector Space Model is implemented, i.e. representing texts and queries as vectors and using cosine distance to rank texts

• Explain and compare how synonymy and polysemy each can affect the precision and recall of IR systems

Reading ahead for LECTURE 3

• An overview of Latent Semantic Indexing (LSI) –

http://lsi.research.telcordia.com/lsi/papers/execsum.html

• You might also skim-read a seminal paper on LSI –

http://lsi.research.telcordia.com/lsi/papers/JASIS90.pdf