csm06 information retrieval lecture 2: text ir part 1 dr andrew salway [email protected]...
TRANSCRIPT
CSM06 Information RetrievalLecture 2: Text IR part 1
Dr Andrew Salway [email protected]
Recap from Lecture 1• IR is increasingly important for people to
access documents on the web, intranets and in personal media collections
• There are important differences between IR in traditional libraries and computer-based IR, however there are similar underlying ideas, e.g. indexing and classification
• When developing or evaluating an IR system it is important to consider the kinds of information to be retrieved, the information needs of the users, and the querying skills of the users
• The performance of IR systems can be compared with Precision and Recall evaluation metrics
Recap from Lecture 1
Generic information access process
(1) Start with an information need(2) Select a system / collections to search(3) Formulate a query(4) Send query(5) Receive results (i.e. information items)(6) Scan, evaluate, interpret results(7) Reformulate query and go to (4) OR stop
From Baeza-Yates and Ribeiro-Neto (1999), p. 263
IndexingQuery Operations
Searching
Text Operations
User Interface
Ranking
INDEX
Text Database
Generic IR System Architecture
Another book to consider…
Mark Levene (2005), An Introduction to Search Engines and Web Navigation. Addison-Wesley / Pearson Education.
– Published in September 2005, this book gives good introductions to a broad range of topics related to web search. Chapters 1, 4 and 5 are particularly relevant to CSM06.
CSM06 Webpage
• Lecture notes will normally be available by 9am, Monday (day before lecture)
• Webpage also has coursework information and links to further reading
http://portal.surrey.ac.uk/computing/resources/lm/csm06
Lecture 2: OVERVIEW
• Basic text processing techniques: tokenization, stemming and stop lists
• Postings Data Structures: simple inverted index: and STAIRS data structure
• The challenges of synonymy and polysemy
• Boolean Model of IR • Vector Space Model of IR: cosine
distance
Idea underlying all Text IR
• Queries and Documents are treated as ‘Bags of Words’– Keywords are used to capture the
‘aboutness’, or topic, of a document– Keywords are used in a query to express
the user’s information needs
• In simple terms, Text IR is a matter of matching the keywords of a query with the keywords of documents
1. Read the text below and decide what you think it is about: imagine you have to store the document in a library of financial news stories.
2. Choose a set of words from the text that you think represent what it is about, i.e. its ‘aboutness’.
3. Write down some examples of users’ information needs that you think would be satisfied by this text. Would the queries these users formulate be matched by the keywords you chose for the text?
Tokyo stocks ended slightly higher on Tuesday, with shares supported by a small gain in the US Nasdaq market overnight, raising hopes of a lull in the slide in equities markets.
In Japan the Nikkei 225 stock average rose 97.26 or 0.9 per cent to 10,292.95 while the broader Topix index gained 2.14 or 0.2 per cent to 1,058.12.
Technology shares were generally higher after a small rebound in Nasdaq listed shares. Kyocera was up Y380 or 4.9 per cent to Y8,090 and NEC gained Y15 or 1.2 per cent to Y1,250.
Restaurant groups were hit by the discovery of Japan's first case of mad-cow disease. Yoshinoya, the fast-food beef-bowl restaurant chain, fell Y20,000 or 9.1 per cent to Y200,000. McDonald's fell Y90 or 2.5 per cent to Y3,490. Both restaurants said they imported their beef but company shares fell on expectations that consumers would still cut down on beef consumption.
Document Preprocessing
Basic text processing techniques applied before an index is created
• Tokenize document • Remove stop words (also known
as noise words)• Perform stemming (saves indexing
space and improves recall)
Create Index: data about remaining keywords is then stored in a postings data structure
Tokenization
• Separate a document into separate words
• For HTML documents there is a question about what to do with the tags - could give useful structural information. May want to exploit structural information for ranking…
• More difficult to work with some formats like .ps / .pdf
Stop Lists
• Stop lists contain words that will not help to discriminate texts so they can be ignored: – closed-class / grammatical words
like the, if, and, but….– vocabulary that is general to the
domain of the text database, e.g. the word computer in a database of computer texts
Stop Lists
• Want to eliminate stopwords which are not meaningful as index terms – (though they are crucial to understanding the full meaning of a document)
• May generate a stopword list by analysing a corpus of documents: stopwords will tend to be the most frequently occurring across the corpus
• Note, pre-compiled lists are available (typically include a few hundred words)
Stemming
• Stemming is applied to reduce grammatically (morphologically) related words to the same form so they will match:– drug, drugs, drugged drug
Stemming
• Assumption is that word-stem contains the important ‘semantics’
• Morphology modifies the stem, according to:– Inflectional morphology, e.g. plurals,
tense, gender– Derivational morphology, e.g. produce
product production
• Morphology adds prefixes and suffixes, and may also change letters in the stem
• In effect a ‘stemmer’ attempts to undo the morphology
Stemming
• May define patterns of character sequences and transformations with a content-sensitive transformation grammar
(.*)SSES /1 SS
Stemming
• The Porter algorithm is a widely used affix removal stemming technique
• Comprises a sequence of rules (to be applied to a word in order): each rule has conditions removal / replacement actions
e.g. sses -> ss• You can download an implementation of
the algorithm from:www.tartarus.org/~martin/PorterStemmer
[See Kowalski and Maybury (2000) pp. 75-77 for more detail; and Appendix of Baeza-Yates and Riberio-Neto 1999]
Postings Data Structures
• Data about keywords, documents and especially the occurrences of keywords in documents, needs to be stored in an appropriate postings data structure: this is the index in an IR system
• The details of what data needs to be stored depend on the required functionality of the application
• Aim (as with all data structures) is to facilitate efficient access to and processing of stored data
• Here we consider two data structures:– Simple inverted index– STAIRS postings data structure
SEE: Belew (2000), pages 50-58
So, what’s wrong with this?
Doc1: car, traffic, lorry, fruit, roads…
Doc2: boat, river, traffic, vegetables…
Doc3: train, bread, railways…
……Doc1,000,000: car, roads, traffic, delays…
Inverted Index
• The previous diagram shows that the word ‘aardvarck’ occurs in 20 different documents, with a total of 65 occurences
• Under ‘Posting’ we see details about its occurrence in Document 5 (twice) and Document 16 (7 times) – this is just an excerpt of the postings list which would contain details of 18 more documents
Inverted Index
• For each vocabulary item there is a list of pointers to occurrences
• Speeds up searching for occurrences of words – consider what a non-inverted index would be like
• Often used for ‘full-text’ indexing, i.e. when all words in a text are used as keywords
EXERCISE• Make an inverted index for this mini collection of mini
documents. IGNORE words like ‘extra, are, to, the, after, a, on, were, by, in, that, there, were, too, many’ – WHY?
Doc1 – “Extra police are to guard the Commons chamber after protesters entered during a debate on banning hunting”
Doc2 – “Protesters against nuclear power were arrested by police in London”.
Doc3 – “London police protested that there were too many protesters these days”
• What about compound words?
• Why store the value ‘totdoc’, i.e. the number of documents a word appears in?
• What other data about the distribution of words in documents would it be useful to store?
STAIRS Data Structure
• Stores data about location of words in text:– facilitates proximity queries – “find documents
where word X is near word Y”– allows weighting of words in important sections at
query time – e.g. give precedence to documents in which query term appears in title / first section
– can show excerpts of document surrounding query term
• Stores synonym relations, e.g. for query expansion
• Stores document security information, and other ‘metadata’ attributes (formatted fields)
Polysemy and Synonymy
• Polysemy: the same word has >1 meanings
• Synonymy: people use different words to mean the same thing
Potential Problems
Polysemy• If I’m interested in finding
documents about X and make the query “ “
This will match documents about…BUT, it will also match documents
about…
Potential Problems
Synonymy• If a document contains the
word ‘automobile’ then maybe it should be returned when a user queries ‘car’
Use a thesaurus for query expansion: take a word in the query and add its synonyms (words with similar meanings) from a thesaurus
Or, index the documents with a controlled vocabulary, i.e. a standard language that must be used to give keywords to documents and that must be used to make queries
Controlled Vocabularies and Thesauri
• Controlled vocabularies and thesauri for indexing-retrieval in specialist domains:– MeSH – Medical Subject Headings– A&AT – Art and Architecture Thesaurus– NASA Thesaurus – space exploration
• These give preferred and alternative / equivalent terms for concepts in the specialist domains
• Also general language lexical resources, like WordNet
• Available in machine-executable form and can be explored on-line
WordNet• “WordNet® is an online lexical
reference system whose design is inspired by current psycholinguistic theories of human lexical memory. English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept”
• WordNet has been used as a thesaurus for various information retrieval tasks – there are mixed opinions about its efficacy for these.
You can download WordNet or use it online:
http://www.cogsci.princeton.edu/~wn/
MeSH: Medical Subject Headings
• An example of a controlled vocabulary for the medical domain: controlled vocabularies are only feasible for limited specialist domains
• “designed to help quickly locate descriptors of possible interest and to show the hierarchy in which descriptors of interest appear”
• 18,000 primary headings; 80,000 terms
www.nlm.nih.gov/mesh/meshhome.html
MeSH Heading
Diabetes Mellitus
Tree Number
C18.452.297Tree Number
C19.246Annotation
caused by insufficient secretion of insulin; GEN or unspecified; prefer specifics; in pregnancy = PREGNANCY IN DIABETICS but do not confuse with DIABETES, GESTATIONAL: see note there; diabetes & obesity = OBESITY IN DIABETES; PREDIABETIC STATE is also available & includes subclinical diabetes; / diet ther: consider also DIABETIC DIET but see note there; alloxan- & streptozocin-induced diabetes: see note on DIABETES MELLITUS, EXPERIMENTAL
Scope Note
A heterogeneous group of disorders that share glucose intolerance in common.
Entry Term
Diabetes
See Also
Diabetic Diet; Gastroparesis
Models for IR
• An IR Model determines how documents are matched / ranked for a query, i.e. it determines which documents a system returns as relevant
• In simple terms, Text IR is a matter of matching the keywords of a query with the keywords of documents
Boolean Model
• Widely used in early IR systems• Documents and queries represented
as sets of index terms; either a document’s set of index terms contains a query term or not (i.e. no consideration of frequency)
• Queries made by joining keywords with Boolean operators (AND / OR / NOT)
• Only documents matching the query exactly are returned
• (For more, see BY&RN pp. 25-27)
Boolean Model
Advantages: – Simple and ‘intuitive’– Precise query semantics
Disadvantages:– Users may misunderstand
AND/OR; difficulty in expressing information need
– Exact matching means often too many or too few documents returned
Vector Model
• Documents and queries represented as vectors in the same vector space
• Dimensionality of the vector space is the number of index terms used
• Vector comprises the (weighted) frequencies of index terms
• Ranking is according to the similarity between query vector and document vectors
Underlying principle is that ‘spatial proximity’ equates to ‘semantic
similarity’
Vector Model
• The cosine distance between query vector and a document vector is often used as the similarity metric
(Belew pp.95-96)
Exercise
Frequency Table for documents D1-D3,
and words W1-W4:D1 D2 D3
W1 3 1 2
W2 2 0 0
W3 0 0 2
W4 1 1 1
Exercise
Given a query, ‘W1, W3’, which document will be ranked most highly according to the vector space model using cosine distance as the similarity metric?
Vector Model
Advantages– Term weighting shown to help clustering– Allows partial matching– Returns ranked documents– ??Maybe less sensitive to synonymy
Disadvantages– Size of frequency table may become
prohibitive– ?Maybe more sensitive to polysemy than
Boolean model– ?Assumes mutual independence of
index terms
Vector Space Model: LAB EXERCISE
• See separate Word document for instructions about VSM exercise using System Quirk and Excel
Set Reading (LECTURE 2)
Available in Library Article Collection
• Baldi, Frasconi and Smyth (2003), Modeling the Internet and the Web, pp. 77-86
• Belew 2000, Finding Out About, pp. 50-58 and 95-97.
• Weiss et al (HANDOUT Lecture 1), sections 2.3, 2.4 and 2.5
Further Reading (LECTURE 2)
• For more background to the Vector Space Model, see Belew 2000, pp. 86-94
Lecture 2: LEARNING OUTCOMES
After this lecture you should be able to:• Describe, apply and explain the use in IR systems
of basic text processing techniques, i.e. tokenization, stemming and stop lists
• Describe and explain a postings data structure, and compare the applicability of a simple inverted index and the STAIRS data structure
• Compare and contrast the Boolean Model of IR and the Vector Space Model of IR in terms of the underlying theory and applicability
• Describe and explain how the Vector Space Model is implemented, i.e. representing texts and queries as vectors and using cosine distance to rank texts
• Explain and compare how synonymy and polysemy each can affect the precision and recall of IR systems