i.r. intro1 dave inman south bank university, london se1 0aa, uk +44 (0)20 7815 7446 [email protected]...
TRANSCRIPT
![Page 1: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/1.jpg)
I.R. Intro 1
Dave Inman
South Bank University,
London SE1 0AA, UK+44 (0)20 7815 7446
Introduction to Information Introduction to Information RetrievalRetrieval
An overview of the topics An overview of the topics to choose for your CWto choose for your CW
![Page 2: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/2.jpg)
I.R. Intro 2
Problems in I.R.Problems in I.R.
Searching for information in a vast unstructured digital world can be tricky. We have probably all found two problems:
Too many hitsYou are looking at the first page of hundreds of hits. Many are duplicates. None seem relevant.
Too few hitsYou enter your key terms and get back nothing relevant, or nothing at all. Rarer probably, but still a nuisance.
![Page 3: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/3.jpg)
I.R. Intro 3
ScopeScope I.R. is about pointing the user to
sources of information
Not about processing the sources for the user.
Not database retrieval
Sometimes the sources must be processed in order to see if they are a hit or not.
Job is done when user is shown relevant sources
![Page 4: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/4.jpg)
I.R. Intro 4
ObjectivesObjectives
Retrieve ALL relevant documents
Retrieve NO irrelevant documents
Show the results sensibly
![Page 5: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/5.jpg)
I.R. Intro 5
Basics questionsBasics questions
1. What problems have you had in I.R.?
2. What is a document?
3. What is a relevant document?
4. What is sensible output?
5. How might you measure the performance of an I.R. system?
![Page 6: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/6.jpg)
I.R. Intro 6
What is a document?What is a document?
![Page 7: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/7.jpg)
I.R. Intro 7
Measures of effectivenessMeasures of effectiveness
Relevant Docs RRelevant Docs R
All DocsAll Docs
Hits HHits H
Relevant Hits RHRelevant Hits RH
![Page 8: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/8.jpg)
I.R. Intro 8
Measures of effectivenessMeasures of effectiveness
Precision = RH / H
0 if no hits relevant
1 if all hits relevant
Recall = RH / R
0 if no relevant docs found
1 if all relevant docs found
Relevant Docs RRelevant Docs R
All DocsAll Docs
Hits HHits H
Relevant Hits RHRelevant Hits RH
![Page 9: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/9.jpg)
I.R. Intro 9
Techniques : Techniques : Simple Binary matchSimple Binary match
Query term 1 Document 12 2
q d
Document d is hit if it contains q1 or q2 ...
What is wrong with this?
How could you do better?
![Page 10: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/10.jpg)
I.R. Intro 10
Techniques: Better Techniques: Better Binary matchBinary match
Remove stop words
Weighting by where match occurs
Heading? Title? Sentence subject?
Weighting by number of matches
More matches = more relevant?
Boolean queries
Allow user to specify AND, OR etc
Maybe distance this applies to?
(e.g. sentence , paragraph, document)
![Page 11: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/11.jpg)
I.R. Intro 11
Better Binary matchBetter Binary matchSignificant wordsSignificant words
![Page 12: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/12.jpg)
I.R. Intro 12
Better Binary matchBetter Binary matchInverse indexInverse index
TermTerm = word= wordIDF IDF = inverse document frequency= inverse document frequency
= freq. of term in all docs= freq. of term in all docsDOCDOC = document identifier= document identifierTF TF = term frequency= term frequency
![Page 13: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/13.jpg)
I.R. Intro 13
Vector matching I.R.Vector matching I.R.
Documents represented as a vector
Size of vector usually the number of terms in the document space
Queries represented as a pseudo document vector (usually very sparse)
Matching by dot product or cosine similarity
![Page 14: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/14.jpg)
I.R. Intro 14
Vector matching I.R.Vector matching I.R.
t = termt = termd = documentd = documentq = queryq = query
![Page 15: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/15.jpg)
I.R. Intro 15
An example document An example document space space Computing titlesComputing titles
C1: Human machine interface for ABC computer applications
C2: A survey of user opinion of computer system response times
C3: The EPS user interface management system
C4: System and human system engineering testing of
EPS
C5: Relation of user perceived response time to error
measurement
![Page 16: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/16.jpg)
I.R. Intro 16
An example document An example document space space Maths titlesMaths titles
M1: The generation of random binary ordered trees
M2: The intersection graph of paths in trees
M3: Graph minors IV : Widths of trees and well quasi
ordering
M4 Graph minors: A survey
![Page 17: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/17.jpg)
I.R. Intro 17
Representing the Representing the document spacedocument space
C1 C2 C3 C4 C5 M1 M2 M3 M4human 1 1interface 1 1 1computer 1 1user 1 1 1system 1 1 2response 1 1time 1 1EPS 1 1survey 1trees 1 1 1graph 1 1 1minors 1 1
![Page 18: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/18.jpg)
I.R. Intro 18
Vector matchingVector matching
C1 C2 C3 C4 C5 M1 M2 M3 M4 Queryhuman 1 1interface 1 1 1computer 1 1user 1 1 1 1system 1 1 2 1response 1 1time 1 1EPS 1 1survey 1trees 1 1 1graph 1 1 1minors 1 1Score 0 2 2 2 1 0 0 0 0
![Page 19: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/19.jpg)
I.R. Intro 19
Specification ErrorsSpecification Errors
Any words NOT in the query are assumed to have ZERO relevance
Fine for irrelevant terms but what about synonyms?
What about distance between terms?• PC very close to “personal computer”• PC close to computer• PC far from dog
![Page 20: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/20.jpg)
I.R. Intro 20
Techniques : Techniques : Neural NetworksNeural Networks
NamesNames PeoplePeople
EducationEducationJob!Job!
GangGangAgeAge
InhibitInhibit
ExciteExcite
![Page 21: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/21.jpg)
I.R. Intro 21
Probabilistic modelsProbabilistic models
Soft matching
Rank hits by relevance
Vector space allows this
Bayes theorem / Fuzzy Logic
![Page 22: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/22.jpg)
I.R. Intro 22
Relevance feedbackRelevance feedback
Why let user interact?
Query can be refined based on user’s input
One good hit leads to another - best queries are long (e.g. a whole document!)
Learn about user for future searches (e.g. What hits did they follow up? Was there a pattern?)
![Page 23: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/23.jpg)
I.R. Intro 23
ClassificationClassification
Manual e.g Yahoo!
Automatic: e.g. auto key word extractor with fixed classification tree (Dewey for example)
Automatic: e.g. use classification algorithm such as ID3 to get classification tree as well (tree can look bizarre)
![Page 24: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/24.jpg)
I.R. Intro 24
CollaborationCollaboration
Why not look at links followed by other users after a search event?
Why not look at links TO or FROM a page? (these might be relevant)
What types of collaboration between a connected community could you imagine?
![Page 25: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/25.jpg)
I.R. Intro 25
VisualisationVisualisation
Show hits in a way that makes sense
Is a list a good way to do this?
What other ways could you imagine?
![Page 26: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/26.jpg)
I.R. Intro 26
User modellingUser modelling
Why not know something about the user?
Their past search events might be informative.
What could you model?
![Page 27: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/27.jpg)
I.R. Intro 27
Text analysisText analysis
How could you examine a page of text and find out what it is about?
Use HTML tags?
Use metadata?
Parse text to find subject of sentence?
How would you try and find key words from a text?
![Page 28: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/28.jpg)
I.R. Intro 28
NLP approachesNLP approaches
Attempt to understand content of a document
Syntax (structure of text) is easy, semantics (meaning) is ad hoc
Possible for very limited domains with little ambiguity
Essential problem is implicit context or common sense
![Page 29: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/29.jpg)
I.R. Intro 29
ConclusionsConclusions
I.R. is needed!
I.R. is hard!
A range of approaches competing
No winners yet
Find out about these, be creative and critical, contribute and win!
![Page 30: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/30.jpg)
I.R. Intro 30
IR topicsIR topics
B Author identification
C Classification
D Collaboration
E Commercial Systems
F Data Mining
G Distributed IR
![Page 31: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/31.jpg)
I.R. Intro 31
IR topicsIR topics
H Evaluation
I Latent Semantic Indexing
K Multimedia IR
L Probabilistic Retrieval
M Query Languages
N Relevance Feedback
O Search technologies
![Page 32: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/32.jpg)
I.R. Intro 32
IR topicsIR topics
P Text Analysis
Q Text level analysis
R Thesauri in IR
S User interfaces and visualisation
T User modelling
U Web Search
![Page 33: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of](https://reader036.vdocuments.us/reader036/viewer/2022062417/551b3036550346d41a8b4e01/html5/thumbnails/33.jpg)
I.R. Intro 33
ReferencesReferences
Chapter 1 of INFORMATION RETRIEVAL by C. J. van RIJSBERGEN look at: http://www.dcs.glasgow.ac.uk/Keith/Preface.html
Chapter 1 of Finding out about by Richard Belew
Chapter 1 of Modern Information Retrieval by Ricardo Baeza-Yates & Bertier Ribeiro-Neto