© CvR SIGIR2002
© CvR SIGIR2002
Keith van RijsbergenTampere 12th August, 2002
Landmarks in Information Retrieval: the message out of the bottle
© CvR SIGIR2002
Introductory Remarks
• Exclusions – IE, TM, ..
• Commercial successes and failures
• Caveats
• Why we have survived.
• Where we were, where we are, where we are going.
© CvR SIGIR2002
Pre-history
Smee (1850)Wells (1936)Bush (1945)Bagley (1951) MITFairthorne (1945-52) RAELuhn (1958)Mooers (1952)
© CvR SIGIR2002
Experimental Methodology
Cleverdon CranfieldLancaster MedlarsKeen Cranfield/SmartSaracevic CWRUSalton SmartSparck Jones Ideal Test CollectionBlair & Maron StairsHarman TREC
© CvR SIGIR2002
Evaluation
ABNO/OBNA (Fairthorne)Precision, Recall -> trade-off (Cleverdon)Probabilistic versions (Swets)Measure-theoretic (Bollman)
© CvR SIGIR2002
‘the world in 1980 according to Belver Griffith’
Who is missing?
© CvR SIGIR2002
Landmarks
Luhn’s tf weightingArchitectureRelevance FeedbackStemmingPoisson Model -> BM25Statistical weighting tf*idfVarious models
© CvR SIGIR2002
Luhn’s curve
© CvR SIGIR2002
What about evaluation?
InformationProblem
IndexedObjects
Query
FictiveObjects
Representation Representation
Compare
© CvR SIGIR2002
Architecture (Brenda Gerrie, 1983)
© CvR SIGIR2002
Time I (highlights for me)1952 Mooers coins IR1958 International Conference on Scientific Information1960 Cranfield I1960 Maron and Kuhns paper1961 Towards IR, RAF1961 (-1965) Smart built1964 Washington conference on Association Methods1966 Cranfield II1968 Salton’s first book197- Cranfield conferences1975 CvR’s book1975 Ideal test collection1976 KSJ/SER JASIS paper
© CvR SIGIR2002
Time II1978 1st SIGIR1979 1st BCSIRSG1980 1st joint ACM/BCS conference on IR1981 KSJ book on IR Experiments1982 Belkin et al ASK hypothesis1983 - Okapi started1985 RIAO-11986 CvR logic model1990 Deerwester et al,LSI paper1991 CoLIS 1 (in Tampere!)1991 – Inquiry started1992 Ingwersen’s book1992 TREC-11998 Croft Ponte paper on language models
© CvR SIGIR2002
Matching
Inference
Model
Classification
Query Language
Query Definition
Query Dependence
Items wanted
Error response
Logic
Exact Match Partial (best) Match
Deduction Induction
Deterministic Probabilistic
Monothetic Polythetic
Artificial Natural
Complete Incomplete
Yes No
Matching Relevant
Sensitive Insensitive
Classical Non-classical
Representation a priori a posteriori
Language Models Logical Statistical
dimensions
© CvR SIGIR2002
Probabilistic Retrieval
Maron and KuhnsMiller (following Goffman)SER/KSJCroft
© CvR SIGIR2002
Vector Space Model
SaltonMurrayRocchio
© CvR SIGIR2002
Logical Model
Mooers/Faithorne 1960+Hillman 1965Cooper/Maron 1970+CvR 1986Nie/Amati/Bruza/Huibers 1990+
For
Against
Bar-Hillel 1950+Kasher 1966
© CvR SIGIR2002
Buried Treasure
Dependence e.g C.T YuUnified Probabilistic Model Maron/Cooper/SERCo-relevance IvieStochastic Processes Mandelbrot/HerdanBrouwerian Logics HillmanError Analysis Hughes/Cover/Duda
© CvR SIGIR2002
Hypotheses/Principles
P & R trade-off – ABNO/OBNAExhaustivity/SpecificityCluster HypothesisAssociation HypothesisProbability Ranking PrincipleLogical Uncertainty PrincipleASKPolyrepresentation
Items may be associated without apparent meaning butexploiting their association may help retrieval
© CvR SIGIR2002
Postulates of Impotence(according to Swanson, 1988)
• An information need cannot be expressed independent of context
• It is impossible to instruct a machine to translate a request into adequate search terms
• A document’s relevance depends on other seen documents
• It is never possible to verify whether all relevant documents have been found
• Machines cannot recognise meaning -> can’t beat human indexing etc
© CvR SIGIR2002
….more postulates
• Word-occurrence statistics can neither represent meaning nor substitute for it
• The ability of an IR system to support an iterative process cannot be evaluated in terms of single-iteration human relevance judgment
• You can have either subtle relevance judgments or highly effective mechanised procedures, but not both
• Thus, consistently effective fully automatic in dexing and retrieval is not possible
© CvR SIGIR2002
?
Conclusions
© CvR SIGIR2002
Co-ordination is positively correlated with external relevanceJackson, 1969 – Association Hypothesis
The larger the number of matching descriptive items, for arequest and document, the more likely the document is to berelevant to the requestSparck Jones, 1971- Relevance Hypothesis
Matching
© CvR SIGIR2002
It is a common fallacy, underwritten at this date by theinvestment of several million dollars in a variety of retrievalhardware, that the algebra of Boole (1847) is the appropriateformalism for retrieval design…..The ‘logic’ of Brouwer,as invoked by Fairthorne, is one such weakening of thepostulate system,……Mooers, 1961
Another one:Logical Uncertainty PrincipleCvR, 1986
Inference
© CvR SIGIR2002
Co-occurrence [of terms] as a basis for grouping makesfor good swops i.e. permits substitutions which retrieverelevant rather than irrelevant documents.Sparck Jones, 1971. – Classification Hypothesis
If an index term is good at discriminating relevant fromnon-relevant document then any closely associated index termis also likely to be good at this. CvR, 1979 – Association Hypothesis
Closely associated documents tend to be relevant to the samerequests – CvR, 1971 - Cluster Hypothesis
Classification
© CvR SIGIR2002
Vector Space/LSIProbabilisticLogical
Models
© CvR SIGIR2002
Query Language
Artificial/Natural
Multilingual/cross-lingual
images
none at all
© CvR SIGIR2002
Query Definition
Complete/Incomplete
Independence/Dependence
Weighted/Unweighted
Query Expansion/one shot (feedback, web)
Sense disambiguation
Cross-lingual
© CvR SIGIR2002
Relevance Feedback
Ostensive Retrieval
Context
Query Expansion
Query Dependence
© CvR SIGIR2002
Relevance
ASK: Anomolous State of Knowledge
Situated Relevance
Items wanted
© CvR SIGIR2002
Precision and Recall
Error response
© CvR SIGIR2002
Logic
standard/non-standard
probabilistic logic
information flow/logic
© CvR SIGIR2002
Discrimination/Representation
Specificity/Exhaustivity
Representation
© CvR SIGIR2002
NLP
Montague Semantics
Language Models
Stochastic
© CvR SIGIR2002