university of malta csa1013:information search and retrieval © 2003- chris staff 1 of 24...
TRANSCRIPT
1 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
CSA1013
Historical Perspectives of
Dr. Christopher StaffDepartment of Computer
Science & AIUniversity of Malta
Information Search and Retrieval
2 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Aims and Objectives
• What is Information Search and Retrieval?
• What’s the “state-of-the-art”?• How did we get here?• What are the issues?• Where are we likely to go next?
3 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
What’s Information Search and Retrieval?
• What’s information?– Structured vs. unstructured
• Where is it?• Question answering vs. Information lack or information need
4 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
What’s the “state-of-the-art”?
• Information Retrieval in the “real” world– Web-based search engines
•Google, AllTheWeb, AltaVista, etc.
• Web directories– Yahoo, Excite, etc.
5 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
What’s the “state-of-the-art”?
• Google, and Google-like search engines– Index > 24 billion web pages (pdf, doc, html, …)
– User expresses “Query” •terms, natural language query, etc
– System “compares” query to indexed documents
– Returns “list” of “relevant” documents
6 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
What’s the “state-of-the-art”?
• Recent study by Jansen & Spink [Jansen] shows:– |Query| = 2.14 terms [Spink]– Queries with 1 term = 53%!– 54% of users are satisfied with first page of results (list of 10 documents)
– 80% of users view not more than 10 - 20 results
– 27.6% read only one document!– 66% read < 5 documents
7 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Has life always been this good?
• It would seem that we’re living in information heaven
• Any info we seek is just a couple of query terms away
• In reality, although majority of queries appear to be “trivial”, the reality is quite different
8 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Has life always been this good?
• What if we want to find all relevant information? (“The Invisible Web”)
• What if we want to find something that is difficult to describe?
• What if we don’t know what we’re looking for?– What tools do we use to find info in encyclopaedias, dictionaries, newspapers, reference manuals, novels and other books?
9 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Here beginneth the history lesson…
• People have devised tools to find information again ever since we learnt to write things down…
• Think of information stored on your personal computers… how do you find something that you wrote last month, last year?
10 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Prehistory!
• Well, nearly!• Early writings
– Papyrus scrolls– No paragraph, page numbers, etc– Couldn’t “scroll to the end” to read an index
– Instead, Greek/Roman libraries used “sillybus”/“index” of title
11 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Greeks/Romans
• 3BC, Greeks probably use alphabetization in Library of Alexandria
• Around 2BC (Rome), evidence of hierarchies of information/classification systems– Greeks probably earlier
• Also, Tables of Contents date from around 2BC (Pliny the Elder reports before 79AD)
12 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Printing Press
• Not much else was to happen until 1455, with the advent of the printing press
• Previously, still difficult to refer to information “within” a book, because copies were inaccurate– Info on one page in one book could be on a different page in other copies
13 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Indices and the Printing Press
• Still, alphabetization was on initial letter, then on first four letters…
• Not until 18th Century did full alphabetization occur!
14 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
The Second World War and beyond
• In 1945, Vannevar Bush publishes “As We May Think” in the Atlantic Monthly
• In 1949, Warren Weaver writes that if Chinese is English + codification, then Machine Translation should be possible
• These give rise to “intelligent” and “statistical” (or surface-based) approaches to Information Search and Retrieval respectively (amongst other things :-))
15 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Intelligent vs. Surface-based
“Concepts”1950’s
• Lay in waiting for years, because hardware/software not around
“Words”1950’s
• First approaches were “Key Words in Context” (KWIC)
16 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Intelligent vs. Surface-based
1960’s• Generality in AI (John McCarthy)
1960’s• Boolean Search• Measures of performance effectiveness
• Thesaural Lookup
• Vector Space Model
17 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Intelligent vs. Surface-based
1970’s• Expert Systems• Still about “understanding” information and reasoning with and about it
1970’s• Explosion in availability of electronic text collections
• Library Retrieval Systems
• Full-text indexing• Probabilistic IR• Relevance Feedback
18 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Intelligent vs. Surface-based
1980’s• Conceptual IR• Knowledge Rep Langs
• Lenat’s CYC• Contextual Reasoning
• 5th Generation Computing, Japan
• LSI feeds Statistical IR
1980’s
• OPACs• IR used by non-specialists
• Extended Boolean IR
• Word Sense Disambiguation
• Statistical IR (LSI, etc)
• Internet
19 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Intelligent vs. Surface-based
1990’s• Better language processing
• information extraction
• entity name recognition
• Advances in contextual reasoning, ontologies
1990’s• WWW (1995 c. 10M pages, 2003 c. 3B!)
• Multimedia Indexing & Retrieval
• Web-based search engines
20 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Intelligent vs. Surface-based
2000’s• Semantic Web
2000’s• Faster processors
• More memory• Cheaper storage space
• More superficial comparisons
21 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Intelligent vs. Surface-based
The future• Computers that can find precisely the information you seek– Even if the answer is non-obvious
– Or the answer needs to be the result of reasoning
• MyLifeBits
The future• Computers that can approximate the information you seek– At much less cost
– At the expense of “correctness”
• MyLifeBits
22 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
23 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Main Issues
• Architecture to handle ever increasing numbers of docs + efficient data structures
• Freshness, indexing and retrieval speed (Efficient algorithms)
• What is “relevance”? (Better, cheaper and more accurate algorithms to understand what the user really wants)
24 of [email protected] University of Malta
CSA1013:Information Search and Retrieval© 2003- Chris Staff
Main References• Paijmans, J.J., last updated 2004, “The Retrieval of
Information from historical perspective”, http://pi0959.kub.nl/Paai/Onderw/V-I/Content/history.html
• American Society of Indexers, last updated 2005, “How Information Retrieval Started”, http://www.asindexing.org/site/history.shtml
• [Jansen] Jansen, B.J., and Spink, A., 2003, ‘An Analysis of Web Documents Retrieved and Viewed’, in Proceedings of the 4th International Conference on Internet Computing, Las Vegas, Nevada, 23-26 June 2003. http://ist.psu.edu/faculty_pages/jjansen/academic/pubs/pages_viewed.pdf
• [Spink] Spink, A., et. al., 2001, ‘Searching the Web: The Public and their Queries’, in JASIST 2001. http://jimjansen.tripod.com/academic/pubs/jasist2001/jasist2001.html