lis618lecture 2 history thomas krichel 2011-09-19

37
LIS618 lecture 2 history Thomas Krichel 2011-09-19

Upload: edgar-atkinson

Post on 16-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: LIS618lecture 2 history Thomas Krichel 2011-09-19

LIS618 lecture 2

history

Thomas Krichel2011-09-19

Page 2: LIS618lecture 2 history Thomas Krichel 2011-09-19

contents

• based on a very fine paper by Michael Lesk “The Seven Ages of Information Retrieval”.

• That paper was written in 1997, so it does not cover recent advances.

Page 3: LIS618lecture 2 history Thomas Krichel 2011-09-19

general problem

• The general approach here is a computer science approach. How to build systems that will allow people to obtain information.

• This can be distinguished from the approach more positivistic approach to observe how people actually deal with obtaining information.

Page 4: LIS618lecture 2 history Thomas Krichel 2011-09-19

general overview

• In the last decades, no industry has been subjected to more technology change than any other industry.

• Nevertheless, there are two other factors of note– the political and economic environment– research and theoretical advances

Page 5: LIS618lecture 2 history Thomas Krichel 2011-09-19

the broad picture

• Through the life of information retrieval there has been a constant struggle about how to prepare the system by intellectual or statistical methods.

• Intellectual methods require substantial human input. Despite that, they have not been completely been outruled.

Page 6: LIS618lecture 2 history Thomas Krichel 2011-09-19

statistical methods

• When we use statistical methods, we interrogate a text for tokens of potential semantic significance, e.g. words. (details in next lecture).

• We look at how much they occur in documents. It’s a brute force job.

• We store a representation of this count in a computer.

Page 7: LIS618lecture 2 history Thomas Krichel 2011-09-19

information analysis method

• In this approach we have humans analyze the contents of documents to make judgments about them.

• This analysis predates computers. • Early computer-based systems had to use this

because the computers were not capable.• In modern systems, such as web information

retrieval, this is making a comeback.

Page 8: LIS618lecture 2 history Thomas Krichel 2011-09-19

1945 As We May Think

• This is very famous (but not a good IMHO) paper by Vannevar Bush.

• Bush envisages the memex. It stores the human knowledge in a desktop device.

• It lets users mange commented associations between its stored items.

• The essay pays scant attention to information retrieval (bar condemning existing approaches).

Page 9: LIS618lecture 2 history Thomas Krichel 2011-09-19

1950s

• The USSR launches the sputnik. • The US is worried– need to improve the organization of science– need to understand what the Russians are doing• learn Russian• work on machine translation

• Suddenly it seems like a good idea to go back to something like a wartime science effort.

Page 10: LIS618lecture 2 history Thomas Krichel 2011-09-19

major problems

• There were about 100 computers in the United States.

• Computers operated in a batch-processing, rather than in an interactive mode.

• Supply of machine-readable text was very small.

Page 11: LIS618lecture 2 history Thomas Krichel 2011-09-19

special hardware

• There were some experiments with special hardware – Edge-notched cards by Calvin Moors– WRU Searching Selector of Allen Kent

• These were abandoned as the digital technology progressed.

Page 12: LIS618lecture 2 history Thomas Krichel 2011-09-19

Hans Peter Luhn (1896—1964)• In 1957 he was the first to propose extracting

information from text by automated means.– “It is here proposed that the frequency of word

occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnish a useful measurement for determining the significance of sentences. The significance factor of a sentence will therefore be based on a combination of these two measurements.”

Page 13: LIS618lecture 2 history Thomas Krichel 2011-09-19

KWIC

• Luhn also worked on KWIC (key word in context) indexing.

• The idea was that information retrieval should not only be based on words. Examplethe idea was that Information Retrieval should worked on KWIC key word in Hans Peter Luhn was the firs

Page 14: LIS618lecture 2 history Thomas Krichel 2011-09-19

1960

• This is when online IR really started.• There were the first systems being built, with

heavy subsidies.• Some of today’s available scientific IR systems

can be trace to precursors built at that time.

Page 15: LIS618lecture 2 history Thomas Krichel 2011-09-19

fuller descriptions

• As computers become more powerful, more and more descriptors could be taken from texts to make them findable in a retrieval systems.

• Taking more descriptors, however meant more work for the people involved in finding the descriptive terms.

Page 16: LIS618lecture 2 history Thomas Krichel 2011-09-19

full-text indexing

• During the 60s people started to look at the situation where all words would be used as descriptors for the document.

• Such a strategy is – less work for the human– more work for the computer

Page 17: LIS618lecture 2 history Thomas Krichel 2011-09-19

IR performance criteria

• In order to compare full-text and selective indexing, criteria had to be found and experiments to be conducted.

• Cyril Cleverdon (1914—1997) developed criteria of precision and recall.

• He and his disciples worked an a standard document set that could be used for testing, about 14k abstract on aeronautics.

Page 18: LIS618lecture 2 history Thomas Krichel 2011-09-19

and they found

• They found that full-text indexing was superior to human-aided indexing.

• This was very controversial at the time. All catalogers got very upset!

Page 19: LIS618lecture 2 history Thomas Krichel 2011-09-19

new retrieval techniques

• Having an experimental dataset that was shared by researchers was key to developing experiments that were comparable.

• One idea was “relevance feedback”. The idea was the user would select relevant documents in a fist step, and then the term from these documents would be added to the query in a second step.

Page 20: LIS618lecture 2 history Thomas Krichel 2011-09-19

NASA ReCON

• Name stand for “remote console”.• It is thought to be the first multi-site

bibliographic online information system for scientific literature.

• First large scale information systems– online– interactive– distributed

Page 21: LIS618lecture 2 history Thomas Krichel 2011-09-19

NASA ReCon

• It dealt with 200000 citation strings. • It searched for keywords manually extracted

from text.• It allowed fielded searches. • It allowed combinations of fields related and

combinations are results received.

Page 22: LIS618lecture 2 history Thomas Krichel 2011-09-19

evolution from testbeds to services

• ReCON was developed by Lockheed in 1965.• It become a testbed at Lockheed using the

ERIC database in 1969.• It became the DIALOG online system in 1972.• That system was the largest online general

retrieval system, mainly for bibliographic data for a long time.

Page 23: LIS618lecture 2 history Thomas Krichel 2011-09-19

specialized engines

• COLEX was a system developed in 1965 for the Air Force by Systems Development Corporation.

• It become SDC’s ORBIT system.• The National Library of Medicine used that

system to build the MEDLARS system that would give online access to medical abstracts as early an 1991.

Page 24: LIS618lecture 2 history Thomas Krichel 2011-09-19

increase of available text

• The development of computer-aided typesetting, and later word processing made for a lot more text available online than before.

• Among the first were the larger abstracting database such as Chemical abstracts.

• Later newspapers followed.

Page 25: LIS618lecture 2 history Thomas Krichel 2011-09-19

other services

• In 1973 the Lexis system of US court records was the first large scale full text database.

• OCLC got tapes from LoC and printed catalog cards. Later they started shared cataloging, because the LoC only cataloged about 2/3 of what the member libraries had.

Page 26: LIS618lecture 2 history Thomas Krichel 2011-09-19

research

• Research into IR prototype systems actually declined in the US.

• NSF shifted away research funds to the development of actual systems.

• There was some work on AI is IR but it got pretty much nowhere.

Page 27: LIS618lecture 2 history Thomas Krichel 2011-09-19

research on models

• Gerard Salton (1927—1995) and his group at Cornell refined the vector space model and introduced tf/idf term weighting in the early 70s. This will be covered later.

• Later in the 70s the probabilistic model of information retrieval emerged in work by Keith van Rijsbergen (1943—) and friends. This will not be covered in the course.

Page 28: LIS618lecture 2 history Thomas Krichel 2011-09-19

80s

• This decade saw the introduction of inexpensive personal computers.

• In libraries – Card catalogs are replaced by OPACs allowing non-

specialists access to online information. – Online information started to use full text.

Page 29: LIS618lecture 2 history Thomas Krichel 2011-09-19

full-text retrieval

• In the scientific abstract databases full-text is not that important.

• But in legal databases, it is. Lexis were the first to provide full-text access.

• Later they were followed by the newspapers. They were among the first non-reference publishers to adopt computerized typesetting. These databases contained text only.

Page 30: LIS618lecture 2 history Thomas Krichel 2011-09-19

research effort• The vector model was still the start, but

researcher tried to using dictionaries to disambiguate terms in documents and queries.

• Some work was done on bi-lingual retrieval.• Some work was done on part of speech

assignments.• Research made no impact on real-world

commercial systems.

Page 31: LIS618lecture 2 history Thomas Krichel 2011-09-19

CD ROM

• In the late 80s the CD-ROM appeared to make a dent in online information retrieval.

• The CD-ROM fits the standard print publishing model.

• Sharing CD-ROMs however, proved a nightmare in libraries.

Page 32: LIS618lecture 2 history Thomas Krichel 2011-09-19

90s

• The 90s were the decade of the Internet.• Online information retrieval become common.• The CD-ROM lost out.

Page 33: LIS618lecture 2 history Thomas Krichel 2011-09-19

integration of graphics

• The web became compelling as soon as a graphical user interface was introduced by the mosaic browser.

• This introduced the subject of multimedia information retrieval, largely defined to research experiments before, out in the open.

Page 34: LIS618lecture 2 history Thomas Krichel 2011-09-19

start of search engines

• Early search engines looked at the web with the eyes of earlier information retrieval engines.

• The essentially applied the vector model to the text extracted from web pages.

• But web pages also contain also pictures and more importantly, links.

Page 35: LIS618lecture 2 history Thomas Krichel 2011-09-19

Stanford digital library project

• Run at Stanford between 1995 and 2004 by a group of researchers to “provide an infrastructure that affords interoperability among heterogeneous, autonomous digital library services.”

• In 1996 as Sergei Brin and Larry Page started to work on a project to analyze the links between web pages.

Page 36: LIS618lecture 2 history Thomas Krichel 2011-09-19

web information retrieval

• I argue that web information retrieval is different from conventional retrieval because on the web pages have different intrinsic values.

• The PageRank algorithm is a simple algorithm to make such values appear.

Page 37: LIS618lecture 2 history Thomas Krichel 2011-09-19

http://openlib.org/home/krichel

Please shutdown the computers whenyou are done.

Thank you for your attention!