information access i an introduction to information retrieval gslt, göteborg, september 2003...

27
Information Access I An Introduction to Information Retrieval GSLT, Göteborg, September 2003 Barbara Gawronska, Högskolan i Skövde

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Information Access IAn Introduction to Information Retrieval

GSLT,

Göteborg, September 2003

Barbara Gawronska, Högskolan i Skövde

Topics:

1st intensive week: Introduction Knowledge representations for IA Text categorization, indexing (theory + labs) Text summarization User aspects

2nd intensive week: Interactivity Multilingual systems and resources Evaluation

Schedule and content

Thursday 11/9

8-10 BG: An Introduction to Information Retrieval

Central notions: Information/knowledge/data/metadata Information Retrieval vs. Data Retrieval Information Extraction, abstracting, summarization

A survey of the history of Information Retrieval (Standard IR-models)

Schedule and content 2

10-12 BG: Representation of information and identification of significant text features

(Standard IR-models) Different types of information and knowledge representation: classical knowledge representation methods ( top-down and

bottom-up hierarchical classifications, thesauri) weighting techniques and co-occurrence-based techniques

The notion of Retrieval Status Value (rsv) and methods for rsv-evaluation.

Schedule and content 3

15-17 HD: User aspects in information retrieval

and text categorization

presentation of search results using KWIC, Key-word-in-context, marking up words in documents, automatic dynamic spell checking of the search query, term expansion and synonym search

categorization and clustering of texts

Schedule and content 4

Friday 12/98-10 HD: Information extraction and automatic text

summarization information extraction techniques for text summarization

(statistics, linguistics and heuristics) a demonstration of SweSum evaluation of automatic text summarization systems.

10-12, 13-15 ED: Testing automatic indexing with predefined categories (labs)

Shannon’s and Weaver’s definition of information (1959)

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1 2 4 16 32

the number of symbols in a code

ii ppnInformation

i

log1

Data – Knowledge – Information...(Fuhr 1995, modified)

Metadata

Data Knowledge Information

Syntactic level:

Organized symbols,values of attributes

Semantic level: meaning

representation

Pragmatic level:

”Knowledge in action”

Information Access

Information Recovery Information Retrieval(Information recovery /= Information Retrieval ; Information

Retrieval includes a selection process)

Information Refinement:Multilingual Information Retrieval Information ExtractionAbstracting, summarization

Data Retrieval vs. IR(Rijsbergen 1979)

RelevantMatching thequery

Search object

Best matchExactMatching

PartialFullQueryspecification

Natural (as agoal

FormalQuery language

ProbabilisticDeterministicModel

InductiveDeductiveInference

IRDR

RelevantMatching thequery

Search object

Best matchExactMatching

PartialFullQueryspecification

Natural (as agoal)

FormalQuery language

ProbabilisticDeterministicModel

InductiveDeductiveInference

IRDR

Data Retrieval vs. IR (2)(the German IR Research Group)

IR systems have to handle ”uncertain knowledge” (”unsicheres Wissen”):

Vague queries; reformulation frequently required

The problem of the user’s own understanding of his/hers information need

Limitations of knowledge representations

IR – Main Issues

How toRepresentInterpretCategorizeEvaluate

The history of IR in brief

The early IR – ”a history of how indexes were created and searched” (Meadow et al 2000:20)

Index – a broad definition: a systematic scheme that places like material together

Thus, books arranged in alphabetical order can also be seen as an index

The history of IR in brief (2)

Pre-alphabetical systems: ca 2700 A.C., the Sumerian culture: grouping

by similarity among initial ideograms (birds, bowls, trees...)

ca 1500 A.C. – first phonetic based systems (syllable-based, later – phoneme-based)

The history of IR in brief (3) First attempts to utilize letter frequency in

search: Arabs, 9th century Categorization of documents in ancient libraries:

Babylonian ”libraries” 12th-7th century A.C.:categories like astronomy, geography, history, mathematics, natural science, laws and...linguistics

The Alexandrian library; Callimachus (310-240 A.C):8 categories; subject matter and genre as criterions(history, laws, medicine, philosophy,lyric poetry, oratory, tragedy – a catalogue in 120 scrolls - pinakes )

The history of IR in brief (4)

CALLIMACHUS in Áitia:

”A big book is a big disaster...”

The history of IR in brief (5) 1876 – Melvil Devey, USA – Devey Decimal

Classification (DDC) Universal Decimal Classification (UDC, Otlet & Lafontaine)

10 main classeshierarchical organizationmax 10 branches from 1 nodetoday 130 000 classes

The history of IR in brief (6)

Universal Decimal Classification, an example:

3 Social science, laws, administration33 National economics

336 Finances336.7 Banking 336.76 Stock exchange 336.763 Share market

The history of IR in brief (7) Card catalogues – 18th/19th century

How to represent the content of a document

in an index? Precoordination vs. postcoordination

of index terms

The history of IR in brief (8) 1950 –

M. Taube – the Uniterm systems W.E. Batten – the optical coincidence system

(in both systems, the TERM serves as the starting point)

C. Moores – the Zatocode system

(cards represent DOCUMENTS and are provided with descriptors, coded as series of holes at the edge of the card)

The history of IR in brief (9)

1950- First computerized IR systems (special purpose

computers): The Western Reserve Rapid Searching Selector (1957,

Shera, Kent & Berry) Based on human-created telegraphic abstracts Aimed at technical texts Semantic categories like product, process, material...

The Minicard Selector (1959, Kessel & DeLucia)

The history of IR in brief (10)

Late 1950s - early 1970s: first IR-systems on general purpose computers

(Bracken & Tillit 1957) computerized IR should become more than simple

string matching: the idea of utilizing word frequencies and inverse document frequency (idf) – Luhn, Bar Hillel

first online IR services (MAC at MIT, MEDLINE, Lexis/Nexis)

The Internet (4 hosts 1967)

IR today and in the future

From simple string matching towards NLP-techniques (statistics/heuristics/morphology/semantics/pragmatics)

Natural language in queries Integrating speech technology Multilingual retrieval and extraction Multimedial retrieval

A General Model of an IR system (Fuhr 1995:11)

Data Analysis Retrieved Information

Knowledgerepresentation Transformations

Information Retrieval

Internal KnowledgeStructures

A Basic Model of a Document Retrieval System (Fuhr 1995:11)

Document AnalysisRetrieved Documents orDocument Information

Indexing, Classification,Clustering Retrieval operations

(Boolean or stochastic)

Document Retrieval

Data Bank Structures

A document from different perspectives (Meghini et al. 91, modified)

Artikel ur NyttI T

Grundskoleprojektet – sammanfattning av detförsta året2003-09-05 FU-kanslietJ ohanna Österberg

Sedan ett år tillbaka driver Högskolan rekryteringsprojektet’Grundskolans elever – våra framtida studenter’.

Genom att på olika sätt nå ut med information om högskolestudier tillgrundskoleelever är målet att avdramatisera och väcka intresse för högrestudier i allmänhet och Högskolan i Skövde i synnerhet. Syftet är attöppna upp högskolans värld, öka mångfalden och minskasnedrekryteringen.

KlassbesökUnder hösten 2002 samarbetade Högskolan med Vasaskolan i Skövde ochCentralskolan i Töreboda. På båda skolorna träffade personal ochstudenter från Högskolan alla avgångsklasser under ungefär en timme föratt diskutera framtiden och olika valmöjligheter i livet. Även skillnadermellan att läsa på högstadiet/gymnasiet och högskola diskuterades.Sammanlagt deltog ungefär 200 elever i dessa träffar. Även föräldrarnatill dessa elever fick en kort information om högskolestudier i sambandmed föräldramöten om gymnasievalet.

Layout”Logical” stucture

(head, title, autor…)Semantics

Different aspects of a search

DB object

Real objectInformation

request

Formalquery

Objectattributes

Logical view

Layout viewLayout

specification

Structurespecification

Semantic viewContent

specification