information access i interactive information search gslt, göteborg, october 2003 barbara gawronska,...

Information Access IInteractive Information Search

GSLT,

Göteborg, October 2003

Barbara Gawronska, Högskolan i Skövde

2nd intensive week:

Interactivity (Th 8-12 BG, 13-15 MM) Multilingual systems and resources

(Fr 8-10 MM, 10-12 BG) Evaluation (Fr 13-15 BG)

Some repetition...: Data Retrieval vs. IR (2)(the German IR Research Group)

IR systems have to handle ”uncertain knowledge” (”unsicheres Wissen”): Vague queries; reformulation frequently required The problem of the user’s own understanding of his/hers information need Limitations of knowledge representations

This implies interaction need.

A General Model of an IR system (Fuhr 1995:11)

Data Analysis Retrieved Information

Knowledgerepresentation Transformations

Information Retrieval

Internal KnowledgeStructures

A Basic Model of a Document Retrieval System

(Fuhr 1995:11)

Document AnalysisRetrieved Documents orDocument Information

Indexing, Classification,Clustering Retrieval operations

(Boolean or stochastic)

Document Retrieval

Data Bank Structures

A document from different perspectives (Meghini et al. 91, modified)

Artikel ur NyttI T

Grundskoleprojektet – sammanfattning av detförsta året2003-09-05 FU-kanslietJ ohanna Österberg

Sedan ett år tillbaka driver Högskolan rekryteringsprojektet’Grundskolans elever – våra framtida studenter’.

Genom att på olika sätt nå ut med information om högskolestudier tillgrundskoleelever är målet att avdramatisera och väcka intresse för högrestudier i allmänhet och Högskolan i Skövde i synnerhet. Syftet är attöppna upp högskolans värld, öka mångfalden och minskasnedrekryteringen.

KlassbesökUnder hösten 2002 samarbetade Högskolan med Vasaskolan i Skövde ochCentralskolan i Töreboda. På båda skolorna träffade personal ochstudenter från Högskolan alla avgångsklasser under ungefär en timme föratt diskutera framtiden och olika valmöjligheter i livet. Även skillnadermellan att läsa på högstadiet/gymnasiet och högskola diskuterades.Sammanlagt deltog ungefär 200 elever i dessa träffar. Även föräldrarnatill dessa elever fick en kort information om högskolestudier i sambandmed föräldramöten om gymnasievalet.

Layout”Logical” stucture

(head, title, autor…)Semantics

Different aspects of a search

DB object

Real objectInformation

request

Formalquery

Objectattributes

Logical view

Layout viewLayout

specification

Structurespecification

Semantic viewContent

specification

But where and when the interactivity is needed?

DB object


request

Formalquery

Objectattributes

Logical view

Layout viewLayout

specification



specification

How to diagnose the need of interaction refinement?

User studies (still to sparse):

User in contact with existing systems:

Free task choice Predefined tasks

Wizard-of-Oz experiments

Relevance feedback (”real” och ”pseudo”)

Wizard-of-Oz experiments(Dahlbäck, Jönson...)

Users tend to spontaneously produce a kind of ”controlled” language:

written language syntax (complete sentences, elipsis avoided) ”reparations” not frequent pronominal anaphora less frequent than in human-human

communication

Wizard-of-Oz experiments (3)

”Controlled” language in users (3)

A psycholinguistic reflection: it is not unlike”baby-talk” (i.e. the way of talking to young children or unskilled/unidiomatic speaker of a language)

This can make human-computer NLP-dialogue a less complicated task than e.g. translating human-human dialogue

Theree seem to be age related differences in the way of inteeracting with computer systems

But:

If the system makes an impression of being too smart, the user normally becomes more natural in his/her linguistic behaviour,

which causesproblem to the system...

Should the systems responses remain a little ”stupid”???

Now, back from wizards

to existing systems.

Let’s think about IR-models again.

But where and when the interactivity is needed?

DB object


request

Formalquery

Objectattributes

Logical view

Layout viewLayout

specification



specification

Information request level:

Common Problems: Spelling errors (recall Hercules´ lecture) Connector interpretation: Natural Language conjunctions

vs. logical connectors; conjuction symbols in IR systems may be ambiguous:

”Food for cats and dogs”

Information request level (2)

Negation (examples inspired by Fuhr 1995):

”Drugs and sedatives without relation to aging”

”Drugs and sedatives, not related to aging”

”Drugs and sedatives, no aging”

”Drugs and sedatives, not age”

Information request (3)

What kind of feedback would be useful on this level?(Feedback, definition (Meadow et al. 2000: 246, Mc GrawHill 1971):

Feedback = information derived from the output of a process and used to control the process in the future

Possible feedback format on the infromation request level (?)

Predicate logic? For(food,cat) & for(food,dog)

Or

For (food, cat) or for(food,dog)

Or

For(food,cat) & dog

Generate NLP questions?

Leave everything to the user?

Or?

How to present the feedback? Menu choice?

Between information request level and formal query level

Meadow et at 2000: 179ff: examples from Dialog: SSELECT CAT interpreted as:

SS (=SELECT SETS) CAT SELECTiON (wrongly used instead of the standard command SELECT)

interpreted as:S(=SELECT) ION

What kind of feedback would be useful on this level?

Between the information request/formal query level and database objects

If the request/query is ambiguous: Give some feedback and try to resolve the ambiguity before

searching the database, or after the search, before presenting the documents (”Delayed disambiguation”) ?

What search stage is most suitable for feedback/dialog? What factors should be taken into account?

Search stages, or ”states” in searchers (Penniman & Dominick 1980, Chapman 1981)

Database selection Exploration of individual terms (looking up terms in a thesaurus or

an inverted file in order to decide which terms are to be used in the query)

Record search by term combinations Record browsing and display Record evaluation ( for possible iteration)

Levels of search activities(Bates 1990, Fuhr 1995)

Strategy (= a plan for an entire information search, e.g. Find relevant literature for a course in IA) Strategem: e.g. journal run, citation search...

Tactic: one or several moves made to further the search Move: a single action

Levels of system involvement(Bates 1990)

1. No system inolvement: All search activities human generated and executed

2. Displays possible activities: system lists search activities when asked. Some of the activities may be executable by system, some may not.

3. Monitors search and recommends search activities:1. Only when searcher asks for suggestions2. Always when it indentifies a need

4. Executes desired actions automatically

Relevance feedback and query reformulation

querym odifica tion

answ er

query

answ ereva lua tion

Query modification by relevance feedback(picture from M.A. Hearst, http://www.sims.berkeley.edu/courses/is202/f98/Lecture25/sld005.htm)

How to utilize terms extracted from relevant documents?

The extracted terms may be added to the query They may be presented for the user, who makes the

decision about modification They can be used for re-weighting the terms in the query

A standard method for re-weighting: Rocchio’s Algorithm(Rocchio 1971)

Goal: to achieve an optimal query

An optimal query maximizes the difference between average relevant vector and average nonrelevant vector

A standard method for re-weighting: Rocchio’s Algorithm(Rocchio 1971; many modifications, e.g. Salton & McGill 1983; Picture from Srinivasan 2003, http://mingo.info-science.uiowa.edu:16080/courses/230/Lectures/Vector.html#1c)

Qnew = a Q old + b Average Relevant Vector - c Average Nonrelevant Vector

Rocchio’s Algorithm (2)(Rocchio 1971; many modifications, e.g. Salton & McGill 1983;a more formal way of expressing the same thing – Meadow et al. 2000:258)

NiRi DDi

DDi DW

NDW

RQWWQ

QW: the initial query vector

QW’: the vector of the modified query

R= the number of the relevant retrieved documents

N= the number of the not relevant retrieved documents

DW = the document vector

, = coefficients that must be determined experimentally ( often about 0.75, about 0.25)

Future?

According to several studies, Machine Learning methods perform better than different variants of Rocchio’s algorithm.

Your experience?

Future?

Future users – a preliminary case study(age:12-13)First observations: most frequent search goals: to DO things, not to read

documents.”Download movies”, ” Prenumerate X”, ”Translate X” etc.

Future? (young users)

Queries in English dominate (specific for Swedish kids, or? What does it mean for multilinguality?)

Narrow terms dominate, specific terms more frequent than general Quite aware of the danger of information overload Short queries, 2-3 words per query ”No idea to search for subcategories” (!)

Future? (young users)

Consequences for system design and feedback planning?