i.r. intro1 dave inman south bank university, london se1 0aa, uk +44 (0)20 7815 7446 [email protected]...

33
I.R. Intro 1 Dave Inman South Bank University, Introduction to Introduction to Information Retrieval Information Retrieval An overview of the topics An overview of the topics to choose for your CW to choose for your CW

Upload: katelyn-newport

Post on 01-Apr-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 1

Dave Inman

South Bank University,

London SE1 0AA, UK+44 (0)20 7815 7446

[email protected]

Introduction to Information Introduction to Information RetrievalRetrieval

An overview of the topics An overview of the topics to choose for your CWto choose for your CW

Page 2: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 2

Problems in I.R.Problems in I.R.

Searching for information in a vast unstructured digital world can be tricky. We have probably all found two problems:

Too many hitsYou are looking at the first page of hundreds of hits. Many are duplicates. None seem relevant.

Too few hitsYou enter your key terms and get back nothing relevant, or nothing at all. Rarer probably, but still a nuisance.

Page 3: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 3

ScopeScope I.R. is about pointing the user to

sources of information

Not about processing the sources for the user.

Not database retrieval

Sometimes the sources must be processed in order to see if they are a hit or not.

Job is done when user is shown relevant sources

Page 4: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 4

ObjectivesObjectives

Retrieve ALL relevant documents

Retrieve NO irrelevant documents

Show the results sensibly

Page 5: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 5

Basics questionsBasics questions

1. What problems have you had in I.R.?

2. What is a document?

3. What is a relevant document?

4. What is sensible output?

5. How might you measure the performance of an I.R. system?

Page 6: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 6

What is a document?What is a document?

Page 7: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 7

Measures of effectivenessMeasures of effectiveness

Relevant Docs RRelevant Docs R

All DocsAll Docs

Hits HHits H

Relevant Hits RHRelevant Hits RH

Page 8: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 8

Measures of effectivenessMeasures of effectiveness

Precision = RH / H

0 if no hits relevant

1 if all hits relevant

Recall = RH / R

0 if no relevant docs found

1 if all relevant docs found

Relevant Docs RRelevant Docs R

All DocsAll Docs

Hits HHits H

Relevant Hits RHRelevant Hits RH

Page 9: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 9

Techniques : Techniques : Simple Binary matchSimple Binary match

Query term 1 Document 12 2

q d

Document d is hit if it contains q1 or q2 ...

What is wrong with this?

How could you do better?

Page 10: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 10

Techniques: Better Techniques: Better Binary matchBinary match

Remove stop words

Weighting by where match occurs

Heading? Title? Sentence subject?

Weighting by number of matches

More matches = more relevant?

Boolean queries

Allow user to specify AND, OR etc

Maybe distance this applies to?

(e.g. sentence , paragraph, document)

Page 11: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 11

Better Binary matchBetter Binary matchSignificant wordsSignificant words

Page 12: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 12

Better Binary matchBetter Binary matchInverse indexInverse index

TermTerm = word= wordIDF IDF = inverse document frequency= inverse document frequency

= freq. of term in all docs= freq. of term in all docsDOCDOC = document identifier= document identifierTF TF = term frequency= term frequency

Page 13: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 13

Vector matching I.R.Vector matching I.R.

Documents represented as a vector

Size of vector usually the number of terms in the document space

Queries represented as a pseudo document vector (usually very sparse)

Matching by dot product or cosine similarity

Page 14: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 14

Vector matching I.R.Vector matching I.R.

t = termt = termd = documentd = documentq = queryq = query

Page 15: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 15

An example document An example document space space Computing titlesComputing titles

C1: Human machine interface for ABC computer applications

C2: A survey of user opinion of computer system response times

C3: The EPS user interface management system

C4: System and human system engineering testing of

EPS

C5: Relation of user perceived response time to error

measurement

Page 16: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 16

An example document An example document space space Maths titlesMaths titles

M1: The generation of random binary ordered trees

M2: The intersection graph of paths in trees

M3: Graph minors IV : Widths of trees and well quasi

ordering

M4 Graph minors: A survey

Page 17: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 17

Representing the Representing the document spacedocument space

C1 C2 C3 C4 C5 M1 M2 M3 M4human 1 1interface 1 1 1computer 1 1user 1 1 1system 1 1 2response 1 1time 1 1EPS 1 1survey 1trees 1 1 1graph 1 1 1minors 1 1

Page 18: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 18

Vector matchingVector matching

C1 C2 C3 C4 C5 M1 M2 M3 M4 Queryhuman 1 1interface 1 1 1computer 1 1user 1 1 1 1system 1 1 2 1response 1 1time 1 1EPS 1 1survey 1trees 1 1 1graph 1 1 1minors 1 1Score 0 2 2 2 1 0 0 0 0

Page 19: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 19

Specification ErrorsSpecification Errors

Any words NOT in the query are assumed to have ZERO relevance

Fine for irrelevant terms but what about synonyms?

What about distance between terms?• PC very close to “personal computer”• PC close to computer• PC far from dog

Page 20: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 20

Techniques : Techniques : Neural NetworksNeural Networks

NamesNames PeoplePeople

EducationEducationJob!Job!

GangGangAgeAge

InhibitInhibit

ExciteExcite

Page 21: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 21

Probabilistic modelsProbabilistic models

Soft matching

Rank hits by relevance

Vector space allows this

Bayes theorem / Fuzzy Logic

Page 22: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 22

Relevance feedbackRelevance feedback

Why let user interact?

Query can be refined based on user’s input

One good hit leads to another - best queries are long (e.g. a whole document!)

Learn about user for future searches (e.g. What hits did they follow up? Was there a pattern?)

Page 23: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 23

ClassificationClassification

Manual e.g Yahoo!

Automatic: e.g. auto key word extractor with fixed classification tree (Dewey for example)

Automatic: e.g. use classification algorithm such as ID3 to get classification tree as well (tree can look bizarre)

Page 24: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 24

CollaborationCollaboration

Why not look at links followed by other users after a search event?

Why not look at links TO or FROM a page? (these might be relevant)

What types of collaboration between a connected community could you imagine?

Page 25: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 25

VisualisationVisualisation

Show hits in a way that makes sense

Is a list a good way to do this?

What other ways could you imagine?

Page 26: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 26

User modellingUser modelling

Why not know something about the user?

Their past search events might be informative.

What could you model?

Page 27: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 27

Text analysisText analysis

How could you examine a page of text and find out what it is about?

Use HTML tags?

Use metadata?

Parse text to find subject of sentence?

How would you try and find key words from a text?

Page 28: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 28

NLP approachesNLP approaches

Attempt to understand content of a document

Syntax (structure of text) is easy, semantics (meaning) is ad hoc

Possible for very limited domains with little ambiguity

Essential problem is implicit context or common sense

Page 29: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 29

ConclusionsConclusions

I.R. is needed!

I.R. is hard!

A range of approaches competing

No winners yet

Find out about these, be creative and critical, contribute and win!

Page 30: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 30

IR topicsIR topics

B Author identification

C Classification

D Collaboration

E Commercial Systems

F Data Mining

G Distributed IR

Page 31: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 31

IR topicsIR topics

H Evaluation

I Latent Semantic Indexing

K Multimedia IR

L Probabilistic Retrieval

M Query Languages

N Relevance Feedback

O Search technologies

Page 32: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 32

IR topicsIR topics

P Text Analysis

Q Text level analysis

R Thesauri in IR

S User interfaces and visualisation

T User modelling

U Web Search

Page 33: I.R. Intro1 Dave Inman South Bank University, London SE1 0AA, UK +44 (0)20 7815 7446 dave@sbu.ac.uk Introduction to Information Retrieval An overview of

I.R. Intro 33

ReferencesReferences

Chapter 1 of INFORMATION RETRIEVAL by C. J. van RIJSBERGEN look at: http://www.dcs.glasgow.ac.uk/Keith/Preface.html

Chapter 1 of Finding out about by Richard Belew

Chapter 1 of Modern Information Retrieval by Ricardo Baeza-Yates & Bertier Ribeiro-Neto