wired week 2

WIRED Week 2WIRED Week 2

• Syllabus Update (at least another week)• Readings Overview

- Many ways to explain the same things- Always think of the user- Skip the math (mostly)

• Readings Review- Most complicated of the entire semester- Refer back to it often

• Readings Review - More Models• Non-Linear• Projects and/or Papers Overview

Opening CreditsOpening Credits

• Material for these slides obtained from:- Modern Information Retrieval by Ricardo Baeza-Yates and

Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/

- Berthier Ribeiro-Neto - Introduction to Modern Information Retrieval by Gerard

Salton and Michael J. McGill, McGraw-Hill, 1983.- Ray Mooney CSE 8335 Spring 2003- Joydeep Ghosh

• IR originally mostly for systems, not people• IR in the last 25 years:

1. classification and categorization2. systems and languages3. user interfaces and visualization

• A small world of concern• The Web changed everything

• Huge amount of accessible information• Varied information sources• Relatively easy to look for information• Improving IR means improving learning

• Digital technology changes everything (again)

Why IR?Why IR?

WIRED FocusWIRED Focus

• Information Retrieval: representation, storage, organization of, and access to information items

• Focus is on the user information need• User information need:

- Find all docs containing information on Austin which:

• Are hosted by utexas.edu• Discuss restaurants

• Emphasis is on the retrieval of information (not data, not just a keyword match)

So just what is Information then?So just what is Information then?

• Oh no, not more about informaiton....

• “The difference that makes a difference”Gregory Bateson

• Element in the communications process- Information Theory- Data

• Something that informs a user- Not just Data

• “Orange”• “1,741,405.339”

- Helps users learn- Helps users make decisions- Facts (in context)

DifferencesDifferences• Data retrieval

- Keywords match for documents- Well-defined semantics- A single erroneous object implies failure

• Information retrieval- information about a subject or topic- Loose semantics (language issues)- Small errors are tolerated

• IR system:- interpret contents of information items (documents)- generate a ranking which reflects relevance for the query- notion of relevance is most important

Let’s Talk about FindingLet’s Talk about Finding

• How many ways are there to find something?

• Find a specific thing?

• Find a concept?

• Learn about a topic?

Ways of Finding are IR modelsWays of Finding are IR models

• Each way of finding is one type of Information Retrieval

• Different ways to search for person, place or thing

• Real life information retrieval combines several of the methods- All at once- In succession

Retrieval

Browsing

“Database”

User Interaction with (IR) SystemUser Interaction with (IR) System

• Users can do 2 distinct tasks- Which one is more important?- Can you do both at once?- Which is more difficult?

• Leverage the strengths of each

Documents

Information Need

index

query

Rankingmatch

documents

?

Quick Overview of the IR ProcessQuick Overview of the IR Process

structure

Manual indexingDocs

structure Full text Index terms

How do you get an index?How do you get an index?

• A “bag of words” with these logical parts that are processed• The Web has (some) structure in markup languages• Follow from Left to Right to see the document get

condensed• Structure + the right Logical parts (+ things added) = Index

Accentsspacing stopwords

Noungroups stemming

Our friend the IndexOur friend the Index• IR systems usually rely on an index or indices

to process queries• Index term:

- a keyword or group of selected words• “Golf” or “Golf Swing”

- any word (not always actually in the document!)• “Waste of time”

• Stemming might be used:- connect: connecting, connection, connections

• An inverted file is built for the chosen index terms- A list of each keyword (or keyword group) and its

location in the document(s)

Indexes, huh?Indexes, huh?

• Matching at index term level is not always the best way to find

• Users get confused and frustrated (without an understanding of the IR Model)

• Search features not easy- Number of search terms is small- Results not presented well

• Web searching actually not so easy either- Junk pages- Ranking games- Duplicate information- Bad page or site Information Architecture

More about indexesMore about indexes

• Indexes make sense of the order (to the system)

• Relevance is measured for the user’s query among the indices

• Ranking of results is how the index is exposed to the user.

RankingRanking

• A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query

• A ranking is based on fundamental premisses regarding the notion of relevance, such as:- common sets of index terms- sharing of weighted terms- likelihood of relevance

• Each set of premisses leads to a distinct IR model

RelevanceRelevance

• Relevance is the correct document for your situation.

• Relevance feedback is doing the search again with changes to search terms- By the user- By the system

• Called Query Reformulation• Relevance is based on the user• A huge area for improvement in search

interfaces

IR ModelsIR Models

• Classic IR Models- Boolean Model- Vector Model- Probabilistic Model

• Set Theoretic Models- Fuzzy Set Model- Extended Boolean Model

• Generalized Vector Model• Latent Semantic Indexing• Neural Network Model

Non-Overlapping ListsProximal Nodes

Structured Models

Retrieval: Adhoc Filtering

Browsing

U s e r

T a s k

Classic Models

BooleanVectorProbablistic

Set Theoretic

Fuzzy Extended Boolean

Probabilistic

Inference Network Belief Network

Algebraic

Generalized Vector Lat. Semantic Index Neural Networks

Browsing

FlatStructure GuidedHypertext

Types of IR ModelsTypes of IR Models

Index Terms

Full Text

Full Text + Structure

Retrieval

Classic

Set Theoretic Algebraic

Probabilistic

Classic

Set Theoretic Algebraic

Probabilistic

Structured

Browsing

Flat

Flat

Hypertext

Structure Guided

Hypertext

LOGICAL VIEW OF DOCUMENTS

USER T A S K

The "Usual Suspects”The "Usual Suspects”

Classic IR Models - BasicsClassic IR Models - Basics

• Each document represented by a set of representative keywords or index terms

• An index term is a word from the document that describes the document- Classicaly, index terms are nouns because nouns

have meaning by themselves- Most human indexers use nouns & verbs- Now, search engines assume that all words are

index terms (full text representation)• A great index has words added to it for additional

meaning and context

Classic IR Models – More BasicsClassic IR Models – More Basics

• Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents

• The importance of the index terms is represented by weights associated to them- Ki - an index term- dj - a document - F - the framework for document representations- R – ranking function in relation to query & document- wij - a weight associated with (ki,dj)- The weight wij quantifies the importance of the index

term for describing the document contents

Classic IR Models – Basic VariablesClassic IR Models – Basic Variables

- t is the total number of index terms

- K = {k1, k2, …, kt} is the set of all index terms

- wij >= 0 is a weight associated with (ki,dj)

- wij = 0 indicates that term does not belong to doc

- dj= (w1j, w2j, …, wtj) is a weighted vector associated with the document dj

- gi(dj) = wij is a function which returns the weight associated with pair (ki,dj)

Wait a minute!Wait a minute!

• Why are these variables so poorly labeled and organized.

• Tradition• Security through Obscurity?• Difference from other Mathematical

symbolism?

• Sadly, not consistent even in the IR literature

Boolean ModelBoolean Model

• Simple model based on set theory• Queries specified as boolean expressions

- precise semantics- neat formalism- q = ka (kb kc)

• Terms are either present or absent. Thus, wij {0,1}

• Consider- q = ka (kb kc)

- qdnf = (1,1,1) (1,1,0) (1,0,0)

- qcc= (1,1,0) is a conjunctive component

(1,1,1)(1,0,0)

(1,1,0)

Ka Kb

Kc

Boolean ModelBoolean Model

• q = ka (kb kc)

• sim(q,dj) = 1 if vec(qcc) | (vec(qcc)

vec(qdnf)) (ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise

Boolean Model ProblemsBoolean Model Problems

• Retrieval based is binary: no partial matching• No ranking of the documents is provided

(absence of a grading scale)• Users aren’t good at boolean queries

- too simplistic- wrong

• As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

Vector ModelVector Model

• Use of binary weights is too limiting• Non-binary weights provide consideration for

partial matches• These term weights are used to compute a

degree of similarity between a query and each document

• Ranked set of documents provides for better matching


• wij > 0 whenever ki appears in dj

• wiq >= 0 associated with the pair (ki,q)

• dj = (w1j, w2j, ..., wtj)

• q = (w1q, w2q, ..., wtq)

• To each term ki is associated a unitary vector i

• The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)

• The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

i

j

dj

q


• Sim(q,dj) = cos()

= [dj q] / |dj| * |q|

= [ wij * wiq] / |dj| * |q|

• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1

• A document is retrieved even if it matches the query terms only partially


• Sim(q,dj) = [ wij * wiq] / |dj| * |q|

• How to compute the weights wij and wiq ?

• A good weight must take into account two effects:- quantification of intra-document contents

(similarity)• tf factor, the term frequency within a document

- quantification of inter-documents separation (dissi-milarity)

• idf factor, the inverse document frequency

- wij = tf(i,j) * idf(i)


- N be the total number of docs in the collection- ni be the number of docs which contain ki- freq(i,j) raw frequency of ki within dj

• A normalized tf factor is given by- f(i,j) = freq(i,j) / max(freq(l,j))- where the maximum is computed over all terms

which occur within the document dj

• The idf factor is computed as- idf(i) = log (N/ni)- the log is used to make the values of tf and idf

comparable. It can also be interpreted as the amount of information associated with the term ki.


• The best term-weighting schemes take both into account.- wij = fi,j * log(N/ni)

• This strategy is called a tf-idf weighting scheme

• Text frequency – Inverse document frequency- How rare is the word in this document in

comparison with other documents too?


• For the query term weights, a suggestion is- wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni)

• The vector model with tf-idf weights is a good ranking strategy with general collections

• The vector model is usually as good as any known ranking alternatives.

• It is also simple and fast to compute.

Weights Weights wwij ij and and wwiqiq ? ?

• One approach is to examine the frequency of the occurence of a word in a document:

• Absolute frequency:- tf factor, the term frequency within a document

- freqi,j - raw frequency of ki within dj

- Both high-frequency and low-frequency terms may not actually be significant

• Relative frequency: tf divided by number of words in document

• Normalized frequency:

fi,j = (freqi,j)/(maxl freql,j)

Inverse Document FrequencyInverse Document Frequency

• Importance of term may depend more on how it can distinguish between documents.

• Quantification of inter-documents separation• Dissimilarity not similarity• idf factor, the inverse document frequency

IDFIDF

• N = the total number of docs in the collection

• ni = the number of docs which contain ki

• The idf factor is computed as

- idfi = log (N/ni)

- the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.

• For example:

- N=1000, n1=100, n2=500, n3=800

- idf1= 3 - 2 = 1

- idf2= 3 – 2.7 = 0.3

- idf3 = 3 – 2.9 = 0.1

Vector Model ConsideredVector Model Considered

• Most Common• Advantages:

- term-weighting improves quality of the answer set- partial matching allows retrieval of docs that

approximate the query conditions- cosine ranking formula sorts documents according

to degree of similarity to the query- Computationally Efficient

• Disadvantages:- assumes independence of index terms - not clear

that this is bad though- Not best if applied exclusively

Probabilistic ModelProbabilistic Model• Uses a probabilistic framework to get a result

- Given a user query, there is an ideal answer set- Querying as specification of the properties of this

ideal answer set (clustering)

• But, what are these properties? • Guess at the beginning what they could be

(i.e., guess initial description of ideal answer set)

• Improve probablity by iteration & additional data

Probabilistic Model IteratedProbabilistic Model Iterated

• An initial set of documents is retrieved• User inspects these docs looking for the relevant

ones (in truth, only top 10-20 need to be inspected)• IR system uses this information to refine description

of ideal answer set• By repeting this process, it is expected that the

description of the ideal answer set will improve• Have always in mind the need to guess at the very

beginning the description of the ideal answer set• Description of ideal answer set is modeled in

probabilistic terms

Probabilistic Ranking PrincipleProbabilistic Ranking Principle

• Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant).

• The model assumes that this probability of relevance depends on the query and the document representations only. Ideal answer set is referred to as R and should maximize the probability of relevance.

• Documents in the set R are predicted to be relevant. • How to compute probabilities?• What is the sample space?

Probabilistic RankingProbabilistic Ranking

• Probabilistic ranking computed as:- sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to

q)

- This is the odds of the document dj being relevant

- Taking the odds minimize the probability of an erroneous judgement

• Definition:- wij {0,1}

- P(R | dj) : probability that given doc is relevant

- P(R | dj) : probability doc is not relevant

Prob Ranking Pros & ConsProb Ranking Pros & Cons

• Advantages:- Docs ranked in decreasing order of probability of

relevance

• Disadvantages:- need to guess initial estimates for P(ki | R)- method does not take into account tf and idf

factors

Models ComparedModels Compared

• Boolean model does not provide for partial matches and is considered to be the weakest classic model

• Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections

• This seems also to be the view of the research community

Break!Break!

• Fuzzy Models• Extended Models• Generalized Models• Advanced Models

• The Boolean model imposes a binary criterion for deciding relevance

• How about something a little more complex?

• Extend the Boolean model to use partial matching and a ranking is a goal

• Two set theoretic models- Fuzzy Set Model- Extended Boolean Model

Set Theoretic ModelsSet Theoretic Models

Fuzzy Set ModelFuzzy Set Model

• Queries and docs represented by sets of index terms: matching is approximate from the start

• This vagueness can be modeled using a fuzzy framework, as follows:- with each term is associated a fuzzy set- each doc has a degree of membership in this fuzzy

set

• This interpretation provides the foundation for many models for IR based on fuzzy theory

• In here, we discuss the model proposed by Ogawa, Morita, and Kobayashi (1991)

Fuzzy Set TheoryFuzzy Set Theory

• Framework for representing classes whose boundaries are not well defined

• Key idea is to introduce the notion of a degree of membership associated with the elements of a set

• This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership

• Thus, membership is now a gradual notion, contrary to the crispy notion enforced by classic Boolean logic

Fuzzy Information RetrievalFuzzy Information Retrieval

• Fuzzy sets are modeled based on a thesaurus • This thesaurus is built as follows:

- Let vec(c) be a term-term correlation matrix- Let c(i,l) be a normalized correlation factor for

(ki,kl):c(i,l) = n(i,l)

ni + nl - n(i,l)- ni: number of docs which contain ki- nl: number of docs which contain kl- n(i,l): number of docs which contain both ki and kl

• This allows for proximity among index terms.

Fuzzy Model IssuesFuzzy Model Issues

• The correlation factor c(i,l) can be used to define fuzzy set membership for a document dj as follows.

• Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory

• Experiments with standard test collections are not available

• Difficult to compare at this time

Extended Boolean ModelExtended Boolean Model• Boolean model is simple and elegant.• But, no provision for a ranking• As with the fuzzy model, a ranking can be

obtained by relaxing the condition on set membership

• Extend the Boolean model with the notions of partial matching and term weighting

• Combine characteristics of the Vector model with properties of Boolean algebra

Latent Semantic Indexing (LSI)Latent Semantic Indexing (LSI)

• Classic IR might lead to poor retrieval due to:- unrelated documents might be included in the

answer set- relevant documents that do not contain at least

one index term are not retrieved- Reasoning: retrieval based on index terms is vague

and noisy

• The user information need is more related to concepts and ideas than to index terms

• A document that shares concepts with another document known to be relevant might be of interest

• Mapping documents and queries into a lower dimensional space (i.e., composed of higher level concepts which are in fewer number than the index terms) reduces complexity

• Retrieval in this reduced concept space might be superior to retrieval in the space of index terms

• It allows reducing the complexity of the representational framework which might be explored, for instance, with the purpose of interfacing with the user

Latent Semantic Indexing IssuesLatent Semantic Indexing Issues

• Neural Networks:- The human brain is composed of billions of neurons- Each neuron can be viewed as a small processing unit- A neuron is stimulated by input signals and emits

output signals in reaction- A chain reaction of propagating signals is called a

spread activation process - As a result of spread activation, the brain might

command the body to take physical reactions

Neural Network ModelNeural Network Model

• A neural network is an oversimplified representation of the neuron interconnections in the human brain:- nodes are processing units- edges are synaptic connections- the strength of a propagating signal is modelled by a

weight assigned to each edge- the state of a node is defined by its activation level- depending on its activation level, a node might issue

an output signal

• Neural Nets are good at recognizing patterns

Neural Network ModelNeural Network Model

• Very difficult to test (and understand)• Training and working document set differences• Has not been tested extensively• May only be good for selected cases• Improvement over traditional models not

consistently proven

Neural Network IssuesNeural Network Issues

• Probability Theory- Semantically clear- Computationally clumsy

• Why Bayesian Networks? - Clear formalism to combine evidences- Modularize the world (dependencies)- Bayesian Network Models for IR

• Inference Network (Turtle & Croft, 1991)• Belief Network (Ribeiro-Neto & Muntz, 1996)

Alternative Probabilistic ModelsAlternative Probabilistic Models

• Epistemological view of the IR problem• Random variables associated with documents,

index terms and queries• A random variable associated with a

document, dj represents the event of observing that documentThe prior probability P(dj) reflects the probability

associated to the event of observing a given document dj

Inference Network ModelInference Network Model

• Similar to the Inference Network Model- Epistemological view of the IR problem- Random variables associated with documents,

index terms and queries

• Contrary to the Inference Network Model- Clearly defined sample space- Set-theoretic view- Different network topology

Belief Network ModelBelief Network Model

• Probability Based- Frequency- Empirical- Chain of conditions (parents)

• In a Bayesian network each variable is conditionally independent of all its non-descendants, given its parents.

• Dynamic Data difficulties• Broad collections of data applicable

Bayesian Inference ModelsBayesian Inference Models

• Inference Network model is the first and well known

• Belief Network model- adopts a set-theoretic view- adopts a clearly define sample space- provides a separation between query and document

portions - able to reproduce any ranking produced by the

Inference Network while the converse is not true (for example: the ranking of the standard vector model)

Model ComparisonsModel Comparisons

• Computational costs• Inference Network Model one document node at a

time then is linear on number of documents• Belief Network only the states that activate each

query term are considered• The networks do not impose additional costs

because the networks do not include cycles.

• The major strength is net combination of distinct evidential sources to support the rank of a given document.

Model ComparisonsModel Comparisons

Structured ModelsStructured Models• Traditional models are (mostly) keyword-

based• They consider the documents are flat i.e., a

word in the title has the same weight as a word in the body of the document

• Document structure is one additional piece of information which can used- Words appearing in the title or in sub-titles within

the document- Structured Markup

What Could Be Improved?What Could Be Improved?

• Advanced interfaces that facilitate the specification of the structure are also highly desirable

• Hybrid models- Combining models- Combine view of data- Complex, phased processing of text

• Structured text models should ranking• Metadata should have an impact

• How can (Web) IR be better?- Better IR models- Better User Interfaces

• More to find vs. easier to find

• Scriptable applications• New interfaces for applications• New datasets for applications

Projects and/or Papers OverviewProjects and/or Papers Overview

Project Idea #1 – simple HTMLProject Idea #1 – simple HTML

• Graphical Google

• What kind of document?

• When was the document created?