wired week 2
DESCRIPTION
WIRED Week 2. Syllabus Update (at least another week) Readings Overview Many ways to explain the same things Always think of the user Skip the math (mostly) Readings Review Most complicated of the entire semester Refer back to it often Readings Review - More Models Non-Linear - PowerPoint PPT PresentationTRANSCRIPT
WIRED Week 2WIRED Week 2
• Syllabus Update (at least another week)• Readings Overview
- Many ways to explain the same things- Always think of the user- Skip the math (mostly)
• Readings Review- Most complicated of the entire semester- Refer back to it often
• Readings Review - More Models• Non-Linear• Projects and/or Papers Overview
Opening CreditsOpening Credits
• Material for these slides obtained from:- Modern Information Retrieval by Ricardo Baeza-Yates and
Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/
- Berthier Ribeiro-Neto - Introduction to Modern Information Retrieval by Gerard
Salton and Michael J. McGill, McGraw-Hill, 1983.- Ray Mooney CSE 8335 Spring 2003- Joydeep Ghosh
• IR originally mostly for systems, not people• IR in the last 25 years:
1. classification and categorization2. systems and languages3. user interfaces and visualization
• A small world of concern• The Web changed everything
• Huge amount of accessible information• Varied information sources• Relatively easy to look for information• Improving IR means improving learning
• Digital technology changes everything (again)
Why IR?Why IR?
WIRED FocusWIRED Focus
• Information Retrieval: representation, storage, organization of, and access to information items
• Focus is on the user information need• User information need:
- Find all docs containing information on Austin which:
• Are hosted by utexas.edu• Discuss restaurants
• Emphasis is on the retrieval of information (not data, not just a keyword match)
So just what is Information then?So just what is Information then?
• Oh no, not more about informaiton....
• “The difference that makes a difference”Gregory Bateson
• Element in the communications process- Information Theory- Data
• Something that informs a user- Not just Data
• “Orange”• “1,741,405.339”
- Helps users learn- Helps users make decisions- Facts (in context)
DifferencesDifferences• Data retrieval
- Keywords match for documents- Well-defined semantics- A single erroneous object implies failure
• Information retrieval- information about a subject or topic- Loose semantics (language issues)- Small errors are tolerated
• IR system:- interpret contents of information items (documents)- generate a ranking which reflects relevance for the query- notion of relevance is most important
Let’s Talk about FindingLet’s Talk about Finding
• How many ways are there to find something?
• Find a specific thing?
• Find a concept?
• Learn about a topic?
Ways of Finding are IR modelsWays of Finding are IR models
• Each way of finding is one type of Information Retrieval
• Different ways to search for person, place or thing
• Real life information retrieval combines several of the methods- All at once- In succession
Retrieval
Browsing
“Database”
User Interaction with (IR) SystemUser Interaction with (IR) System
• Users can do 2 distinct tasks- Which one is more important?- Can you do both at once?- Which is more difficult?
• Leverage the strengths of each
Documents
Information Need
index
query
Rankingmatch
documents
?
Quick Overview of the IR ProcessQuick Overview of the IR Process
structure
Manual indexingDocs
structure Full text Index terms
How do you get an index?How do you get an index?
• A “bag of words” with these logical parts that are processed• The Web has (some) structure in markup languages• Follow from Left to Right to see the document get
condensed• Structure + the right Logical parts (+ things added) = Index
Accentsspacing stopwords
Noungroups stemming
Our friend the IndexOur friend the Index• IR systems usually rely on an index or indices
to process queries• Index term:
- a keyword or group of selected words• “Golf” or “Golf Swing”
- any word (not always actually in the document!)• “Waste of time”
• Stemming might be used:- connect: connecting, connection, connections
• An inverted file is built for the chosen index terms- A list of each keyword (or keyword group) and its
location in the document(s)
Indexes, huh?Indexes, huh?
• Matching at index term level is not always the best way to find
• Users get confused and frustrated (without an understanding of the IR Model)
• Search features not easy- Number of search terms is small- Results not presented well
• Web searching actually not so easy either- Junk pages- Ranking games- Duplicate information- Bad page or site Information Architecture
More about indexesMore about indexes
• Indexes make sense of the order (to the system)
• Relevance is measured for the user’s query among the indices
• Ranking of results is how the index is exposed to the user.
RankingRanking
• A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query
• A ranking is based on fundamental premisses regarding the notion of relevance, such as:- common sets of index terms- sharing of weighted terms- likelihood of relevance
• Each set of premisses leads to a distinct IR model
RelevanceRelevance
• Relevance is the correct document for your situation.
• Relevance feedback is doing the search again with changes to search terms- By the user- By the system
• Called Query Reformulation• Relevance is based on the user• A huge area for improvement in search
interfaces
IR ModelsIR Models
• Classic IR Models- Boolean Model- Vector Model- Probabilistic Model
• Set Theoretic Models- Fuzzy Set Model- Extended Boolean Model
• Generalized Vector Model• Latent Semantic Indexing• Neural Network Model
Non-Overlapping ListsProximal Nodes
Structured Models
Retrieval: Adhoc Filtering
Browsing
U s e r
T a s k
Classic Models
BooleanVectorProbablistic
Set Theoretic
Fuzzy Extended Boolean
Probabilistic
Inference Network Belief Network
Algebraic
Generalized Vector Lat. Semantic Index Neural Networks
Browsing
FlatStructure GuidedHypertext
Types of IR ModelsTypes of IR Models
Index Terms
Full Text
Full Text + Structure
Retrieval
Classic
Set Theoretic Algebraic
Probabilistic
Classic
Set Theoretic Algebraic
Probabilistic
Structured
Browsing
Flat
Flat
Hypertext
Structure Guided
Hypertext
LOGICAL VIEW OF DOCUMENTS
USER T A S K
The "Usual Suspects”The "Usual Suspects”
Classic IR Models - BasicsClassic IR Models - Basics
• Each document represented by a set of representative keywords or index terms
• An index term is a word from the document that describes the document- Classicaly, index terms are nouns because nouns
have meaning by themselves- Most human indexers use nouns & verbs- Now, search engines assume that all words are
index terms (full text representation)• A great index has words added to it for additional
meaning and context
Classic IR Models – More BasicsClassic IR Models – More Basics
• Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents
• The importance of the index terms is represented by weights associated to them- Ki - an index term- dj - a document - F - the framework for document representations- R – ranking function in relation to query & document- wij - a weight associated with (ki,dj)- The weight wij quantifies the importance of the index
term for describing the document contents
Classic IR Models – Basic VariablesClassic IR Models – Basic Variables
- t is the total number of index terms
- K = {k1, k2, …, kt} is the set of all index terms
- wij >= 0 is a weight associated with (ki,dj)
- wij = 0 indicates that term does not belong to doc
- dj= (w1j, w2j, …, wtj) is a weighted vector associated with the document dj
- gi(dj) = wij is a function which returns the weight associated with pair (ki,dj)
Wait a minute!Wait a minute!
• Why are these variables so poorly labeled and organized.
• Tradition• Security through Obscurity?• Difference from other Mathematical
symbolism?
• Sadly, not consistent even in the IR literature
Boolean ModelBoolean Model
• Simple model based on set theory• Queries specified as boolean expressions
- precise semantics- neat formalism- q = ka (kb kc)
• Terms are either present or absent. Thus, wij {0,1}
• Consider- q = ka (kb kc)
- qdnf = (1,1,1) (1,1,0) (1,0,0)
- qcc= (1,1,0) is a conjunctive component
(1,1,1)(1,0,0)
(1,1,0)
Ka Kb
Kc
Boolean ModelBoolean Model
• q = ka (kb kc)
• sim(q,dj) = 1 if vec(qcc) | (vec(qcc)
vec(qdnf)) (ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise
Boolean Model ProblemsBoolean Model Problems
• Retrieval based is binary: no partial matching• No ranking of the documents is provided
(absence of a grading scale)• Users aren’t good at boolean queries
- too simplistic- wrong
• As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query
Vector ModelVector Model
• Use of binary weights is too limiting• Non-binary weights provide consideration for
partial matches• These term weights are used to compute a
degree of similarity between a query and each document
• Ranked set of documents provides for better matching
Vector ModelVector Model
• wij > 0 whenever ki appears in dj
• wiq >= 0 associated with the pair (ki,q)
• dj = (w1j, w2j, ..., wtj)
• q = (w1q, w2q, ..., wtq)
• To each term ki is associated a unitary vector i
• The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)
• The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors
i
j
dj
q
Vector ModelVector Model
• Sim(q,dj) = cos()
= [dj q] / |dj| * |q|
= [ wij * wiq] / |dj| * |q|
• Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1
• A document is retrieved even if it matches the query terms only partially
Vector ModelVector Model
• Sim(q,dj) = [ wij * wiq] / |dj| * |q|
• How to compute the weights wij and wiq ?
• A good weight must take into account two effects:- quantification of intra-document contents
(similarity)• tf factor, the term frequency within a document
- quantification of inter-documents separation (dissi-milarity)
• idf factor, the inverse document frequency
- wij = tf(i,j) * idf(i)
Vector ModelVector Model
- N be the total number of docs in the collection- ni be the number of docs which contain ki- freq(i,j) raw frequency of ki within dj
• A normalized tf factor is given by- f(i,j) = freq(i,j) / max(freq(l,j))- where the maximum is computed over all terms
which occur within the document dj
• The idf factor is computed as- idf(i) = log (N/ni)- the log is used to make the values of tf and idf
comparable. It can also be interpreted as the amount of information associated with the term ki.
Vector ModelVector Model
• The best term-weighting schemes take both into account.- wij = fi,j * log(N/ni)
• This strategy is called a tf-idf weighting scheme
• Text frequency – Inverse document frequency- How rare is the word in this document in
comparison with other documents too?
Vector ModelVector Model
• For the query term weights, a suggestion is- wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni)
• The vector model with tf-idf weights is a good ranking strategy with general collections
• The vector model is usually as good as any known ranking alternatives.
• It is also simple and fast to compute.
Weights Weights wwij ij and and wwiqiq ? ?
• One approach is to examine the frequency of the occurence of a word in a document:
• Absolute frequency:- tf factor, the term frequency within a document
- freqi,j - raw frequency of ki within dj
- Both high-frequency and low-frequency terms may not actually be significant
• Relative frequency: tf divided by number of words in document
• Normalized frequency:
fi,j = (freqi,j)/(maxl freql,j)
Inverse Document FrequencyInverse Document Frequency
• Importance of term may depend more on how it can distinguish between documents.
• Quantification of inter-documents separation• Dissimilarity not similarity• idf factor, the inverse document frequency
IDFIDF
• N = the total number of docs in the collection
• ni = the number of docs which contain ki
• The idf factor is computed as
- idfi = log (N/ni)
- the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.
• For example:
- N=1000, n1=100, n2=500, n3=800
- idf1= 3 - 2 = 1
- idf2= 3 – 2.7 = 0.3
- idf3 = 3 – 2.9 = 0.1
Vector Model ConsideredVector Model Considered
• Most Common• Advantages:
- term-weighting improves quality of the answer set- partial matching allows retrieval of docs that
approximate the query conditions- cosine ranking formula sorts documents according
to degree of similarity to the query- Computationally Efficient
• Disadvantages:- assumes independence of index terms - not clear
that this is bad though- Not best if applied exclusively
Probabilistic ModelProbabilistic Model• Uses a probabilistic framework to get a result
- Given a user query, there is an ideal answer set- Querying as specification of the properties of this
ideal answer set (clustering)
• But, what are these properties? • Guess at the beginning what they could be
(i.e., guess initial description of ideal answer set)
• Improve probablity by iteration & additional data
Probabilistic Model IteratedProbabilistic Model Iterated
• An initial set of documents is retrieved• User inspects these docs looking for the relevant
ones (in truth, only top 10-20 need to be inspected)• IR system uses this information to refine description
of ideal answer set• By repeting this process, it is expected that the
description of the ideal answer set will improve• Have always in mind the need to guess at the very
beginning the description of the ideal answer set• Description of ideal answer set is modeled in
probabilistic terms
Probabilistic Ranking PrincipleProbabilistic Ranking Principle
• Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant).
• The model assumes that this probability of relevance depends on the query and the document representations only. Ideal answer set is referred to as R and should maximize the probability of relevance.
• Documents in the set R are predicted to be relevant. • How to compute probabilities?• What is the sample space?
Probabilistic RankingProbabilistic Ranking
• Probabilistic ranking computed as:- sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to
q)
- This is the odds of the document dj being relevant
- Taking the odds minimize the probability of an erroneous judgement
• Definition:- wij {0,1}
- P(R | dj) : probability that given doc is relevant
- P(R | dj) : probability doc is not relevant
Prob Ranking Pros & ConsProb Ranking Pros & Cons
• Advantages:- Docs ranked in decreasing order of probability of
relevance
• Disadvantages:- need to guess initial estimates for P(ki | R)- method does not take into account tf and idf
factors
Models ComparedModels Compared
• Boolean model does not provide for partial matches and is considered to be the weakest classic model
• Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections
• This seems also to be the view of the research community
Break!Break!
• Fuzzy Models• Extended Models• Generalized Models• Advanced Models
• The Boolean model imposes a binary criterion for deciding relevance
• How about something a little more complex?
• Extend the Boolean model to use partial matching and a ranking is a goal
• Two set theoretic models- Fuzzy Set Model- Extended Boolean Model
Set Theoretic ModelsSet Theoretic Models
Fuzzy Set ModelFuzzy Set Model
• Queries and docs represented by sets of index terms: matching is approximate from the start
• This vagueness can be modeled using a fuzzy framework, as follows:- with each term is associated a fuzzy set- each doc has a degree of membership in this fuzzy
set
• This interpretation provides the foundation for many models for IR based on fuzzy theory
• In here, we discuss the model proposed by Ogawa, Morita, and Kobayashi (1991)
Fuzzy Set TheoryFuzzy Set Theory
• Framework for representing classes whose boundaries are not well defined
• Key idea is to introduce the notion of a degree of membership associated with the elements of a set
• This degree of membership varies from 0 to 1 and allows modeling the notion of marginal membership
• Thus, membership is now a gradual notion, contrary to the crispy notion enforced by classic Boolean logic
Fuzzy Information RetrievalFuzzy Information Retrieval
• Fuzzy sets are modeled based on a thesaurus • This thesaurus is built as follows:
- Let vec(c) be a term-term correlation matrix- Let c(i,l) be a normalized correlation factor for
(ki,kl):c(i,l) = n(i,l)
ni + nl - n(i,l)- ni: number of docs which contain ki- nl: number of docs which contain kl- n(i,l): number of docs which contain both ki and kl
• This allows for proximity among index terms.
Fuzzy Model IssuesFuzzy Model Issues
• The correlation factor c(i,l) can be used to define fuzzy set membership for a document dj as follows.
• Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory
• Experiments with standard test collections are not available
• Difficult to compare at this time
Extended Boolean ModelExtended Boolean Model• Boolean model is simple and elegant.• But, no provision for a ranking• As with the fuzzy model, a ranking can be
obtained by relaxing the condition on set membership
• Extend the Boolean model with the notions of partial matching and term weighting
• Combine characteristics of the Vector model with properties of Boolean algebra
Latent Semantic Indexing (LSI)Latent Semantic Indexing (LSI)
• Classic IR might lead to poor retrieval due to:- unrelated documents might be included in the
answer set- relevant documents that do not contain at least
one index term are not retrieved- Reasoning: retrieval based on index terms is vague
and noisy
• The user information need is more related to concepts and ideas than to index terms
• A document that shares concepts with another document known to be relevant might be of interest
• Mapping documents and queries into a lower dimensional space (i.e., composed of higher level concepts which are in fewer number than the index terms) reduces complexity
• Retrieval in this reduced concept space might be superior to retrieval in the space of index terms
• It allows reducing the complexity of the representational framework which might be explored, for instance, with the purpose of interfacing with the user
Latent Semantic Indexing IssuesLatent Semantic Indexing Issues
• Neural Networks:- The human brain is composed of billions of neurons- Each neuron can be viewed as a small processing unit- A neuron is stimulated by input signals and emits
output signals in reaction- A chain reaction of propagating signals is called a
spread activation process - As a result of spread activation, the brain might
command the body to take physical reactions
Neural Network ModelNeural Network Model
• A neural network is an oversimplified representation of the neuron interconnections in the human brain:- nodes are processing units- edges are synaptic connections- the strength of a propagating signal is modelled by a
weight assigned to each edge- the state of a node is defined by its activation level- depending on its activation level, a node might issue
an output signal
• Neural Nets are good at recognizing patterns
Neural Network ModelNeural Network Model
• Very difficult to test (and understand)• Training and working document set differences• Has not been tested extensively• May only be good for selected cases• Improvement over traditional models not
consistently proven
Neural Network IssuesNeural Network Issues
• Probability Theory- Semantically clear- Computationally clumsy
• Why Bayesian Networks? - Clear formalism to combine evidences- Modularize the world (dependencies)- Bayesian Network Models for IR
• Inference Network (Turtle & Croft, 1991)• Belief Network (Ribeiro-Neto & Muntz, 1996)
Alternative Probabilistic ModelsAlternative Probabilistic Models
• Epistemological view of the IR problem• Random variables associated with documents,
index terms and queries• A random variable associated with a
document, dj represents the event of observing that documentThe prior probability P(dj) reflects the probability
associated to the event of observing a given document dj
Inference Network ModelInference Network Model
• Similar to the Inference Network Model- Epistemological view of the IR problem- Random variables associated with documents,
index terms and queries
• Contrary to the Inference Network Model- Clearly defined sample space- Set-theoretic view- Different network topology
Belief Network ModelBelief Network Model
• Probability Based- Frequency- Empirical- Chain of conditions (parents)
• In a Bayesian network each variable is conditionally independent of all its non-descendants, given its parents.
• Dynamic Data difficulties• Broad collections of data applicable
Bayesian Inference ModelsBayesian Inference Models
• Inference Network model is the first and well known
• Belief Network model- adopts a set-theoretic view- adopts a clearly define sample space- provides a separation between query and document
portions - able to reproduce any ranking produced by the
Inference Network while the converse is not true (for example: the ranking of the standard vector model)
Model ComparisonsModel Comparisons
• Computational costs• Inference Network Model one document node at a
time then is linear on number of documents• Belief Network only the states that activate each
query term are considered• The networks do not impose additional costs
because the networks do not include cycles.
• The major strength is net combination of distinct evidential sources to support the rank of a given document.
Model ComparisonsModel Comparisons
Structured ModelsStructured Models• Traditional models are (mostly) keyword-
based• They consider the documents are flat i.e., a
word in the title has the same weight as a word in the body of the document
• Document structure is one additional piece of information which can used- Words appearing in the title or in sub-titles within
the document- Structured Markup
What Could Be Improved?What Could Be Improved?
• Advanced interfaces that facilitate the specification of the structure are also highly desirable
• Hybrid models- Combining models- Combine view of data- Complex, phased processing of text
• Structured text models should ranking• Metadata should have an impact
• How can (Web) IR be better?- Better IR models- Better User Interfaces
• More to find vs. easier to find
• Scriptable applications• New interfaces for applications• New datasets for applications
Projects and/or Papers OverviewProjects and/or Papers Overview
Project Idea #1 – simple HTMLProject Idea #1 – simple HTML
• Graphical Google
• What kind of document?
• When was the document created?