special topics in computer science advanced topics in information retrieval chapter 2: modeling
DESCRIPTION
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling. Alexander Gelbukh www.Gelbukh.com. Previous chapter. User Information Need Vague Semantic, not formal Document Relevance Order, not retrieve Huge amount of information Efficiency concerns - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/1.jpg)
Special Topics in Computer ScienceSpecial Topics in Computer Science
Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval
Chapter 2: ModelingChapter 2: Modeling
Alexander Gelbukh
www.Gelbukh.com
![Page 2: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/2.jpg)
2
Previous chapterPrevious chapter
User Information Needo Vague
o Semantic, not formal
Document Relevanceo Order, not retrieve
Huge amount of informationo Efficiency concerns
o Tradeoffs
Art more than science
![Page 3: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/3.jpg)
3
ModelingModeling
Still science: computation is formal No good methods to work with (vague) semantics Thus, simplify to get a (formal) model Develop (precise) math over this (simple) model
Why math if the model is not precise (simplified)?
phenomenon model = step 1 = step 2 = ... = result
math
phenomenon model step 1 step 2 ... ?!
![Page 4: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/4.jpg)
4
ModelingModeling
Substitute a complex real phenomenon with a simple model, which you can measure and manipulate formally
Keep only important properties (for this application) Do this with text:
![Page 5: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/5.jpg)
5
Modeling in IR: ideaModeling in IR: idea
Tag documents with fieldso As in a (relational) DB: customer = {name, age, address}
o Unlike DB, very many fields: individual words!
o E.g., bag of words: {word1, word2, ...}: {3, 5, 0, 0, 2, ...}
Define a similarity measure between query and such a recordo (Unlike DB) Rank (order), not retrieve (yes/no)
o Justify your model (optional, but nice)
Develop math and algorithms for fast accesso as relational algebra in DB
![Page 6: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/6.jpg)
Taxonomy of IR systemsTaxonomy of IR systems
![Page 7: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/7.jpg)
7
Aspects of an IR systemAspects of an IR system
IR modelo Boolean, Vector, Probabilistic
Logical view of documentso Full text, bag of words, ...
User tasko retrieval, browsing
Independent, though some are more compatible
![Page 8: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/8.jpg)
Appropriate modelsAppropriate models
![Page 9: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/9.jpg)
9
Characterization of an IR modelCharacterization of an IR model
D = {dj}, collection of formal representations of docso e.g., keyword vectors
Q = {qi}, possible formal representations of user information need (queries)
F, framework for modeling these two: reason for the next
R(qi,dj): Q D R, ranking functiono defines ordering
![Page 10: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/10.jpg)
Specific IR modelsSpecific IR models
![Page 11: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/11.jpg)
11
IR modelsIR models
Classicalo Boolean
o Vector
o Probabilistic
(clear ideas, but some disadvantages)
Refinedo Each one with refinements
o Solve many of the problems of the “basic” models
o Give good examples of possible developments in the area
o Not investigated well We can work on this
![Page 12: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/12.jpg)
12
Basic notionsBasic notions
Document: Set of index termo Mainly nouns
o Maybe all, then full text logical view
Term weightso some terms are better than others
o terms less frequent in this doc and more frequent in other docs are less useful
Documents index term vector {w1j, w2j, ..., wtj}o weights of terms in the doc
o t is the number of terms in all docs
o weights of different terms are independent (simplification)
![Page 13: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/13.jpg)
13
Boolean modelBoolean model
Weights {0, 1}o Doc: set of words
Query: Boolean expressiono R(qi,dj) {0, 1}
Good:o clear semantics, neat formalism, simple
Bad:o no ranking ( data retrieval), retrieves too many or too few
o difficult to translate User Information Need into query
No term weighting
![Page 14: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/14.jpg)
14
Vector modelVector model
Weights (non-binary) Ranking, much better results (for User Info Need) R(qi,dj) = correlation between query vector and doc v
ector E.g., cosine measure: (there is a typo in the book)
![Page 15: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/15.jpg)
Projection
![Page 16: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/16.jpg)
16
WeightsWeights
How are the weights wij obtained? Many variants.
One way: TF-IDF balance TF: Term frequency
o How well the term is related to the doc?o If appears many times, is importanto Proportional to the number of times that appears
IDF: Inverse document frequencyo How important is the term to distinguish documents?o If appears in many docs, is not importanto Inversely proportional to number of docs where appears
Contradictory. How to balance?
![Page 17: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/17.jpg)
17
TF-IDF rankingTF-IDF ranking
TF: Term frequency
IDF: Inverse document frequency
Balance: TF IDFo Other formulas exist. Art.
![Page 18: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/18.jpg)
18
Advantages of vector modelAdvantages of vector model
One of the best known strategies Improves quality (term weighting) Allows approximate matching (partial matching) Gives ranking by similarity (cosine formula) Simple, fast
But: Does not consider term dependencies
o considering them in a bad way hurts quality
o no known good way
No logical expressions (e.g., negation: “mouse & NOT cat”)
![Page 19: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/19.jpg)
19
Probabilistic modelProbabilistic model
Assumptions: o set of “relevant” docs, o probabilities of docs to be relevanto After Bayes calculation: probabilities of terms to be impo
rtant for defining relevant docs Initial idea: interact with the user.
o Generate an initial seto Ask the user to mark some of them as relevant or noto Estimate the probabilities of keywords. Repeat
Can be done without usero Just re-calculate the probabilities assuming the user’s acc
eptance is the same as predicted ranking
![Page 20: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/20.jpg)
20
(Dis)(Dis) advantages of Probabilistic modeladvantages of Probabilistic model
Advantage: Theoretical adequacy: ranks by probabilities
Disadvantages: Need to guess the initial ranking Binary weights, ignores frequencies Independence assumption (not clear if bad)
Does not perform well (?)
![Page 21: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/21.jpg)
21
Alternative Set Theoretic modelsAlternative Set Theoretic modelsFuzzy set modelFuzzy set model
Takes into account term relationships (thesaurus)o Bible is related to Church
Fuzzy belonging of a term to a documento Document containing Bible also contains “a little bit of”
Church, but not entirely
Fuzzy set logic applied to such fuzzy belongingo logical expressions with AND, OR, and NOT
Provides ranking, not just yes/no Not investigated well.
o Why not investigate it?
![Page 22: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/22.jpg)
22
Alternative Set Theoretic modelsExtended Boolean modelExtended Boolean model
Combination of Boolean and Vector In comparison with Boolean model, adds “distance fro
m query”o some documents satisfy the query better than others
In comparison with Vector model, adds the distinction between AND and OR combinations
There is a parameter (degree of norm) allowing to adjust the behavior between Boolean-like and Vector-like
This can be even different within one query Not investigated well. Why not investigate it?
![Page 23: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/23.jpg)
23
Alternative Algebraic modelsAlternative Algebraic modelsGeneralized Vector Space modelGeneralized Vector Space model
Classical independence assumptions:o All combinations of terms are possible, none are equivale
nt (= basis in the vector space)
o Pair-wise orthogonal: cos ({ki}, {kj}) = 0
This model relaxes the pair-wise orthogonality:cos ({ki}, {kj}) 0
Operates by combinations (co-occurrences) of index terms, not individual terms
More complex, more expensive, not clear if better Not investigated well. Why not investigate it?
![Page 24: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/24.jpg)
24
Alternative Algebraic modelsLatent Semantic Indexing modelLatent Semantic Indexing model
Index by larger units, “concepts” sets of terms used together
Retrieve a document that share concepts with a relevant one (even if it does not contain query terms)
Group index terms together (map into lower dimensional space). So some terms are equivalent.o Not exactly, but this is the idea
o Eliminates unimportant details
o Depends on a parameter (what details are unimportant?)
Not investigated well. Why not investigate it?
![Page 25: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/25.jpg)
25
Alternative Algebraic modelsNeural Network modelNeural Network model
NNs are good at matching Iteratively uses the found documents as auxiliary que
rieso Spreading activation.
o Terms docs terms docs terms docs ...
Like a built-in thesaurus First round gives same result as Vector model No evidence if it is good Not investigated well. Why not investigate it?
![Page 26: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/26.jpg)
26
Models for browsingModels for browsing
Flat browsing: Stringo Just as a list of paper
o No context cues provided
Structure guided: Treeo Hierarchy
o Like directory tree in the computer
Hypertext (Internet!): Directed grapho No limitations of sequential writing
o Modeled by a directed graph: links from unit A to unit B units: docs, chapters, etc.
o A map (with traversed path) can be helpful
![Page 27: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/27.jpg)
27
Research issuesResearch issues
How people judge relevance?o ranking strategies
How to combine different sources of evidence? What interfaces can help users to understand and
formulate their Information Need?o user interfaces: an open issue
Meta-search engines: combine results from different Web search engineso They almost do not intersect
o How to combine ranking?
![Page 28: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/28.jpg)
28
ConclusionsConclusions
Modeling is needed for formal operations Boolean model is the simplest Vector model is the best combination of quality and
simplicityo TF-IDF term weighting
o This (or similar) weighting is used in all further models
Many interesting and not well-investigated variationso possible future work
![Page 29: Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling](https://reader035.vdocuments.us/reader035/viewer/2022070400/56813527550346895d9c913e/html5/thumbnails/29.jpg)
29
Thank you!
Till March 22, 6 pm