level search filtering for ir model reduction
Post on 12-Feb-2016
21 Views
Preview:
DESCRIPTION
TRANSCRIPT
Michael W. BerryXiaoyan (Kathy) ZhangPadma RaghavanDepartment of Computer ScienceUniversity of Tennessee
Level Search Filtering for
IR Model Reduction
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
2
Computational Models for IR
1. Need framework for designing concept-based IR models.
2. Can we draw upon backgrounds and experiences of computer scientists and mathematicians?
3. Effective indexing should address issues of scale and accuracy.
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
3
The Vector Space Model
Represent terms and documents as vectors in k-dimensional space
Similarity computed by measures such as cosine or Euclidean distance
Early prototype - SMART system developed by Salton et al. [70’s, 80’s]
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
4
Motivation for LSI Two fundamental query
matching problems:synonymy
(image, likeness, portrait, facsimile, icon)
polysemy(Adam’s apple, patient’s discharge, culture)
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
5
Motivation for LSIApproach
Treat word to document association data as an unreliable estimate of a larger set of applicable words.
Goal Cluster similar documents which
may share no terms in a low-dimensional subspace (improve recall).
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
6
LSI Approach Preprocessing
Compute low-rank approximation to the original term-by-document (sparse) matrix
Vector Space Model Encode terms and documents using
factors derived from SVD (ULV, SDD) Postprocessing
Rank similarity of terms and docs to query via Euclid. distances or cosines
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
7
SVD Encoding
Ak is the best rank-k approx. to term-by-document matrix A
Ak = Uk
VkTk
docs
term
s
Term Vectors
Doc Vectors
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
8
Vector Space Dimension
Want minimum no. of factors (k ) that discriminates most concepts
In practice, k ranges between 100 and 300 but could be much larger.
Choosing optimal k for different collections is challenging.
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
9
Strengths of LSI Completely automatic
no stemming required, allow misspellings
Multilanguage search capabilityLandauer (Colorado), Littman (Duke)
Conceptual IR capability (Recall) Retrieve relevant documents that
do not contain any search terms
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
10
Changing the LSI Model Updating
Folding-in new terms or documents [Deerwester et al. ‘90]
SVD-updating [O’Brien ‘94], [Simon & Zha ‘97]
DowndatingModify SVD w.r.t. term or document deletions[Berry & Witter ‘98]
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
11
Recent LSI-based Research
Implementation of kd-trees to reduce query matching complexity (Hughey & Berry ‘00, Info. Retrieval )
Unsupervised learning model for data mining electronic commerce data (J. Jiang et al. ‘99, IDA)
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
12
Recent LSI-based Research
Nonlinear SVD approach for constraint-based feedback (E. Jiang & Berry ‘00, Lin. Alg. & Applications)
Future incorporation of up- and down-dating into LSI-based client/servers
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
13
Information Filtering
Concept: Reduce a large document collection
to a reasonably sized set of potential retrievable documents.
Goal: Produce a relatively small subset
containing a high proportion of relevant documents.
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
14
Approach: Level Search Reduce sparse SVD computation cost by selecting a small subset from the original term-by-document matrix
Use undirected graphic model . . . Term or document: vertices Term weight: edge weight Term in document or document
containing term: edges in graph
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
15
Level Search
Term 1
Term 2
Term 3
QueryDoc 1
Doc 2
Doc 3
Doc 4
Document
Term 5
Term 6
Term 7
Term 8
Term
Level 1 Level 2 Level 3
DocumentDoc 5
Doc 6
Doc 7
Doc 8
Level 4
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
16
Similarity measures Recall: ratio of no. of documents
retrieved that are relevant to total no. of relevant documents.
Precision: ratio of no. of documents retrieved that are relevant to total no. of documents. retrieved
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
17
Test Collections
Collection Matrix Size (Docs Terms Non-zeros)
MEDLINE 1033 5831 52009TIME 425 10804 68240CISI 1469 5609 83602FBIS 4974 42500 1573306
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
18
Avg Recall & Submatrix Sizes for LS
Collection Avg R %D %T %NMEDLINE 85.7 24.8 63.2 27.8TIME 69.4 15.3 61.9 22.7CISI 55.1 21.4 64.1 25.2FBIS 82.1 28.5 55.0 52.9Mean 67.8 18.2 53.4 27.0
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
19
Results for MEDLINE
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Prec
isio
n LSI Only
Level Search Plus LSI
5,831 terms 1,033 docs
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
20
Results for CISI
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Prec
isio
n
LSI Only
Level Search Plus LSI
5,609 terms 1,469 docs
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
21
Results for TIME
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Prec
isio
n
LSI Only
Level Search Plus LSI
10,804 terms 425
docs
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
22
Results for FBIS (TREC-5)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Prec
isio
n
LSI Only
Level Search Plus LSI
42,500 terms 4,974 docs
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
23
Level Search with Pruning
Term 1
Term 2
Term 3
QueryDoc 1
Doc 2
Doc 3
Doc 4
Document
Term 5
Term 6
Term 7
Term 8
Term
Level 1 Level 2 Level 3
DocumentDoc 5
Doc 6
Doc 7
Doc 8
Level 4
Deletesingletonterms
Prune terms to further reduce submatrix andmaintain recall; no affect on documents.
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
24
Effects of Pruning
0102030405060708090
100
% N
onze
ros
MEDLINE CISI TIME FBIS LATIMES
LSI input matrix density comparisons after level search filtering (L) and pruning (P).
LSILSI&LLSI&LP
17,903 terms 1,086 docs (TREC5)
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
25
Effects of Pruning
0102030405060708090
100Av
erag
e Pr
ecisi
on (%
)
MEDLINE CISI TIME FBIS LATIMES LSI average precision comparisons with/ without
level search (L) and/ or pruning (P).
LSILSI&LLSI&LP
230 terms/doc 29 terms/query
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
26
Impact Level Search is a simple and cost-
effective filtering method for LSI; scalable IR.
May reduce the effective term-by-document matrix size by 75% with no significant loss of LSI precision (less than 5%).
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
27
Some Future Challenges for LSI
Agent-based software for indexing remote/distributed collections
Effective updating with global weighting Incorporate phrases and proximity Expand cosine matching to incorporate
other similarity-based data (e.g., images) Optimal number of dimensions
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
28
LSI Web Site
InvestigatorsPapersDemo’sSoftware
http://www.cs.utk.edu/~lsi
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
29
SIAM Book (June’99)
Document File Prep.Vector Space ModelsMatrix
DecompositionsQuery ManagementRanking & Relevance
FeedbackUser InterfacesA Course ProjectFurther Reading
IMA Hot Topics Workshop: Text Mining, Apr 17, 2000
30
CIR00 Workshophttp://www.cs.utk.edu/cir00
10-22-00, Raleigh NC
Invited Speakers:I. Dhillon (Texas)C. Ding (NERSC)K. Gallivan (FSU)D. Martin (UTK)H. Park (Minnesota)B. Pottenger (Lehigh)P. Raghavan (UTK)J. Wu (Boeing)
top related