temple university – cis dept. cis616– principles of data management

111
Temple University – CIS Dept. CIS616– Principles of Data Management V. Megalooikonomou Text Databases (some slides are based on notes by C. Faloutsos)

Upload: emery-flores

Post on 30-Dec-2015

22 views

Category:

Documents


0 download

DESCRIPTION

Temple University – CIS Dept. CIS616– Principles of Data Management. V. Megalooikonomou Text Databases (some slides are based on notes by C. Faloutsos). Text - Detailed outline. text problem full text scanning inversion signature files clustering information filtering and LSI. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Temple University – CIS Dept. CIS616– Principles of Data Management

Temple University – CIS Dept.CIS616– Principles of Data Management

V. MegalooikonomouText Databases

(some slides are based on notes by C. Faloutsos)

Page 2: Temple University – CIS Dept. CIS616– Principles of Data Management

Text - Detailed outline

text problem full text scanning inversion signature files clustering information filtering and LSI

Page 3: Temple University – CIS Dept. CIS616– Principles of Data Management

Problem - Motivation

Eg., find documents containing “data”, “retrieval”

Applications:

Page 4: Temple University – CIS Dept. CIS616– Principles of Data Management

Problem - Motivation

Eg., find documents containing “data”, “retrieval”

Applications: Web law + patent offices digital libraries information filtering

Page 5: Temple University – CIS Dept. CIS616– Principles of Data Management

Problem - Motivation

Types of queries: boolean (‘data’ AND ‘retrieval’ AND

NOT ...)

Page 6: Temple University – CIS Dept. CIS616– Principles of Data Management

Problem - Motivation

Types of queries: boolean (‘data’ AND ‘retrieval’ AND

NOT ...) additional features (‘data’ ADJACENT

‘retrieval’) keyword queries (‘data’, ‘retrieval’)

How to search a large collection of documents?

Page 7: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

Build a FSA; scan

ca

t

Page 8: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

for single term: (naive: O(N*M))

ABRACADABRA text

CAB pattern

Page 9: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

for single term: (naive: O(N*M)) Knuth Morris and Pratt (‘77)

build a small FSA; visit every text letter once only, by carefully shifting more than one step

ABRACADABRA text

CAB pattern

Page 10: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

ABRACADABRA text

CAB pattern

CAB

CAB

CAB

...

Page 11: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

for single term: (naive: O(N*M)) Knuth Morris and Pratt (‘77) Boyer and Moore (‘77)

preprocess pattern; start from right to left & skip!

ABRACADABRA text

CAB pattern

Page 12: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

ABRACADABRA text

CAB pattern

CAB

CAB

CAB

Page 13: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

ABRACADABRA text

OMINOUS pattern

OMINOUS

Boyer+Moore: fastest, in practiceSunday (‘90): some improvements

Page 14: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

For multiple terms (w/o “don’t care” characters): Aho+Corasic (‘75) again, build a simplified FSA in O(M)

time Probabilistic algorithms:

‘fingerprints’ (Karp + Rabin ‘87) approximate match: ‘agrep’

[Wu+Manber, Baeza-Yates+, ‘92]

Page 15: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

Approximate matching - string editing distance:

d( ‘survey’, ‘surgery’) = 2 = min # of insertions, deletions,

substitutions to transform the first string into the second SURVEY SURGERY

Page 16: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

string editing distance - how to compute?

A:

Page 17: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

string editing distance - how to compute?

A: dynamic programming cost( i, j ) = cost to match prefix

of length i of first string s with prefix of length j of second string t

Page 18: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

if s[i] = t[j] then cost( i, j ) = cost(i-1, j-1)else cost(i, j ) = min ( 1 + cost(i, j-1) // deletion 1 + cost(i-1, j-1) // substitution 1 + cost(i-1, j) // insertion )

Page 19: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

Complexity: O(M*N) (when using a matrix to ‘memoize’ partial results)

Page 20: Temple University – CIS Dept. CIS616– Principles of Data Management

Full-text scanning

Conclusions: Full text scanning needs no space

overhead, but is slow for large datasets

Page 21: Temple University – CIS Dept. CIS616– Principles of Data Management

Text - Detailed outline

text problem full text scanning inversion signature files clustering information filtering and LSI

Page 22: Temple University – CIS Dept. CIS616– Principles of Data Management

Text - Inversion

Page 23: Temple University – CIS Dept. CIS616– Principles of Data Management

Text - Inversion

Q: space overhead?

Page 24: Temple University – CIS Dept. CIS616– Principles of Data Management

Text - Inversion

A: mainly, the postings lists

Page 25: Temple University – CIS Dept. CIS616– Principles of Data Management

Text - Inversion

how to organize dictionary?

stemming – Y/N? insertions?

Page 26: Temple University – CIS Dept. CIS616– Principles of Data Management

Text - Inversion

how to organize dictionary? B-tree, hashing, TRIEs, PATRICIA

trees, ... stemming – Y/N? insertions?

Page 27: Temple University – CIS Dept. CIS616– Principles of Data Management

Text – Inversion

newer topics: Parallelism [Tomasic+,93] Insertions [Tomasic+94], [Brown+]

‘zipf’ distributions Approximate searching (‘glimpse’

[Wu+])

Page 28: Temple University – CIS Dept. CIS616– Principles of Data Management

postings list – more Zipf distr.: eg., rank-frequency plot of ‘Bible’

log(rank)

log(freq)

Text - Inversion

freq ~ 1 / (rank * ln(1.78V))

Page 29: Temple University – CIS Dept. CIS616– Principles of Data Management

Text - Inversion

postings lists Cutting+Pedersen

(keep first 4 in B-tree leaves) how to allocate space: [Faloutsos+92]

geometric progression compression (Elias codes) [Zobel+] –

down to 2% overhead!

Page 30: Temple University – CIS Dept. CIS616– Principles of Data Management

Conclusions

Conclusions: needs space overhead (2%-300%), but it is the fastest

Page 31: Temple University – CIS Dept. CIS616– Principles of Data Management

Text - Detailed outline

text problem full text scanning inversion signature files clustering information filtering and LSI

Page 32: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

idea: ‘quick & dirty’ filter

Page 33: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

idea: ‘quick & dirty’ filter then, do seq. scan on sign. file and

discard ‘false alarms’ Adv.: easy insertions; faster than seq.

scan Disadv.: O(N) search (with small

constant) Q: how to extract signatures?

Page 34: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

A: superimposed coding!! [Mooers49], ...

m (=4 bits/word) ~ (=4 bits set to “1” and the rest left as “0”)F (=12 bits sign. size)the bit patterns are OR-ed to form the document signature

Page 35: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

A: superimposed coding!! [Mooers49], ...

data

actual match

Page 36: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

A: superimposed coding!! [Mooers49], ...

retrieval

actual dismissal

Page 37: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

A: superimposed coding!! [Mooers49], ...

nucleotic

false alarm (‘false drop’)

Page 38: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

A: superimposed coding!! [Mooers49], ...

‘YES’ is ‘MAYBE’ ‘NO’ is ‘NO’

Page 39: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?

Page 40: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

Q1: How to choose F and m ?

m (=4 bits/word)F (=12 bits sign. size)

Page 41: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

Q1: How to choose F and m ? A: so that doc. signature is 50%

full

m (=4 bits/word)F (=12 bits sign. size)

Page 42: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?

Page 43: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

Q2: Why is it called ‘false drop’? Old, but fascinating story [1949]

how to find qualifying books (by title word, and/or author, and/or keyword)

in O(1) time? without computers

Page 44: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

Solution: edge-notched cards

......

1 2 40

•each title word is mapped to m numbers(how?)•and the corresponding holes are cut out:

Page 45: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

Solution: edge-notched cards

......

1 2 40

data

‘data’ -> #1, #39

Page 46: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

Search, e.g., for ‘data’: activate needle #1, #39, and shake the stack of cards!

......

1 2 40

data

‘data’ -> #1, #39

Page 47: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

Also known as ‘zatocoding’, from ‘Zator’ company.

Page 48: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?

Page 49: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

Q3: other apps of signature files? A: anything that has to do with

‘membership testing’: does ‘data’ belong to the set of words of the document?

Page 50: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files

UNIX’s early ‘spell’ system [McIlroy]

Bloom-joins in System R* [Mackert+] and ‘active disks’ [Riedel99]

differential files [Severance+Lohman]

Page 51: Temple University – CIS Dept. CIS616– Principles of Data Management

Signature files - conclusions

easy insertions; slower than inversion

brilliant idea of ‘quick and dirty’ filter: quickly discard the vast majority of non-qualifying elements, and focus on the rest.

Page 52: Temple University – CIS Dept. CIS616– Principles of Data Management

References

Aho, A. V. and M. J. Corasick (June 1975). "Fast Pattern Matching: An Aid to Bibliographic Search." CACM 18(6): 333-340.

Boyer, R. S. and J. S. Moore (Oct. 1977). "A Fast String Searching Algorithm." CACM 20(10): 762-772.

Brown, E. W., J. P. Callan, et al. (March 1994). Supporting Full-Text Information Retrieval with a Persistent Object Store. Proc. of EDBT conference, Cambridge, U.K., Springer Verlag.

Page 53: Temple University – CIS Dept. CIS616– Principles of Data Management

References - cont’d

Faloutsos, C. and H. V. Jagadish (Aug. 23-27, 1992). On B-tree Indices for Skewed Distributions. 18th VLDB Conference, Vancouver, British Columbia.

Karp, R. M. and M. O. Rabin (March 1987). "Efficient Randomized Pattern-Matching Algorithms." IBM Journal of Research and Development 31(2): 249-260.

Knuth, D. E., J. H. Morris, et al. (June 1977). "Fast Pattern Matching in Strings." SIAM J. Comput 6(2): 323-350.

Page 54: Temple University – CIS Dept. CIS616– Principles of Data Management

References - cont’d

Mackert, L. M. and G. M. Lohman (August 1986). R* Optimizer Validation and Performance Evaluation for Distributed Queries. Proc. of 12th Int. Conf. on Very Large Data Bases (VLDB), Kyoto, Japan.

Manber, U. and S. Wu (1994). GLIMPSE: A Tool to Search Through Entire File Systems. Proc. of USENIX Techn. Conf.

McIlroy, M. D. (Jan. 1982). "Development of a Spelling List." IEEE Trans. on Communications COM-30(1): 91-99.

Page 55: Temple University – CIS Dept. CIS616– Principles of Data Management

References - cont’d

Mooers, C. (1949). Application of Random Codes to the Gathering of Statistical Information

Bulletin 31. Cambridge, Mass, Zator Co. Pedersen, D. C. a. J. (1990). Optimizations for

dynamic inverted index maintenance. ACM SIGIR.

Riedel, E. (1999). Active Disks: Remote Execution for Network Attached Storage. ECE, CMU. Pittsburgh, PA.

Page 56: Temple University – CIS Dept. CIS616– Principles of Data Management

References - cont’d

Severance, D. G. and G. M. Lohman (Sept. 1976). "Differential Files: Their Application to the Maintenance of Large Databases." ACM TODS 1(3): 256-267.

Tomasic, A. and H. Garcia-Molina (1993). Performance of Inverted Indices in Distributed Text Document Retrieval Systems. PDIS.

Tomasic, A., H. Garcia-Molina, et al. (May 24-27, 1994). Incremental Updates of Inverted Lists for Text Document Retrieval. ACM SIGMOD, Minneapolis, MN.

Page 57: Temple University – CIS Dept. CIS616– Principles of Data Management

References - cont’d

Wu, S. and U. Manber (1992). "AGREP- A Fast Approximate Pattern-Matching Tool." .

Zobel, J., A. Moffat, et al. (Aug. 23-27, 1992). An Efficient Indexing Technique for Full-Text Database Systems. VLDB, Vancouver, B.C., Canada.

Page 58: Temple University – CIS Dept. CIS616– Principles of Data Management

Text - Detailed outline

text problem full text scanning inversion signature files clustering information filtering and LSI

Page 59: Temple University – CIS Dept. CIS616– Principles of Data Management

Vector Space Model and Clustering

keyword queries (vs Boolean) each document: -> vector (HOW?) each query: -> vector search for ‘similar’ vectors

Page 60: Temple University – CIS Dept. CIS616– Principles of Data Management

Vector Space Model and Clustering

main idea:

document

...data...

aaron zoodata

V (= vocabulary size)

‘indexing’

Page 61: Temple University – CIS Dept. CIS616– Principles of Data Management

Vector Space Model and Clustering

Then, group nearby vectors together Q1: cluster search? Q2: cluster generation?

Two significant contributions ranked output relevance feedback

Page 62: Temple University – CIS Dept. CIS616– Principles of Data Management

Vector Space Model and Clustering

cluster search: visit the (k) closest superclusters; continue recursively

CS TRs

TU TRs

Page 63: Temple University – CIS Dept. CIS616– Principles of Data Management

Vector Space Model and Clustering

ranked output: easy!

CS TRs

TU TRs

Page 64: Temple University – CIS Dept. CIS616– Principles of Data Management

Vector Space Model and Clustering

relevance feedback (brilliant idea) [Roccio’73]

CS TRs

TU TRs

Page 65: Temple University – CIS Dept. CIS616– Principles of Data Management

Vector Space Model and Clustering

relevance feedback (brilliant idea) [Roccio’73]

How?

CS TRs

TU TRs

Page 66: Temple University – CIS Dept. CIS616– Principles of Data Management

Vector Space Model and Clustering

How? A: by adding the ‘good’ vectors and subtracting the ‘bad’ ones

CS TRs

TU TRs

Page 67: Temple University – CIS Dept. CIS616– Principles of Data Management

Outline - detailed

main idea cluster search cluster generation evaluation

Page 68: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

Problem: given N points in V dimensions, group them

Page 69: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

Problem: given N points in V dimensions, group them

Page 70: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

We need Q1: document-to-document

similarity Q2: document-to-cluster similarity

Page 71: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

Q1: document-to-document similarity

(recall: ‘bag of words’ representation)

D1: {‘data’, ‘retrieval’, ‘system’} D2: {‘lung’, ‘pulmonary’, ‘system’} distance/similarity functions?

Page 72: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

A1: # of words in commonA2: ........ normalized by the vocabulary

sizesA3: .... etc

About the same performance - prevailing one:

cosine similarity

Page 73: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

cosine similarity: similarity(D1, D2) = cos(θ) =

sum(v1,i * v2,i) / len(v1)/ len(v2)

θ

D1

D2

Page 74: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

cosine similarity - observations: related to the Euclidean distance weights vi,j : according to tf/idf

θ

D1

D2

Page 75: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

tf (‘term frequency’)high, if the term appears very often in

this document.

idf (‘inverse document frequency’)penalizes ‘common’ words, that appear

in almost every document

Page 76: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

We need Q1: document-to-document

similarity Q2: document-to-cluster similarity

?

Page 77: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

A1: min distance (‘single-link’) A2: max distance (‘all-link’) A3: avg distance A4: distance to centroid

?

Page 78: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

A1: min distance (‘single-link’) leads to elongated clusters

A2: max distance (‘all-link’) many, small, tight clusters

A3: avg distance in between the above

A4: distance to centroid fast to compute

Page 79: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

We have document-to-document similarity document-to-cluster similarity

Q: How to group documents into ‘natural’ clusters

Page 80: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

A: *many-many* algorithms - in two groups [VanRijsbergen]:

theoretically sound (O(N^2)) independent of the insertion order

iterative (O(N), O(N log(N))

Page 81: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation - ‘sound’ methods

Approach#1: dendrograms - create a hierarchy (bottom up or top-down) - choose a cut-off (how?) and cut

cat tiger horse cow0.1

0.3

0.8

Page 82: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation - ‘sound’ methods

Approach#2: min. some statistical criterion (eg., sum of squares from cluster centers) like ‘k-means’ but how to decide ‘k’?

Page 83: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation - ‘sound’ methods

Approach#3: Graph theoretic [Zahn]: build MST; delete edges longer than 2.5* std of

the local average

Page 84: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation - ‘sound’ methods

Result:• variations

• Complexity?

Page 85: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation - ‘iterative’ methods

general outline: Choose ‘seeds’ (how?) assign each vector to its closest seed

(possibly adjusting cluster centroid) possibly, re-assign some vectors to

improve clustersFast and practical, but ‘unpredictable’

Page 86: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation - ‘iterative’ methods

general outline: Choose ‘seeds’ (how?) assign each vector to its closest seed

(possibly adjusting cluster centroid) possibly, re-assign some vectors to

improve clustersFast and practical, but ‘unpredictable’

Page 87: Temple University – CIS Dept. CIS616– Principles of Data Management

Cluster generation

one way to estimate # of clusters k: the ‘cover coefficient’ [Can+] ~ SVD

Page 88: Temple University – CIS Dept. CIS616– Principles of Data Management

Outline - detailed

main idea cluster search cluster generation evaluation

Page 89: Temple University – CIS Dept. CIS616– Principles of Data Management

Evaluation

Q: how to measure ‘goodness’ of one distance function vs another?

A: ground truth (by humans) and ‘precision’ and ‘recall’

Page 90: Temple University – CIS Dept. CIS616– Principles of Data Management

Evaluation

precision = (retrieved & relevant) / retrieved 100% precision -> no false alarms

recall = (retrieved & relevant)/ relevant 100% recall -> no false dismissals

Page 91: Temple University – CIS Dept. CIS616– Principles of Data Management

References

Can, F. and E. A. Ozkarahan (Dec. 1990). "Concepts and Effectiveness of the Cover-Coefficient-Based Clustering Methodology for Text Databases." ACM TODS 15(4): 483-517.

Noreault, T., M. McGill, et al. (1983). A Performance Evaluation of Similarity Measures, Document Term Weighting Schemes and Representation in a Boolean Environment. Information Retrieval Research, Butterworths.

Rocchio, J. J. (1971). Relevance Feedback in Information Retrieval. The SMART Retrieval System - Experiments in Automatic Document Processing. G. Salton. Englewood Cliffs, New Jersey, Prentice-Hall Inc.

Page 92: Temple University – CIS Dept. CIS616– Principles of Data Management

References - cont’d

Salton, G. (1971). The SMART Retrieval System - Experiments in Automatic Document Processing. Englewood Cliffs, New Jersey, Prentice-Hall Inc.

Salton, G. and M. J. McGill (1983). Introduction to Modern Information Retrieval, McGraw-Hill.

Van-Rijsbergen, C. J. (1979). Information Retrieval. London, England, Butterworths.

Zahn, C. T. (Jan. 1971). "Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters." IEEE Trans. on Computers C-20(1): 68-86.

Page 93: Temple University – CIS Dept. CIS616– Principles of Data Management

Text - Detailed outline

text problem full text scanning inversion signature files clustering information filtering and LSI

Page 94: Temple University – CIS Dept. CIS616– Principles of Data Management

LSI - Detailed outline

LSI problem definition main idea experiments

Page 95: Temple University – CIS Dept. CIS616– Principles of Data Management

Information Filtering + LSI

[Foltz+,’92] Goal: users specify interests (= keywords) system alerts them, on suitable news-

documents Major contribution: LSI = Latent

Semantic Indexing latent (‘hidden’) concepts

Page 96: Temple University – CIS Dept. CIS616– Principles of Data Management

Information Filtering + LSI

Main idea map each document into some

‘concepts’ map each term into some ‘concepts’

‘Concept’:~ a set of terms, with weights, e.g. “data” (0.8), “system” (0.5), “retrieval”

(0.6) -> DBMS_concept

Page 97: Temple University – CIS Dept. CIS616– Principles of Data Management

Information Filtering + LSI

Pictorially: term-document matrix (BEFORE)

'data' 'system' 'retrieval' 'lung' 'ear'

TR1 1 1 1

TR2 1 1 1

TR3 1 1

TR4 1 1

Page 98: Temple University – CIS Dept. CIS616– Principles of Data Management

Information Filtering + LSI

Pictorially: concept-document matrix and...

'DBMS-concept'

'medical-concept'

TR1 1

TR2 1

TR3 1

TR4 1

Page 99: Temple University – CIS Dept. CIS616– Principles of Data Management

Information Filtering + LSI

... and concept-term matrix

'DBMS-concept'

'medical-concept'

data 1

system 1

retrieval 1

lung 1

ear 1

Page 100: Temple University – CIS Dept. CIS616– Principles of Data Management

Information Filtering + LSI

Q: How to search, eg., for ‘system’?

Page 101: Temple University – CIS Dept. CIS616– Principles of Data Management

Information Filtering + LSI

A: find the corresponding concept(s); and the corresponding documents

'DBMS-concept'

'medical-concept'

data 1

system 1

retrieval 1

lung 1

ear 1

'DBMS-concept'

'medical-concept'

TR1 1

TR2 1

TR3 1

TR4 1

Page 102: Temple University – CIS Dept. CIS616– Principles of Data Management

Information Filtering + LSI

A: find the corresponding concept(s); and the corresponding documents

'DBMS-concept'

'medical-concept'

data 1

system 1

retrieval 1

lung 1

ear 1

'DBMS-concept'

'medical-concept'

TR1 1

TR2 1

TR3 1

TR4 1

Page 103: Temple University – CIS Dept. CIS616– Principles of Data Management

Information Filtering + LSI

Thus it works like an (automatically constructed) thesaurus:

we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)

Page 104: Temple University – CIS Dept. CIS616– Principles of Data Management

LSI - Detailed outline

LSI problem definition main idea experiments

Page 105: Temple University – CIS Dept. CIS616– Principles of Data Management

LSI - Experiments

150 Tech Memos (TM) / month 34 users submitted ‘profiles’ (6-66

words per profile) 100-300 concepts

Page 106: Temple University – CIS Dept. CIS616– Principles of Data Management

LSI - Experiments

four methods, cross-product of: vector-space or LSI, for similarity

scoring keywords or document-sample, for

profile specification measured: precision/recall

Page 107: Temple University – CIS Dept. CIS616– Principles of Data Management

LSI - Experiments

LSI, with document-based profiles, were better

precision

recall

(0.25,0.65)

(0.50,0.45)

(0.75,0.30)

Page 108: Temple University – CIS Dept. CIS616– Principles of Data Management

LSI - Discussion - Conclusions

Great idea, to derive ‘concepts’ from documents to build a ‘statistical thesaurus’

automatically to reduce dimensionality

Often leads to better precision/recall but:

Needs ‘training’ set of documents ‘concept’ vectors are not sparse anymore

Page 109: Temple University – CIS Dept. CIS616– Principles of Data Management

LSI - Discussion - Conclusions

Observations Bellcore (-> Telcordia) has a

patent used for multi-lingual retrieval

How exactly SVD works?

Page 110: Temple University – CIS Dept. CIS616– Principles of Data Management

Indexing - Detailed outline

primary key indexing secondary key / multi-key indexing spatial access methods fractals text SVD: a powerful tool multimedia ...

Page 111: Temple University – CIS Dept. CIS616– Principles of Data Management

References

Foltz, P. W. and S. T. Dumais (Dec. 1992). "Personalized Information Delivery: An Analysis of Information Filtering Methods." Comm. of ACM (CACM) 35(12): 51-60.