adding semantics to information retrieval

22
Adding Semantics to Information Retrieval By Kedar Bellare 20 th April 2003

Upload: york

Post on 30-Jan-2016

69 views

Category:

Documents


1 download

DESCRIPTION

Adding Semantics to Information Retrieval. By Kedar Bellare. 20 th April 2003. Motivation. Current IR techniques term-based Semantics of document and query not considered Problems like polysemy and synonymy Lot of advances in NLP and Statistical Modeling of Semantics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Adding Semantics to Information Retrieval

Adding Semantics to Information Retrieval

ByKedar Bellare20th April 2003

Page 2: Adding Semantics to Information Retrieval

Motivation

Current IR techniques term-based Semantics of document and query

not considered Problems like polysemy and

synonymy Lot of advances in NLP and

Statistical Modeling of Semantics Is Semantic IR really required?

Page 3: Adding Semantics to Information Retrieval

Organization

Traditional IR Statistics for Semantics – Latent

Semantic Indexing Semantic Resources for Semantics –

Use of Semantic Nets, Conceptual Graphs, WordNet etc. in IR.

Conclusion

Page 4: Adding Semantics to Information Retrieval

Information Retrieval

An information retrieval system does not inform the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.

Page 5: Adding Semantics to Information Retrieval

A Typical IR System

Page 6: Adding Semantics to Information Retrieval

Current IR Preprocessing of Documents

Inverted Index Removing stopwords and Stemming

Representation of Documents Vector Space Model – TF and IDF Document Clustering

Improvements to the above Better weighting of Document Vectors Link analysis – PageRank and Anchor Text

Page 7: Adding Semantics to Information Retrieval

Latent Semantic Indexing

Problems with Traditional Approaches Synonymy – Automobile and Car Polysemy – Jaguar means both a Car

and Animal LSI – Linear Algebra for capturing

“Latent Semantics” of documents Method of dimensionality reduction

Page 8: Adding Semantics to Information Retrieval

LSI Compares document vectors in Latent

Semantic Space Two documents can have high

similarity value even if no terms shared Attempts to remove minor differences

in terminology during indexing Truncated SVD – used for construction

of Latent Semantic Space

Page 9: Adding Semantics to Information Retrieval

Singular Value Decomposition

Given a term-document matrix At x d

converts it into product of three matrices Tt x r, Sr x r and Dd x r such that

A = T S DT

T and D are orthogonal, S is diagonal and r is rank of A

Reduced space corresponds to axes of greatest variation

Page 10: Adding Semantics to Information Retrieval

What LSI does?

Uses truncated SVD Instead of r – dimensional space

uses a factor k

Āt x d = Tt x k Sk x k DTd x k

Truncated SVD – captures underlying structure in association of terms and documents

Page 11: Adding Semantics to Information Retrieval

Using the SVD model

Comparison of terms – entries of the matrix T S2T T

Comparison of documents – entries of the matrix D S2DT

Comparison of term and document – entries of the matrix TSDT

Query in SVD model – q’ = qT T S-1

Page 12: Adding Semantics to Information Retrieval

Example of LSI

Page 13: Adding Semantics to Information Retrieval

Why LSI works? Although lot of empirical evidence

no concrete proof of why LSI works No major degradation – Theorem

of Eckart and Young States that the distance of two

matrices is minimum Still does not explain

improvements in recall and precision

Page 14: Adding Semantics to Information Retrieval

Why LSI works? (contd.) Papadimitriou et. al.

Assumes documents generated from set of topics with disjoint vocabularies

If term-document matrix A is perturbed, they prove that LSI recovers topic information and removes the noise

Kontostathis et. al. Essentially claims that LSI’s ability to trace

term co-occurrences is what helps in improved recall

Page 15: Adding Semantics to Information Retrieval

Advantages & Disadvantages

Advantages Synonymy Term Dependence

Disadvantages Storage Efficiency

Page 16: Adding Semantics to Information Retrieval

Semantic Resources

Semantic Nets - E.g. John gave Mary the book

Applied in UNL – Eg. Only a few farmers could use information technology in early 1990s

Page 17: Adding Semantics to Information Retrieval

Semantic Resources (contd.) Conceptual Graphs – E.g. A bird is

singing in a Sycamore tree

Conceptual Dependency – E.g. I gave the man a book

Lexical Resources – WordNet

Page 18: Adding Semantics to Information Retrieval

Applications of Semantic Resources in

IR UNL

Used in improving document vectors Conceptual Graphs

Graph matching of query and document CDs

FERRET – Comparison of CD patterns WordNet

Query Expansion using WordNet

Page 19: Adding Semantics to Information Retrieval

Conclusion Various things need to be considered before

applying to Web Storage Efficiency Knowledge Content of Query

Clearly, semantic method needed for eliminating synonymy and polysemy

Currently, traditional models with minor hacks serve the purpose

However, in conclusion : Statistical or Conceptual or combination of both to model Document Semantics is definitely required

Page 20: Adding Semantics to Information Retrieval

References[1] M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using linear algebra

for intelligent information retrieval. SIAM Review, 37(4), pages 573–595, 1995.

[2] S. Chakrabarti. Mining the Web - Discovering Knowledge from Hypertext Data. Morgan Kau.mann Publishers, San Francisco, 2002.

[3] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by Latent Semantic Analysis. Journal of the Society for Information Science 41 (6), pages 391–407, 1990.

[4] A. Kontostathis and W. M. Pottenger. A mathematical view of Latent Semantic Indexing: Tracing Term Co-occurences. Technical report, Lehigh University, 2002.

[5] R. Mandala, T. Takenobu, and T. Hozumi. The use of WordNet in Information Retrieval. In COLING/ACL Workshop on the Usage of WordNet in Natural Language Processing Systems, pages 31–37, 1998.

Page 21: Adding Semantics to Information Retrieval

References (contd.)[6] M. L. Mauldin. Retrieval performance in FERRET: a conceptual

information retrieval system. In Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pages 347–355. ACM Press, 1991.

[7] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J.Miller. Introduction to WordNet: an on-line lexical database. International Journal of Lexicography 3 (4), pages 235 – 244, 1990.

[8] M. Montes-y-Gomez, A. Lopez, and A. F. Gelbukh. Information retrieval with Conceptual Graph matching. In Database and Expert Systems Applications, pages 312–321, 2000.

[9] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent Semantic Indexing: A probabilistic analysis. pages 159–168, 1998.

[10] E. Rich and K. Knight. Artificial Intelligence. Tata McGraw-Hill Publishers, New Delhi, 2002.

Page 22: Adding Semantics to Information Retrieval

References (contd.)[11] G. Salton, A. Wong, and C. S. Yang. A vector space

model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.

[12] C. Shah, B. Chowdhary, and P. Bhattacharyya. Constructing better Document Vectors using Universal Networking Language (UNL). In Proceedings of International Conference on Knowledge-Based Computer Systems (KBCS) 2002. NCST, Navi Mumbai, India, 1995.

[13] H. Uchida, M. Zhu, and S. T. Della. UNL : A gift for a millenium. Technical report, The United Nations University, 2000.

[14] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.