intelligent search by dimension reduction holger bast max-planck-institut für informatik ag 1 -...

28
Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik AG 1 - Algorithmen und Komplexität

Upload: amie-boone

Post on 20-Jan-2018

215 views

Category:

Documents


0 download

DESCRIPTION

Dimension Reduction  Given a high-dimensional space of objects, recover the (assumed) underlying low dimensional space  Formally: given an m×n matrix, possibly full rank, find best low-rank approximation car automobile search engine web

TRANSCRIPT

Intelligent Searchby Dimension Reduction

Holger BastMax-Planck-Institut für Informatik

AG 1 - Algorithmen und Komplexität

Short History I started getting involved in April 2002 I gave a series of lectures in Nov./Dec. 2002 We started a subgroup beginning of this year,

current members are:– Kurt Mehlhorn (director of AG 1)– Hisao Tamaki (visiting professor from Japan)– Kavitha Telikepalli, Venkatesh Srinivasan (postdocs)– Irit Katriel, Debapriyo Majumdar (PhD students,

IMPRS)

Dimension Reduction Given a high-dimensional space of objects, recover

the (assumed) underlying low dimensional space Formally: given an m×n matrix, possibly full rank,

find best low-rank approximation

car 1 1 0 1 0 1 0automobile 1 0 1 1 0 0 0search 0 0 0 0 1 1 0engine 1 1 1 0 1 0 1web 0 0 0 0 1 1 1

Dimension Reduction

car 1 1 1 1 0 0 0automobile 1 1 1 1 0 0 0search 0 0 0 0 1 1 1engine 1 1 1 1 1 1 1web 0 0 0 0 1 1 1

Given a high-dimensional space of objects, recover the (assumed) underlying low dimensional space

Formally: given an m×n matrix, possibly full rank, find best low-rank approximation

Generic Method Find concepts (vectors in term space) c1,…,ck

Replace each document by a linear combination of the c1,…,ck

That is, replace term document matrix by product C·D’, where– columns of C are c1,…,ck

– columns of D’ are documents expressed in terms of concepts

11110002100111

1011100101

11110002201111111100020001101100101

Generic Method

11110002100111

1011100101

111100021111

111100020011

10011

31

1121

Find concepts (vectors in term space) c1,…,ck

Replace each document by a linear combination of the c1,…,ck

That is, replace term document matrix by product C·D’, where– columns of C are c1,…,ck

– columns of D’ are documents expressed in terms of concepts

Specific Methods Latent semantic indexing (LSI)

– Dumais et al. 1989– orthogonal concepts c1,…,ck

– span of c1,…,ck is that k-dimensional subspace which minimizes the squared distances

– choice of basis not specified (at least two sensible ways)

– computable in poynomial time via the singular value decomposition (SVD)

– surprisingly good in practice

Specific Methods Probabilistic Latent Semantic Indexing (PLSI)

– Hofmann 1999– find stochastic matrix of rank k that maximizes the

probability that given matrix is an instance– connects problem to statistical learning theory– hard to compute, approximate by local search

techniques– very good results on some test collections

Specific Methods Concept Indexing (CI)

– Karypis & Han 2000– c1,…,ck = centroid vectors of a k-clustering– documents = projections onto these centroids– computationally easy (given the clustering)– gave surprisingly good results in a recent DFKI

project (Arbeitsamt)

Comparing Methods Fundamental question: which method is how

good under which circumstances? Few theoretically founded answers to this

question– seminal paper: A Probabilistic Analysis of Latent

Semantic Indexing, Papadimitriou, Raghavan, Tamaki, Vempala, PODS’98 (ten years after LSI was born!)

– follow-up paper: Spectral Analysis of Data, Azar, Fiat, Karlin, McSherry, Saia, STOC’01

– main statement: LSI is robust against addition of (how much?) noise

Why does LSI work so well? A good method should produce

– small angles between documents on similar topics– large angles between documents on different topics

A formula for angles in the reduced space:– Let D = C·G, and let c1’,…,ck’ be the images of the

concepts under LSI– Then the k×k dot products ci’·cj’ are given by the

matrix (G·GT)-1

– That is, pairwise angles are ≥ 90 degrees if and only if (G·GT)-1 has nonpositive offdiagonal entries (M-matrix)

Polysemy and Simonymy Let Tij be the dot product of the i-th with the

j-th row of a term-document matrix (~ co-occurence of terms i and j)– Call term k a polysem if there exist terms i and j

such that for some t, Tik, Tjk ≥ t but Tij < t– Two terms i and j are simonyms if Tij ≥ Tii or Tjj

Without polysems and simonyms we have1. Tij ≥ min(Tik,Tjk) for all i,j,k2. Tii > Tij for all j≠i

A symmetric matrix (Tij) with 1. and 2. is called strictly ultrametric

Help from Linear Algebra Theorem [Martinez,Michon,San Martin 1994]:

The inverse of a strictly ultrametric matrix is an M-matrix, i.e. its diagonal entries are positive and its off-diagonal entries are nonpositive

23.0 05.007.005.032.0 19.007.019.047.0 1

3.57.15.17.16.41.25.11.22.3

A new LSI theorem Theorem: If D can be well approximated by

a set of concepts free from polysemy and simonymy, then in the reduced LSI-space these concepts form large pairwise angles.

Beware: This only holds for the original LSI, not for its widely used variant!

Question: How can we check whether such a set exists? This would yield a method for selecting the optimal (reduced) dimension!

Exploiting Link Structure Achlioptas,Fiat,Karlin,McSherry (FOCS’01):

– documents have a topic (implicit in the distribution of terms)

– and a quality (implicit in the link structure)– represent each document by a vector

direction corresponds to the topic length corresponds to the quality

– Goal: for a given query, rank documents by their dot product with the topic of the query

Model details Underlying parameters

– A = [A1 … An] authority topics, one per doc.– H = [H1 … Hn] hub topics, one per doc.– C = [C1 … Ck] translates topics to terms– q = [q1 … qk] query topic

The input we see– D A·C + H·C term document matrix– L HT·A link matrix– Q q·Cquery terms

Goal: recover ordering of A1·q,…,An·q

Model - Problems Link matrix generation L HT·A

– is ok, because the presence of a link is related to the hub/authority value

Term document matrix generation D A·C + H·C – very unrealistic: the term distribution gives

information on the topic, but not on the quality!– more realistic: D A0·C + H0·C, where A0 and H0

contain the normed columns of A and H So far, we could solve the special case where

A differs from H by only a diagonal matrix (i.e. hub topic = authority topic)

Perspective Strong theoretical foundations

– unifying framework + comparative analysis for large variety of dimension reduction methods

– realistic models + performance guarantuees Make proper use of human intelligence

– integrate explicit knowledge– but only as much as required (automatic

detection)– combine dimension reduction methods with

interactive schemes (e.g. phrase browsing)

Ende!

Specific Methods Latent semantic indexing (LSI) [Dumais et al.

’89]– orthogonal concepts c1,…,ck

– span of c1,…,ck is that k-dimensional subspace which minimizes the squared distances

Probabilistic Lat. Sem. Ind. (PLSI) [Hofmann ’99]– find stochastic matrix of rank k that maximizes the

probability that given matrix is an instance Concept Indexing (CI) [Karypis & Han ’00]

– c1,…,ck = centroid vectors of a k-clustering– documents = projections onto these centroids

Dimension Reduction Methods Main idea: the high-dimensional space of objects is

a variant of an underlying low dimensional space Formally: given an m×n matrix, possibly full rank,

find best low-rank approximation

car 11 11 01 11 10 10 10automobile 11 01 11 11 10 10 10search 00 00 00 00 01 01 01engine 11 11 11 01 01 01 01web 00 00 00 00 01 01 01

I will talk about … Dimension reduction techniques

– some methods– a new theorem

Exploiting link structure– state of the art– some new ideas

Perspective

Overview Exploiting the link structure

– Google, HITS, SmartyPants– Trawling

Semantic Web– XML, XML-Schema– RDF, DAML+OIL

Interactive browsing– Scatter/Gather– Phrase Browsing

Scatter/Gather Cutting, Karger, Pedersen, Tukey, SIGIR’92 Motivation: Zooming into a large document

collection Realisation: geometric clustering Challenge: extremely fast algorithms required,

i.p.– linear-time preprocessing– constant-time query processing

Example: New York Times News Service, articles from August 1990 (~5000 articles, 30MB text)

Scatter/Gather – Example

taken from from Cutting, Karger, Pedersen, Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, © 1992 ACM SIGIR

Scatter/Gather – Example

taken from from Cutting, Karger, Pedersen, Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, © 1992 ACM SIGIR

Phrase Browsing Nevill-Manning,Witten,Moffat, 1997 Formulating a good query requires more or

less knowledge of the document collection– if less, fine– if more, interaction is a must

Build hierarchy of phrases Example: http://www.nzdl.org/cgi-bin/library Challenge: fast algorithms for finding minimal

grammar, e.g. for S babaabaabaa

Teoma More refined concept of authoritativeness,

depending on the specific query (“subject-specific popularity”)

More sophisticated query refinement But: Coverage is only 10% of that of Google Example: http://www.teoma.com