parikshit ram – senior machine learning scientist, skytree at mlconf atl

Max-kernel searchHow to search for just about anything?

Parikshit Ram

Similarity search

● Set of objects● Query● Similarity functionR

q

1

Finding similar images

2

Drug discovery

3http://fineartamerica.com

Movie recommendations

4

Similarity search is ubiquitous

● Machine learning

● Computer vision

● Theory

● Databases

● Information retrieval

● Web application

● Collaborative filtering

● Scientific computing

5

Search-based classification

6


6

?


6

k-nearest-neighbor classification/regression


7

“RomCom fan”


7

“Kids movie fanatic”

Search-based outlier detection

8

Search-based ML

Advantage● nonparametric - lets the data speak● no need to train complex models

Key ingredient● notion of similarity (domain/data-specific)

Main challenge: efficiency● Sheer size of the data● Varied data types

10

Properties of similarity functions

11

● symmetry

OR

11

3

1

The dissimilarity is the size of the set-theoretic difference


11

● symmetry

● self-similarity

OR

OR

11

We do not really care about this.


11

● symmetry

● self-similarity

OR

OR

12

Metricsused everywhere

12

Bregman divergenceswidely used for distributions

Mercer kernelswidely used in ML for variety of objects and problems

???not quite explored in search or ML

Metricsused everywhere

Breadth of Kernel Functions

Objects Kernel Functions

Images linear, polynomial, Gaussian, Pyramid match

Documents cosine

Sequences p-spectrum kernel, alignment score

Trees subtree, syntactic, partial tree

Graphs random walk

Time series cross-correlation, dynamic time-warping

Natural Lang. convolution, decomposition, lexical semantic

13

What is a Kernel Function?

In wordsA pairwise symmetric function

● Correlation in a richer but hidden feature space● Cannot access the hidden space

Object space

Hidden space

Hidden mapping

14

Max-kernel Search

Find the object in R most similar to q with respect to a kernel

15

Existing methods

● Brute-force (parallel/distributed)○ Domain-specific optimizations

● Coerce data to use metrics○ Only approximate

No standard search tools!

16

Understanding kernels

If two objects equally similar to each other

then they are equally similar to the query q

17

IF

17

Understanding kernels

THEN

18

Indexing our collection

Multi-resolution index in O( n log n ) time

p

18

Indexing our collection

Cover Tree (BKL 2006)

How to Search with this Index?

19

q

p


19

q

p

p'

p''


q

p

p''

p'

19


q

p

p''

p'

Safely ignore a large chunk (potentially millions)

19

Results: Efficiency

Improvement

20

● Widely applicable algorithm● Performance data/kernel-dependent

Results: Efficiency10000x

10xImprovement

20

Results: Sublinear Query Time

Bigger data implies bigger efficiency gains

Improvement

Object set size

21

Can We Prove it?What Makes Search Hard?

Thm. For a set R of n objects, the query time is

● expansion constant

○ the distribution of the data

● directional concentration constant

○ the distribution of a kernel-induced transformation

of the data22

Code/tutorial for Fast Exact Max-Kernel Search

23

version 1.0.5http://www.mlpack.org Ryan R. Curtin

Endnote

● Search is an essential tool for ML● Exploring different types of similarity functions

increases the applicability and quality of search● Kernels are widely applicable similarity functions

○ now we have provably fast max kernel search

Email: [email protected]

http://mlpack.org

http://mlpack.org

parikshit ram – senior machine learning scientist, skytree at mlconf atl

Technology