recall systems: efficient learning and use of category indices

ResearchResearch

Recall Systems: Efficient Learning and Use of Category

Indices

Omid Madani

With Wiley Greiner, David Kempe, and Mohammad Salavatipour

ResearchResearch

Overview• Problems and motivation• Proposal: recall systems• Experiments• Related work and conclusions

ResearchResearch

Massive Learning

• Lots of ...• Instances (millions, unbounded..)• Dimensions (1000s and beyond)• Categories (1000s and beyond)

• Two questions:1. How to quickly categorize?

2. How to efficiently learn to categorize efficiently?

ResearchResearchYahoo! Page Topics (Y! Directory)

Arts&Humanities

Photography

Magazines ContestsEducation

History

Business&Economy Recreation&Sports

Sports

Amateur

college

basketballOver 100,000 categories in the Yahoo! directory

Given a page, quickly categorize… Larger for vision, text prediction,...

(millions and beyond)

ResearchResearch

Efficiency1. Two phases (unless truly online):

1. Learning

2. Classification time/deployment

2. Resource requirements:1. Memory

2. Time

3. Sample efficiency

ResearchResearch

Idea

• Cues in input may quickly narrow down possibilities => “index” categories

• Like search engine, but learn a good index • Goal here: index reduces possible classes,

classifiers are then applied for precise classification

ResearchResearch

Summary Findings

• Very fast: • Train time: learned in minutes on

thousands of instances/categories• 10s of online classifiers trained on each

instance (not 1000s)

• Index doesn’t hurt classifier accuracy!

ResearchResearch

Recognition System Recall System

Reduced set of candidate categories

Instancex

Classifier Application

Categories for x

ResearchResearch

The Problem: Tripartite Graphfeatures categories

c1

c2

c3

c4

f1

f2

f3

f4

f5

x2

x3

x4

x5

instances

x1

x6x7

ResearchResearch

Output: An Indexfeatures concepts

c1

c2

c3

c4

c5

f1

f2

f3

f4

set of edges (E)=“COVER”

}c,c{)f(c 434

}f{)c(f 21

}f,f{)c(f 522

Bipartitegraph

f5

ResearchResearch

Using the Index

• Given instance x, retrieve the following candidate set of concepts:

)x(ff

i

i

)f(c

x of features active of set)x(f

A concept is retrieved when a disjunction of features is satisfied

ResearchResearch

Terminology

• False positive: The retrieved concept shouldn’t have been retrieved (irrelevant)

• False negative: The concept should have been retrieved, but was not (missed)

ResearchResearch

Learning to Index

• Lets learn the cover (the edges)

• Online and mistake driven

• Mistake means: • A false negative concept, or• Too many false positives

ResearchResearch

The Indexer Algorithm

• For each concept c keep a sparse vector Vc, initially 0

• Begin with empty cover• On each instance x,

• Retrieve candidates concepts• Update Vc for each false negative c (promotion)• If fp-count > tolerance, update Vc for each false

positive c (demotion)• Update index accordingly• Update classifiers

ResearchResearch

Use Feature Weights

• For each concept c keep a sparse vector Vc, initially 0

• An (i,j)-edge exists in the cover iff

10.]f[V iC j

Inclusion threshold

ResearchResearch

Updating the Vectors• Increase/decrease feature weights in

Vc that appear in x by learning rate• In promotion, if feature is not present

in Vc: initialize to 1 or 1/df • In demotion: ignore 0 features• Max normalize weights (optional)

• Update the index • Takes O(|x| + |Vc|) on every instance

)l(

ResearchResearchThe Indexer Algorithm

)f(C :concepts Retrieve 2.1

: sampletraining in x instance each For 2.

indexempty an withBegin .

)r rate learning , oleraneMax_Norm(t orithmlgA

i)x(ffi

1

demote , tolerance than greater is count-fp If 2.3

)r c, Update(x,

:c concept negative false each for Promote 2.2

)1.0/r c, Update(x,:c concept positive false each for

sclassifier ingcorrespond update andApply 2.4

ResearchResearchThe Update Subroutine

g]i[v iff )C(fc

:again holds condition following the soc for index Update 3.

]j[vmax

]i[v [i]v ,f:v normalize Max2.

r*[i]v[i]v else

1.0[i]v then 1), r and 0, is[i] (v If

f(x) f featureevery For 1.

parameter constant a is g threshold Inclusion .0

)r rate,c concept x, cetaninsUpdate( ubroutineS

ci

cj

ccic

cc

cc

i

ResearchResearch

Analysis

• Under a distribution X on instances

• A given cover E induces a• A false-positive rate (fp-rate):

• A false-negative rate = fn-rate

X~xAvgfp-rate(E) [fp-count on x]

ResearchResearch

Analysis

• If fp-rate(E) <= fp, and fn-rate(E) <= fn, we say the cover is a (fp,fn)-cover

• Is there an algorithm that converges efficiently to a (fp, fn)-cover?

• We can show this for the max-norm algorithms, given existence of (0,0)-cover, and we set tolerance to 0

ResearchResearch

Convergence of max-norm

• The max-norm algorithm converges to a (0,0)-cover, given such exists, and tolerance is set to 0

• The max-norm algorithm makes O(KL) mistakes for a concept with K pure features, and average instance length of L

ResearchResearch

Pure Features

• Pure feature f for c = if f occurs, the instance belong to c

• A “pure” feature never gets “punished” for its concept

• Will take O(L) mistakes to get other irrelevant features out of index

ResearchResearch

Complexity Results

• Existence of (fn,fp)-cover is NP-hard (when fp > 0, fn can remain 0).

• Approximation is also NP-hard!

• Why successful in practice?!

ResearchResearch

Variations

• Some alternatives:• Use of weights for ranking • Other update policies

• Additive updates• Use of other norms, or no norm

• Batch versus online• …

ResearchResearch

Recognition System Recall System

Reduced set of candidate categories

Instancex

Classifier Application

Categories for x

ResearchResearch

The Classifiers

• (Possibly) Binary classifiers:• One for each concept

• For learning the classifiers:• Online learning algorithms

ResearchResearch

Learners Used

• Need online algorithms

• Experimented with:• Perceptron• Winnow• Committees of these (voted

perceptrons, etc.)

ResearchResearch

Experiments

ResearchResearch

Questions• Small tolerance (10s, 100s) enough? • Convergence? Overhead (speed &

memory)? • Overall performance? (together with

classifier training and testing)

ResearchResearch

Size Statistics

• 3 large text categorization corpora:• The big new Reuters corpus (Rose

et al)• An ads dataset (internal)• ODP = open directory project (web

pages and their categories)

ResearchResearch

Domain statistics# of

Instances

# of

features|C| L Cavg

Reuters 800,000 47,000 453 76 3.9

Ads 2,600,000 660,000 13,000 27 4.2

ODP 330,000 3,400,000 70,000 331 4.9

ResearchResearchDomains

ResearchResearch

Experimental set up

• Split data into 70% train and 30% test• Same split used for all experiments• Algorithm parameters:

• Tolerance = 100, • Learning rate = 1.2 • Inclusion threshold = 0.1

• 2.4 ghz with 64 gig ram

ResearchResearchPerformance (Indexer Alone)

ResearchResearch With Classifiers

Reuters

All three domainsbut subset of classes

ResearchResearchIndexer’s Performance

Train 37 40 68 0.3 0.2 0.18

Test 38 41 72 0.23 0.23 0.22

Train 46 44 89 0.1 0.016 0.003

Test 47 46 89 0.15 0.136 0.11

Train 147 59 237 2.38 0.38 .004

Test 86 55 144 2.16 2.22 2.27

)(W 1)(FP 1 )(FP 2 )(FN 1 )(FN 2 )(FN 10

Reuters

Ads

ODP

1 pass during touched concepts of number Avg Work"" W )( 1

)i(FP fp-rate at pass i )i(FN fn-rate at pass i

ResearchResearchIndexer’s Timings

Reuters 1m 1.4m 1.8m 0.46h

Ads 0.8m 0.75m 0.8m 0.26h

ODP 74m 2.9m 0.4m 4.15h

)(d 1 )(d 10 )(d 20 )(T 20

i pass of duration d )i(

20 pass after time total T )( 20

m = minutes, h = hours

ResearchResearchPerformance With Classifiers I

No 0.908 0.982 0.982 0.853 2.6h 52h

Yes 0.903 1.06 0.878 0.94 0.74h 16h

)(FP 1

Reuters)(FN 1 )(FN 10)(FP 10 )(T 10)(T 1

No = index NOT usedYes = index used

ResearchResearchWith Classifiers II

No 0.43 0.475 0.54 0.555 0.36h 14.3h

Yes 0.42 0.464 0.53 0.524 0.05h 1.75h

No 0.45 0.578 0.715 0.731 0.22h 18.6h

Yes 0.47 0.579 0.700 0.736 0.01h 0.5h

No 0.032 0.07 0.14 0.17 0.34h 19h

Yes 0.059 0.10 0.14 0.15 1.3h 4.5h

)(F 101

Reuters, 50 sample categories

Ads, 76 sample categories

ODP, 108 sample categories

)(F 201

)(F 21

)(F 11

)(T 1 )(T 20

f1 score (harmonic mean of precision and recall) at pass i)i(F1

ResearchResearchError Plot

False positive

False negative

total

ResearchResearch Convergence

number of instances

W and fp-rate

ResearchResearchFn-rate vs. Tolerance

tolerance

Fn-rate

ResearchResearchFp-rate vs. toleranceFp-rate

ResearchResearchIndex Size Statistics

Reuters 116,992 258 27,450 64

Ads 523,639 42 173,859 40

ODP 8,900,000 126 2,500,000 35

|E| | F |avg|)c(f| max|)f(C|

index) in edges of (number cover the of size |E|

concepts for outdegress average |)c(f| avg concept someindexing features of number | F |

After 20 Passes

outdegree feature max |)f(C| max

ResearchResearchHigh Out-degree Features

• In Reuters:• “woodmark” (outdegree 10)

• Wooden Furniture Measuring• Precision Instruments• Electronic Active Components• …

• “prft” (64)• “shr” (59)

ResearchResearch

Related Work• Fast classification candidates:

• hierarchical learning, trees (kd, metric, ball, vp, cover, ..),

• inverted indices (search engines!)

• Fast learning candidates:• Nearest neighbors• Naïve Bayes • Generative models• Hierarchical learning

• Feature selection/reduction

ResearchResearch

Related

• Fast visual categorization in biological systems (e.g. Thorpe et al)

• Psychology of concepts (e.g. Murphy’02)

• Associative memory, speed up learning, blackboard systems, models of aspects of mind/brain

ResearchResearch

Summary

• Problem: Efficiently learn and classify when categories abound

• Proposed the recall system: an index that serves as a filter

• Efficiently learned the filter

quickly learned a quick system!

ResearchResearch

Current/Future• Evaluation on other domains

• Language modeling, prediction• Vision ..

• Extend techniques• Ranking (easier than labeling: got very promising results)• Learn “staged” versions • Concept discovery

• Understand better:• Why such efficient algorithm work?• Why should good covers exist? What tolerance?• Strengthen convergence analysis

ResearchResearch

Acknowledgements

• Thanks to Thomas Pierce for helping us with the Nutch engine

• The Y!R ML group (DeCoste and Keerthi) for discussions

ResearchResearchHigh Out-degree Features

• In Reuters:• “woodmark” (outdegree 10)

• Wooden Furniture Measuring• Precision Instruments• Electronic Active Components• …

• “prft” (64)• “shr” (59)

recall systems: efficient learning and use of category indices

Documents