recall systems: efficient learning and use of category indices

51
Research Research Recall Systems: Efficient Learning and Use of Category Indices Omid Madani With Wiley Greiner, David Kempe, and Mohammad Salavatipour

Upload: mei

Post on 12-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Recall Systems: Efficient Learning and Use of Category Indices. Omid Madani With Wiley Greiner, David Kempe, and Mohammad Salavatipour. Overview. Problems and motivation Proposal: recall systems Experiments Related work and conclusions. Massive Learning. Lots of ... - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Recall Systems: Efficient Learning and Use of Category

Indices

Omid Madani

With Wiley Greiner, David Kempe, and Mohammad Salavatipour

Page 2: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Overview• Problems and motivation• Proposal: recall systems• Experiments• Related work and conclusions

Page 3: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Massive Learning

• Lots of ...• Instances (millions, unbounded..)• Dimensions (1000s and beyond)• Categories (1000s and beyond)

• Two questions:1. How to quickly categorize?

2. How to efficiently learn to categorize efficiently?

Page 4: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchYahoo! Page Topics (Y! Directory)

Arts&Humanities

Photography

Magazines ContestsEducation

History

Business&Economy Recreation&Sports

Sports

Amateur

college

basketballOver 100,000 categories in the Yahoo! directory

Given a page, quickly categorize… Larger for vision, text prediction,...

(millions and beyond)

Page 5: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Efficiency1. Two phases (unless truly online):

1. Learning

2. Classification time/deployment

2. Resource requirements:1. Memory

2. Time

3. Sample efficiency

Page 6: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Idea

• Cues in input may quickly narrow down possibilities => “index” categories

• Like search engine, but learn a good index • Goal here: index reduces possible classes,

classifiers are then applied for precise classification

Page 7: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Summary Findings

• Very fast: • Train time: learned in minutes on

thousands of instances/categories• 10s of online classifiers trained on each

instance (not 1000s)

• Index doesn’t hurt classifier accuracy!

Page 8: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Recognition System Recall System

Reduced set of candidate categories

Instancex

Classifier Application

Categories for x

Page 9: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

The Problem: Tripartite Graphfeatures categories

c1

c2

c3

c4

f1

f2

f3

f4

f5

x2

x3

x4

x5

instances

x1

x6x7

Page 10: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Output: An Indexfeatures concepts

c1

c2

c3

c4

c5

f1

f2

f3

f4

set of edges (E)=“COVER”

}c,c{)f(c 434

}f{)c(f 21

}f,f{)c(f 522

Bipartitegraph

f5

Page 11: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Using the Index

• Given instance x, retrieve the following candidate set of concepts:

)x(ff

i

i

)f(c

x of features active of set)x(f

A concept is retrieved when a disjunction of features is satisfied

Page 12: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Terminology

• False positive: The retrieved concept shouldn’t have been retrieved (irrelevant)

• False negative: The concept should have been retrieved, but was not (missed)

Page 13: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Learning to Index

• Lets learn the cover (the edges)

• Online and mistake driven

• Mistake means: • A false negative concept, or• Too many false positives

Page 14: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

The Indexer Algorithm

• For each concept c keep a sparse vector Vc, initially 0

• Begin with empty cover• On each instance x,

• Retrieve candidates concepts• Update Vc for each false negative c (promotion)• If fp-count > tolerance, update Vc for each false

positive c (demotion)• Update index accordingly• Update classifiers

Page 15: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Use Feature Weights

• For each concept c keep a sparse vector Vc, initially 0

• An (i,j)-edge exists in the cover iff

10.]f[V iC j

Inclusion threshold

Page 16: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Updating the Vectors• Increase/decrease feature weights in

Vc that appear in x by learning rate• In promotion, if feature is not present

in Vc: initialize to 1 or 1/df • In demotion: ignore 0 features• Max normalize weights (optional)

• Update the index • Takes O(|x| + |Vc|) on every instance

)l(

Page 17: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchThe Indexer Algorithm

)f(C :concepts Retrieve 2.1

: sampletraining in x instance each For 2.

indexempty an withBegin .

)r rate learning , oleraneMax_Norm(t orithmlgA

i)x(ffi

1

demote , tolerance than greater is count-fp If 2.3

)r c, Update(x,

:c concept negative false each for Promote 2.2

)1.0/r c, Update(x,:c concept positive false each for

sclassifier ingcorrespond update andApply 2.4

Page 18: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchThe Update Subroutine

g]i[v iff )C(fc

:again holds condition following the soc for index Update 3.

]j[vmax

]i[v [i]v ,f:v normalize Max2.

r*[i]v[i]v else

1.0[i]v then 1), r and 0, is[i] (v If

f(x) f featureevery For 1.

parameter constant a is g threshold Inclusion .0

)r rate,c concept x, cetaninsUpdate( ubroutineS

ci

cj

ccic

cc

cc

i

Page 19: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Analysis

• Under a distribution X on instances

• A given cover E induces a• A false-positive rate (fp-rate):

• A false-negative rate = fn-rate

X~xAvgfp-rate(E) [fp-count on x]

Page 20: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Analysis

• If fp-rate(E) <= fp, and fn-rate(E) <= fn, we say the cover is a (fp,fn)-cover

• Is there an algorithm that converges efficiently to a (fp, fn)-cover?

• We can show this for the max-norm algorithms, given existence of (0,0)-cover, and we set tolerance to 0

Page 21: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Convergence of max-norm

• The max-norm algorithm converges to a (0,0)-cover, given such exists, and tolerance is set to 0

• The max-norm algorithm makes O(KL) mistakes for a concept with K pure features, and average instance length of L

Page 22: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Pure Features

• Pure feature f for c = if f occurs, the instance belong to c

• A “pure” feature never gets “punished” for its concept

• Will take O(L) mistakes to get other irrelevant features out of index

Page 23: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Complexity Results

• Existence of (fn,fp)-cover is NP-hard (when fp > 0, fn can remain 0).

• Approximation is also NP-hard!

• Why successful in practice?!

Page 24: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Variations

• Some alternatives:• Use of weights for ranking • Other update policies

• Additive updates• Use of other norms, or no norm

• Batch versus online• …

Page 25: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Recognition System Recall System

Reduced set of candidate categories

Instancex

Classifier Application

Categories for x

Page 26: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

The Classifiers

• (Possibly) Binary classifiers:• One for each concept

• For learning the classifiers:• Online learning algorithms

Page 27: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Learners Used

• Need online algorithms

• Experimented with:• Perceptron• Winnow• Committees of these (voted

perceptrons, etc.)

Page 28: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Experiments

Page 29: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Questions• Small tolerance (10s, 100s) enough? • Convergence? Overhead (speed &

memory)? • Overall performance? (together with

classifier training and testing)

Page 30: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Size Statistics

• 3 large text categorization corpora:• The big new Reuters corpus (Rose

et al)• An ads dataset (internal)• ODP = open directory project (web

pages and their categories)

Page 31: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Domain statistics# of

Instances

# of

features|C| L Cavg

Reuters 800,000 47,000 453 76 3.9

Ads 2,600,000 660,000 13,000 27 4.2

ODP 330,000 3,400,000 70,000 331 4.9

Page 32: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchDomains

Page 33: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Experimental set up

• Split data into 70% train and 30% test• Same split used for all experiments• Algorithm parameters:

• Tolerance = 100, • Learning rate = 1.2 • Inclusion threshold = 0.1

• 2.4 ghz with 64 gig ram

Page 34: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchPerformance (Indexer Alone)

Page 35: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch With Classifiers

Reuters

All three domainsbut subset of classes

Page 36: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchIndexer’s Performance

Train 37 40 68 0.3 0.2 0.18

Test 38 41 72 0.23 0.23 0.22

Train 46 44 89 0.1 0.016 0.003

Test 47 46 89 0.15 0.136 0.11

Train 147 59 237 2.38 0.38 .004

Test 86 55 144 2.16 2.22 2.27

)(W 1)(FP 1 )(FP 2 )(FN 1 )(FN 2 )(FN 10

Reuters

Ads

ODP

1 pass during touched concepts of number Avg Work"" W )( 1

)i(FP fp-rate at pass i )i(FN fn-rate at pass i

Page 37: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchIndexer’s Timings

Reuters 1m 1.4m 1.8m 0.46h

Ads 0.8m 0.75m 0.8m 0.26h

ODP 74m 2.9m 0.4m 4.15h

)(d 1 )(d 10 )(d 20 )(T 20

i pass of duration d )i(

20 pass after time total T )( 20

m = minutes, h = hours

Page 38: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchPerformance With Classifiers I

No 0.908 0.982 0.982 0.853 2.6h 52h

Yes 0.903 1.06 0.878 0.94 0.74h 16h

)(FP 1

Reuters)(FN 1 )(FN 10)(FP 10 )(T 10)(T 1

No = index NOT usedYes = index used

Page 39: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchWith Classifiers II

No 0.43 0.475 0.54 0.555 0.36h 14.3h

Yes 0.42 0.464 0.53 0.524 0.05h 1.75h

No 0.45 0.578 0.715 0.731 0.22h 18.6h

Yes 0.47 0.579 0.700 0.736 0.01h 0.5h

No 0.032 0.07 0.14 0.17 0.34h 19h

Yes 0.059 0.10 0.14 0.15 1.3h 4.5h

)(F 101

Reuters, 50 sample categories

Ads, 76 sample categories

ODP, 108 sample categories

)(F 201

)(F 21

)(F 11

)(T 1 )(T 20

f1 score (harmonic mean of precision and recall) at pass i)i(F1

Page 40: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchError Plot

False positive

False negative

total

Page 41: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch Convergence

number of instances

W and fp-rate

Page 42: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchFn-rate vs. Tolerance

tolerance

Fn-rate

Page 43: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchFp-rate vs. toleranceFp-rate

Page 44: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchIndex Size Statistics

Reuters 116,992 258 27,450 64

Ads 523,639 42 173,859 40

ODP 8,900,000 126 2,500,000 35

|E| | F |avg|)c(f| max|)f(C|

index) in edges of (number cover the of size |E|

concepts for outdegress average |)c(f| avg concept someindexing features of number | F |

After 20 Passes

outdegree feature max |)f(C| max

Page 45: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchHigh Out-degree Features

• In Reuters:• “woodmark” (outdegree 10)

• Wooden Furniture Measuring• Precision Instruments• Electronic Active Components• …

• “prft” (64)• “shr” (59)

Page 46: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Related Work• Fast classification candidates:

• hierarchical learning, trees (kd, metric, ball, vp, cover, ..),

• inverted indices (search engines!)

• Fast learning candidates:• Nearest neighbors• Naïve Bayes • Generative models• Hierarchical learning

• Feature selection/reduction

Page 47: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Related

• Fast visual categorization in biological systems (e.g. Thorpe et al)

• Psychology of concepts (e.g. Murphy’02)

• Associative memory, speed up learning, blackboard systems, models of aspects of mind/brain

Page 48: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Summary

• Problem: Efficiently learn and classify when categories abound

• Proposed the recall system: an index that serves as a filter

• Efficiently learned the filter

quickly learned a quick system!

Page 49: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Current/Future• Evaluation on other domains

• Language modeling, prediction• Vision ..

• Extend techniques• Ranking (easier than labeling: got very promising results)• Learn “staged” versions • Concept discovery

• Understand better:• Why such efficient algorithm work?• Why should good covers exist? What tolerance?• Strengthen convergence analysis

Page 50: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearch

Acknowledgements

• Thanks to Thomas Pierce for helping us with the Nutch engine

• The Y!R ML group (DeCoste and Keerthi) for discussions

Page 51: Recall Systems: Efficient Learning and Use of Category Indices

ResearchResearchHigh Out-degree Features

• In Reuters:• “woodmark” (outdegree 10)

• Wooden Furniture Measuring• Precision Instruments• Electronic Active Components• …

• “prft” (64)• “shr” (59)