recall systems: efficient learning and use of category indices
DESCRIPTION
Recall Systems: Efficient Learning and Use of Category Indices. Omid Madani With Wiley Greiner, David Kempe, and Mohammad Salavatipour. Overview. Problems and motivation Proposal: recall systems Experiments Related work and conclusions. Massive Learning. Lots of ... - PowerPoint PPT PresentationTRANSCRIPT
ResearchResearch
Recall Systems: Efficient Learning and Use of Category
Indices
Omid Madani
With Wiley Greiner, David Kempe, and Mohammad Salavatipour
ResearchResearch
Overview• Problems and motivation• Proposal: recall systems• Experiments• Related work and conclusions
ResearchResearch
Massive Learning
• Lots of ...• Instances (millions, unbounded..)• Dimensions (1000s and beyond)• Categories (1000s and beyond)
• Two questions:1. How to quickly categorize?
2. How to efficiently learn to categorize efficiently?
ResearchResearchYahoo! Page Topics (Y! Directory)
Arts&Humanities
Photography
Magazines ContestsEducation
History
Business&Economy Recreation&Sports
Sports
Amateur
college
basketballOver 100,000 categories in the Yahoo! directory
Given a page, quickly categorize… Larger for vision, text prediction,...
(millions and beyond)
ResearchResearch
Efficiency1. Two phases (unless truly online):
1. Learning
2. Classification time/deployment
2. Resource requirements:1. Memory
2. Time
3. Sample efficiency
ResearchResearch
Idea
• Cues in input may quickly narrow down possibilities => “index” categories
• Like search engine, but learn a good index • Goal here: index reduces possible classes,
classifiers are then applied for precise classification
ResearchResearch
Summary Findings
• Very fast: • Train time: learned in minutes on
thousands of instances/categories• 10s of online classifiers trained on each
instance (not 1000s)
• Index doesn’t hurt classifier accuracy!
ResearchResearch
Recognition System Recall System
Reduced set of candidate categories
Instancex
Classifier Application
Categories for x
ResearchResearch
The Problem: Tripartite Graphfeatures categories
c1
c2
c3
c4
f1
f2
f3
f4
f5
x2
x3
x4
x5
instances
x1
x6x7
ResearchResearch
Output: An Indexfeatures concepts
c1
c2
c3
c4
c5
f1
f2
f3
f4
set of edges (E)=“COVER”
}c,c{)f(c 434
}f{)c(f 21
}f,f{)c(f 522
Bipartitegraph
f5
ResearchResearch
Using the Index
• Given instance x, retrieve the following candidate set of concepts:
)x(ff
i
i
)f(c
x of features active of set)x(f
A concept is retrieved when a disjunction of features is satisfied
ResearchResearch
Terminology
• False positive: The retrieved concept shouldn’t have been retrieved (irrelevant)
• False negative: The concept should have been retrieved, but was not (missed)
ResearchResearch
Learning to Index
• Lets learn the cover (the edges)
• Online and mistake driven
• Mistake means: • A false negative concept, or• Too many false positives
ResearchResearch
The Indexer Algorithm
• For each concept c keep a sparse vector Vc, initially 0
• Begin with empty cover• On each instance x,
• Retrieve candidates concepts• Update Vc for each false negative c (promotion)• If fp-count > tolerance, update Vc for each false
positive c (demotion)• Update index accordingly• Update classifiers
ResearchResearch
Use Feature Weights
• For each concept c keep a sparse vector Vc, initially 0
• An (i,j)-edge exists in the cover iff
10.]f[V iC j
Inclusion threshold
ResearchResearch
Updating the Vectors• Increase/decrease feature weights in
Vc that appear in x by learning rate• In promotion, if feature is not present
in Vc: initialize to 1 or 1/df • In demotion: ignore 0 features• Max normalize weights (optional)
• Update the index • Takes O(|x| + |Vc|) on every instance
)l(
ResearchResearchThe Indexer Algorithm
)f(C :concepts Retrieve 2.1
: sampletraining in x instance each For 2.
indexempty an withBegin .
)r rate learning , oleraneMax_Norm(t orithmlgA
i)x(ffi
1
demote , tolerance than greater is count-fp If 2.3
)r c, Update(x,
:c concept negative false each for Promote 2.2
)1.0/r c, Update(x,:c concept positive false each for
sclassifier ingcorrespond update andApply 2.4
ResearchResearchThe Update Subroutine
g]i[v iff )C(fc
:again holds condition following the soc for index Update 3.
]j[vmax
]i[v [i]v ,f:v normalize Max2.
r*[i]v[i]v else
1.0[i]v then 1), r and 0, is[i] (v If
f(x) f featureevery For 1.
parameter constant a is g threshold Inclusion .0
)r rate,c concept x, cetaninsUpdate( ubroutineS
ci
cj
ccic
cc
cc
i
ResearchResearch
Analysis
• Under a distribution X on instances
• A given cover E induces a• A false-positive rate (fp-rate):
• A false-negative rate = fn-rate
X~xAvgfp-rate(E) [fp-count on x]
ResearchResearch
Analysis
• If fp-rate(E) <= fp, and fn-rate(E) <= fn, we say the cover is a (fp,fn)-cover
• Is there an algorithm that converges efficiently to a (fp, fn)-cover?
• We can show this for the max-norm algorithms, given existence of (0,0)-cover, and we set tolerance to 0
ResearchResearch
Convergence of max-norm
• The max-norm algorithm converges to a (0,0)-cover, given such exists, and tolerance is set to 0
• The max-norm algorithm makes O(KL) mistakes for a concept with K pure features, and average instance length of L
ResearchResearch
Pure Features
• Pure feature f for c = if f occurs, the instance belong to c
• A “pure” feature never gets “punished” for its concept
• Will take O(L) mistakes to get other irrelevant features out of index
ResearchResearch
Complexity Results
• Existence of (fn,fp)-cover is NP-hard (when fp > 0, fn can remain 0).
• Approximation is also NP-hard!
• Why successful in practice?!
ResearchResearch
Variations
• Some alternatives:• Use of weights for ranking • Other update policies
• Additive updates• Use of other norms, or no norm
• Batch versus online• …
ResearchResearch
Recognition System Recall System
Reduced set of candidate categories
Instancex
Classifier Application
Categories for x
ResearchResearch
The Classifiers
• (Possibly) Binary classifiers:• One for each concept
• For learning the classifiers:• Online learning algorithms
ResearchResearch
Learners Used
• Need online algorithms
• Experimented with:• Perceptron• Winnow• Committees of these (voted
perceptrons, etc.)
ResearchResearch
Experiments
ResearchResearch
Questions• Small tolerance (10s, 100s) enough? • Convergence? Overhead (speed &
memory)? • Overall performance? (together with
classifier training and testing)
ResearchResearch
Size Statistics
• 3 large text categorization corpora:• The big new Reuters corpus (Rose
et al)• An ads dataset (internal)• ODP = open directory project (web
pages and their categories)
ResearchResearch
Domain statistics# of
Instances
# of
features|C| L Cavg
Reuters 800,000 47,000 453 76 3.9
Ads 2,600,000 660,000 13,000 27 4.2
ODP 330,000 3,400,000 70,000 331 4.9
ResearchResearchDomains
ResearchResearch
Experimental set up
• Split data into 70% train and 30% test• Same split used for all experiments• Algorithm parameters:
• Tolerance = 100, • Learning rate = 1.2 • Inclusion threshold = 0.1
• 2.4 ghz with 64 gig ram
ResearchResearchPerformance (Indexer Alone)
ResearchResearch With Classifiers
Reuters
All three domainsbut subset of classes
ResearchResearchIndexer’s Performance
Train 37 40 68 0.3 0.2 0.18
Test 38 41 72 0.23 0.23 0.22
Train 46 44 89 0.1 0.016 0.003
Test 47 46 89 0.15 0.136 0.11
Train 147 59 237 2.38 0.38 .004
Test 86 55 144 2.16 2.22 2.27
)(W 1)(FP 1 )(FP 2 )(FN 1 )(FN 2 )(FN 10
Reuters
Ads
ODP
1 pass during touched concepts of number Avg Work"" W )( 1
)i(FP fp-rate at pass i )i(FN fn-rate at pass i
ResearchResearchIndexer’s Timings
Reuters 1m 1.4m 1.8m 0.46h
Ads 0.8m 0.75m 0.8m 0.26h
ODP 74m 2.9m 0.4m 4.15h
)(d 1 )(d 10 )(d 20 )(T 20
i pass of duration d )i(
20 pass after time total T )( 20
m = minutes, h = hours
ResearchResearchPerformance With Classifiers I
No 0.908 0.982 0.982 0.853 2.6h 52h
Yes 0.903 1.06 0.878 0.94 0.74h 16h
)(FP 1
Reuters)(FN 1 )(FN 10)(FP 10 )(T 10)(T 1
No = index NOT usedYes = index used
ResearchResearchWith Classifiers II
No 0.43 0.475 0.54 0.555 0.36h 14.3h
Yes 0.42 0.464 0.53 0.524 0.05h 1.75h
No 0.45 0.578 0.715 0.731 0.22h 18.6h
Yes 0.47 0.579 0.700 0.736 0.01h 0.5h
No 0.032 0.07 0.14 0.17 0.34h 19h
Yes 0.059 0.10 0.14 0.15 1.3h 4.5h
)(F 101
Reuters, 50 sample categories
Ads, 76 sample categories
ODP, 108 sample categories
)(F 201
)(F 21
)(F 11
)(T 1 )(T 20
f1 score (harmonic mean of precision and recall) at pass i)i(F1
ResearchResearchError Plot
False positive
False negative
total
ResearchResearch Convergence
number of instances
W and fp-rate
ResearchResearchFn-rate vs. Tolerance
tolerance
Fn-rate
ResearchResearchFp-rate vs. toleranceFp-rate
ResearchResearchIndex Size Statistics
Reuters 116,992 258 27,450 64
Ads 523,639 42 173,859 40
ODP 8,900,000 126 2,500,000 35
|E| | F |avg|)c(f| max|)f(C|
index) in edges of (number cover the of size |E|
concepts for outdegress average |)c(f| avg concept someindexing features of number | F |
After 20 Passes
outdegree feature max |)f(C| max
ResearchResearchHigh Out-degree Features
• In Reuters:• “woodmark” (outdegree 10)
• Wooden Furniture Measuring• Precision Instruments• Electronic Active Components• …
• “prft” (64)• “shr” (59)
ResearchResearch
Related Work• Fast classification candidates:
• hierarchical learning, trees (kd, metric, ball, vp, cover, ..),
• inverted indices (search engines!)
• Fast learning candidates:• Nearest neighbors• Naïve Bayes • Generative models• Hierarchical learning
• Feature selection/reduction
ResearchResearch
Related
• Fast visual categorization in biological systems (e.g. Thorpe et al)
• Psychology of concepts (e.g. Murphy’02)
• Associative memory, speed up learning, blackboard systems, models of aspects of mind/brain
ResearchResearch
Summary
• Problem: Efficiently learn and classify when categories abound
• Proposed the recall system: an index that serves as a filter
• Efficiently learned the filter
quickly learned a quick system!
ResearchResearch
Current/Future• Evaluation on other domains
• Language modeling, prediction• Vision ..
• Extend techniques• Ranking (easier than labeling: got very promising results)• Learn “staged” versions • Concept discovery
• Understand better:• Why such efficient algorithm work?• Why should good covers exist? What tolerance?• Strengthen convergence analysis
ResearchResearch
Acknowledgements
• Thanks to Thomas Pierce for helping us with the Nutch engine
• The Y!R ML group (DeCoste and Keerthi) for discussions
ResearchResearchHigh Out-degree Features
• In Reuters:• “woodmark” (outdegree 10)
• Wooden Furniture Measuring• Precision Instruments• Electronic Active Components• …
• “prft” (64)• “shr” (59)