![Page 1: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/1.jpg)
ResearchResearch
Ranked Recall: Efficient Classification by Learning
Indices That Rank
Omid Madani
with Michael Connor (UIUC)
![Page 2: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/2.jpg)
ResearchResearchMany Category Learning (e.g. Y! Directory)
Arts&Humanities
Photography
Magazines ContestsEducation
History
Business&Economy Recreation&Sports
Sports
Amateur
college
basketballOver 100,000 categories in the Yahoo! directory
Given a page, quickly categorize… Larger for vision, text prediction,...
(millions and beyond)
![Page 3: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/3.jpg)
ResearchResearch
Supervised Learning • Often two phases:
• Training
• Execution/Testing
A Learnt classifier f(categorizer)
f(unseen) instance class prediction(s)
Classfeatures11 0 3
50 0 1
21 1 0
20 0 0
?0 0 1 Y)x(f
x
x1x2x3
Often learn binary classifiers
![Page 4: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/4.jpg)
ResearchResearch
Massive Learning
• Lots of ...• Instances (millions, unbounded..)• Dimensions (1000s and beyond)• Categories (1000s and beyond)
• Two questions:1. How to quickly categorize?
2. How to efficiently learn to categorize efficiently?
![Page 5: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/5.jpg)
ResearchResearch
Efficiency1. Two phases (combined when online):
1. Learning
2. Classification time/deployment
2. Resource requirements:1. Memory
2. Time
3. Sample efficiency
![Page 6: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/6.jpg)
ResearchResearch
Idea
• Cues in input may quickly narrow down possibilities => “index” categories
• Like search engine, but learn a good index
• Goal: learn to strike a good balance between accuracy and efficiency
![Page 7: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/7.jpg)
ResearchResearch
Summary Findings
• Very fast: • Train time: minutes versus hours/days
(compared against one-versus-rest and top-down)
• Classification time: O(|x|)?• Memory efficient
• Simple to use (runs on laptop..)• Competitive accuracy!
![Page 8: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/8.jpg)
ResearchResearch
Problem Formulation
![Page 9: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/9.jpg)
ResearchResearchInput-Output Summary
features categoriesinstances
Input:tripartite graph
learn
features categories
Output: an index = sparse weighted
directed bipartite graph(sparse matrix)
21w
if jcijw
2f1c
![Page 10: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/10.jpg)
ResearchResearch
Scheme
• Learn a weighted bipartite graph
• Rank categories retrieved
• For category assignment, could use rank, or define thresholds, or map scores to probabilities, etc.
![Page 11: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/11.jpg)
ResearchResearch
Three Parts to the Online of Solution
• How to use the index?
• How to update (learn) it?
• When to update it?
![Page 12: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/12.jpg)
ResearchResearch
Retrieval (Ranked Recall)
}f,f{x 32
1. Features are “activated”
features categories
c1
c2
c3
c4
c5
f1
f2
f3
f42. Edges are activated
3. Receiving categories are activated4. Categories sorted/ranked
).,c(),.,c(),.,c(),.,c(
:list sorted
10104050 1534
40.
30.20.
10.
10.
1. Like use of inverted indices2. Sparse dot products
![Page 13: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/13.jpg)
ResearchResearch
Computing the Index
• Efficiency: Impose a constraint on every feature’s maximum out-degree
• Accuracy: Connect and compute weights so that some measure of accuracy is maximized..
![Page 14: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/14.jpg)
ResearchResearch
• Measure average performance per instance
• Recall: The proportion of instances for which the right category ended up in top k
• Recall at k = 1 (R1), 5 (R5), 10, …• R1=“Accuracy” when “multiclass”
Measure of Accuracy: Recall
![Page 15: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/15.jpg)
ResearchResearch
Computational Complexity• NP-Hard!• The problem: given a finite set of
instances (Boolean features), exactly one category per instance, is there an index with max out-degree 1, such that R1 on training set is greater than a threshold t ?
• Reduction from set cover• Approximation? (not known)
![Page 16: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/16.jpg)
ResearchResearch
How About Practice?
• Devised two main learning algorithms:• IND treats features independently.• Feature Normalize (FN) doesn’t make an
independence assumption; it’s online.• Only non-negative weights are learned.
![Page 17: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/17.jpg)
ResearchResearch
Feature Normalize (FN) Algorithm
• Begin with an empty index
• Repeat• Input instance (features + categories), and
retrieve and rank candidate categories
• If margin is not met, update index
![Page 18: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/18.jpg)
ResearchResearch
Three Parts (Online Setting)
• How to use the index?
• How to update it?
• When to update it?
![Page 19: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/19.jpg)
ResearchResearch
Index Updating
• For each active feature:• Strengthen weights between active
feature and true category• Weaken the other connections to
the feature
• Strengthening = Increase weight by addition or multiplication
![Page 20: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/20.jpg)
ResearchResearch
Updating
features categories
c1
c2
c3
c4
c5
f1
f2
f3
f4
3
2
Cx
xf
1. Identify connection
2. Increase weight
3. Normalize/weaken other weights
4. Drop small weights
![Page 21: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/21.jpg)
ResearchResearch
Three Parts
• How to use an index?
• How to update it?
• When to update it?
![Page 22: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/22.jpg)
ResearchResearch
A Tradeoff
1. To achieve stability (helps accuracy), we need to keep updating (think single feature scenario)
2. To “fit” more instances, we need to stop updates on instances that we get “right”
Use of margin threshold strikes a balance.
![Page 23: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/23.jpg)
ResearchResearch
Margin Definition
• Margin = score of the true positive category MINUS score of highest ranked negative category
• Choice of margin threshold: • Fixed, e.g. 0,0.1, 0.5, …• Online average (eg: average of the
last 10000 margins + 0.1)
![Page 24: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/24.jpg)
ResearchResearch
Salient Aspects of FN
• “Differentially” updates, attempts to improve retrieved ranking (in “context”)
• Normalizes, but from “feature’s side”• No explicit weight demotion/punishment!
(normalization/weakening achieves demotion/reordering ..)
• Memory/Efficiency conscious design from the outset • Very dynamic/adaptive:
• edges added and dropped• Weights adjusted, categories reordered
• Extensions/variations exit (e.g. each feature’s out-degree may dynamically adjust)
![Page 25: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/25.jpg)
ResearchResearch
Reuters 21578
Domain statistics
121014k685k70k Web
1.42712.6k301k369kAds
2.087641447k23kReuters RCV1
Avg labels per x
Avg vector length|C|
# of
features
# of
Instances
industry
115.117.4k299k749kJane Austin
• Experiments are average of 10 runs, each run is a single pass, with 90% for training, 10% held out• |C| is the number of classes, L is avg vector length, Cavg is average
Number of categories per instance
20 News grp
23k9.6k
20k
9.4k
60k
33k
1
180.9 10
20
1120
80
10469k
![Page 26: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/26.jpg)
ResearchResearchSmaller Domains
Keerthi and DeCoste, 06 (fast linear SVM)
• Max out-degree = 25, min allowed weight = 0.01, tested with margins 0, 0.1, and 0.5 and up to 10 passes• 90-10 random splits
10 categories, 10k instances
![Page 27: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/27.jpg)
ResearchResearch Three Smaller Domains
20 categories, 20k instances
![Page 28: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/28.jpg)
ResearchResearch Three Smaller Domains
104 categories, 10k instances
![Page 29: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/29.jpg)
ResearchResearch3 Large Data Sets (top-down comparisons)
~500 categories, 20k instances
~12.6k categories, ~370k instances
~14k categories, ~70k instances
![Page 30: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/30.jpg)
ResearchResearchAccuracy vs. Max Out-Degree
max out-degree allowed
accuracy
Web page categorization
Ads
RCV1
![Page 31: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/31.jpg)
ResearchResearchAccuracy vs. Passes and Margin
# passes
Accuracy
![Page 32: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/32.jpg)
ResearchResearch
Related Work and Discussion • Multiclass learning/categorization
algorithms (top-down, nearest neighbors, perceptron, Naïve Bayes, MaxEnt, SVMs, online methods, ..),
• Speed up methods (trees, indices, …)• Feature selection/reduction• Evaluation criteria• Fast categorization in the natural world• Prediction games! (see poster)
![Page 33: Ranked Recall: Efficient Classification by Learning Indices That Rank](https://reader035.vdocuments.us/reader035/viewer/2022062518/5681478e550346895db4be2c/html5/thumbnails/33.jpg)
ResearchResearch
Summary
• A scalable supervised learning method for huge class sets (and instances,..)
• Idea: learn an index (a sparse weighted bipartite graph, mapping features to categories)
• Online time/memory efficient algorithms• Current/future: more algorithms, theory,
other domains/applications, ..