Download - Feature Selection for Document Ranking
Feature Selection Algorithms for Learning to Rank
Andrea Gigli Email
Slides: http://www.slideshare.net/andrgig
March-2016
Outline
Machine Learning for Ranking
Proposed Feature Selection Algorithms (FSA) and Feature Selection Protocol
Application to Public Available Web Search Data
Outline
Machine Learning for Ranking
Proposed Feature Selection Algorithms (FSA) and Feature Selection Protocol
Application to Public Available Web Search Data
Information Retrieval and Ranking Systems
Information Retrieval is the activity of providing information offers relevant to an information need from a collection of information resources.
Ranking consists in sorting the information offers according to some criterion, so that the "best" results appear early in the provided list.
Information Retrieval and Ranking Systems
Ranking System
Information Request
(Query)
Information Offer
(Documents)
Indexed DocumentsInformation
RequestProcessing
(Top) Ranked
Documents
Compute numeric scores on query/document pairs Cosine Similarity, BM25 score, LMIR probabilitiesโฆ
Use Machine Learning to build a ranking modelLearning to Rank (L2R)
How to Rank
๐จ๐ช
๐ฉ GCA
๐ซ
Information Offers(Documents) Ranked List
Information Request(Query) ๐ฌ
๐ฎ
๐ญ
๐ฏ
โฆ
โฆ
...
...
Learning System
Ranking
SystemIndexed
Documents
...
Training
Prediction
How to Rank using Supervised Learning
...
...
๐1,1 ๐1,2
โ1,1
๐1,๐1
โ1,2 โ1,๐1
โฆ
โฆ
๐M,1 ๐M,2
โ๐,1
๐M,๐๐
โ๐,2 โ๐,๐2
๐1
๐๐ ๐(๐, ๐)
๐๐+1
๐(๐๐+1, ๐๐1)
๐(๐๐+1, ๐๐๐)
๐๐: i-th query
๐๐,๐: j-th document
associated to the i-thquery
โ๐,๐: observed score
of the j-th document associated to the i-thquery
๐(๐, ๐): scoring function
Outline
Machine Learning for Ranking
Proposed Feature Selection Algorithms (FSA) and Feature Selection Protocol
Application to Public Available Web Search Data
Query & Information Offer Features
๐๐,1 ๐๐,2 ๐๐,๐๐๐๐โฆ โฆโ๐,1 โ๐,2 โ๐,๐๐
๐ฅ๐,1(1)
๐ฅ๐,1(2)
๐ฅ๐,1(3)
โฎ
๐ฅ๐,1(๐น)
๐ฅ๐,2(1)
๐ฅ๐,2(2)
๐ฅ๐,2(3)
โฎ
๐ฅ๐,2(๐น)
๐ฅ๐,๐๐
(1)
๐ฅ๐,๐๐
(2)
๐ฅ๐,๐๐
(3)
โฎ
๐ฅ๐,๐๐
(๐น)
โฆ
Documents Query/Documents LabelsQuery
๐ ๐, ๐ โ ๐(๐)
F is of the order of hundreds, thousands
Which features
Case Feature examples
Web Search Query-URL matching features: number of occurrences of query terms in the document, BM25, N-gram BM25, Tf-Idf,โฆ Importance of Url: PageRank , Number of in-links, Number of clicks, Browse Rank, Spam Score, Page Quality Scoreโฆโฆโฆโฆ..
Online Advertisement
User features: last page visited, time from the last visit, last advertisement clicked, products queriedโฆ Product features: product description, product category, priceโฆ User-productmatching feature: tf-idf, expected rating,โฆ Page-Product matching feature: topic, category, tf-idf, โฆ
Collaborative Filtering
User features: age, gender, consumption history, โฆ Product characteristics: category, price, description, โฆ Context -Product matching: tag-matching, tf-idf, โฆ
โฆ โฆ
How to select features in L2R
The main goal of any feature selection process is to select a subset of n elements from a set of N measurement, with n<N, without significantly degrading the performance of the system
The search for the optimal subset require to search among
2N possible subsets
How to select features in L2R
0
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
9,000,000
0 5 10 15 20 25
Number of possible
feature subsets
Number of Features
A suboptimalcriteria isneeded
Proposed Protocol for Comparing Feature Selection Algorithms
Measure the Relevance of each Feature
Measure the Similarity of each pair of
features
Select a Feature Subset using a
Feature Selector
Train the L2R Model
Measure the L2R Model Performance
on the Test Set
Compare Feature
Selection Algorithms
Repeat for different Subset Size
1 2 3 4 5 6
Repeat from 3 for every Feature
Selection Algorithm
Competing Algorithms for feature selection
We developed the following algorithms
Naรฏve Greedy search Algorithm for feature Selection (NGAS)
Naรฏve Greedy search Algorithm for feature Selection -Extended (NGAS-E)
Hierarchical Clustering search Algorithm for feature Selection (HCAS)
Competing Algorithms for feature selection
We developed the following algorithms
Naรฏve Greedy search Algorithm for feature Selection (NGAS)
Naรฏve Greedy search Algorithm for feature Selection -Extended (NGAS-E)
Hierarchical Clustering search Algorithm for feature Selection (HCAS)
Competing Algorithm for feature selection #1: NGAS
The undirect graph is built and the set S of selected features is initialized.
Competing Algorithm for feature selection #1: NGAS
Assuming node 1 has the highest relevance, add it to S.
Competing Algorithm for feature selection #1: NGAS
Select the node with the lowest similarity to Node 1, say Node 7, and the one with the highest
similarity to Node 7, say Node 5.
Competing Algorithm for feature selection #1: NGAS
Remove Node 1. Node 5 is the one with the highest relevance between 5 and 7, add it to S.
Competing Algorithm for feature selection #1: NGAS
Select the node with the lowest similarity to Node 5, say Node 2, and the one with the highest
similarity to Node 2, say Node 3.
Competing Algorithm for feature selection #1: NGAS
Remove Node 5. Assuming Node 2 is the one with highest relevance between 2 and 3, add it to S.
Competing Algorithm for feature selection #1: NGAS
Select the node with the lowest similarity to Node 2, say Node 4, and the one with the highest
similarity to Node 4, say Node 8.
Competing Algorithm for feature selection #1: NGAS
Remove Node 2. Assuming Node 4 is the one with highest relevance between 4 and 8, add it to S.
Competing Algorithm for feature selection #1: NGAS
Select the node with the lowest similarity to Node 4, say Node 6, and the one with the highest
similarity to Node 6, say Node 7.
Competing Algorithm for feature selection #1: NGAS
Remove Node 4. Assuming Node 6 is the one with highest relevance between 6 and 7, add it to S.
Competing Algorithm for feature selection #1: NGAS
Select the node with the lowest similarity to Node 6, say Node 3, and the one with the highest
similarity to Node 3, say Node 8.
Competing Algorithm for feature selection #1: NGAS
Remove Node 6. Assuming Node 3 is the one with highest relevance between 3 and 8, add it to S.
Competing Algorithm for feature selection #1: NGAS
Select the node with the lowest similarity to Node 3, say Node 8, and the one with the highest
similarity to Node 8, say Node 7.
Competing Algorithm for feature selection #1: NGAS
Remove Node 3. Assuming Node 8 is the one with highest relevance between 8 and 7, add it to S.
We developed the following algorithms
Naรฏve Greedy search Algorithm for feature Selection (NGAS)
Naรฏve Greedy search Algorithm for feature Selection -Extended (NGAS-E)
Hierarchical Clustering search Algorithm for feature Selection (HCAS)
Competing Algorithms for feature selection
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
The undirect graph is built and the set S of selected features is initialized.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Assuming node 1 has the highest relevance, add it to S.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Select 7 โ 50% nodes less similar to 1.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Cancel Node 1 from the graph. Among the selected nodes, add the one with highest relevance (say
node 5) to S.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Select โ6*50% โ nodes less similar to node 5.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Cancel Node 5 from the graph. Among the selected nodes, add the one with highest relevance (say
Node 3) to S.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Select โ5*50% โ nodes less similar to node 3.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Cancel node 3 from the graph. Among the selected nodes, add the one with highest relevance (say
node 4) to S.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Select โ4*50% โ nodes less similar to node 4.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Cancel node 4 from the graph. Among the selected nodes, add the one with highest relevance (say
node 6) to S.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Select โ3*50% โ nodes less similar to node 6.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Cancel node 6 from the graph. Among the selected nodes, add the one with highest relevance (say
node 2) to S.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Select โ2*50% โnodes less similar to node 2.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Node 2 is cancelled from the graph and node 8 is added to S.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Node 8 is cancelled from the graph and the last node, 7, is added to S.
Competing Algorithm for feature selection #2: NGAS-E (p=50%)
Node 8 is cancelled from the graph and the last node, 7, is added to S.
We developed the following algorithms
Naรฏve Greedy search Algorithm for feature Selection (NGAS)
Naรฏve Greedy search Algorithm for feature Selection -Extended (NGAS-E)
Hierarchical Clustering search Algorithm for feature Selection (HCAS)
Competing Algorithms for feature selection
Outline
Machine Learning for Ranking
Proposed Feature Selection Algorithms (FSA) and Feature Selection Protocol
Application to Public Available Web SearchData
Application to Web SearchEngine Data
Bing Data http://research.microsoft.com/en-us/projects/mslr/
Yahoo! Data http://webscope.sandbox.yahoo.com
Train Validation Test
#queries 19,944 2,994 6,983
#urls 473,134 71,083 165,660
# features 519
Train Validation Test
#queries 18,919 6,306 6,306
#urls 723,412 235,259 241,521
# features 136
Proposed Protocol for Comparing Feature Selection Algorithms
Measure the Relevance of each Feature
Measure the Similarity of each pair of
features
Select a Feature Subset using a
Feature Selector
Train the L2R Model
Measure the L2R Model Performance
on the Test Set
Compare Feature
Selection Algorithms
Repeat for different Subset Size
1 2 3 4 5 6
Repeat from 3 for every Feature
Selection Algorithm
Learning to Rank Algorithms: a timeline of major contributions
LambdaMARTLambdaRank
CRR
IntervalRank
GBlend
NDCG Boost
BayesRank
BoltzRank
MPBoost
SortNet
SSRankBoost
RR
SoftRank
PermuRank
ListMLE
SVMmap
RankRLS
RankGP
RankCosine
QBRank
McRank
ListNet
GBRank
AdaRank
IR-SVM
RankNet
RankBoost
Pranking
RankSVM
2000 2002 2003 2005 2006 2007
FRank
2008 2009 2010
Pointwise
Pairwise
Listwise
Multiple Additive
Regression Trees
LambdaMART Model for LtR
Lambda function
Ensemble method: Tree Boosting
Loss function not differentiable
Sorting characteristic
Speed
Proposed Protocol for Comparing Feature Selection Algorithms
Measure the Relevance of each Feature
Measure the Similarity of each pair of
features
Select a Feature Subset using a
Feature Selector
Train the L2R Model
Measure the L2R Model Performance
on the Test Set
Compare Feature
Selection Algorithms
Repeat for different Subset Size
1 2 3 4 5 6
Repeat from 3 for every Feature
Selection Algorithm
Feature Relevance
The relevance of a document is measured with a categorical variable (0,1,2,3,4) we need to use metrics good at measuring ยซdependenceยป between discrete/continuous feature variables and a categorical label variable.
In the following we use Normalized Mutual Information (NMI):
Spearman coefficient (S)
Kendallโs tau (K)
Average Group Variance (AGV)
One Variable NDCG@10 (1VNDCG)
Feature Relevance via Normalized Mutual Information
Mutual Information (MI) measures how much, on average, the realization of a random variable X tells us about the realization of the random variableY, or how much the entropy of Y, H(Y), isreduced knowing about the realization of X
๐๐ผ ๐, ๐ = ๐ป ๐ โ ๐ป ๐ ๐ = ๐ป ๐ + ๐ป ๐ โ ๐ป ๐, ๐
The normalizad version is
๐๐๐ผ ๐, ๐ =๐๐ผ(๐, ๐)
๐ป(๐) ๐ป(๐)
Feature Relevance via Spearmanโs coefficient
Spearmanโs rank correlation coefficient is a non-parametric measure of statistical dependence between two random variables.
It is given by
๐ = 1 โ6 ๐๐
2
๐(๐2 โ 1)
where n is the sample size and
๐๐ = ๐๐๐๐ ๐ฅ๐ โ ๐๐๐๐ ๐ฆ๐
Feature Relevance via Kendallโs tau
Kendallโs Tau is a measure of association defined on two ranking lists of length n. It is defined as
ฯ =๐๐ โ ๐๐
๐(๐ โ 1)2
โ ๐1๐(๐ โ 1)
2โ ๐2
where ๐๐ denotes the number of concordant pairs between the two lists, ๐๐ denotes the number of discordant pairs, ๐1 = ๐ก๐(๐ก๐ โ 1)/2, ๐2 = ๐ข๐(๐ข๐ โ 1)/2, ๐ก๐ is the number of tied
values in the i-th group of ties for the first list and ๐ข๐ is the
number of tied values in the j-th group of ties for the second list.
Feature Relevance via Average Group Variance
Average Group Variance measure the discrimination power of a feature. The intuitive justification is that a feature is useful if it is capable of discriminating a small portion of the ordered scale from the rest, and that features with a small variance are those which satisfy this property.
๐ด๐บ๐ = 1 โ ๐=1
5 ๐๐ ๐ฅ๐ โ ๐ฅ2
๐ ๐ฅ๐ โ ๐ฅ 2
where ๐๐be the size of group g, ๐ฅ๐ the sample mean of feature ๐ฅ
in the g-th group and ๐ฅ whole sample mean.
Feature Relevance via single feature LambdaMART scoring
For each feature i we run LambdaMART and compute the ๐๐ท๐ถ๐บ๐,๐@10 for each query q
The i-th feature relevance is measured averaging๐๐ท๐ถ๐บ๐,๐@10 over the whole query set
๐๐ท๐ถ๐บ๐@10 =1
๐
๐โ๐
๐๐ท๐ถ๐บ๐,๐@10
Precision at k:
๐๐@๐ =# ๐๐๐๐๐ฃ๐๐๐ก ๐๐๐๐ข๐๐๐๐ก๐ ๐๐ ๐ก๐๐ ๐ ๐๐๐ ๐ข๐๐ก๐
๐
Average precision:
1
๐ท
๐=1
๐ท๐๐@๐ โ ๐ ๐๐๐๐ข๐๐๐๐ก ๐ ๐๐ ๐๐๐๐๐ฃ๐๐๐ก
# ๐๐๐๐๐ฃ๐๐๐ก ๐๐๐๐ข๐๐๐๐ก๐
Discounted Cumulative Gain
๐ท๐ถ๐บ๐ =
๐=1
๐2๐๐๐๐,๐ โ 1
๐๐๐2 1 + ๐๐๐๐๐
How to Measure Ranking Performance on query i
How to Measure Ranking Performance: Normalized DCG
Document GainCumulative Gain
Document 1 31 31
Document 2 3 34
Document 3 7 41
Document 4 31 72
Discounted
31x1=31
31+3x0.63=32.9
32.9+7x0.5=36.4
36.4+31x0.4=48.8
Normalization: divide DCG by the ideal DCG
Document GainCumulative Gain
Document 1 31 31
Document 4 31 62
Document 3 7 69
Document 2 3 72
Discounted
31x1=31
31+31x0.63=50.53
50.53+7x0.5=54.03
54.03+3x0.4=57.07
RelevanceRating
Gain
Perfect 25-1=31
Excellent 24-1=15
Good 23-1=7
Fair 22-1=3
Bad 21-1=1
Choosing the Relevance Measure (1/2)
FSA performance is measured using the Average NDCG@10 obtaind from LambdaMART on the test set.
NDCG@10 on Yahoo Test Set
Feature Subset Dimension
5% 10% 20% 30% 40% 50% 75% 100%
NMI 0.73398 0.75952 0.76241 0.7678 0.76912 0.769 0.77015 0.76935
AVG 0.7524 0.7548 0.76168 0.76493 0.76498 0.76717 0.76971 0.76935
S 0.74963 0.75396 0.76099 0.76398 0.7649 0.76753 0.77002 0.76935
K 0.75225 0.75291 0.76145 0.76304 0.7648 0.76673 0.76972 0.76935
1VNDCG 0.75246 0.75768 0.76452 0.76672 0.76823 0.77008 0.77027 0.76935
NDCG@10 on Bing Test Set
Feature Subset Dimension
5% 10% 20% 30% 40% 50% 75% 100%
NMI 0.38927 0.3978 0.41347 0.41539 0.44785 0.44966 0.45083 0.46336
AVG 0.32682 0.33168 0.36043 0.36976 0.37383 0.37612 0.43444 0.46336
S 0.32969 0.3346 0.3428 0.34592 0.36711 0.42475 0.42809 0.46336
K 0.32917 0.3346 0.34356 0.42124 0.42071 0.4245 0.42706 0.46336
1VNDCG 0.41633 0.42571 0.42413 0.42601 0.42757 0.43795 0.46222 0.46336
Proposed Protocol for Comparing Feature Selection Algorithms
Measure the Relevance of each Feature
Measure the Similarity of each pair of
features
Select a Feature Subset using a
Feature Selector
Train the L2R Model
Measure the Model Performance on
the Test Set
Compare Feature
Selection Algorithms
Repeat for different Subset Size
1 2 3 4 5 6
Repeat from 3 for every Feature
Selection Algorithm
Feature Similarity
We used Spearmanโs Rank coefficient for measuring features similarity.
Spearmanโs Rank is faster to be computed than NMI, Kendallโs tau and 1VNDCG.
The FSA benchmark: Greedy Algorithm for feature Selection
1. Build a complete undirected graph ๐บ0, in which a) each node represent the i-th feature with weight ๐ค๐ and b) each edge has weigth ๐๐,๐
2. Let ๐0 = โ be the set of selected features at step 0.
3. For i=1, โฆ, na) Select the node with largest weight from ๐บ๐โ1, suppose that it is the k-
th nodeb) Punish all the nodes connected with the k-th node: ๐ค๐ โ ๐ค๐ โ2*c*๐๐,๐ ,
๐ โ ๐c) Add the k-th node to ๐๐โ1
d) Remove the k-th node from ๐บ๐โ1
4. Return ๐๐
Train the L2R Model
Proposed Protocol for Comparing Feature Selection Algorithms
Measure the Relevance of each Feature
Measure the Similarity of each pair of
features
Select a Feature Subset using a
Feature Selector
Compare Feature
Selection Algorithms
Repeat for different Subset Size
1 2 3 4 5 6
Repeat from 3 for every Feature
Selection Algorithm
Measure the L2R Model Performance
on the Test Set
Proposed Protocol for Comparing Feature Selection Algorithms
Measure the Relevance of each Feature
Measure the Similarity of each pair of
features
Select a Feature Subset using a
Feature Selector
Train the L2R Model
Measure the L2R Model Performance
on the Test Set
Compare Feature
Selection Algorithms
1 2 3 4 5 6
Repeat from 3 for every Feature
Selection AlgorithmRepeat for different
Subset Size
Proposed Protocol for Comparing Feature Selection Algorithms
Measure the Relevance of each Feature
Measure the Similarity of each pair of
features
Select a Feature Subset using a
Feature Selector
Train the L2R Model
Compare Feature
Selection Algorithms
Repeat for different Subset Size
1 2 3 4 5 6
Repeat from 3 for every Feature
Selection Algorithm
Measure the Model Performance on
the Test Set
Significance Test using Randomization Test
NDCG@10 on Yahoo Test Set
Feature Subset Dimension
5% 10% 20% 30% 40% 50% 75% 100%
NGAS 0.7430โผ 0.7601 0.7672 0.7717 0.7724 0.7759 0.7766 0.7753
NGAS-E, p = 0.8 0.7655 0.7666 0.7723 0.7742 0.7751 0.7759 0.776 0.7753
HCAS, "single" 0.7350โผ 0.7635 0.7666 0.7738 0.7742 0.7754 0.7756 0.7753
HCAS, "ward" 0.7570โผ 0.7626 0.7704 0.7743 0.7755 0.7763 0.7757 0.7753
GAS, c = 0.01 0.7628 0.7649 0.7671 0.773 0.7737 0.7737 0.7758 0.7753
NDCG@10 on Bing Test Set
Feature Subset Dimension
5% 10% 20% 30% 40% 50% 75% 100%
NGAS 0.4011โผ 0.4459 0.471 0.4739โผ 0.4813 0.4837 0.4831 0.4863
NGAS-E, p = 0.05 0.4376โฒ 0.4528 0.4577โผ 0.4825 0.4834 0.4845 0.4867 0.4863
HCAS, "single" 0.4423โฒ 0.4643โฒ 0.4870โฒ 0.4854 0.4848 0.4847 0.4853 0.4863
HCAS, "ward" 0.4289 0.4434โผ 0.4820 0.4879 0.4853 0.4837 0.4870 0.4863
GAS, c = 0.01 0.4294 0.4515 0.4758 0.4848 0.4863 0.4860 0.4868 0.4863
Conclusions
We designed 3 FSAs and we applied them to the Web Search Pages Ranking problem.
NGAS-E e HCAS have a performance equal or greater than the benchmark model.
HCAS and NGAS are very
The proposed FSAs can be implemented independently of the L2R model.
The proposed FSAs can be applied to other ML contexts, to Sorting problems and to Model Ensambling.