the hedge algorithm and applications - college of … more problems z more problems z trading stocks...
TRANSCRIPT
22
will it rain tomorrow ?will it rain tomorrow ?
I say it will rainI say it will rainNick says it will not rainNick says it will not rainCNN says it will rainCNN says it will rainFOX says it will not rainFOX says it will not rain
TOMORROW : sunny all day…TOMORROW : sunny all day…what about Saturday?what about Saturday?
33
weighted majority algorithmweighted majority algorithm
Give the “weatherGive the “weather--men” weightsmen” weightsInitial uniform or according to some prior beliefInitial uniform or according to some prior belief
Run the forecast for several days. Every day :Run the forecast for several days. Every day :Make our prediction by weightedMake our prediction by weighted--majoritymajority--votevoteGet the real outcomeGet the real outcomeUpdate weightsUpdate weights
““penalizepenalize”” wrong predictorswrong predictors““rewardreward”” good predictorsgood predictors
44
more problemsmore problems
more problemsmore problemstrading stockstrading stocksIR IR metasearchmetasearchdisease classificationdisease classification
where we need to “train” doctorswhere we need to “train” doctorscommon underlying ideacommon underlying idea
generalization of weightedgeneralization of weighted--majoritymajoritywhy why notnot exactly WMexactly WM
losses are real numbers instead of discrete 0losses are real numbers instead of discrete 0--11our loss may be a weighted sum of expert lossesour loss may be a weighted sum of expert losses
55
this talkthis talk
Hedge algorithm for online Hedge algorithm for online allocation [allocation [schapireschapire, , freundfreund ’96]’96]
ApplicationsApplicationsOn classification : On classification : AdaboostAdaboostOn IR : On IR : metasearchmetasearch algorithm algorithm
66
““how to use expert advice” ?how to use expert advice” ?
N experts (or strategies) N experts (or strategies) maintain a set of weights over experts maintain a set of weights over experts loop for T episodes t=1,2,…..,Tloop for T episodes t=1,2,…..,T
allocate Resources (believe) allocate Resources (believe) -- based on weightsbased on weightsreceive loses receive loses reweightreweight the expertsthe experts
77
online allocation online allocation -- hedge algorithmhedge algorithm
[ ]
[ ] 1;1,0 w weightsinitial
(systems) strategies ;1,0
1
11 =∈
∈
∑=
N
ii
N w
Nβ
∑=
= N
i
ti
tit
i
w
wp
1
[ ]Ntl 1,0 ∈ tttH lpL ⋅=
tilt
iti ww β⋅=+1
∑ =⋅=
T
ttt
HEDGE lpL1
88
why hedgewhy hedgegoal : Hedge loss close to the Loss of the best expert goal : Hedge loss close to the Loss of the best expert (bound)(bound)proof idea :proof idea :
relate episodic Hedge loss with the sum of weightsrelate episodic Hedge loss with the sum of weights
relate cumulate Hedge loss with sum of final weightsrelate cumulate Hedge loss with sum of final weights
relate sum of final weights with loss of the best expertrelate sum of final weights with loss of the best expert
( ) ( )
−−≤ ∑∑
=
+ tHedgeLt
i
N
i
ti ww )(11
1
1
ββ
( ) ( )NLL bestHedge ln1
11
/1ln)( ββ
ββ −
+−
≤
99
episodic episodic hedge loss hedge loss
∑∑==
+ =N
i
lti
N
i
ti
tiww
11
1 β
∑=
−−≤N
i
ti
ti lw
1))1(1( β
( ) ( )( )∑ ∑=
−−=N
i
ti
ti
ti lwp
1 11 β
( ) ( )( )∑∑=
−−=N
i
ti
ti
ti
ti lppw
1 1 β
( ) ( )
−−= ∑∑
=
N
i
ti
ti
ti lpw
111 β
( ) ( )
−−= ⋅∑ ttt
iw lp11 β tttHedgeL lp ⋅=
( ) ( )
−−≤ ∑∑
=
+ tHedgeLt
i
N
i
ti ww β11
1
1
[ ]( ) l
ll 11
1,0,ββ
β
−−≤
∈xxf )1(1)( β−−=xxf β=)(
∑=
= N
i
ti
tit
i
w
wp
1
1010
cumulative cumulative hedge loss hedge loss
( ) ( )
−−≤ ∑∑
=
+ tHedgeLt
i
N
i
ti ww β11
1
1
( ) ( )∏∑∑==
+
−−≤ ⋅
T
ti
N
i
ti
ttww1
1
1
1 lp11 β
∑ =⋅=
T
ttt
Hedge lpL1
( ) ( )∏∑=
−−≤ ⋅
T
ti
ttw1
1 lp1exp β )exp(1 xx ≤−
1111
[almost] as good as the best expert[almost] as good as the best expert
anyLany
Tany
N
i
Ti www β11
1
1 =≥ +
=
+∑
β
ββ −
−
−
≤1
ln1ln)(
any
Hedge
LNL
Nw any
11 =
1212
optimalityoptimality
β
ββ −
−
−
≤1
ln1ln)(
best
Hedge
LNL
( ) ( )NLbest ln1
11
/1lnββ
β−
+−
=
( )NacLL bestB ln+≤
1313
how to choose how to choose ββ
( ) ( )NLL bestHedge ln1
11
/1ln)( ββ
ββ −
+−
≤
ββ = confidence parameter= confidence parameterthink of it as a tradethink of it as a trade--off off try to make the bound tighttry to make the bound tightbinary search : perfect expert + (binary search : perfect expert + (ββ=0) =0)
1414
that wasn’t so badthat wasn’t so bad
Hedge algorithm for Hedge algorithm for online allocationonline allocation
ApplicationsApplicationsOn classification : On classification : AdaboostAdaboostOn IR : On IR : metasearchmetasearch algorithm algorithm
1515
boostingboosting
disease classification…. disease classification…. get past dataget past data“train” a disease“train” a disease--predictor “doctor” predictor “doctor”
(“(“hypothesis”,”weakhypothesis”,”weak learner”)learner”)
how good is it (on training data) ?how good is it (on training data) ?where is it wrong ? where is it wrong ? train a new predictor to correct train a new predictor to correct mistakes for the first onemistakes for the first one
1616
hedge application : hedge application : AdaBoostAdaBoost[[schapireschapire, , freundfreund ’96]’96]
HEDGEHEDGEgiven : expertsgiven : experts
incoming : losesincoming : loses
reweightreweight: experts: experts
BOOSTINGBOOSTINGgiven : given : datapointsdatapoints
experts in weakexperts in weak--learners performancelearners performance
incoming: weak learnersincoming: weak learnerscompute error (loss)compute error (loss)
compute “believe” (compute “believe” (ββ))
reweightreweight : : datapointsdatapoints
{-1,1} X : h t →
≠=
⋅=+ )(if)(if/1
1itit
itit
t
tt xhy
xhyZDD
ββ
miD 1)(1 =
])([Pr iitDt yxht
≠=ε
11>
−=
t
tt ε
εβ
( )
= ∑
ttt xhxH )(lnsgn)(final β
ADABOOSTADABOOST
1717
AdaBoostAdaBoost -- exampleexample
Start with uniform distribution on data
Weak learners = halfplanes
2222
AdaBoostAdaBoost -- analysisanalysis
generalization error generalization error
training error training error
∏ −≤t
ttH )1(4)error(training final εε
[ ]1,0 ,)1(4)( ∈−= xxxxf
2323
2003 Gödel Prize2003 Gödel PrizeYoavYoav Freund and Robert Freund and Robert SchapireSchapire
““The prize was awarded to The prize was awarded to YoavYoav Freund and Robert Freund and Robert SchapireSchapire for for their paper "A Decision Theoretic Generalization of Ontheir paper "A Decision Theoretic Generalization of On--Line Line Learning and an Application to Boosting," Journal of Computer anLearning and an Application to Boosting," Journal of Computer and d System Sciences 55 (1997), pp. 119System Sciences 55 (1997), pp. 119--139. This paper introduced 139. This paper introduced AdaBoostAdaBoost, an adaptive algorithm to improve the accuracy of , an adaptive algorithm to improve the accuracy of hypotheses in machine learning. The algorithm demonstrated novelhypotheses in machine learning. The algorithm demonstrated novelpossibilities in possibilities in analysinganalysing data and is a permanent contribution to data and is a permanent contribution to science even beyond computer science. Because of a combination science even beyond computer science. Because of a combination of features, including its elegance, the simplicity of its of features, including its elegance, the simplicity of its implementation, its wide applicability, and its striking successimplementation, its wide applicability, and its striking success in in reducing errors in benchmark applications even while its theoretreducing errors in benchmark applications even while its theoretical ical assumptions are not known to hold, assumptions are not known to hold, the algorithm set off an the algorithm set off an explosion of research in the fields of statistics, explosion of research in the fields of statistics, artificial intelligence, experimental machine artificial intelligence, experimental machine learning, and data mininglearning, and data mining. The algorithm is now widely . The algorithm is now widely used in practice. The paper highlights the fact that theoreticalused in practice. The paper highlights the fact that theoreticalcomputer science continues to be a fount of powerful and entirelcomputer science continues to be a fount of powerful and entirely y novel ideas with significant and direct impact even in areas, sunovel ideas with significant and direct impact even in areas, such as ch as data analysis, that have been studied extensively by other data analysis, that have been studied extensively by other communities.” communities.”
2424
what we did last year what we did last year [[AslamAslam, Pavlu, , Pavlu, SavellSavell]]
Hedge algorithm for Hedge algorithm for online allocationonline allocation
ApplicationsApplicationsOn classification : On classification : AdaboostAdaboostOn IR : On IR : metasearchmetasearch algorithm algorithm
2626
search and search and metasearchmetasearch
Search engines:Search engines:Provide a ranked list of documents.Provide a ranked list of documents.May provide relevance scores.May provide relevance scores.
MetasearchMetasearch engines:engines:Query multiple search engines.Query multiple search engines.May or may not combine results.May or may not combine results.
3030
metasearchmetasearch algorithmsalgorithms
Heuristics and hacks:Heuristics and hacks:Interleave, average rank, sum scores, etc.Interleave, average rank, sum scores, etc.
Principled models:Principled models:Bayesian inference, election theory, etc.Bayesian inference, election theory, etc.OnOn--line combination of expert advice.line combination of expert advice.
3232
hedge application : hedge application : metasearchmetasearchHEDGEHEDGE
given : expertsgiven : experts
incoming : losesincoming : loses
reweightreweight: experts: experts
METASEARCHMETASEARCHgiven : search enginesgiven : search engines
incoming: documentsincoming: documentscompute lossescompute losses
reweightreweight : search engines : search engines
3636
ranks importanceranks importance
11
+=−=TNONRELEVAN
RELEVANT
rZ
Zrrrvalue ln1...
111)( ≈+++
+=
rZdlabelrankvaluedlabelsdLOSS sd ln)()()(),( , ⋅≈⋅=
• map ranks to values
total loss vs. total precision vs. average precision
3737
average_value at episode taverage_value at episode t“trust” in systems change with t“trust” in systems change with t
feedbackfeedbacklooploop)(
)(_
,1
1sd
N
s
ts
t
rankvaluew
dvalueaverage
⋅=
=
∑=
−
),(1 sdLOSSts
ts
ttww β⋅=+
)()(),( ,sdtt trankvaluedlabelsdLOSS ⋅=
)(_ dvalueaverage t
get feedback get feedback
metasearchmetasearch list : order docs bylist : order docs by
modify weightsmodify weightscompute lossescompute losses
)( tdlabel
3838
experimentsexperimentsTREC 3,5,6,7,8TREC 3,5,6,7,8
4141--129 systems129 systems50 queries per TREC50 queries per TRECMetasearchMetasearch combines combines allall systemssystems
Use TREC judgments as user feedbackUse TREC judgments as user feedback
3939
metasearchmetasearch -- no feedback (yet)no feedback (yet)
MNZ=CombMNZ(Fox,Shaw,Lee et al)COND=Condorcet(Aslam,Montague)
4848
““a unified model for a unified model for metasearchmetasearch, , pooling and system evaluation”pooling and system evaluation”
4949
conclusionconclusion
theoretic explanation of “being adaptive”theoretic explanation of “being adaptive”simple, elegant, intuitivesimple, elegant, intuitiveusually performs much better than the usually performs much better than the boundbound
5151
AdaBoostAdaBoost -- technicaltechnical
Start with uniform Start with uniform distribdistrib DD1 1 ::At every round t=1 to T At every round t=1 to T
given given DDtt
•• ffind weak hypothesisind weak hypothesis
•• with errorwith error
••
•• compute “belief” in compute “belief” in hhtt
••
•• update distributionupdate distribution
final hypothesis:final hypothesis:
≠=
⋅=−
+ )(if)(if
1iti
iti
t
tt xhye
xhyeZDD
t
t
α
α
01ln21
>
−=
t
tt ε
εα
= ∑
ttt xhxH )(sgn)(final α
miD 1)(1 =
{-1,1} X : h t →
])([Pr iitDt yxht
≠=ε
5454
hedge application : hedge application : metasearchmetasearch
problem : problem : metasearchmetasearchsearch enginessearch enginesdocument judgmentdocument judgmentmetaserachmetaserach
[hedge] solution[hedge] solutionsearch engines are “experts”search engines are “experts”judgments on documents are “loses”judgments on documents are “loses”
5656
generalization errorgeneralization error-- based on marginsbased on margins
generalization error [generalization error [Schapire,Freund,Barlett,LeeSchapire,Freund,Barlett,Lee1998]1998]
training error [training error [Schapire,FreundSchapire,Freund 1996]1996]
∏ −≤t
ttH )1(4)error(training final εε
5858
∑ ∗−∗=
==
docs all
))(2(),( : FACTranks allat valuesprecision of average ofprecision total)(
sTPZCsdLOSSssTP
loss functionloss function
11
+=−=TNONRELEVAN
RELEVANTrZ
Zrrrvalue ln1...
111)( ≈+++
+=
Average the precision at Average the precision at ALLALL ranksranksNormalize so ideal system gets TP=1Normalize so ideal system gets TP=1math is simplermath is simpler
rZdlabelrankvaluedlabelsdLOSS sd ln)()()(),( , ⋅≈⋅=
• map ranks to values
5959
average_valueaverage_value at episode tat episode t“trust” in systems change with t“trust” in systems change with t
metasearchmetasearch : include already judged docs: include already judged docspooling pooling
system evaluationsystem evaluation
get feedbackget feedbackmodify weightsmodify weights
how does it work ?how does it work ?)(
)(_
,1
1sd
N
s
ts
t
rankvaluew
dvalueaverage
⋅=
=
∑=
−
),(1 sdLOSSts
ts
ttww β⋅=+
)()(),( ,sdtt trankvaluedlabelsdLOSS ⋅=
( ))(_maxarg dvalueaveraged dt =
6161
experimentsexperimentsTREC 3,5,6,7,8TREC 3,5,6,7,8
4141--129 systems129 systems50 queries per TREC50 queries per TRECmetasearchmetasearch uses uses allall systemssystems
use TREC judgments as user feedbackuse TREC judgments as user feedbacksystem evaluation : incomplete judgmentssystem evaluation : incomplete judgments
6565
metasearchmetasearch -- no feedback (yet)no feedback (yet)
MNZ=CombMNZ(Fox,Shaw,Lee et al)COND=Condorcet(Aslam,Montague)
6868
conclusionconclusiona powerful machine learning approacha powerful machine learning approach
Hedge = Hedge = AdaBoostAdaBoost corecore
works [usually] better than anything else we’ve works [usually] better than anything else we’ve seenseentrue, it uses feedbacktrue, it uses feedback
but without feedback there are provable limitationsbut without feedback there are provable limitations
6969
Why hedge [Why hedge [schapireschapire, , freundfreund ’96]’96]
( ) ....p)1(111
1 ≤⋅−−
≤ ∑∑
==
+ TTN
i
Ti
N
i
Ti lww β
)p)1(exp(...1
t tT
tl⋅−−≤ ∑
=
βTT lphedge ⋅= T episodeat loss
HEDGEL losscummulat
ββ
−
+
≤1
ln1ln NLL
SYSTEM
HEDGE
)Hedge(βL ∑=
+N
i
Tiw
1
1
7070
AdaBoostAdaBoost –– distribution updatedistribution update
01ln21
>
−=
t
tt ε
εα
{-1,1} X : h t →
])([Pr iitDt yxht
≠=ε
““neutralize” the last weak hypothesisneutralize” the last weak hypothesis
)( iti xhy ≠ )( iti xhy =
teZDD
t
tt
α⋅=+1te
ZDD
t
tt
α−+ ⋅=1
7272
problem setupproblem setupOn the same queryOn the same querySet of underlying systems Set of underlying systems User feedbackUser feedbackGoalGoal
Find relevant documentsFind relevant documentsProduce online Produce online metasearchmetasearch listslistsPerform online system evaluation Perform online system evaluation
We are looking for an adaptive approachWe are looking for an adaptive approach
7575
∑ ∗−∗=
==
docs all
: FACT
ranksallat precisionaverageof precision total
))(2(),()(
sTPZCsdLOSSssTP
loss functionloss function
Average the precision at Average the precision at ALLALL ranksranksNormalize so ideal system gets TP=1Normalize so ideal system gets TP=1math is more simplemath is more simple
7676
pooling pooling –– howtohowto
Naturally “want” top ranksNaturally “want” top ranksIf NON RELEVANT, then a NR in top ranks of the If NON RELEVANT, then a NR in top ranks of the system listssystem listsIf RELEVANT, bingo.If RELEVANT, bingo.
7777
system evaluation system evaluation –– howtohowto
assume assume allall docs not judged docs not judged (so many ?)(so many ?) to be to be NON RELEVANTNON RELEVANTcompute compute AvegPrecisionAvegPrecision for every systemfor every systemone (or few) very good systems one (or few) very good systems –– use small use small ββ
7878
metasearchmetasearch -- howtohowto
Compute “pooling value” for each docCompute “pooling value” for each docInstead of “select the top doc” for poolingInstead of “select the top doc” for poolingdo “select the top 1000 doc” for do “select the top 1000 doc” for metasearchmetasearch
almost 1000 almost 1000 –– docs already pooled are docs already pooled are automatically in top of automatically in top of metasearchmetasearch listlist
7979
experimentsexperimentsTRECTREC
~100 systems~100 systems50 queries each competition50 queries each competition
Use TREC Use TREC qrelsqrels as user feedbackas user feedbackincomplete feedbackincomplete feedback
For comparison with depthFor comparison with depth--pooling we use pooling we use average number of pools (over queries)average number of pools (over queries)
8686
conclusionconclusionA powerful machine learning approachA powerful machine learning approach
Hedge = Hedge = AdaBoostAdaBoost corecoreWorks [usually] better than anything else we’ve Works [usually] better than anything else we’ve seenseenTrue, it uses feedbackTrue, it uses feedback
But without feedback there are provable limitationsBut without feedback there are provable limitations
It is missing a rigorous analysisIt is missing a rigorous analysisWe are not very far away with thatWe are not very far away with thatNeed a model assumptionNeed a model assumption
9393
pooling pooling -- howtohowto
Naturally “want” top ranksNaturally “want” top ranksIf NON RELEVANT, then a NR in top ranks of the If NON RELEVANT, then a NR in top ranks of the system listssystem listsIf RELEVANT, bingo.If RELEVANT, bingo.
9494
metasearchmetasearch –– howtohowtoBefore the next episodeBefore the next episode
Compute “pooling value” Compute “pooling value” for each docfor each doc
Instead of “select the top doc” for poolingInstead of “select the top doc” for poolingdo “select the top 1000 doc” for do “select the top 1000 doc” for metasearchmetasearch
almost 1000 almost 1000 –– docs already pooled are docs already pooled are automatically in top of automatically in top of metasearchmetasearch listlist
9595
““total” total” precizionprecizion
Average the precision at Average the precision at ALLALL ranksranksNormalize so ideal system gets TP=1Normalize so ideal system gets TP=1
math is more simplemath is more simple