online learning + game theory
Post on 24-Feb-2016
58 Views
Preview:
DESCRIPTION
TRANSCRIPT
Online Learning + Game Theory
Avrim BlumCarnegie Mellon University
Your guide:
Indo-US Lectures Week in Machine Learning, Game Theory, and Optimzation
ThemeSimple but powerful algorithms for online decision-making when you don’t have a lot of knowledge.
Nice connections between these algorithms and game-theoretic notions of optimality and equilibrium.
Start with something called the problem of “combining expert advice”.
Using “expert” advice• We solicit n “experts” for their advice. (Will
the market go up or down?)• We then want to use their advice somehow
to make our prediction. E.g.,
Say we want to predict the stock market.
Basic question: Is there a strategy that allows us to do nearly as well as best of these in hindsight?[“expert” = someone with an opinion. Not necessarily someone who knows anything.]
Simpler question• We have n “experts”.• One of these is perfect (never makes a
mistake). We just don’t know which one.• Can we find a strategy that makes no more
than lg(n) mistakes?Answer: sure. Just take majority vote over all experts that have been correct so far.Each mistake cuts # available by factor of 2.Note: this means ok for n to be very large.
What if no expert is perfect?One idea: just run above protocol until all
experts are crossed off, then repeat.Makes at most log(n) mistakes per mistake
of the best expert (plus initial log(n)).
Seems wasteful. Constantly forgetting what we've “learned”. Can we do better?
What if no expert is perfect?Intuition: Making a mistake doesn't
completely disqualify an expert. So, instead of crossing off, just lower its weight.
Weighted Majority Alg:– Start with all experts having weight 1.– Predict based on weighted majority vote.– Penalize mistakes by cutting weight in half.Weights: 1 1 1 1
Predictions: U U U D We predict: UWeights: ½ ½ ½ 1
Truth: D
Analysis: do nearly as well as best expert in hindsight
• M = # mistakes we've made so far.• m = # mistakes best expert has made so
far.• W = total weight (starts at n).• After each mistake, W drops by at least
25%. So, after M mistakes, W is at most n(3/4)M.• Weight of best expert is (1/2)m. So,
So, if m is small, then M is pretty small too.
constant ratio
Randomized Weighted Majority
2.4(m + lg n) not so good if the best expert makes a mistake 20% of the time. Can we do better? Yes.
• Instead of taking majority vote, use weights as probabilities. (e.g., if 70% on up, 30% on down, then pick 70:30) Idea: smooth out the worst case.
• Also, generalize ½ to 1- e.
unlike most worst-case
bounds, numbers are pretty good.
M = expected
#mistakes
Analysis• Say at time t we have fraction Ft of
weight on experts that made mistake.• So, we have probability Ft of making a mistake,
and we remove an eFt fraction of the total weight.– Wfinal = n(1-e F1)(1 - e F2)...– ln(Wfinal) = ln(n) + åt [ln(1 - e Ft)] · ln(n) - e åt Ft
(using ln(1-x) < -x) = ln(n) - e M. (å Ft = E[# mistakes])• If best expert makes m mistakes, then ln(Wfinal) > ln((1-
e)m).• Now solve: ln(n) - e M > m ln(1-e).
Ft
Additive regret• So, have M · OPT + eOPT + 1/e log(n).• If set e=(log(n)/OPT)1/2 to balance the two terms
out (or use guess-and-double), get bound of M · OPT + 2(OPT¢ )1/2
· OPT + 2(T )1/2
• These are called “additive regret” bounds. M/T · OPT/T + 2(log(n)/T)1/2 . “no regret”
OPT
T = # time steps
Extensions• What if experts are actions? (rows in a matrix
game, ways to drive to work,…)• At each time t, each has a loss (cost) in {0,1}.• Can still run the algorithm
– Rather than viewing as “pick a prediction with prob proportional to its weight” ,
– View as “pick an expert with probability proportional to its weight”
– Alg pays expected cost . • Same analysis applies.
Do nearly as well as best action in hindsight!
Extensions• What if losses (costs) in [0,1]? • Just modify alg update rule: .• Fraction of wt removed from system is: • Analysis very similar to case of {0,1}.
World – life - opponent
RWM (multiplicative weights alg)
111111
(1-ec11)
(1-ec21)
(1-ec31)
.
.(1-ecn
1)
scaling so costs in [0,1]
c1 c2
(1-ec12)
(1-ec22)
(1-ec32)
.
.(1-ecn
2)
Guarantee: do nearly as well as fixed row in hindsight Which implies doing nearly as well (or
better) than minimax optimal
World – life - opponent
111111
(1-ec11)
(1-ec21)
(1-ec31)
.
.(1-ecn
1)
scaling so costs in [0,1]
c2
(1-ec12)
(1-ec22)
(1-ec32)
.
.(1-ecn
2)
If play RWM against a best-response oracle, will approach minimax optimality (most will be close).
(If if didn’t, wouldn’t be getting promised guarantee)
Connections to minimax optimality
World – life - opponent
111111
(1-ec11)
(1-ec21)
(1-ec31)
.
.(1-ecn
1)
scaling so costs in [0,1]
c2
(1-ec12)
(1-ec22)
(1-ec32)
.
.(1-ecn
2)
If play two RWM against each other, then empirical distributions must be near-minimax-
optimal.
(Else, one or the other could & would take advantage)
Connections to minimax optimality
A natural generalization A natural generalization of our regret goal (thinking
of driving) is: what if we also want that on rainy days, we do nearly as well as the best route for rainy days.
And on Mondays, do nearly as well as best route for Mondays.
More generally, have N “rules” (on Monday, use path P). Goal: simultaneously, for each rule i, guarantee to do nearly as well as it on the time steps in which it fires.
For all i, want E[costi(alg)] · (1+e)costi(i) + O(e-1log N).
(costi(X) = cost of X on time steps where rule i fires.)
Can we get this?
A natural generalization This generalization is esp natural in machine
learning for combining multiple if-then rules. E.g., document classification. Rule: “if <word-X>
appears then predict <Y>”. E.g., if has football then classify as sports.
So, if 90% of documents with football are about sports, we should have error · 11% on them.
“Specialists” or “sleeping experts” problem.
Assume we have N rules. For all i, want E[costi(alg)] · (1+e)costi(i) + O(e-1log
N).(costi(X) = cost of X on time steps where rule i fires.)
A simple algorithm and analysis (all on one slide)
Start with all rules at weight 1. At each time step, of the rules i that fire,
select one with probability pi / wi. Update weights:
If didn’t fire, leave weight alone. If did fire, raise or lower depending on
performance compared to weighted average: ri = [åj pj cost(j)]/(1+e) – cost(i) wi à <- wi(1+e)ri
So, if rule i does exactly as well as weighted average, its weight drops a little. Weight increases if does better than weighted average by more than a (1+e) factor. This ensures sum of weights doesn’t increase.
Final wi = (1+e)E[costi(alg)]/(1+e)-costi(i). So, exponent · e-1log N.
So, E[costi(alg)] · (1+e)costi(i) + O(e-1log N).
Application: adapting to change What if we want to adapt to change - do nearly as
well as best recent expert? For each expert, instantiate copy who wakes up on
day t for each 0 · t · T-1. Our cost in previous t days is at most (1+²)(best
expert in last t days) + O(²-1 log(NT)). (not best possible bound since extra log(T) but not
bad).
[ACFS02]: applying RWM to bandit setting
What if only get your own cost/benefit as feedback?
Use of RWM as subroutine to get algorithm with cumulative regret O( (TN log N)1/2 ).
[average regret O( ((N log N)/T)1/2 ).]
Will do a somewhat weaker version of their analysis (same algorithm but not as tight a bound).
For fun, talk about it in the context of online pricing…
Online pricing• Say you are selling lemonade (or a cool new software
tool, or bottles of water at the world cup).• For t=1,2,…T
– Seller sets price pt
– Buyer arrives with valuation vt
– If vt ¸ pt, buyer purchases and pays pt, else doesn’t.– Repeat.
• Assume all valuations · h.$500 a glass
$2$3.00 a glass
• Goal: do nearly as well as best fixed price in hindsight.
View each possible price as a different
row/expert
• If vt revealed, run RWM. E[gain] ¸ OPT(1-²) - O(²-1 h log n).
Multi-armed bandit problemExponential Weights for Exploration and Exploitation
(exp3)
RWM
n = #experts
Exp3
Distrib pt
Expert i ~ qt
$1.25Gain gi
t Gain vector ĝt
qt
qt = (1-°)pt + ° unifĝt = (0,…,0, gi
t/qit,0,…,0)
OPTOPT
1. RWM believes gain is: pt ¢ ĝt = pit(gi
t/qit) ´ gt
RWM
3. Actual gain is: git = gt
RWM (qit/pi
t) ¸ gtRWM(1-°)
2. åt gtRWM ¸ (1-²) - O(²-1 nh/° log n)OPT
4. E[ ] ¸ OPT. OPT Because E[ĝjt] = (1- qj
t)0 + qjt(gj
t/qjt) = gj
t ,so E[maxj[åt ĝj
t]] ¸ maxj [ E[åt ĝjt] ] = OPT.
· nh/°
[Auer,Cesa-Bianchi,Freund,Schapire]
Multi-armed bandit problemExponential Weights for Exploration and Exploitation
(exp3)
RWM
n = #experts
Exp3
Distrib pt
Expert i ~ qt
$1.25Gain gi
t Gain vector ĝt
qt
qt = (1-°)pt + ° unifĝt = (0,…,0, gi
t/qit,0,…,0)
OPTOPT
Conclusion (° = ²): E[Exp3] ¸ OPT(1-²)2 - O(²-2 nh log(n))
[Auer,Cesa-Bianchi,Freund,Schapire]
· nh/°
Balancing would give O((OPT nh log n)2/3) in bound because of ²-2. But can reduce to ²-1 and O((OPT nh log n)1/2) more care in
analysis.
Summary (of results so far)Algorithms for online decision-making with strong guarantees on performance compared to best fixed choice.• Application: play repeated game against
adversary. Perform nearly as well as fixed strategy in hindsight.
Can apply even with very limited feedback.• Application: which way to drive to work, with
only feedback about your own paths; online pricing, even if only have buy/no buy feedback.
Internal/Swap Regret and
Correlated Equilibria
What if all players minimize regret? In zero-sum games, empirical frequencies
quickly approach minimax optimality. In general-sum games, does behavior quickly
(or at all) approach a Nash equilibrium? After all, a Nash Eq is exactly a set of
distributions that are no-regret wrt each other. So if the distributions stabilize, they must converge to a Nash equil.
Well, unfortunately, they might not stabilize.
An interesting bad example• [Balcan-Constantin-Mehta12]:
– Failure to converge even in Rank-1 games (games where R+C has rank 1).
– Interesting because one can find equilibria efficiently in such games.
What can we say?If algorithms minimize “internal” or “swap” regret, then empirical distribution of play approaches correlated equilibrium.
Foster & Vohra, Hart & Mas-Colell,… Though doesn’t imply play is stabilizing.
What are internal/swap regret and correlated equilibria?
More general forms of regret1. “best expert” or “external” regret:
– Given n strategies. Compete with best of them in hindsight.
2. “sleeping expert” or “regret with time-intervals”:
– Given n strategies, k properties. Let Si be set of days satisfying property i (might overlap). Want to simultaneously achieve low regret over each Si.
3. “internal” or “swap” regret: like (2), except that Si = set of days in which we chose strategy i.
Internal/swap-regret• E.g., each day we pick one stock to buy
shares in.– Don’t want to have regret of the form
“every time I bought IBM, I should have bought Microsoft instead”.
• Formally, swap regret is wrt optimal function f:{1,…,n}!{1,…,n} such that every time you played action j, it plays f(j).
Weird… why care?“Correlated equilibrium”• Distribution over entries in matrix, such that
if a trusted party chooses one at random and tells you your part, you have no incentive to deviate.
• E.g., Shapley game. -1,-1 -1,1 1,-1 1,-1 -1,-1 -1,1 -1,1 1,-1 -1,-1
RPS
R P S
In general-sum games, if all players have low swap-regret, then empirical distribution of play is apx correlated equilibrium.
-1,-1-1,-1
-1,-1
Connection• If all parties run a low swap regret
algorithm, then empirical distribution of play is an apx correlated equilibrium.
– Correlator chooses random time t 2 {1,2,…,T}. Tells each player to play the action j they played in time t (but does not reveal value of t).
– Expected incentive to deviate:åjPr(j)(Regret|j) = swap-regret of algorithm
– So, this suggests correlated equilibria may be natural things to see in multi-agent systems where individuals are optimizing for themselves
Correlated vs Coarse-correlated Eq
“Correlated equilibrium”• You have no incentive to deviate, even after
seeing what the advice is.
“Coarse-Correlated equilibrium”• If only choice is to see and follow, or not to
see at all, would prefer the former.
In both cases: a distribution over entries in the matrix. Think of a third party choosing from this distr and telling you your part as “advice”.
Low external-regret ) apx coarse correlated equilib.
Internal/swap-regret, contd
Algorithms for achieving low regret of this form:
– Foster & Vohra, Hart & Mas-Colell, Fudenberg & Levine.
– Will present method of [BM05] showing how to convert any “best expert” algorithm into one achieving low swap regret.
– Unfortunately, #steps to achieve low swap regret is O(n log n) rather than O(log n).
Can convert any “best expert” algorithm A into one achieving low swap regret. Idea:
– Instantiate one copy Aj responsible for expected regret over times we play j.
Alg
q1Play p = pQ
Cost vector cq2
A1
A2
An
.
.
.qn
Q
– Allows us to view pj as prob we play action j, or as prob we play alg Aj.
pnc
p1c
p2c
– Give Aj feedback of pjc.– Aj guarantees åt (pj
tct)¢qjt · mini åt pj
tcit + [regret term]
– Write as: åt pjt(qj
t¢ct) · mini åt pjtci
t + [regret term]
Can convert any “best expert” algorithm A into one achieving low swap regret. Idea:
– Instantiate one copy Aj responsible for expected regret over times we play j.
Alg
q1Play p = pQ
Cost vector cq2
A1
A2
An
.
.
.qn
Q
– Sum over j, get:
pnc
p1c
p2c
åt ptQtct · åj mini åt pjtci
t + n[regret term]
– Write as: åt pjt(qj
t¢ct) · mini åt pjtci
t + [regret term]
Our total cost
For each j, can move our prob to its own i=f(j)
Can convert any “best expert” algorithm A into one achieving low swap regret. Idea:
– Instantiate one copy Aj responsible for expected regret over times we play j.
Alg
q1Play p = pQ
Cost vector cq2
A1
A2
An
.
.
.qn
Q
– Sum over j, get:
pnc
p1c
p2c
åt ptQtct · åj mini åt pjtci
t + n[regret term]
Our total cost
For each j, can move our prob to its own i=f(j)
– Get swap-regret at most n times orig external regret.
Summary (game theory connection)
Zero-sum game: two players minimizing external regret minimax optimality
General-sum game: players minimizing external regret coarse-correlated equilib.
General-sum game: players minimizing internal regret correlated equilib.
Algs can be useful for fast approximately optimal solutions to optimization problems.
top related