online learning + game theory

Online Learning + Game Theory

Avrim BlumCarnegie Mellon University

Your guide:

Indo-US Lectures Week in Machine Learning, Game Theory, and Optimzation

ThemeSimple but powerful algorithms for online decision-making when you don’t have a lot of knowledge.

Nice connections between these algorithms and game-theoretic notions of optimality and equilibrium.

Start with something called the problem of “combining expert advice”.

Using “expert” advice• We solicit n “experts” for their advice. (Will

the market go up or down?)• We then want to use their advice somehow

to make our prediction. E.g.,

Say we want to predict the stock market.

Basic question: Is there a strategy that allows us to do nearly as well as best of these in hindsight?[“expert” = someone with an opinion. Not necessarily someone who knows anything.]

Simpler question• We have n “experts”.• One of these is perfect (never makes a

mistake). We just don’t know which one.• Can we find a strategy that makes no more

than lg(n) mistakes?Answer: sure. Just take majority vote over all experts that have been correct so far.Each mistake cuts # available by factor of 2.Note: this means ok for n to be very large.

What if no expert is perfect?One idea: just run above protocol until all

experts are crossed off, then repeat.Makes at most log(n) mistakes per mistake

of the best expert (plus initial log(n)).

Seems wasteful. Constantly forgetting what we've “learned”. Can we do better?

What if no expert is perfect?Intuition: Making a mistake doesn't

completely disqualify an expert. So, instead of crossing off, just lower its weight.

Weighted Majority Alg:– Start with all experts having weight 1.– Predict based on weighted majority vote.– Penalize mistakes by cutting weight in half.Weights: 1 1 1 1

Predictions: U U U D We predict: UWeights: ½ ½ ½ 1

Truth: D

Analysis: do nearly as well as best expert in hindsight

• M = # mistakes we've made so far.• m = # mistakes best expert has made so

far.• W = total weight (starts at n).• After each mistake, W drops by at least

25%. So, after M mistakes, W is at most n(3/4)M.• Weight of best expert is (1/2)m. So,

So, if m is small, then M is pretty small too.

constant ratio

Randomized Weighted Majority

2.4(m + lg n) not so good if the best expert makes a mistake 20% of the time. Can we do better? Yes.

• Instead of taking majority vote, use weights as probabilities. (e.g., if 70% on up, 30% on down, then pick 70:30) Idea: smooth out the worst case.

• Also, generalize ½ to 1- e.

unlike most worst-case

bounds, numbers are pretty good.

M = expected

#mistakes

Analysis• Say at time t we have fraction Ft of

weight on experts that made mistake.• So, we have probability Ft of making a mistake,

and we remove an eFt fraction of the total weight.– Wfinal = n(1-e F1)(1 - e F2)...– ln(Wfinal) = ln(n) + åt [ln(1 - e Ft)] · ln(n) - e åt Ft

(using ln(1-x) < -x) = ln(n) - e M. (å Ft = E[# mistakes])• If best expert makes m mistakes, then ln(Wfinal) > ln((1-

e)m).• Now solve: ln(n) - e M > m ln(1-e).

Ft

Additive regret• So, have M · OPT + eOPT + 1/e log(n).• If set e=(log(n)/OPT)1/2 to balance the two terms

out (or use guess-and-double), get bound of M · OPT + 2(OPT¢ )1/2

· OPT + 2(T )1/2

• These are called “additive regret” bounds. M/T · OPT/T + 2(log(n)/T)1/2 . “no regret”

OPT

T = # time steps

Extensions• What if experts are actions? (rows in a matrix

game, ways to drive to work,…)• At each time t, each has a loss (cost) in {0,1}.• Can still run the algorithm

– Rather than viewing as “pick a prediction with prob proportional to its weight” ,

– View as “pick an expert with probability proportional to its weight”

– Alg pays expected cost . • Same analysis applies.

Do nearly as well as best action in hindsight!

Extensions• What if losses (costs) in [0,1]? • Just modify alg update rule: .• Fraction of wt removed from system is: • Analysis very similar to case of {0,1}.

World – life - opponent

RWM (multiplicative weights alg)

111111

(1-ec11)

(1-ec21)

(1-ec31)

.

.(1-ecn

1)

scaling so costs in [0,1]

c1 c2

(1-ec12)

(1-ec22)

(1-ec32)

.

.(1-ecn

2)

Guarantee: do nearly as well as fixed row in hindsight Which implies doing nearly as well (or

better) than minimax optimal


111111

(1-ec11)

(1-ec21)

(1-ec31)

.

.(1-ecn

1)


c2

(1-ec12)

(1-ec22)

(1-ec32)

.

.(1-ecn

2)

If play RWM against a best-response oracle, will approach minimax optimality (most will be close).

(If if didn’t, wouldn’t be getting promised guarantee)

Connections to minimax optimality


111111

(1-ec11)

(1-ec21)

(1-ec31)

.

.(1-ecn

1)


c2

(1-ec12)

(1-ec22)

(1-ec32)

.

.(1-ecn

2)

If play two RWM against each other, then empirical distributions must be near-minimax-

optimal.

(Else, one or the other could & would take advantage)

Connections to minimax optimality

A natural generalization A natural generalization of our regret goal (thinking

of driving) is: what if we also want that on rainy days, we do nearly as well as the best route for rainy days.

And on Mondays, do nearly as well as best route for Mondays.

More generally, have N “rules” (on Monday, use path P). Goal: simultaneously, for each rule i, guarantee to do nearly as well as it on the time steps in which it fires.

For all i, want E[costi(alg)] · (1+e)costi(i) + O(e-1log N).

(costi(X) = cost of X on time steps where rule i fires.)

Can we get this?

A natural generalization This generalization is esp natural in machine

learning for combining multiple if-then rules. E.g., document classification. Rule: “if <word-X>

appears then predict <Y>”. E.g., if has football then classify as sports.

So, if 90% of documents with football are about sports, we should have error · 11% on them.

“Specialists” or “sleeping experts” problem.

Assume we have N rules. For all i, want E[costi(alg)] · (1+e)costi(i) + O(e-1log

N).(costi(X) = cost of X on time steps where rule i fires.)

A simple algorithm and analysis (all on one slide)

Start with all rules at weight 1. At each time step, of the rules i that fire,

select one with probability pi / wi. Update weights:

If didn’t fire, leave weight alone. If did fire, raise or lower depending on

performance compared to weighted average: ri = [åj pj cost(j)]/(1+e) – cost(i) wi Ã <- wi(1+e)ri

So, if rule i does exactly as well as weighted average, its weight drops a little. Weight increases if does better than weighted average by more than a (1+e) factor. This ensures sum of weights doesn’t increase.

Final wi = (1+e)E[costi(alg)]/(1+e)-costi(i). So, exponent · e-1log N.

So, E[costi(alg)] · (1+e)costi(i) + O(e-1log N).

Application: adapting to change What if we want to adapt to change - do nearly as

well as best recent expert? For each expert, instantiate copy who wakes up on

day t for each 0 · t · T-1. Our cost in previous t days is at most (1+²)(best

expert in last t days) + O(²-1 log(NT)). (not best possible bound since extra log(T) but not

bad).

[ACFS02]: applying RWM to bandit setting

What if only get your own cost/benefit as feedback?

Use of RWM as subroutine to get algorithm with cumulative regret O( (TN log N)1/2 ).

[average regret O( ((N log N)/T)1/2 ).]

Will do a somewhat weaker version of their analysis (same algorithm but not as tight a bound).

For fun, talk about it in the context of online pricing…

Online pricing• Say you are selling lemonade (or a cool new software

tool, or bottles of water at the world cup).• For t=1,2,…T

– Seller sets price pt

– Buyer arrives with valuation vt

– If vt ¸ pt, buyer purchases and pays pt, else doesn’t.– Repeat.

• Assume all valuations · h.$500 a glass

$2$3.00 a glass

• Goal: do nearly as well as best fixed price in hindsight.

View each possible price as a different

row/expert

• If vt revealed, run RWM. E[gain] ¸ OPT(1-²) - O(²-1 h log n).

Multi-armed bandit problemExponential Weights for Exploration and Exploitation

(exp3)

RWM

n = #experts

Exp3

Distrib pt

Expert i ~ qt

$1.25Gain gi

t Gain vector ĝt

qt

qt = (1-°)pt + ° unifĝt = (0,…,0, gi

t/qit,0,…,0)

OPTOPT

1. RWM believes gain is: pt ¢ ĝt = pit(gi

t/qit) ´ gt

RWM

3. Actual gain is: git = gt

RWM (qit/pi

t) ¸ gtRWM(1-°)

2. åt gtRWM ¸ (1-²) - O(²-1 nh/° log n)OPT

4. E[ ] ¸ OPT. OPT Because E[ĝjt] = (1- qj

t)0 + qjt(gj

t/qjt) = gj

t ,so E[maxj[åt ĝj

t]] ¸ maxj [ E[åt ĝjt] ] = OPT.

· nh/°

[Auer,Cesa-Bianchi,Freund,Schapire]

Multi-armed bandit problemExponential Weights for Exploration and Exploitation

(exp3)

RWM

n = #experts

Exp3

Distrib pt

Expert i ~ qt

$1.25Gain gi

t Gain vector ĝt

qt

qt = (1-°)pt + ° unifĝt = (0,…,0, gi

t/qit,0,…,0)

OPTOPT

Conclusion (° = ²): E[Exp3] ¸ OPT(1-²)2 - O(²-2 nh log(n))

[Auer,Cesa-Bianchi,Freund,Schapire]

· nh/°

Balancing would give O((OPT nh log n)2/3) in bound because of ²-2. But can reduce to ²-1 and O((OPT nh log n)1/2) more care in

analysis.

Summary (of results so far)Algorithms for online decision-making with strong guarantees on performance compared to best fixed choice.• Application: play repeated game against

adversary. Perform nearly as well as fixed strategy in hindsight.

Can apply even with very limited feedback.• Application: which way to drive to work, with

only feedback about your own paths; online pricing, even if only have buy/no buy feedback.

Internal/Swap Regret and

Correlated Equilibria

What if all players minimize regret? In zero-sum games, empirical frequencies

quickly approach minimax optimality. In general-sum games, does behavior quickly

(or at all) approach a Nash equilibrium? After all, a Nash Eq is exactly a set of

distributions that are no-regret wrt each other. So if the distributions stabilize, they must converge to a Nash equil.

Well, unfortunately, they might not stabilize.

An interesting bad example• [Balcan-Constantin-Mehta12]:

– Failure to converge even in Rank-1 games (games where R+C has rank 1).

– Interesting because one can find equilibria efficiently in such games.

What can we say?If algorithms minimize “internal” or “swap” regret, then empirical distribution of play approaches correlated equilibrium.

Foster & Vohra, Hart & Mas-Colell,… Though doesn’t imply play is stabilizing.

What are internal/swap regret and correlated equilibria?

More general forms of regret1. “best expert” or “external” regret:

– Given n strategies. Compete with best of them in hindsight.

2. “sleeping expert” or “regret with time-intervals”:

– Given n strategies, k properties. Let Si be set of days satisfying property i (might overlap). Want to simultaneously achieve low regret over each Si.

3. “internal” or “swap” regret: like (2), except that Si = set of days in which we chose strategy i.

Internal/swap-regret• E.g., each day we pick one stock to buy

shares in.– Don’t want to have regret of the form

“every time I bought IBM, I should have bought Microsoft instead”.

• Formally, swap regret is wrt optimal function f:{1,…,n}!{1,…,n} such that every time you played action j, it plays f(j).

Weird… why care?“Correlated equilibrium”• Distribution over entries in matrix, such that

if a trusted party chooses one at random and tells you your part, you have no incentive to deviate.

• E.g., Shapley game. -1,-1 -1,1 1,-1 1,-1 -1,-1 -1,1 -1,1 1,-1 -1,-1

RPS

R P S

In general-sum games, if all players have low swap-regret, then empirical distribution of play is apx correlated equilibrium.

-1,-1-1,-1

-1,-1

Connection• If all parties run a low swap regret

algorithm, then empirical distribution of play is an apx correlated equilibrium.

– Correlator chooses random time t 2 {1,2,…,T}. Tells each player to play the action j they played in time t (but does not reveal value of t).

– Expected incentive to deviate:åjPr(j)(Regret|j) = swap-regret of algorithm

– So, this suggests correlated equilibria may be natural things to see in multi-agent systems where individuals are optimizing for themselves

Correlated vs Coarse-correlated Eq

“Correlated equilibrium”• You have no incentive to deviate, even after

seeing what the advice is.

“Coarse-Correlated equilibrium”• If only choice is to see and follow, or not to

see at all, would prefer the former.

In both cases: a distribution over entries in the matrix. Think of a third party choosing from this distr and telling you your part as “advice”.

Low external-regret ) apx coarse correlated equilib.

Internal/swap-regret, contd

Algorithms for achieving low regret of this form:

– Foster & Vohra, Hart & Mas-Colell, Fudenberg & Levine.

– Will present method of [BM05] showing how to convert any “best expert” algorithm into one achieving low swap regret.

– Unfortunately, #steps to achieve low swap regret is O(n log n) rather than O(log n).

Can convert any “best expert” algorithm A into one achieving low swap regret. Idea:

– Instantiate one copy Aj responsible for expected regret over times we play j.

Alg

q1Play p = pQ

Cost vector cq2

A1

A2

An

.

.

.qn

Q

– Allows us to view pj as prob we play action j, or as prob we play alg Aj.

pnc

p1c

p2c

– Give Aj feedback of pjc.– Aj guarantees åt (pj

tct)¢qjt · mini åt pj

tcit + [regret term]

– Write as: åt pjt(qj

t¢ct) · mini åt pjtci

t + [regret term]



Alg

q1Play p = pQ

Cost vector cq2

A1

A2

An

.

.

.qn

Q

– Sum over j, get:

pnc

p1c

p2c

åt ptQtct · åj mini åt pjtci

t + n[regret term]

– Write as: åt pjt(qj

t¢ct) · mini åt pjtci

t + [regret term]

Our total cost

For each j, can move our prob to its own i=f(j)



Alg

q1Play p = pQ

Cost vector cq2

A1

A2

An

.

.

.qn

Q

– Sum over j, get:

pnc

p1c

p2c

åt ptQtct · åj mini åt pjtci

t + n[regret term]

Our total cost

For each j, can move our prob to its own i=f(j)

– Get swap-regret at most n times orig external regret.

Summary (game theory connection)

Zero-sum game: two players minimizing external regret minimax optimality

General-sum game: players minimizing external regret coarse-correlated equilib.

General-sum game: players minimizing internal regret correlated equilib.

Algs can be useful for fast approximately optimal solutions to optimization problems.

online learning + game theory

Documents