di culties in rational collective behavior cs-e4800 arti ... · cs-e4800 arti cial intelligence...

CS-E4800 Artificial Intelligence

Jussi Rintanen

Department of Computer ScienceAalto University

March 9, 2017

Difficulties in Rational Collective Behavior

Individual utility in conflict with collective utilityExamples:

greenhouse gasesover-populationde-forestationarms race, military build-up

No general solution to resolve this conflict

Issue: How to align agents’ and collectives’ utilities

Law/agreements to constrain individual actions(hard to enforce when utilities high)

Tragedy of the Commons

Using jointly-ownedresource; cost evenly shared

0 2 4 60 0,0 -1,1 -2,2 -3,32 1,-1 0,0 -1,1 -2,24 2,-2 1,-1 0,0 -1,16 3,-3 2,-2 1,-1 0,0

Using your own resource

0 2 4 60 0,0 0,0 0,0 0,02 0,0 0,0 0,0 0,04 0,0 0,0 0,0 0,06 0,0 0,0 0,0 0,0

Best action: Spend joint resource as much as you can(Or (under diminishing marginal utility) at least: Spend it more

than you would if you had to pay for it in full.)

(Flatmates agree to fill-up fridge every day, divide the cost evenly,

and let everybody eat as much as they like. Good idea?)

Games with State

Strategies in normal form games single shot

Real-world games typically involve multiple stagesFormalizations:

Games in extensive form (game theory)Multi-agent Markov decision processes: MDPs withactions replaced by normal form games, and payoffsobtained from values of successor states)Game-tree search for zero-sum games (later today)

Can be abstractly viewed as normal form games(reduction to normal form exponential size)

Time-dependent aspects cannot be investigated innormal form

Challenges in multi-agent systems

1 players’ utilities opposite (coordination impossible,mixed strategies)

2 conflicting individual and collective utility(coordination difficult, suboptimal collectiveoutcomes)

3 making a decision collectively (measuring utility)

Preference aggregation

Decision between alternatives A, B and CAgents express their preferences

option 1: some ranking/ordering of A, B, Coption 2: numeric values of A, B, C

Preferences need to be aggregated to obtain ajoint ordering/valuation of A, B, CThis is difficult!

Agents’ utilities generally not publicly knownOptimal strategy (often): lie about utilities/preferencesSuboptimal outcomes

Aggregation of rankings

Set of candidates (outcomes, alternatives)

Set of agentsObjective: Produce

aggregate ordering of all candidates, ora winning candidate.


A scoring rule assigns a numeric score based onthe position in each invididual ordering.

Aggregate ordering formed by summing the scoresfrom each individual.Possible rules (for ranking 4 individuals)

plurality x > y > z > u mapped to 1, 0, 0, 0Only 1st preference counts.

veto x > y > z > u mapped to 1, 1, 1, 0 (or 0, 0, 0,−1)Only last preference counts.

Borda count x > y > z > u mapped to 3, 2, 1, 0


Scoring rules can be combined with runoff procedures:

2-candidate runoff with plurality rule1 Eliminate all but top two based on scores2 Recalculate scores, and winner is the one scoring higher

Single transferable vote with plurality rule1 Eliminate candidate with lowest plurality score2 Continue eliminations until one left

Aggregation of preferences/ranksOrdering by pairwise plurality

Order x > y if plurality of agents prefer x to y .Can lead to cycles:

agent 1: > >

agent 2: > >

agent 3: > >

Aggregation of preferences/ranksOrdering by pairwise plurality

How are these cycles possible?

Candidates have different property vectors:

(1,1,0) (1,0,1) (0,1,1)

Even if the agents value all properties positively,uneven weights lead to cycles.Example: (3,2,1) (1,3,2) (2,1,3)

Strategic voting

Expressing preferences incorrectly can be beneficial:

Assume plurality voting

Agent’s actual preferences are A > B > CAgent knows that other agents’ preferences are

B > C > AB > C > AC > A > BC > A > B

B and C will be tied if agent votes A > B > C .

B wins if agent votes B > A > C (better result!)

Other scoring rules (and voting systems in general) aremanipulable similarly.

Vickrey-Clarke-Grove mechanismWith the Clarke pivot rule

Choice between alternatives in set X :1 Agents report their value functions vi(x), x ∈ X2 Best outcome is xopt = arg maxx∈X

∑ni=1 vi(x)

3 Agent i is paid∑

j 6=i

vj(xopt)−max

x∈X

∑

j 6=i

vj(x)

Value of xopt - value of best alternative (without i)

Agent’s payment+utility is maximized by truthfulreporting!

(Can be viewed as a generalization of second-price sealed bid

auctions.

Game tree search

Two-person multi-stage zero-sum games

player wins, opponent loses, or vice versa (or it’s adraw)

Board games: checkers, chess, backgammon, go

Other applications? (Military operations?)

Issue: very large search trees

Issue: focusing search difficult

Basic game tree search by Minimax

Depth-first search of bounded depth AND-OR tree

Leaf nodes evaluated with a heuristic valuefunctionChess: value of pieces, relative positions (mobility,safety of king, ...)

Values of non-leaf nodes by min or max of children

AND-nodes (opponent) by minimization

OR-nodes (player) by maximization

(Special case: whole game-tree covered, winning leafs1, losing leafs -1, and draws 0)

Minimax Tree Search

∨1

∧0 ∧1

∨2 ∨0 ∨3 ∨1

0 2 -1 0 3 -2 1 1

Alpha-Beta Pruning

Idea behind Alpha-Beta Pruningmin(x ,max(y , z)) = x if x ≤ y (α cuts)max(x ,min(y , z)) = x if x ≥ y (β cuts)In both cases, z is irrelevant.

Alpha-Beta pruning example

MAX

3 12 8

MIN 3

3


MAX

3 12 8

MIN 3

2

2

X X

3


MAX

3 12 8

MIN 3

2

2

X X14

14

3


MAX

3 12 8

MIN 3

2

2

X X14

14

5

5

3


MAX

3 12 8

MIN

3

3

2

2

X X14

14

5

5

2

2

3

Heuristics to support Alpha-Beta Pruning

Alpha-Beta prunes more if best actions tried first

Determine promising actions through iterativedeepening: use score for action/child fromprevious iterative-deepening round

Issue with depth-bounds: Horizon effect

Black bishop is trapped, but its capture could bedelayed to search depth d + 1

Transposition tables

Depth-first used in games like chess becauseastronomic state spaces: algorithms that requirestoring all visited states not feasible.

Need to utilize memory for pruning, withoutexhausting it

DFS can reach a state in multiple ways=⇒ Multiple copies of the same subtree

Transposition tables: Cache states encounteredduring DFS; retrieve value of already-encounteredstates, rather than repeating search

When table full, delete low-importance states

Endgame databases

In games with a limited number of simple (late)states/configurations, compute their value byexhaustive game-tree search and store for lateruse.

Another form of caching, constructed once, beforegame-playing

Endgame databases

All ≤ 7 piece states solved in 2012

7-piece DB is 140 TB; 6-piece DB is 1.2 TB

Black to check-mate in 545 moves:

Checkers is solved

Checkers (5 · 1020 states) was shown to be a draw(Schaeffer et al., 2007)

The solution consists of:AND-OR tree from initial state (∼ 107 nodes)Leaf nodes evaluated from endgame database with all≤ 10 piece positions: consists of 3.9 · 1013 states;computations took 2001-2005

Checkers is solved Monte Carlo methods

DFS not working well for some types of gamesToo many statesHeuristics don’t guide search wellInformation gained during search not utilized

Monte Carlo methodsSample randomly full game-playsFocus search according to promising game-playsWorks even without heuristics, e.g. for Go

Similar methods used also for very large MDPs,POMDPs (e.g. in robotics)

Go (or Baduk or Weiqi)

Two-player fully-observable deterministic zero-sumboard game

Has been a big challenge for computers

Rules of Go

Go is played on 19×19 square grid of points, by players called Black andWhite.Each point on the grid may be colored black, white or empty.

A point P, not colored C, is said to reach C, if there is a path of (vertically orhorizontally) adjacent points of P’s color from P to a point of color C.

Clearing a color means emptying all points of that color that don’t reachempty.

Starting with an empty grid, the players alternate turns, starting with Black.

A turn is either a pass; or a move that doesn’t repeat an earlier grid coloring.

A move consists of coloring an empty point one’s own color; then clearing theopponent color, and then clearing one’s own color.

The game ends after two consecutive passes.

A player’s score is the number of points of her color, plus the number ofempty points that reach only her color. White gets 6.5 points extra.The player with the higher score at the end of the game is the winner.

Example game of 9×9 Go Example game of 9×9 Go

Example game of 9×9 Go Why is Go difficult for computers?

Go is visual and thus easy for peopleCould not show 10 Chess moves in one image

Branching factor far larger than in Chess

Evaluation of board configurations difficult

Horizon effect is strong (easy to delay capture)

Paradigm shift in 2006

Computer Go was progressing slowly(weak amateur level)

In 2006, Monte-Carlo methods surpassedtraditional tree searchIn 2015

All competitive programs use Monte Carlo19×19 is strong amateur level9×9 is professional level5×6 is solved, solving 6×6 feasible

In 2016, board-evaluation by neural networks →AlphaGo beats human champions

Monte Carlo Search

Try out every possible actionSeveral randomized plays:

Choose actions randomlyStop only after game ends

Score each gameplay accordingto who wins

Best action is one with most wins

Notice: No search tree here, onlyevaluation of current actionalternatives

Monte Carlo Tree Search (MCTS)

Extension of simulation/sampling-only MonteCarlo search

Generate a search tree, with leafs evaluated byrandomized simulation

Example (Single Agent)

0/0

0/0

Show number of wins/trials for each node

Monte Carlo Tree Search

win

1/1

1/1


1/1

1/1

0/0


1/1

loss

0/1

1/2


1/1 0/1 1/1

win

2/3


0/1 1/1

win

1/1

2/2

3/4


0/1

1/1

loss

0/1

1/22/2

3/5


0/1

1/1 0/1

1/2

win

1/1

3/3

4/6


Which tree node to choose for next expansion ortrial?

Incomplete information: results of previous trials

Choose one with few trials with high rewards(low confidence)

Choose one with many trials with lower rewards(high confidence)

(Exploration-exploitation trade-off as in Reinforcement Learning)

approach: Multi-Armed Bandits

Multi-Armed Bandits

Consider three “One-Armed Bandits”(slot machines) with different windistributions, and with the followingwins so far.

1 0, 1, 0, 0, 12 53 2, 2, 1

Which arm would you pull next?

Multi-Armed Bandits

µi = (initially unknown) expected pay-off of arm i

Ti(t) = how many times arm i played in steps 1..t

µ∗ = maxKi=1 µi is optimum pay-off

Optimal way of choosing the arm minimizes regret(how much below optimum?) after n steps:

nµ∗ −K∑

i=1

µiE[Ti(n)]

Multi-Armed Bandits

xi = average reward from arm i in first n steps

UCB1 Formula (Auer et al. 2002)First every arm is played once.Optimal arm after n steps: choose i to maximize

xi +

√2 ln n

Ti(n)

UCT algorithm

Create a root of tree with initial statewhile within computational budget do

leaf ← Selection(root)terminal ← Simulation(leaf)Backpropagation(leaf, Utility(terminal))

endreturn arg maxChildren(root) N(child)

UCT algorithmfunction Selection(node)while NonTerminal(State(node)) do

action← arg maxActions(node) UCB1(node, action)

if Child(node, action) thennode ← Child(node, action)

elsereturn Expand(node,action)

endendreturn node

function UCB1(node,action)child ← Child(node,action)if child then

return SumUtil(child)N(child)

+√

2 lnN(node)N(child)

elsereturn ∞

end

UCT algorithm

function Expand(node, action)child ← Create a new child to nodeN(child) ← 0SumUtil(child) ← 0return child

function Backpropagation(node, utility)while node do

N(node) ← N(node) + 1SumUtil(node) ← SumUtil(node) + utilitynode ← Parent(node)

end

Properties of UCT algorithm

Best action chosen exponentially more oftenGrows an asymmetric treeUtility estimates converge to true values

Applicable toone or more agentsdeterministic or stochastic systems

di culties in rational collective behavior cs-e4800 arti ... · cs-e4800 arti cial intelligence...

Documents