di culties in rational collective behavior cs-e4800 arti ... · cs-e4800 arti cial intelligence...
TRANSCRIPT
CS-E4800 Artificial Intelligence
Jussi Rintanen
Department of Computer ScienceAalto University
March 9, 2017
Difficulties in Rational Collective Behavior
Individual utility in conflict with collective utilityExamples:
greenhouse gasesover-populationde-forestationarms race, military build-up
No general solution to resolve this conflict
Issue: How to align agents’ and collectives’ utilities
Law/agreements to constrain individual actions(hard to enforce when utilities high)
Tragedy of the Commons
Using jointly-ownedresource; cost evenly shared
0 2 4 60 0,0 -1,1 -2,2 -3,32 1,-1 0,0 -1,1 -2,24 2,-2 1,-1 0,0 -1,16 3,-3 2,-2 1,-1 0,0
Using your own resource
0 2 4 60 0,0 0,0 0,0 0,02 0,0 0,0 0,0 0,04 0,0 0,0 0,0 0,06 0,0 0,0 0,0 0,0
Best action: Spend joint resource as much as you can(Or (under diminishing marginal utility) at least: Spend it more
than you would if you had to pay for it in full.)
(Flatmates agree to fill-up fridge every day, divide the cost evenly,
and let everybody eat as much as they like. Good idea?)
Games with State
Strategies in normal form games single shot
Real-world games typically involve multiple stagesFormalizations:
Games in extensive form (game theory)Multi-agent Markov decision processes: MDPs withactions replaced by normal form games, and payoffsobtained from values of successor states)Game-tree search for zero-sum games (later today)
Can be abstractly viewed as normal form games(reduction to normal form exponential size)
Time-dependent aspects cannot be investigated innormal form
Challenges in multi-agent systems
1 players’ utilities opposite (coordination impossible,mixed strategies)
2 conflicting individual and collective utility(coordination difficult, suboptimal collectiveoutcomes)
3 making a decision collectively (measuring utility)
Preference aggregation
Decision between alternatives A, B and CAgents express their preferences
option 1: some ranking/ordering of A, B, Coption 2: numeric values of A, B, C
Preferences need to be aggregated to obtain ajoint ordering/valuation of A, B, CThis is difficult!
Agents’ utilities generally not publicly knownOptimal strategy (often): lie about utilities/preferencesSuboptimal outcomes
Aggregation of rankings
Set of candidates (outcomes, alternatives)
Set of agentsObjective: Produce
aggregate ordering of all candidates, ora winning candidate.
Aggregation of rankings
A scoring rule assigns a numeric score based onthe position in each invididual ordering.
Aggregate ordering formed by summing the scoresfrom each individual.Possible rules (for ranking 4 individuals)
plurality x > y > z > u mapped to 1, 0, 0, 0Only 1st preference counts.
veto x > y > z > u mapped to 1, 1, 1, 0 (or 0, 0, 0,−1)Only last preference counts.
Borda count x > y > z > u mapped to 3, 2, 1, 0
Aggregation of rankings
Scoring rules can be combined with runoff procedures:
2-candidate runoff with plurality rule1 Eliminate all but top two based on scores2 Recalculate scores, and winner is the one scoring higher
Single transferable vote with plurality rule1 Eliminate candidate with lowest plurality score2 Continue eliminations until one left
Aggregation of preferences/ranksOrdering by pairwise plurality
Order x > y if plurality of agents prefer x to y .Can lead to cycles:
agent 1: > >
agent 2: > >
agent 3: > >
Aggregation of preferences/ranksOrdering by pairwise plurality
How are these cycles possible?
Candidates have different property vectors:
(1,1,0) (1,0,1) (0,1,1)
Even if the agents value all properties positively,uneven weights lead to cycles.Example: (3,2,1) (1,3,2) (2,1,3)
Strategic voting
Expressing preferences incorrectly can be beneficial:
Assume plurality voting
Agent’s actual preferences are A > B > CAgent knows that other agents’ preferences are
B > C > AB > C > AC > A > BC > A > B
B and C will be tied if agent votes A > B > C .
B wins if agent votes B > A > C (better result!)
Other scoring rules (and voting systems in general) aremanipulable similarly.
Vickrey-Clarke-Grove mechanismWith the Clarke pivot rule
Choice between alternatives in set X :1 Agents report their value functions vi(x), x ∈ X2 Best outcome is xopt = arg maxx∈X
∑ni=1 vi(x)
3 Agent i is paid∑
j 6=i
vj(xopt)−max
x∈X
∑
j 6=i
vj(x)
Value of xopt - value of best alternative (without i)
Agent’s payment+utility is maximized by truthfulreporting!
(Can be viewed as a generalization of second-price sealed bid
auctions.
Game tree search
Two-person multi-stage zero-sum games
player wins, opponent loses, or vice versa (or it’s adraw)
Board games: checkers, chess, backgammon, go
Other applications? (Military operations?)
Issue: very large search trees
Issue: focusing search difficult
Basic game tree search by Minimax
Depth-first search of bounded depth AND-OR tree
Leaf nodes evaluated with a heuristic valuefunctionChess: value of pieces, relative positions (mobility,safety of king, ...)
Values of non-leaf nodes by min or max of children
AND-nodes (opponent) by minimization
OR-nodes (player) by maximization
(Special case: whole game-tree covered, winning leafs1, losing leafs -1, and draws 0)
Minimax Tree Search
∨1
∧0 ∧1
∨2 ∨0 ∨3 ∨1
0 2 -1 0 3 -2 1 1
Alpha-Beta Pruning
Idea behind Alpha-Beta Pruningmin(x ,max(y , z)) = x if x ≤ y (α cuts)max(x ,min(y , z)) = x if x ≥ y (β cuts)In both cases, z is irrelevant.
Alpha-Beta pruning example
MAX
3 12 8
MIN 3
3
Alpha-Beta pruning example
MAX
3 12 8
MIN 3
2
2
X X
3
Alpha-Beta pruning example
MAX
3 12 8
MIN 3
2
2
X X14
14
3
Alpha-Beta pruning example
MAX
3 12 8
MIN 3
2
2
X X14
14
5
5
3
Alpha-Beta pruning example
MAX
3 12 8
MIN
3
3
2
2
X X14
14
5
5
2
2
3
Heuristics to support Alpha-Beta Pruning
Alpha-Beta prunes more if best actions tried first
Determine promising actions through iterativedeepening: use score for action/child fromprevious iterative-deepening round
Issue with depth-bounds: Horizon effect
Black bishop is trapped, but its capture could bedelayed to search depth d + 1
Transposition tables
Depth-first used in games like chess becauseastronomic state spaces: algorithms that requirestoring all visited states not feasible.
Need to utilize memory for pruning, withoutexhausting it
DFS can reach a state in multiple ways=⇒ Multiple copies of the same subtree
Transposition tables: Cache states encounteredduring DFS; retrieve value of already-encounteredstates, rather than repeating search
When table full, delete low-importance states
Endgame databases
In games with a limited number of simple (late)states/configurations, compute their value byexhaustive game-tree search and store for lateruse.
Another form of caching, constructed once, beforegame-playing
Endgame databases
All ≤ 7 piece states solved in 2012
7-piece DB is 140 TB; 6-piece DB is 1.2 TB
Black to check-mate in 545 moves:
Checkers is solved
Checkers (5 · 1020 states) was shown to be a draw(Schaeffer et al., 2007)
The solution consists of:AND-OR tree from initial state (∼ 107 nodes)Leaf nodes evaluated from endgame database with all≤ 10 piece positions: consists of 3.9 · 1013 states;computations took 2001-2005
Checkers is solved Monte Carlo methods
DFS not working well for some types of gamesToo many statesHeuristics don’t guide search wellInformation gained during search not utilized
Monte Carlo methodsSample randomly full game-playsFocus search according to promising game-playsWorks even without heuristics, e.g. for Go
Similar methods used also for very large MDPs,POMDPs (e.g. in robotics)
Go (or Baduk or Weiqi)
Two-player fully-observable deterministic zero-sumboard game
Has been a big challenge for computers
Rules of Go
Go is played on 19×19 square grid of points, by players called Black andWhite.Each point on the grid may be colored black, white or empty.
A point P, not colored C, is said to reach C, if there is a path of (vertically orhorizontally) adjacent points of P’s color from P to a point of color C.
Clearing a color means emptying all points of that color that don’t reachempty.
Starting with an empty grid, the players alternate turns, starting with Black.
A turn is either a pass; or a move that doesn’t repeat an earlier grid coloring.
A move consists of coloring an empty point one’s own color; then clearing theopponent color, and then clearing one’s own color.
The game ends after two consecutive passes.
A player’s score is the number of points of her color, plus the number ofempty points that reach only her color. White gets 6.5 points extra.The player with the higher score at the end of the game is the winner.
Example game of 9×9 Go Example game of 9×9 Go
Example game of 9×9 Go Why is Go difficult for computers?
Go is visual and thus easy for peopleCould not show 10 Chess moves in one image
Branching factor far larger than in Chess
Evaluation of board configurations difficult
Horizon effect is strong (easy to delay capture)
Paradigm shift in 2006
Computer Go was progressing slowly(weak amateur level)
In 2006, Monte-Carlo methods surpassedtraditional tree searchIn 2015
All competitive programs use Monte Carlo19×19 is strong amateur level9×9 is professional level5×6 is solved, solving 6×6 feasible
In 2016, board-evaluation by neural networks →AlphaGo beats human champions
Monte Carlo Search
Try out every possible actionSeveral randomized plays:
Choose actions randomlyStop only after game ends
Score each gameplay accordingto who wins
Best action is one with most wins
Notice: No search tree here, onlyevaluation of current actionalternatives
Monte Carlo Tree Search (MCTS)
Extension of simulation/sampling-only MonteCarlo search
Generate a search tree, with leafs evaluated byrandomized simulation
Example (Single Agent)
0/0
0/0
Show number of wins/trials for each node
Monte Carlo Tree Search
win
1/1
1/1
Monte Carlo Tree Search
1/1
1/1
0/0
Monte Carlo Tree Search
1/1
loss
0/1
1/2
Monte Carlo Tree Search
1/1 0/1 1/1
win
2/3
Monte Carlo Tree Search
0/1 1/1
win
1/1
2/2
3/4
Monte Carlo Tree Search
0/1
1/1
loss
0/1
1/22/2
3/5
Monte Carlo Tree Search
0/1
1/1 0/1
1/2
win
1/1
3/3
4/6
Monte Carlo Tree Search
Which tree node to choose for next expansion ortrial?
Incomplete information: results of previous trials
Choose one with few trials with high rewards(low confidence)
Choose one with many trials with lower rewards(high confidence)
(Exploration-exploitation trade-off as in Reinforcement Learning)
approach: Multi-Armed Bandits
Multi-Armed Bandits
Consider three “One-Armed Bandits”(slot machines) with different windistributions, and with the followingwins so far.
1 0, 1, 0, 0, 12 53 2, 2, 1
Which arm would you pull next?
Multi-Armed Bandits
µi = (initially unknown) expected pay-off of arm i
Ti(t) = how many times arm i played in steps 1..t
µ∗ = maxKi=1 µi is optimum pay-off
Optimal way of choosing the arm minimizes regret(how much below optimum?) after n steps:
nµ∗ −K∑
i=1
µiE[Ti(n)]
Multi-Armed Bandits
xi = average reward from arm i in first n steps
UCB1 Formula (Auer et al. 2002)First every arm is played once.Optimal arm after n steps: choose i to maximize
xi +
√2 ln n
Ti(n)
UCT algorithm
Create a root of tree with initial statewhile within computational budget do
leaf ← Selection(root)terminal ← Simulation(leaf)Backpropagation(leaf, Utility(terminal))
endreturn arg maxChildren(root) N(child)
UCT algorithmfunction Selection(node)while NonTerminal(State(node)) do
action← arg maxActions(node) UCB1(node, action)
if Child(node, action) thennode ← Child(node, action)
elsereturn Expand(node,action)
endendreturn node
function UCB1(node,action)child ← Child(node,action)if child then
return SumUtil(child)N(child)
+√
2 lnN(node)N(child)
elsereturn ∞
end
UCT algorithm
function Expand(node, action)child ← Create a new child to nodeN(child) ← 0SumUtil(child) ← 0return child
function Backpropagation(node, utility)while node do
N(node) ← N(node) + 1SumUtil(node) ← SumUtil(node) + utilitynode ← Parent(node)
end
Properties of UCT algorithm
Best action chosen exponentially more oftenGrows an asymmetric treeUtility estimates converge to true values
Applicable toone or more agentsdeterministic or stochastic systems