rlchina 2021 game theory and machine learning in

Reinforcement Learning China Summer School

RLChina 2021

Game Theory and Machine Learning

in Multiagent Communication and

Coordination

Prof. FANG Fei

Leonardo Assistant Professor

School of Computer Science

Carnegie Mellon University

August 20, 2021

Machine Learning + Game Theory

for Societal Challenges

Security & Safety

Environmental

SustainabilityTransportation

Zero HungerArtificial

Intelligence

Machine Learning

Computational

Game Theory

Societal Challenges

2

Protect Ferry Line from Potential Attacks

0

2

4

6

8

Max 𝔼

[𝑈]

Previous USCG

Game-theoretic

Defender-attacker security game

Randomized patrol strategy

Minimize attacker’s maximum

expected utility

Reduce potential risk by 50%Deployed by US Coast Guard

Optimal Patrol Strategy for Protecting Moving Targets with Multiple Mobile

Resources. Fei Fang, Albert Xin Jiang, Milind Tambe. In AAMAS-133

In collaboration with US Coast Guard

Protect Wildlife from Poaching

Data from past patrols

& satellite imageryPredicted poaching threat

Machine Learning Methods

Ensemble Learning, Decision Trees,

Neural Networks, Gaussian

Process, Markov Random Field, …

Learn poacher behavior from data

Ranger-poacher game to plan patrols

Deployed in Uganda, China, Malaysia

Increased detection of poaching

Available to more than 600 sites worldwide

In collaboration with Uganda Wildlife Authority, Wildlife Conservation Society, World Wild Fund for Nature, Panthera, Rimba

IJCAI-15, IAAI-16, AAMAS-17, ECML-PKDD 2017, COMPASS 2019, IAAI 20214

Outline

• Game Theory and Machine Learning for Multiagent

Communication and Coordination

• Role of informants in security games

• Strategic signaling in security games

• Maintaining/breaking information advantage in security games

• Coordination through correlated signals

• Coordination through notification in platform-user settings

• Discussion and Summary

5

Motivation: Community Engagement in Anti-Poaching

6

• Lack of patrol resources, e.g., 1 patroller/167 𝑘𝑚2

• Recruit informants to provide tips about poachers

• Other domains: community watchers for urban safety

Green Security Game with Community Engagement Taoan Huang, Weiran

Shen, David Zeng, Tianyu Gu, Rohit Singh, Fei Fang In AAMAS-20

Motivation: Community Engagement in Anti-Poaching

• How should the rangers plan patrols with/without tips?

• Informant’s goal may not be always aligned with the

defender

• Strategic informants can choose what to tell

AttackerDefender

Informant!

7When to Follow the Tip: Security Games with Strategic Informants Weiran

Shen, Weizhe Chen, Taoan Huang, Rohit Singh, Fei Fang In IJCAI-PCAI-20

• A set 𝑇 of 𝑛 targets

• Defender: choose a patrol strategy

– A (randomized) allocation of 𝑟 resources to 𝑛 targets

• Attacker: attack a target

• Informant: has type 𝜃 ∈ Θ with prior distribution 𝑝(𝜃), send a message to defender

• Defender: determine a defense plan

Defender-Attacker-Informant Game

When to Follow the Tip: Security Games with Strategic Informants Weiran


Utility Covered Uncovered

Defender

Attacker

8

Defender-Attacker-Informant Game

Step 1

Step 2

Step 3

Step 4

9

• A defense plan 𝑑 = 𝑀, 𝑥, 𝑥0

– 𝑀: A set of possible messages

– 𝑥0 ∈ [0,1]𝑛: A routine patrol strategy (when no messages)

– 𝑥:𝑀 → [0,1]𝑛: A mapping from message to a patrol strategy



Direct Defense Plan

10

In a direct defense plan 𝑑 = 𝑀, 𝑥, 𝑥0 , 𝑀 = 𝑇 × Θ. A direct defense plan is truthful, if reporting the

actual target and his true type is the informant’s

best strategy.

Definition



Revelation Principle

11

For any defense plan 𝑀, 𝑥, 𝑥0 , there exists a truthful

direct defense plan 𝑀, 𝑥, 𝑥0 , such that all players

obtain the same utility, for any target and any type.

Theorem (Revelation Principle)



How many messages are enough?

There exists a defense plan 𝑀, 𝑥, 𝑥0 , with 𝑀 = 𝑛 + 1that achieves the optimal defender’s utility.

Theorem

12

Upper bound: Direct defense plan: 𝑀 = 𝑛|Θ|

Interpretation

Message 1 to 𝑛: pro-defender informants

Message 𝑛 + 1: pro-attacker informants

For target 𝑡:

Informant reports message 𝑡, if 𝑈𝑡𝑐 𝜃 > 𝑈𝑡

𝑢 𝜃

Informant reports message 𝑛 + 1, if 𝑈𝑡𝑢 𝜃 > 𝑈𝑡

𝑐 𝜃



Computation

• The optimal defense plan can be computed in

polynomial time

• Solve a linear program (LP) for each target

– Each LP ensures

• The attacker’s best strategy is to attack a target 𝑡

• Informant’s best strategy is to report 𝑚𝑡 if the

informant is defender aligned on target 𝑡, i.e.,

𝑈𝑡𝑐 𝜃 > 𝑈𝑡

𝑢 𝜃

• Informant’s best strategy is to report 𝑚𝑛+1 if the

informant is attacker aligned on target 𝑡, i.e.,

𝑈𝑡𝑐 𝜃 < 𝑈𝑡

𝑢 𝜃



Computation

14

The attacker’s best strategy is to

attack a target 𝑡

Informant’s best strategy is to

report 𝑚𝑡′ if the informant is

defender aligned on target 𝑡′, and

to report 𝑚𝑛+1 otherwise

Maximize the defender’s expected

utility

LP assuming target 𝑡 is the best choice for attacker

Defender has 𝑟 resources in total



• Utility vs. informant type

– Type 1: Fully defender aligned: 𝑈𝑡𝑐 𝜃 > 𝑈𝑡

𝑢 𝜃 , ∀𝑡

– Type 2: Fully attacker aligned: 𝑈𝑡𝑐 𝜃 < 𝑈𝑡

𝑢 𝜃 , ∀𝑡

– Type 3: Random

• Informant could significantly affect the game

Experiments



Experiments

• If informant is not fully aligned with defender, more

defender resources are needed to achieve the same

expected utility

• Giving the informant additional reward helps a lot



Outline








• Summary

17

Motivation: UAV & Human Patrols in Anti-Poaching

18SPOT Poachers in Action: Augmenting Conservation Drones with Automatic Detection in Near Real Time. Elizabeth

Bondi, Fei Fang, Mark Hamilton, Debarun Kar, Donnabell Dmello, Jongmoo Choi, Robert Hannaford, Arvind Iyer,

Lucas Joppa, Milind Tambe, Ram Nevatia. In IAAI-18

Motivation: UAV & Human Patrols in Anti-Poaching

19

Not enough rangers

Flash light to deter poachers

Actual video of poacher running away

Signaling

• Flash light is a signal to indicate ranger is arriving

• The signal can be deceptive

• If Prob(ranger arrives|signal) = 0.1, poacher may not be

stopped

• Must be strategic in deceptive signaling

20

Signaling with Perfect Detection

21

Assuming

perfect

detectionNo Signal

no detection

detection

No Signal

0.3

0.7

0.3

0.4

0.3

How to incorporate uncertainty?

Strategic Coordination of Human Patrollers and Mobile Sensors with Signaling for Security

Games Haifeng Xu, Kai Wang, Phebe Vayanos and Milind Tambe AAAI 2018

Signaling with Detection Uncertainty

• Key insight: With uncertainty, adversary also uncertain

about our uncertainty. Exploit the information advantage

22

no detectionNo Signal

?

detection

No Signal

To Signal or Not To Signal: Exploiting Uncertain Real-Time Information in Signaling Games for Security and

Sustainability Elizabeth Bondi, Hoon Oh, Haifeng Xu, Fei Fang, Bistra Dilkina, Milind Tambe In AAAI-20

Stackelberg Security Game Model

23

Defender utilities at each target:

• Positive if covered

• Negative if attacked

• 0 if attacker runs away



• Enumerate all possible “states” of a target

• Defender’s pure strategy: Assign a state to each target

• Goal: Find optimal mixed strategy for the defender

Solution

24

Matched

Unmatched

Matched

UnmatchedPatroller far

p

n+

n-

s+

s-

s

Patroller near



Solution

• Linear Programming + Branch and Price

25

𝑥𝑖𝜃: Prob. of allocating resources to ensure state 𝜃 at target 𝑖

𝜓𝑖𝜃: Prob. of sending signal in state 𝜃 with detection

𝜑𝑖𝜃: Prob. of sending signal in state 𝜃 without detection

Defender’s expected utility



Joint prob. (not

conditional prob.)

Feasibility of 𝑞 and 𝑥

Marginal prob.

Feasibility of 𝜓 and 𝜑

Attacker attacks if

signal is 0 and runs

away if 1

𝑞𝑒: Prob. of defender choosing pure strategy 𝑒

Experimental Results

• Perform worse than expected if ignoring uncertainty

26

Case Study



Outline








• Summary

27

Information Advantage

• Consider a finitely repeated Bayesian security game

• 𝑇 rounds

• In each round, defender/attacker chooses actions

𝑎𝑡1, 𝑎𝑡2 (targets to protect/attack) simultaneously

• Defender has no commitment power

• Attacker’s type (utility) is unknown to the defender

• Defender need to infer attacker’s type 𝜆 ∈ Λ from their actions

• Prior type distribution 𝐩 = {𝑝𝜆}

• Attacker balance between playing myopically and

maintaining information advantage to maximize

accumulated payoff

• Attack can be viewed as (deceptive) signal of his type

• Task: Find optimal defender strategy28

Bayesian Equilibrium

• Rationality

• Belief Consistency: the belief is updated followed

the Bayes' rule

29

Optimality from any decision point onward

30

a (P1)

L R

b (P2) c (P2)

3; 1

K U

1; 3 2; 1

K U

0; 0

a (P1)

L R

b (P2) c (P2)

3; 1

K U

1; 3 2; 1

K U

0; 0

NE NE and Perfect NE

Perfect Bayesian Equilibrium

• Equilibrium refinement for Bayesian Equilibrium

• Sequential rationality starting from any information set

• Most existing work solve using Mathematical

Programming-based method (Nguyen et al. 2019[1]; Guo et

al. 2017[2])

– Very precise

– Lacks scalability: long time and large memory to solve

Thanh H. Nguyen, Yongzhao Wang, Arunesh Sinha, and Michael P. Wellman. Deception

in finitely repeated security games. In AAAI-19

31

Our Algorithm for Computing PBE

• Temporal Induced Self-Play (TISP)

– A framework that can be combined with different

learning algorithms

32

Backward

inductionPolicy

Learning

Belief-space

Approximation

Belief-based

representation

Belief-based representation

• Use belief instead of history: 𝜋(𝑠, 𝑏) instead of 𝜋(ℎ)

– 𝜋(attack Target 1 in (𝑙 − 1) round, 2 in 𝑙 − 2 round, . . )is now 𝜋(0.2 prob. of being attacker type a)

• Helps in the case with long history

33

Backward Induction

• Reverse the training process

– From round 𝐿 − 1 to round 𝐿 − 2, to …, to round 0

– Use trained value network 𝑉 and policy network 𝜋in round 𝑙 + 1 when training round 𝑙

• Do not sample the whole trajectory from round 0

to round 𝐿 − 1, but one step trajectory from

round 𝑙 to round 𝑙 + 1.

• Using a special reset function to help

• Different networks for different rounds

• Improve performance without adding training cost

34

Belief Space Approximation

• Sample 𝐾 belief vectors, and train the strategies

specifically conditioning on the belief and round,

• Query time:

𝜋 𝑎 𝑏, 𝑠 = 𝑘=1𝐾 𝜋𝜃𝑘 𝑎 𝑠; 𝑏𝑘 𝑤(𝑏, 𝑏𝑘)

𝑘=1𝐾 𝑤(𝑏, 𝑏𝑘)

35

Policy Learning

• Policy gradient:

– Update rule changed:

• Regret matching:

𝜋𝑡+1 𝑎 𝑠, 𝑏 =𝑅𝑡+1(𝑠,𝑏,𝑎)

+

𝑎′ 𝑅𝑡+1(𝑠,𝑏,𝑎′) +

where

𝛻𝜃 𝑉𝜆(𝜋, 𝑏, 𝑠) =

𝑎∈Α

𝛻𝜃 𝜋𝜃 𝑎 𝑏, 𝑠 𝑄𝜆(𝜋, 𝑏, 𝑠, 𝑎))

= 𝐸[ 𝑄𝜆 𝜋, 𝑏, 𝑠, 𝑎 𝛻𝜃 𝑙𝑛𝜋𝜃(𝑎 | 𝑏, 𝑠) + 𝛾𝛻𝜃𝑏′𝛻𝑏′𝑉𝜆(𝜋, 𝑏′, 𝑠′)]

𝑅𝑡+1 𝑠, 𝑏, 𝑎 =

𝜏=1

𝑡

𝑄𝜏 𝜋𝜏, 𝑠, 𝑏, 𝑎 − 𝑉𝜙𝜏(𝜋𝜏, 𝑠, 𝑏)

36

Temporal Induced Self-play Training

37

Test-time Policy Transformation

38

Experiment: Security Game

• Better scalability than MP-based method

• Much higher quality than other learning-based method

TISP can be used for more complex Stochastic Bayesian games

39

Outline








• Summary

40

Coordination in Games

• Correlated equilibrium (CE)

• Correlation device: send private signals to players

– Signals are sampled from a joint probability distribution over the actions of players and represent recommended player behavior

– Equivalent to having a mediator that privately recommends behavior to the players, but does not enforce it

Nash Equilibrium:

Total Utility=7+2=2+7=9

0.5

0.25

0.25

Correlated Equilibrium:

Total Utility=0.25*2(7+2)+0.5(6+6)

=10.5

41

Understand and Compute EFCE

• Extensive-form correlated equilibrium (EFCE):– Correlation device selects

private signals for the players before the game starts;

– Recommendations are revealed incrementally as the players progress in the game tree

• It is computationally challenging to compute EFCE

42

Theoretical Results

• Theorem (informal): Finding an EFCE in a two-player

game can be seen as a bilinear saddle-point problem

• Conceptual implication: A zero-sum game between the

mediator and the deviator

• Computational implication: The bilinear saddle-point

formulation opens the way to the plethora of optimization

algorithm that has been developed specifically for

saddle-point problems

43

min𝑥∈𝑋max𝑦∈𝑌𝑥𝑇𝐴𝑦

Correlation in Extensive-Form Games: Saddle-Point Formulation and Benchmarks Gabriele Farina,

Chun Kai Ling, Fei Fang, Tuomas Sandholm. In NeurIPS-19:

Algorithms to Compute EFCE

• Algorithm 1: A simple subgradient descent method

– Exploits the bilinear saddle-point problem formulation

– Use structural properties of EFCEs

– Can lead to better scalability than the prior approach

based on linear programming

• Algorithm 2: A regret minimization-based algorithm

– Adapt the self-play methods based on regret

minimization

– Much more scalable than Algorithm 1

44Correlation in Extensive-Form Games: Saddle-Point Formulation and Benchmarks Gabriele Farina, Chun Kai Ling,

Fei Fang, Tuomas Sandholm. In NeurIPS-19. Efficient Regret Minimization Algorithm for Extensive-Form

Correlated Equilibrium Gabriele Farina, Chun Kai Ling, Fei Fang, Tuomas Sandholm In NeurIPS-19

Copula Learning for Agent Coordination

Correlation

Device

Players in Team

1

Players in Team

2

Our goal is to design the copula, from which we can

derive the distribution of the signals, to achieve good

enough coordination among players

Design the distribution

of the signal

represented by a

copula

e.g., a neural network

Parameterize the copula with a neural network and try to learn the

parameters to ensure the players have incentive to follow the

recommended action

Deep Archimedean Copulas Chun Kai Ling, Fei Fang, Zico Kolter, NeurIPS-2045

Archimedean Copulas

• Copulas: Multivariate CDF with marginals uniform in [𝟎, 𝟏]

• 𝐶 𝑥1, … , 𝑥𝑑 = 𝑃(𝑋1 ≤ 𝑥1, … , 𝑋𝑑 ≤ 𝑥𝑑)

• Archimedean Copulas: specified by a generator 𝜑: 0,∞ →0, 1

• 𝐶 𝑥1, … 𝑥𝑑 = 𝜑 𝜑−1 𝑥1 +⋯+ 𝜑

−1 𝑥𝑑

• Commonly used A.C. are parameterized by a single scalar 𝜃e.g.,

Frank Clayton

𝜑𝜃 𝑡 = −1

𝜃log(e−𝑡 𝑒−𝜃 − 1 + 1) 𝜑𝜃 𝑡 = 1 + 𝑡

−1/𝜃

Image from Scherer, Matthias, and Jan-frederik Mai. Simulating copulas: stochastic models, sampling algorithms, and

applications.

46

Our approach: ACNet

• ACNet learns a 𝜑 as a convex combination of negative

exponentials

• Other probabilistic quantities obtained by differentiation

w.r.t. inputs

– Joint Density: 𝜕𝑑𝐶 𝑥1,…,𝑥𝑑

𝜕𝑥1…𝜕𝑥𝑑

– Conditional densities, conditional distributions, etc.

• Evaluating these quantities requires computing 𝜑−1

– Provide a wrapper in PyTorch to compute inverses

using Newtons method

– Fully differentiable, derivatives w.r.t. weights are

computed using auto-differentiation

47

Fitting real world data

Ground Truth

Boston INTC-MSFT GOOG-FB

Best parametric

ACNet

48

We find that ACNet…

• Is able to fit synthetic data generated from common A.C.

• Outperforms common A.C. in real-world datasets

• Can give conditional densities/CDFs in a single model

• Can be sampled from efficiently even in high dimensions

• Can be the basis for computing correlated equilibrium in

complex games (e.g., with continuous action space)

• Future direction: Leverage ACNet to compute correlated

equilibrium for complex games

49

Outline








• Summary

50

Motivation: Volunteer-Based Food Rescue Platform

Volunteer claim

rescue

Pick up from

donor

Deliver to recipient

Success!

Food waste and food insecurity coexist Waste up to 40% of our food globally (>1.3 billion tons annually)

1 in 8 people go hungry every day

Rescue good food!

Post rescue

requests

In collaboration with 412 Food Rescue (412FR)51


• Challenges

– Uncertainty about whether a rescue will be claimed

and completed

Human dispatcher intervenesSend notifications to volunteers

1-to-many communication

Volunteer: notification fatigue

1-to-1 communication

Dispatcher: overstretched

52


• How can AI help?

– Predictive model of rescue claim status

• Determine which rescues need special attention

from human dispatcher

– Data-driven optimization of intervention and

notification

• Avoid excessive notifications to help retain

volunteers

Improving Efficiency of Volunteer-Based Food Rescue Operations. Zheyuan

Ryan Shi∗, Yiwen Yuan∗, Kimberly Lo, Leah Lizarondo, Fei Fang. In IAAI-20.

53

Predictive Model of Rescue Claim Status

Timing

Weather

Location

Percentage of unclaimed

rescues by zip codeFeatures used for predictive model

Operational dataset of 412FR from March 2018 to May 2019

54


A stacking model

Predict whether a rescue will be claimed by volunteers

+: Claimed (3825)

-: Not claimed (749)

NN

Training data: May 2018 to Dec 2018

Test data: Jan 2019 to May 2019

55


Predict whether a rescue will be claimed by volunteers

Model Accuracy Precision Recall F1 AUC

Gradient

boosting0.73 0.86 0.82 0.84 0.51

Random forest 0.71 0.87 0.78 0.82 0.54

Gaussian

process0.56 0.88 0.54 0.67 0.60

Stacking model 0.69 1.00* 0.64 0.78 0.81

56

Rescue

published

Dispatcher

Intervene

60 minutes

5 miles

15 minutes

1st wave

notification

Notify all

volunteers

Optimize Intervention and Notification Scheme

Current Practice (Default INS)

964 volunteers get 1st-wave notification on average

44.6% rescues are claimed by volunteers receiving 1st-wave

notification 57

Rescue

published

Dispatcher

Intervene

𝑧 minutes

𝑦 miles

𝑥 minutes

1st wave

notification

Notify all

volunteers

(2nd-wave)


58

Improve INS with minor changes

Task: Find best values of 𝑥, 𝑦, 𝑧 to reduce notifications and

human intervention while ensuring at least the same claim rate


• Optimization problem

59

min𝑥,𝑦,𝑧𝑣 𝑦 + 𝑞 𝑥, 𝑦, 𝑧 + 𝜆 × 𝑠 𝑥, 𝑦, 𝑧

s.t. 𝑝 𝑥, 𝑦, 𝑧 ≥ 𝑏

𝑣 𝑦 = Expected # of volunteers receiving 1st-wave notification

𝑞 𝑥, 𝑦, 𝑧 = Expected # of volunteers receiving 2nd-wave

notification𝑠 𝑥, 𝑦, 𝑧 = Expected # of rescues requiring human

intervention

𝑥, 𝑦, 𝑧 ∈ 𝑆

Claim rate ≥ threshold

Counterfactual Estimation

• How to estimate 𝑣 𝑦 , 𝑞 𝑥, 𝑦, 𝑧 , 𝑠 𝑥, 𝑦, 𝑧 , 𝑝(𝑥, 𝑦, 𝑧)?

– Assume rescue and volunteer distribution remain the

same

– Estimate the quantities based on historical rescues

– For each rescue, we calculate the counterfactual

claim time (CCT) under an INS with a number of

assumptions

• Assumption 1: Upon receiving the notification, a

volunteer take the same amount of time to respond

• Assumption 2: Success of intervention is not

affected by INS

• …

60

Branch-and-Bound Algorithm

• Can we avoid enumerating all possible INSs?

• Estimate a lower bound of objective value with INS

(𝑥, 𝑦, 𝑧) when a subset of parameters are specified

• Use the lower bound to prioritize and prune INSs through

branch-and-bound

61

𝑥, 𝑦, 𝑧 ∈ 𝑆

Example: 𝑥 = 𝑢𝑛𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑒𝑑, 𝑦 = 5 𝑚𝑖𝑙𝑒𝑠, 𝑧 = 45 𝑚𝑖𝑛

Lower bound = 𝑣 𝑦 + 𝑞 𝑥𝑚𝑎𝑥 , 𝑦, 𝑧 + 𝜆 𝑠 𝑥𝑚𝑖𝑛, 𝑦, 𝑧

1st wave 2nd wave Human

Branch-and-Bound Algorithm

62

(𝑥, 𝑦, 𝑧)unspecified

𝑧 = 45 𝑧 = 60𝑧 = 50 𝑧 = 55

𝑦 = 4.5 𝑦 = 6𝑦 = 5 𝑦 = 5.5

Calculate LB

𝑥 = 14𝑥 = 15

𝑥 = 16

Calculate LB

Calculate Objective Value

𝑂𝑏𝑗 = 2500

𝐿𝐵 = 2600

Recommended INS

• Optimize on data from May 2018 to Dec 2018

• Test on test data from Jan 2019 to May 2019

• Recommend INS by checking Pareto frontier

Deployed

since Jan

2020 63

Rescue-Specific Notification Scheme

• Can we do better than simply changing the parameters

of the default INS?

– Send 1st-wave notifications to volunteers that are

more likely to claim the rescue

• Task: Given a rescue, provide a list of 𝑘 volunteers

• Similar to recommender systems

– Users - Rescue trips

– Items - Volunteers

A Recommender System for Crowdsourcing Food Rescue Platforms. Zheyuan

Ryan Shi, Leah Lizarondo, Fei Fang In WWW-21

64

Distribution of Donor and Recipient Organizations

0 1 2

3 4 5

6 7 8

10 119

12 13 14

15

0 1 2

3 4 5

6 7 8

10 119

12 13 14

15

Donor Organizations Recipient Organizations

Divide the Pittsburgh area into 16 regions

65

Feature Extraction

Feature extraction

Rescue

Volunteer

Volunteer’s # completed

rescues in donor’s region

Volunteer’s # completed

rescues in recipient’s region

Volunteer’s total # completed

rescues

Time between rescue and

volunteer’s onboarding

Distance between donor and

volunteer

66

Predictive Model of Rescue-Volunteer Compatibility

Predict whether a rescue will be claimed by a specific volunteer

+: Claimed

-: Not claimed Training data: Mar 2018 to Oct 2019

Test data: Nov 2019 to Mar 2020

Feature

s

Neural

network

with 4

hidden

layers

6757 rescues

9212 volunteers

67

Rescue-Specific Notification

Volunteer 1

Volunteer 2

Volunteer N

…

0.341

0.105

0.422

0.663

0.635

0.002

Volunteer 8346

Volunteer 333

Volunteer 1835

…

68

Evaluation

• Metric: Hit ratio at top 𝑘 (HR@k): % of rescues that are

claimed by volunteers in top 𝑘

• 𝑘 = 964 to match the default INS

69

Caveat with Rescue-Specific Notification

• ML model discovers some frequent volunteers and

sends them notifications almost all the time

70

ML + Online Planning

• Rather than greedily taking the top 𝑘 volunteers, we

enforce a constraint such that each volunteer receives at

most 𝐿 notifications per day

• For current rescue 𝑖, determine who to send notifications

to by planning with an projected set of future rescues 𝑅

71

𝑥𝑖𝑗 ∈ {0,1}: Whether to send

notification of rescue 𝑖 to

volunteer 𝑗

𝑝𝑖𝑗 ∈ 0,1 : Output of ML model

indicating the prob. that

volunteer 𝑗 will claim rescue 𝑖

𝑏𝑗 ∈ {0, … , 𝐿}: Number of

notifications volunteer 𝑗 can

receive for the rest of the day

Online Planning-Based Rescue-Specific Notification

• Avoid the over-concentration with 𝐿 = 5

• HR@k = 0.645, much better than current practice

72

(𝐿)

Outline








• Summary

73

Summary

• Communication and Coordination in Multi-Agent

Interaction

– Informants, Signals, Notifications

– Historical actions

• ML+ GT for Communication and Coordination

– Mathematical programming-based algorithms

– Learn human behavior

– Learn equilibrium / optimal strategy

• ML+ GT for Societal Challenges

– Security, Sustainability, Food security

74

Acknowledgment

• Advisors, postdocs, students and all co-authors!

• Collaborators and partners

• Funding support

75

rlchina 2021 game theory and machine learning in

Documents