vikram krishnamurthy and h. vincent poor social learning ...vikramk/kp13.pdf · we refer the reader...

1053-5888/13/$31.00©2013IEEE IEEE SIGNAL PROCESSING MAGAZINE [43] MAY 2013

How do local agents and global decision makers interact in statistical signal processing problems where autonomous decisions need to be made? When individual agents possess limited sensing, computation, and communication capabilities,

can a network of agents achieve sophisticated global behavior? Social learning and Bayesian games are natural settings for addressing these questions. This article presents an overview, novel insights, and a discussion of social learning and Bayesian games in adaptive sensing problems when agents communicate over a network. Two highly stylized examples that demonstrate

to the reader the ubiquitous nature of the models, algorithms, and analysis in statistical signal processing are discussed in tutorial fashion.

INTRODUCTION AND MOTIVATIONThis article discusses two examples involving multiagent sens-ing and decision making based on Bayesian signal processing. The first example considers a Bayesian global game, and the second example deals with multiagent social learning. Bayesian games and social learning can be used either as descriptive tools, to predict the outcome of complex interactions among agents, or as prescriptive tools, to design systems around given interaction rules. In recent years there have been significant advances in these areas motivated by applications in automated

Digital Object Identifier 10.1109/MSP.2012.2232356

Date of publication: 5 April 2013

[Vikram Krishnamurthy and H. Vincent Poor]

[How do local and global decision makers interact?]

Social Learning and Bayesian Games

in Multiagent Signal Processing

© IS

TOC

KP

HO

TO.C

OM

/JA

MIE

FA

RR

AN

T

IEEE SIGNAL PROCESSING MAGAZINE [44] MAY 2013

decision making in economics [1], [2], sensor networks, cognitive radio [3], and autonomous systems [4].

EXAMPLE 1: MULTIAGENT ADAPTIVE SENSING AS A GLOBAL GAMEGiven a network of sensors, how can individual sensors activate themselves autonomously to achieve event-driven detection? That is, based on noisy observa-tions, if an event of interest occurs, sufficient numbers of sensors must switch from a battery-saving sleep mode to a high-resolution mode to obtain accurate measurements. This schematic setup is illustrated in Figure 1(a) where multiple sen-sors with individual controllers sense a target. Too many sensors switching to the high-resolution mode is an overkill resulting in wasted battery energy. On the other hand, too few sensors switching to the high-resolution mode results in insufficient measurements and hence an inaccurate (high variance)

estimate at the data fusion center. This tradeoff between the cost of acquiring information (battery power) versus the utility of information (variance of estimate) is inherent in adaptive sensing. Moreover, since environments evolve over time, and centralized organization is costly in terms of communication and energy, there is strong motivation to develop and analyze autonomous activation schemes.

To answer the above ques-tion, we formulate a highly styl-ized Bayesian game-theoretic analysis of sensor activation algorithms. Game theory is a natural tool for analyzing mul-tiagent adaptive sensing systems since it models each sensor as a self-driven decision maker that makes decisions based on Bayes-

ian state estimates. (The Bayesian games we consider here are motivated by adaptive sensing and state estimation. They are quite different from static games with full information consid-ered extensively in wireless communications.) In dense sensor

(a) (b)

A Threshold Policy

Tx

Tx

Tx

ControlUnit

Activate

Idle

ControlUnit

ControlUnit

ControlUnit

Sensor

Sensor

Sensor

Sensor

Target

Tx

xk ! 1,k < x 0

xk = 1,k $ x 0.

[FIG1] An illustration of the two examples considered in this article, which deal with autonomous decision making based on Bayesian signal processing. (a) The first example (multiagent active sensing) deals with a Bayesian global game. Each sensor is equipped with a controller that chooses the mode of the sensor, namely low resolution or high resolution. Using a Bayesian global games formulation, we will show that if each sensor deploys a threshold policy with respect to measurement, the global behavior is a Bayesian Nash equilibrium. (b) The second example (quickest detection social learning) deals with social learning. Agents monitor an underlying state xk that changes at a geometrically distributed random time .0x Each agent obtains a noisy private measurement of the state, makes a local decision (green or red) to optimize its expected utility and broadcasts this decision. Subsequent agents combine their private observations and decisions from previous agents using Bayes’ formula and then make their local decisions. Given these local decisions over time, how can a global decision maker decide when the state xk has changed?

GAME THEORY IS A NATURAL TOOL FOR ANALYZING MULTIAGENT ADAPTIVE SENSING SYSTEMS, SINCE IT MODELS EACH SENSOR AS A SELF-

DRIVEN DECISION MAKER THAT MAKES DECISIONS BASED ON BAYESIAN

STATE ESTIMATES.


networks (i.e., where a large number of sensors are present), it is natural to respond to the decentralized awareness of a sen-sor network with decentralized information processing. If sen-sors can adapt their behavior to locally observed conditions, then they can self-organize into a functioning network, elimi-nating the need for centralized control. For sensor mode selec-tion, self-configuration allows sensor networks to efficiently extract information, by adjusting individual sensor behavior to “form to” their environment according to local conditions. This leads to robust, scalable, and efficient operation since a central authority is not required. We refer the reader to [5] and [6] for applications in sensor networks and spectrum sensing in cognitive radio.

EXAMPLE 2: MULTIAGENT SOCIAL LEARNING AND CHANGE DETECTIONIn a multiagent network, how can agents use their noisy obser-vations and decisions made by previous agents to estimate an underlying randomly evolving state? How do decisions made by previous agents affect decisions made by subsequent agents? In this article, these questions will be formulated as a multiagent sequential detection problem involving social learning. In social learning, each agent optimizes its local utility selfishly and then broadcasts its action. Subsequent agents then use their private observations together with the actions of previous agents to estimate (learn) an underlying state. The setup is quite different from classical signal processing in which sensors use only noisy observations to compute estimates—now agents use noisy observations together with decisions made by previous agents. In the last decade, social learning has been studied widely in economics to model the behavior of financial markets, crowds, and social networks; see [7]–[11] and numerous references therein. The social learning framework is similar to Hellman’s and Cover’s seminal paper [12] which analyzes learning with limited memory.

Social learning can result in unusual behavior. Indeed, a key result in social learning of an underlying random variable is that rational agents eventually herd [8], that is, they eventually end up choosing the same action irrespective of their private observa-tions. As a result, the actions contain no information about the private observations and so the Bayesian estimate of the underly-ing random variable freezes. For a sensor network, this can be undesirable, particularly if individual sensors herd and make incorrect decisions. To enhance social learning, we will describe how herding can be delayed when agents act benevolently to opti-mize a social welfare function.

To illustrate social learning to a signal processing audience, the article will then describe the multiagent change detection problem illustrated in Figure 1(b). Individual agents record noisy observations of an underlying state process and perform social learning to estimate the underlying state. They make decisions about whether a change has occurred that optimize their local utilities (which are functions of their posterior esti-mates of the underlying state). Agents then broadcast their local decisions (red for change, green for no change) to subsequent

agents. As these local decisions accumulate over time, a global decision maker needs to decide (based on these local decisions) whether or not to declare a change has occurred. How can the global decision maker achieve such change detection to mini-mize a cost function composed of false alarm rate and delay penalty? The local and global decision makers interact since the local decisions determine the posterior distribution of subse-quent agents, which determines the global decision (stop or continue), which determines subsequent local decisions. We show that this social learning-based change detection problem leads to unusual behavior. The optimal decision policy of the stopping time problem has multiple thresholds. This is unusual—if it is optimal to declare that a change has occurred based on the posterior probability of change, it may not be opti-mal to declare a change when the posterior probability of change is higher!

PERSPECTIVEThe above examples motivate the two main themes of this article:

■ When individual agents possess limited sensing and computation and communication capabilities (such agents with limited cognition are termed “boundedly rational agents”), can a network of agents achieve sophisticated global behavior?In Example 1, if each agent deploys a simple threshold acti-

vation policy (switch to high-resolution mode if the received observation is sufficiently large; otherwise remain in sleep mode), can the network of sensors achieve an operating point that reflects sophisticated behavior? That is, an operating point such that there is no unilateral benefit for individual agents to deviate from this operating point. Such a Bayesian Nash equilib-rium (BNE) (formally defined in the section “Bayesian Nash Equilibrium”) takes into account utilities of individual sensors together with their interaction. In [2], a nice description is given of how if individual agents deploy simple heuristics, the global system behavior can achieve “rational” behavior. The related problem of achieving coherence (i.e., agents eventually choosing the same action or the same decision policy) among disparate sensors of decision agents without cooperation has also witnessed intense research; see [4] and [13].

■ How do local agents and global decision makers interact in multiagent statistical signal processing problems in which autonomous decisions need to be made? As described in Example 2, local and global decision makers

interact in social learning and multiagent change detection problems in which agents perform social learning.

The literature in these areas is extensive; see also [14]. Due to page restrictions, we refer the reader to [5], [6], and [15]–[17] for detailed literature surveys. In fact, as is often the case, many of the techniques presented here are developed from “old prob-lems” that have become fashionable again. Game-theoretic anal-yses of decentralized problems in sensor networks are becoming increasingly common; see [18] and [19]. There has also been much recent research in energy saving mechanisms for sensor networks [20], [21].


EXAMPLE 1: MULTIAGENT ADAPTIVE SENSING: A GLOBAL GAMES APPROACHThis section deals with our first example for autonomous sensor activation in large-scale sensor networks such as unattended ground sensor networks and body area networks. In these appli-cations, battery life is a key issue [22]. For example, in body area networks, where devices are implanted in the human body, bat-teries are not easily replaceable. Sensor units require an energy source for data collection, processing, and transmission. Sensors typically operate in a battery-saving low-resolution mode until they sense an event of interest. How can sensors decide autono-mously when to switch to a high-resolution mode? Given the tight energy budget, self-organization and self-configuration are important for efficient operation [23] and have been applied to routing, topology control, power control, and sensor scheduling.

Our goal of this section is to develop a global games approach [15], [24] to sensor activation in a dense network. To give the reader a clear yet rapid treatment of the ideas, our description of the assumptions and theory below is idealized; generalizations are presented in [15] and [25].

WHY GLOBAL GAMES?Global games [24], [26], [27] are Bayesian games of incomplete information. They are ideally suited for analyzing decentralized coordination among agents. The theory of global games was introduced in [26] as a tool for refining equilibria in economic game theory [27]. We refer the reader to [10, Ch. 11] for a nice description including economic market examples such as specula-tive attacks against a currency with fixed exchange rate that are modeled as global games. Global games study the interaction of a continuum of players who choose actions based on noisy obser-vations of a common signal. The term global refers to the fact that players at each time can play any game selected from a sub-class of all games, which adds an extra dimension to standard game-play (wherein players act to maximize their own utility in a fixed interactive environment). Global games model the incentive of sensors (players) to act together or not. The incen-tive of a sensor to act is either dampened or stimulated by the average level of activity of other sensors. This is typical in a sen-sor network in which one seeks a tradeoff between the cost of measurement (reusability of a sensor, battery life) and accuracy of measurement.

Sensors, imperfectly aware of their environment, choose their best actions, assuming that other sensors have information similar to their own. Each sensor is aware that others are pre-dicting its own behavior, that they are aware that it is aware, and so on. The eductive reasoning process [10] by which sensors iteratively hypothesize strategies and predict reactions to arrive at an optimum, is the justification for the Nash equilibrium con-ditions considered in this article. The Nash equilibrium has interesting artifacts, such as herding; see [10].

From a signal processing point of view, Bayesian games such as global games invoke Bayesian state estimation. The reader familiar with elementary nonlinear filtering for state estimation will find Theorem 1 of interest, as it develops ele-mentary stochastic dominance properties of such filters.

FORMULATION OF MULTIAGENT ADAPTIVE SENSING AS A GLOBAL GAME

SHOULD I GO TO A NIGHTCLUB?We first illustrate a typical global game using the following analogy described in [24]. Consider a nightclub with a large number (actually a continuum) of patrons. Suppose the quality of music X playing in the nightclub is a random variable with prior distribution 0r . Each patron i receives noisy information Y X W( ) ( )i i= + about the quality of music X playing at a night-club. Based on this noisy information, the patron can choose either to go or not go to the nightclub. If a patron chooses not to go to the nightclub, he receives no reward. If the patron goes to the nightclub, he receives a reward ( )X f a+ , where [ , ]0 1!a is the fraction of patrons that decided to go to the nightclub. (Obviously a is a function of the strategy the patrons use to chose whether to stay or leave, but for notational convenience we omit this dependence.) Thus, the better the music quality ,X the higher the reward to the patron if he goes to the nightclub. On the other hand, ( )f a is typically a quasiconcave function

with ( ) .f 0 0= The reasoning is this: If too few patrons decide to go, i.e., a is small, then ( )f a is small due to lack of social interac-tion. If too many patrons go to the nightclub, i.e., a is large, then ( )f a is also small due to the crowded nature (congestion) of the night-

club. Each patron is “rational” and knows that other patrons who choose to go to the nightclub will also receive the reward

( ).X f a+ Each patron also knows that other patrons know that he knows this, and so on, ad infinitum. So each patron i can predict rationally (via Bayes’ rule) given his measurement ,Y( )i what proportion a of patrons will choose to go to the nightclub. How should the agent decide rationally whether to go or not to go to the nightclub to maximize his reward?

EVENT-DRIVEN SENSING IN SENSOR NETWORKThe above nightclub problem is an example of a global game [24], [27]. It is analogous to the following decentralized sensor activation problem in large-scale sensor networks. Let X denote the underlying state the network is monitoring mod-eled as a random variable with prior .0r Each sensor chooses an action

{ (Low_Res), (High_Res)} .a 1 2! (1)

To conserve battery life, assume that each sensor initially is in the “Low_Res Mode” (low-resolution mode) and obtains coarse

SENSORS, IMPERFECTLY AWARE OF THEIR ENVIRONMENT, CHOOSE THEIR

BEST ACTIONS, ASSUMING THAT OTHER SENSORS HAVE INFORMATION

SIMILAR TO THEIR OWN.


measurements .Y X W( ) ( )i i= + Each sensor then decides whether or not to switch to the “High_Res Mode” (high-resolution mode) to obtain more accurate measurements. Assume that the activated sensors transmit their inferences to a base station. Let

( , ),0 1!a denote the proportion of sensors who respond to their measurement of X by switching to high-resolution mode. If too few sensors decide to activate themselves (i.e., a is close to zero), then the combined information from the sensors at the base station is not sufficiently accurate. (Assume that the base station averages the measurements of the sensors—so the more sensors that transmit, the lower the variance and the more accurate the inference.) If too many sensors activate themselves (i.e., a is close to one), then network congestion (assuming a multiaccess communication scheme) results in wasted battery energy. How should individual sensors decide when to activate themselves to achieve a tradeoff between conserving battery life and accuracy of estimating the underlying state X? This tradeoff is captured by the term ( )f a in the utility.

Examples Here are some examples of the above formulation. Consider the case in which sensors register the magnitude of an event such as a footstep or concentration of a chemical. The average inten-sity of footstep energy or concentration is denoted by X while the estimate is denoted by .Y( )i If such magnitudes are measured on a logarithmic scale (decibels for sound or pH scale for con-centration) then it is reasonable to model W( )i as being zero-mean Gaussian. The term X in the utility ( )X f a+ implies that as the intensity or concentration gets larger (which implies that an event of importance is occurring), the sensors are given a higher reward for switching to the high-resolution mode. As another example, consider thermal sensors monitoring a possi-ble forest fire. Let Y( )i denote the measured rate of the number of times the measured temperature exceeds a prespecified threshold over some time window .D (A higher rate indicates a higher probability of fire.) Then Y X W( ) ( )i i= + , where X is the true rate and W( )i is the measurement error. By the central limit theorem, W( )i is approximately Gaussian for large window length .D The larger X is, the more important the information is, as it indicates possibility of a fire. Other examples include measuring the number of times the concentrations of chemi-cals exceed a particular amount, and measuring the rate of occurrence of a particular signature profile indicating footsteps.

The model outlined above is event driven, since it relies on environmental signals X for mode selection. This is common for unattended sensor networks for target detection and track-ing [5], [28], allowing sensors to act in a decentralized fashion, based on the occurrence of a target event.

BAYESIAN NASH EQUILIBRIUMAssume for simplicity of exposition that all sensors have the same prior, noise distribution, and utility. This assumption is relaxed in [15] and [25], where multiple clusters of heteroge-neous sensors are considered. (This can be interpreted as extending the formulation here to multiple nightclubs and

multivariate observations as discussed in the section “Closing Remarks.”) In a sensor network, sensors are typically mass produced with the same activation strategy coded into their controllers. So we assume that all sensors use the same acti-vation strategy .n Of course, the actions ( )Y( )in chosen by individual agents i depend on their random observations Y( )i and are not necessarily identical even though the strategies are identical.

Given its observation ,Y( )i the goal of each sensor i is to exe-cute a (possibly randomized) strategy to optimize its utility. That is, sensor i seeks to compute a (possibly randomized) strategy

: { (Low_Res), (High_Res)}to [ ( , ( ), ( ) ],

YE C X X Y Y

1 2maximize

( )i "

;

n

a n (2)

where ( , ( ), )( ( )),

,if (High_Res)if (Low_Res)C X X a

X f X aa0

21a

a=

+ =

='

and ( ) ( ( ) ).X P Y X2 ;a n= =

Here, ( , ( ), )C X X aa denotes the reward a sensor receives when it chooses action .a Also, ( )Xa is the fraction of sensors that choose Action 2 given that the underlying state is .X Its pres-ence in the utility function above transforms the problem into a two player game—each individual agent playing versus the mass of agents ( ).Xa

DEFINITION 2.1: BAYESIAN NASH EQUILIBRIUMA strategy *n is a symmetric BNE if it is optimal in the sense of (2) for each sensor i given activity parameter .a (In other words, if a sensor unilaterally departs from a Nash equilibrium, it is worse off.)

Since we are dealing with an incomplete information game, players use distributional strategies as defined by [29]. If a BNE exists, then a pure (nonrandomized) version exists straightfor-wardly (see Proposition 8E.1 in [30, p. 225]). Indeed, with y denot-ing a realization of the random variable ,Y

[ ( , ( ), ( ) ]

[ ( , ( ), ) ] ( ).

E C X X Y Y y

E C X X u Y y P a Y ya 1

2

;

; ;

a n

a

=

= = ==

/

Thus obviously, the optimal (BNE) strategy is to choose ( | ) ,P u Y y 1* = = where

( ) [ ( , ( ), ) ] .arg maxa y E C X X a Y y* *{ , }a 1 2 ;n a= = =! (3)

As explained in [24], because we focus on symmetric BNEs, tech-nicalities that arise with an uncountable number of players are straightforwardly avoided.

OPPORTUNISTIC ADAPTIVE SENSING: THRESHOLD STRUCTURE OF BNEWe are interested in characterizing conditions under which the BNE is a simple threshold strategy of the form

( ),,

ifif .

yy yy y

21

* ( )( ) *

( ) *i

i

i2#

n = ) (4)


Here y* is a threshold value that is either specified by design or computed by each sensor. Sufficient conditions that yield thresh-old symmetric Nash strategies are of great interest since they are readily implementable at each agent and can be estimated and adapted in real time. Each agent simply needs to implement/learn its single threshold value .y* Also if the BNE is a threshold strat-egy, it is equivalent to each agent opportunistically adapting its sensing mode: If the measurement is larger than some threshold, then switch to a high-resolution mode; otherwise remain in the low-resolution mode.

There are strong parallels between the structure of a thresh-old BNE and opportunistic scheduling. In opportunistic trans-mission scheduling of multiple users (see [31]), the user chosen at each time slot is the one with the best channel, where the chan-nel for each user is a state of nature that evolves randomly (just like the measurement Y( )i for each agent i considered here). The random variation is exploited to choose users at each time slot with the best channels (just like the threshold Nash equilibrium strategy in which the random observations are used to choose which sensors switch to the high-resolution mode). Of course, our setting is somewhat more complex since each rational agent is interested in choosing between sensing modes in a Bayesian game-theoretic setting.

Clearly, if the BNE strategy has a threshold structure (4), then each sensor can compute ( )Xa defined in (2) (that is, predict the number of sensors in the high-resolution mode) since

( ) ( ( ) ) ( )

( ) ( ).x P Y X x P Y y X x

P W x y y x2

1

* *

* *W

; $ ;

$

a n

U

= = = = =

= - = - -

(5)

What are sufficient conditions that ensure the BNE has a threshold structure given by (4)? Examining (2) and (3), it is clear that [ ( , ( ), ) ]E C X X a Y y2 ;a = = increasing in y is sufficient. So the question we wish to answer is:

■ Under what conditions is the conditional expectation of a nonlinear function ( , ( ), )C X X 2a of the underlying state ,X given a noisy observation ,y increasing in y? This is a question of independent interest in signal process-

ing where filtered estimates (conditional expectations using an optimal Bayesian filter) are used to compute reward or cost functions.

Naturally to order conditional expectations with respect to observations, we need to impose some sort of ordering of the underlying probability distributions. In this article, we use the monotone likelihood ratio (MLR) stochastic order. A probabil-ity density function (pdf) or probability mass function (pmf) p dominates another pdf (or pmf) q with respect to monotone likelihood ratio stochastic order if ( ) / ( )p x q x is increasing in .x This condition is denoted as p qr$ or equivalently, .q pr# The MLR stochastic order is ideal for dealing with Bayesian games since it is preserved after conditioning. It is important to note that the MLR order is a partial order. Two pdfs or pmfs p and q are not necessarily MLR orderable.

The following is the main result. It gives sufficient condi-tions for the BNE to posses a threshold structure (4).

THEOREM 1{ ( , ( ), )| }E C X X a Y y2a = = is increasing in y if both of the fol-

lowing conditions hold:1) ( , ( ), )C x x 2a is increasing in .x A sufficient condition for this is

( ( )) .supddf

p w1

w W$

a- (6)

2) The noise distribution satisfies ( ) ( ).p y x p y xW r W#- - r Y

Let us prove Theorem 1. It is well known [32] that p qr$ implies first-order stochastic dominance, that is, for any increasing function ( ),g x that ( ) ( ) ( ) ( ) .g x p x dx g x q x dx

R R$# # In light of

this property, to show that ( ) ( ) ( ) ( )g x p x y dx g x p x y dxR R

; $ ; r# # for ,y y$ r it suffices for 1) ( )g x to be increasing and 2) the conditional pdf ( )p x y; MLR dominates ( )p x y; r for .y y$ r Let us examine this second condition more carefully. From the definition of MLR dominance, the condition is equivalent to ( ) / ( )p x y p x y; ; r increasing in .x Clearly,

( ) / ( )p x y p x y; ; =r ( ) ( ) / ( ) ( ) ( ) / ( ) .p y x p x p y x p x p y x p y x; ; ; ;=r r

So for the second condition to hold, it suffices for ( ) ( )p y x p y x; ;r to be increasing in .x In summary, 1) ( )g x

increasing in x and 2) ( ) / ( )p y x p y x; ;r increasing in x are suf-ficient conditions for { ( ) }E g X Y y; = to increase in .y These are Conditions 1 and 2 of the theorem.

Condition 2 of the above theorem places restrictions on the noise distribution. Numerous noise distributions used in classical detection theory satisfy this monotone likelihood ratio dominance condition including Gaussians, exponentials, uni-form, etc.

Let us now consider Condition 1. Equation (6) is sufficient for ( , ( ), )C x x 2a to be increasing in .x This follows since from (2),

( , ( ), ) ( ).dx

dC x xddf

dxd

ddf p y x2 1 1 *

Wa

aa

a= + = + -

Then (6) follows directly from ( , ( ), ) / .dC x x dx2 0$a

Equation (6) requires that the term ( )f a in the utility func-tion does not decrease too rapidly with .a Recall in the night-club problem that large a denoted a congested (crowded) bar. In a sensor network, large a denotes too many sensors in the high-resolution mode leading to congestion in transmission. So Condition 1 essentially says that the utility function ( )f a should not decrease too fast with congestion. Checking (6) is straightforward for any choice of utility ( ).f a For Gaussian noise with variance v, ( ) / ;sup p w 1 2w W rv= so (6) requires

/ .df d 22a rv- For uniform noise with support ,L ( ) / ;sup p w L1w W = so (6) requires / .df d L2a -

The reader might wonder what happens if ( )f a decreases sufficiently fast with congestion a so that the sufficient condi-tion (1) of Theorem 1 does not hold. It turns out that in many cases, the BNE is no longer a threshold. For the uniform


noise, single class case, [24] gives a beautiful explanation of what happens when the BNE is no longer threshold, which we paraphrase as follows. When there is high congestion, i.e., ( )f a decays rapidly, and measurements are sufficiently precise, then if the BNE were threshold, more patrons would go to the bar when the measurement y( )i is high. However, on receiving a high signal, the rational patron knows that many others have received a high signal (since the measurements are pre-cise); therefore the bar will be crowded. Hence with high con-gestion, a patron with a large measurement y( )i would prefer not to go to the bar, meaning that a threshold policy is not a Nash equilibrium. It leads to the following inconsistency [24], as once said by Yogi Berra: “Nobody goes there anymore. It’s too crowded.”

From a practical point of view, to ensure that the optimal sensing strategy is a threshold (opportunistic), the designer needs to ensure that the system does not operate in high congestion.

CLOSING REMARKS FOR EXAMPLE 1To summarize, we have formulated event-driven adaptive sens-ing in a sensor network as a Bayesian game. Based on its pri-vate measurement ,y( )i each agent i predicts the fraction ( )Xa of agents that will switch to the high-resolution mode and then chooses its action so that an appropriate number of sen-sors are in the high-resolution mode. The conditional mean (Bayesian) predictor for this is given by { ( ) | )E X Y y( )ia = where ( )Xa is given by (5). The main result (Theorem 1) says that if each sensor uses a threshold policy (4), then the global system is in a BNE.

To give further perspective, we give several comments.

THRESHOLD POLICYIn light of Theorem 1, the value of the threshold y* in the threshold policy (4) is given by

{ ( , ( ), )| } .E C X X a Y y2 0a = = =)

For the case of zero-mean uniform noise and zero-mean uni-form prior, y) can be computed explicitly [15], [24] as

( ) .f d0

1a a- # For other types of noise distributions, y) can be

estimated via a simulation-based stochastic approximation algorithm as detailed in [15]. Individual agents can compute/estimate the threshold y) without communicating with other agents. Naturally, the value of the threshold y) depends on the prior distribution 0r of ,X the distribution of the noise, and the reward functions.

EXISTENCE OF BAYESIAN NASH EQUILIBRIUMWe have omitted technical arguments regarding the existence of a BNE. The proof of existence of such a threshold BNE strategy is in two steps: The first step involves showing that a BNE exists among the class of randomized strategies. This involves using a suitable fixed-point theorem—Glicksberg’s fixed-point theorem [29] in our

case. (Glicksberg’s theorem is a function-space version of Kakuta-ni’s fixed-point theorem.) The second step is to prove that the Nash equilibrium comprises threshold strategies. It is this second aspect that we have focused on above.

MULTIVARIATE BAYESIAN GAMESThe highly stylized model here can be generalized to multivariate observations, multiple actions, and multiple classes of sensors. In the nightclub analogy, multivariate observations model patrons who receive rewards for different musicians in the band (e.g., pia-nist, guitarist, etc.); the multiple action problem involves the choice of multiple nightclubs; and multiple classes of sensors are equivalent to patrons who place different rewards on various musicians who constitute the band. We refer the reader to [25] for an application in spectrum allocation in cognitive radio. There the totally positive (TP2) stochastic order [33] is used, which is a multivariate generalization of the MLR order used in this section.

DYNAMIC GAMESFinally, the global game considered above is a one-shot Bayesian game. If the posterior computed at each stage is used as the prior for the next stage, a sequence of one-shot games can be solved over a time horizon. Actually, one can generalize the setting above to dynamic Bayesian games as in [34], which model regime change. In such dynamic games, agents optimize a reward over a time horizon and can take actions over several time periods. Such dynamic Bayesian games can also model multiagent change detec-tion problems.

EXAMPLE 2: MULTIAGENT SEQUENTIAL CHANGE DETECTION WITH SOCIAL LEARNINGThis section addresses the second question posed in the intro-duction, concerned with how local and global decision makers interact. Suppose individual agents make local decisions based on estimating the underlying state of a system and forward these local decisions to subsequent agents. This situation is depicted in Figure 2. It is well known [10] in the context of social learning that such a sequential procedure leads to information cascades in which eventually all sensors make the same local decision. How can a global decision for a sequential stopping time prob-lem such as quickest change detection be made based on these local decisions?

[FIG2] Information exchange structure in social learning.

Sensor 1

y1

a1 a2 a3

y2 y3

Sensor 2 Sensor 3

State x ~ r0


MOTIVATION: WHAT IS SOCIAL LEARNING?In social learning [10], agents estimate the underlying state of nature not only from their local measurements, but also from the actions of previous agents. (These previous actions were taken by agents in response to their local measurements; there-fore these actions convey information about the underlying state.) As we will describe below, the state estimation update in social learning has a drastically different structure compared to the standard optimal filtering recursion and can result in unusual behavior.

Consider a countable number of agents performing social learning to estimate the state of an underlying finite state Markov chain .x Let { , , , }X1 2X f= denote a finite state space, P the transition matrix, and 0r the initial distribution of the Markov chain. Each agent acts once in a predetermined sequential order indexed by , , .k 1 2 f= The index k can also be viewed as the discrete time instant when agent k acts. A multiagent system seeks to esti-mate .x0 This is done according to the following social learning proto-col [8], [10].

Step 1) Private Observation: At time ,k agent k records a private observation y Yk ! from the observation distribution ( ),B P y x iiy ;= = .i X! Throughout this section we assume that { , , , }Y1 2Y f= is finite.

Step 2) Private Belief: Using the public belief k 1r - available at time k 1- [defined in Step 4) below], agent k updates its private posterior belief ( ) ( , , , )i P x i a a yk k k k1 1f;h = = - as the following Bayesian update (this is the classical hidden Markov model filtering update):

, where diag( ( | ), ) .B PB P B P y x i i1 XkX y

yy k

kk !h

rr

= = =l ll

Here 1 X denotes the X-dimensional vector of ones, kh is a X-dimensional pmf, and Pl denotes transpose of the matrix .P

Step 3) Myopic Action: Agent k takes action Aak ! =

{ , , , }A1 2 f to minimize its expected cost

( , ) { ( , ) , , , }

{ } .

Earg min

arg min

a a y c x a a a y

c

k k ka

k k

aa k

1 1 1A

A

f;r

h

= =

=!

!

- -

l (7)

Here ( ( , ), )c c i a i Xa != denotes an X dimensional cost vector, and ( , )c i a denotes the cost incurred when the underlying state is i and the agent chooses action .a

Step 4) Social Learning Filter: Agent k then broadcasts its action ak to subsequent agents. Based on the action ,ak all agents (apart from k) perform social learning to update their public belief according to the following “social learn-ing filter”

( , ), where ( , ) ( , ) ,

( , ) .

T a T a aR P

a R P1

k k ka

X a

1r r rv r

r

v r r

= =

=

r

r

-l

l l (8)

In (8), the public belief ( ) ( | , )i P x i a ak k k1 fr = = and diag( ( | , ), )R P a x i i Xa !r= =r with elements

( | , ) ( | , ) ( | ),P a a x i P a y P y x ik k k ky

1Y

r r r= = = = =!

- / (9)

where using (7),

( | , )if , allotherwise.

AP a a y

c B P c B P a10

fork

a y a y# !r

r r= =

l l l l uu'

DISCUSSIONActually, in typical formulations of social learning, the under-lying state is assumed to be a random variable and not a Markov chain. Our description is given in terms of a Markov chain since we wish to highlight the unusual structure of the social learning filter below to a signal processing audience

familiar with basic ideas in Bayesian filtering. Also we are interested in change detection problems in which the change time distribution can be modeled as the absorption time of a Markov chain.

The derivation of the social learning filter is as follows: Define the posterior as

( ) ( | , , ).j P x j a ak k k1 fr = = Then

( )jkr

( , )

( | , , , )a

P a x j a a1k k

k k k1

1 1fv r

= =-

-

( | ) ( | , , )P x j x i P x i a ak k k ki

1 1 1 1f= = =- - -/

( , )

( | , , , ) ( | )a

P a y y a a P y y x j1k k

k k k k ky1

1 1fv r

= = = =-

-/

( | ) ( )P x j x i ik k ki

1 1r= =- -/

( , )

( | , ) ( | )a

P a y y P y y x j1k k

k k k k ky1

1v r

r= = = =-

-/ ( ),P iij k

i1r -/

w h e r e t h e n o r m a l i z a t i o n t e r m i s ( , )ak k1v r =-

( | , ) ( | )P a y y P y y x jk k k k kyj 1r= = =-// ( ).P iij ki 1r -/

INFORMATION EXCHANGE STRUCTUREFigure 2 illustrates the above social learning protocol in which the information exchange is sequential. Agents send their hard decisions (actions) to subsequent agents. In the social learning protocol, we have assumed that each agent acts once. Another way of viewing the social learning protocol is that there are finitely many agents that act repeatedly in some predefined order. If each agent chooses its local decision using the current public belief, then the setting is identical to the social learning setup. We also refer the reader to [9] for several recent results in social learning over several types of network adjacency matrices.

IN SOCIAL LEARNING, AGENTS ESTIMATE THE UNDERLYING STATE OF

NATURE NOT ONLY FROM THEIR LOCAL MEASUREMENTS, BUT ALSO FROM THE

ACTIONS OF PREVIOUS AGENTS.


FILTERING WITH HARD DECISIONSSocial learning can be viewed as agents making hard decision estimates at each time and sending these estimates to subse-quent agents. In conventional Bayesian state estimation, a soft decision is made, namely, the posterior distribution (or equiva-lently, observation) is sent to subsequent agents. For example, if

,A X= and the costs are chosen as c ea a=- where ea denotes the unit indicator with 1 in the ath position, then argmin argmaxca a ar =l ( ),ar i.e., the maximum aposteriori probability (MAP) state estimate. For this example, social learn-ing is equivalent to agents sending the hard MAP estimates to subsequent agents.

Note that rather than sending a hard decision estimate, if each agent chooses its action a yk k= (that is agents send their private observations), then the right-hand side of (9) becomes

( ) ( ) ( )I y y P y x i P y x ik k k ky Y; ;= = = =

!/ and so the problem becomes a standard Bayesian filtering problem.

DEPENDENCE OF OBSERVATION LIKELIHOOD ON THE PRIORThe most unusual feature of the above protocol (to a signal pro-cessing audience) is the social learning filter (8). In standard state estimation via a Bayesian filter, the observation likelihood given the state is completely parametrized by the observation noise distribution and is functionally independent of the cur-rent prior distribution. In the social learning filter, the likeli-hood of the action given the state (which is denoted by Ra

r) is an explicit function of the prior !r Not only does the action likeli-hood depend on the prior, but it is also a discontinuous func-tion, due to the presence of the argmin in (7).

The above social learning protocol and social learning filter (8) result in interesting dynamics in state estimation and deci-sion making. We will illustrate two interesting consequences that are highly unusual in a signal processing setting in the next two subsections, namely,

■ Rational agents form herds and information cascades and blindly follow previous agents.

■ Making global decisions on change detection in a multiagent system performing social learning results in multithreshold behavior.

RATIONAL AGENTS FORM INFORMATION CASCADESThe first consequence of the unusual nature of the social learning filter (8) is that social learning can result in multi-ple rational agents taking the same action independently of their observations. To illustrate this behavior, throughout this subsection, we assume that x is a finite state ran-dom variable (instead of a Markov chain) with prior distri-bution .0r

HERDS AND INFORMATION CASCADESWe start with the following definitions; see also [10]:

■ An individual agent k herds on the public belief k 1r - if it chooses its action ( , )a a yk k k1r= - in (7) independently of its observation .yk

■ A herd of agents takes place at time ,kr if the actions of all agents after time kr are identical, i.e., a ak k= r for all time

.k k2 r

■ An information cascade occurs at time ,kr if the public beliefs of all agents after time kr are identical, i.e., k kr r= r for all .k k1 r

Note that if an information cascade occurs, then since the public belief freezes, social learning ceases. Also from the above definitions it is clear that an information cascade implies a herd of agents, but the reverse is not true.

The following result, which is well known in the economics literature [8], [10], states that if agents follow the above social learning protocol, then after some finite time ,kr an information cascade occurs. A nice analogy is provided in [10]. If I see some-one walking down the street with an umbrella, I assume (based on rationality) that he has checked the weather forecast and is carrying an umbrella since it might rain. Therefore, I also take an umbrella. So now there are two people walking down the street carrying umbrellas. A third person sees two people with umbrellas and, based on the same inferential logic, also takes an umbrella. Even though individuals are rational, such herding behavior might be irrational since the first person who took the umbrella may not have checked the weather forecast. The proof follows via an elementary application of the martingale conver-gence theorem.

THEOREM 2 (8)The social learning protocol described above leads to an informa-tion cascade in finite time with probability 1. That is, there exists a finite time kr after which social learning ceases, i.e., public belief

,k k1r r=+ ,k k$ r and all agents choose the same action, i.e., ,a ak k1 =+ .k k$ r Y

Instead of reproducing the proof, let us give some insight as to why Theorem 2 holds. It can be shown using martingale methods that at some finite time ,k k*= the agent’s probability

( , )P a yk k k 1; r - becomes independent of the private observation .yk Then clearly from (9), ( , )P a a x ik k k 1; r r= = =-

( ).P a ak ;r= = Substituting this into the social learning filter (8), we see that .k k 1r r= - Thus after some finite time ,k* the social learning filter hits a fixed point and social learning stops. As a result, all subsequent agents k k*2 completely disregard their private observations and take the same action ,ak* thereby forming an information cascade (and therefore a herd).

SOCIALIST (BENEVOLENT) AGENTS DELAY HERDINGIn the above social learning protocol, agents are capitalistic. Agents choose their actions by optimizing their expected utilities according to (7). This leads to an information cascade and social learning stops. In other words, agents are interested in optimizing their own costs and ignore the information benefits their actions provide to others.

We now describe an optimized social learning procedure to delay herding. In the restaurant problem, an obvious approach to prevent herding is as follows. If a restaurant knew that patrons choose the restaurant with the most customers, then the


restaurant could deliberately pay actors to sit in the restaurant so that it appears popular, thereby attracting customers. The meth-odology in this section where herding is delayed by benevolent agents is a different approach. This approach (see [10]), is moti-vated by the following question: How can agents aid social learn-ing by acting benevolently and choosing their actions to sacrifice their local costs but optimize a social welfare cost? For example, to estimate an underlying random variable given noisy observations in a sensor network, agents could initially transmit soft decisions, i.e., raw observations or the posterior distribution (which incurs a larger communication/battery cost for each sensor). As the state estimate becomes more accurate, sensors could minimize their communication costs and transmit hard decisions, i.e., their local actions. Continuing with this line of thought, in optimized social learning, agents choose their actions according to a policy

: ak k"n r where kr denotes the current posterior (belief state) computed via the social learning filter (8).The policy n is chosen to minimize the social welfare cost (which takes into account the costs of all agents)

( ) ( , ) ,EJ c x akk

k0

00r t=

3

n rn

=

' 1/ (10)

where [ , )0 1!t is an economic discount factor and 0r denotes the initial probability (prior) of the state .x P 0r

n and E 0rn denote

the probability measure and expectation of the evolution of the observations and underlying state that are strategy dependent. The reader should compare the social welfare cost (10) with the capitalistic cost (7)—the social welfare cost (10) takes into account the costs of all the agents, whereas the capitalistic cost (7) considers only the individual agent.

Determining the policy *n that minimizes the social welfare cost (10) is equivalent to solving a partially observed Markov decision process (POMDP) problem [16], [35]. A POMDP com-prises a noisy observed Markov chain, and the dynamics of the posterior distribution (belief state) is controlled by a policy (n in our case). In general, POMDP problems are computationally intractable to solve. However, the above problem has a lot of structure that can be exploited. In [16], it is shown that certain special cases of the above optimized social learning has a threshold structure on the space of posterior distributions. This structure can be exploited by individual agents and herd-ing can be delayed. From a practical point of view, determining sufficient conditions for the optimal policy of a POMDP prob-lem to have a special structure (such as a threshold) is very important, since the structure can be exploited to solve the POMDP problem efficiently. Since the optimal policy is the solution of a stochastic dynamic programming problem, one needs to give sufficient conditions on the transition matrix, observation distribution, and costs so that the optimal policy has a special structure. One particularly useful characteriza-tion of the structure is based on supermodularity of the dynamic programming recursion. We will comment on this briefly in the “Closing Remarks” section and refer the reader to [36]–[39] for details.

MULTIAGENT QUICKEST CHANGE DETECTION WITH SOCIAL LEARNINGThe previous subsection dealt with social learning for estimating a random variable x and can therefore be viewed as a localization problem. We now consider social learning in a tracking context where we wish to estimate the state of a finite state Markov chain

.xk Suppose a multiagent system performs social learning and makes local decisions. Given the public beliefs from the social learning protocol, how can quickest change detection be achieved? In other words, how can a global decision maker use the local decisions from individual agents to decide when a change has occurred? It is shown below that making a global decision (change or no change) based on local decisions of individual agents has an unusual structure.

CLASSICAL QUICKEST DETECTIONThe classical Bayesian quickest detection problem [40] is as follows: An underlying discrete-time state process x jump changes at a geometrically distributed random time .0x Con-sider a sequence of discrete time random measurements { , },y k 1k $ such that conditioned on the event { },t0x = ,yk k t# are independent and identically distributed (i.i.d.) random variables with distribution B1, and ,y k tk 2 are i.i.d. ran-dom variables with distribution .B2 The quickest detection problem involves detecting the change time 0x with min-imal cost. That is, at each time , , ,k 1 2 f= a decision

{ (stop and announce change), (continue)}u 1 2k ! needs to be made to optimize a tradeoff between false alarm frequency and linear delay penalty.

To formalize this setup, let P P P1

10

22 22=

-; E denote the transition

matrix of a two state Markov chain x in which State 1 is absorbing. Then it is easily seen that the geometrically distributed change time 0x is equivalent to the time at which the Markov chain enters State 1. That is { : }min k x 1k

0x = = and { } / ( ).E P1 10

22x = - Let x be the time at which the decision u 1k = (announce change) is taken. The goal of quickest time detection is to minimize the Kolmogorov–Shiryaev criterion for detection of disorder [41]

( ) {( ) } ( ).E PJ d f00 0

0 0 1r x x x x= - +n rn

rn+ (11)

Here ( , ) .maxx x 0=+ The nonnegative constants d and f denote the delay and false alarm penalties, respectively. So waiting too long to announce a change incurs a delay penalty d at each time instant after the system has changed, while declaring a change before it happens, incurs a false alarm penalty .f In (11), n denotes the strategy of the decision maker. P 0r

n and E 0rn are the probability

measure and expectation of the evolution of the observations and Markov state, which are strategy dependent. 0r denotes the initial distribution of the Markov chain .x

In classical quickest detection, the decision policy n is a func-tion of the two-dimensional belief state (posterior pmf)

( ) ( | , , , , , ),i P x i y y u uk k k k1 1 1f fr = = - , ,i 1 2= w i t h ( ) ( ) .1 2 1k kr r+ = So it suffices to consider one element, say ( ),2kr of this pmf. Classical quickest change detection (see, for


example, [40]) says that the policy ( )n r) that optimizes (11) has the following threshold structure: There exists a threshold point

[ , ]0 1!r) such that

( )(continue)(stop and announce change)

if ( )if ( ) .

21

22k

k

k 1$

n rr r

r r=)

)

)' (12)

MULTIAGENT QUICKEST DETECTION PROBLEMWith the above classical formulation in mind, consider now the following multiagent quickest change detection problem. Sup-pose that a multiagent system performs social learning to esti-mate an underlying state according to the social learning protocol. That is, each agent acts once in a predetermined sequential order indexed by , , .k 1 2 f= (Equivalently, a finite number of agents act repeatedly in some predefined order and each agent chooses its local decision using the current public belief.) Given these local decisions (or equivalently the public belief), the goal of the global decision maker is to minimize the quickest detection objective (11). The problem now is a nontrivial generalization of classical quickest detection. The posterior r is now the public belief given by the social learning filter (8) instead of a standard Bayesian filter. There is now interaction between the local and global decision makers. The local decision ak from the social learning protocol determines the public belief state kr via the social learning filter (8), which determines the global decision (stop or continue), which determines the local decision at the next time instant, and so on.

The global decision maker’s policy : { , }1 2"n r) that opti-mizes the quickest detection objective (11) and the cost ( )J 0* rn of this optimal policy are the solution of Bellman’s dynamic programming equation

( ) { ( ), ( ( ))

( , ) ( , )}, ( ) ( )arg min f d

V T a a J V2 1 2*

Aa0 0

n r r r

r v r r r

= -

+ =!

n)^ h/ (13)

( ) ( ), ( ( )) ( , ) ( , ) .minV f d V T a a2 1 2Aa

r r r r v r= - +!

^ h' 1/

Here ( , )T ar and ( , )av r are given by the social learning filter (8)—recall a denotes the local decision. ( )V r is called the “value function”—it is the cost incurred by the optimal policy when the initial belief state (prior) is .r As will be seen in the numerical example below, the optimal policy ( )*n r has a very different struc-ture compared to classical quickest detection.

Apart from applications in automated decision-making sys-tems and sensor networks, the above formulation can be used in agent-based models for the microstructure of asset prices in high-frequency trading in financial systems [42]. The state x denotes the underlying asset value that changes at time .0x Agents observe local individual decisions of previous agents via an order book, combine these observed decisions with their noisy private signals about the asset, selfishly optimize their expected local utilities, and then make their own individual decisions (whether to buy, sell, or do nothing). The market evolves through the orders of trading agents. Given this order book information, the goal of the

market maker (global decision maker) is to achieve quickest change point detection when a shock occurs to the value of the asset [43].

NUMERICAL EXAMPLEWe now illustrate the unusual multithreshold property of the global decision maker’s optimal policy ( )*n r in multiagent quickest detection with social learning. Consider the social learning model with the following parameters: The underlying state is a two-state Markov chain x with state space { , }1 2X = and transition probability matrix P . . .

10 05

00 95=; E So the change

time 0x (i.e., the time the Markov chain jumps from State 2 into absorbing State 1) is geometrically distributed with

{ } / . .E 1 0 05 200x = =

SOCIAL LEARNING PARAMETERSIndividual agents observe the Markov chain x in noise with the observation symbol set { , } .1 2Y = Suppose the observation likeli-hood matrix with elements ( )B P y y x iiy k k;= = = is .

... .B

0 90 1

0 10 9=; E

Agents can choose their local actions a from the action set { , }.A 1 2= The state dependent cost matrix of these actions is

( ( , ), , )c c i a i X a! != = ..

..

4 572 57

5 570

; E Agents perform social learn-ing with the above parameters. The intervals [ , ]0 *

1r and [ , ]1*2r in

Figure 3(a) are regions where the optimal local actions taken by agents are independent of their observations. For ( ) [ , ],2 1*

2!r r the optimal local action is 2 and for ( ) [ , ],2 0 *

1!r r the optimal local action is 1. So individual agents herd for belief states in these intervals and the local actions do not yield any information about the underlying state. Moreover, the interval [ , ]0 *

1r depicts a region where all agents herd, meaning that once the belief state is in this region, it remains so indefinitely and all agents choose the same local Action 1. Note that even if the agent k herds so that its action ak provides no information about its private observation

,yk the public belief still evolves according to the predictor .Pk k1r r=+ l So an information cascade does not occur in

this example.

GLOBAL DECISION MAKINGBased on the local actions of the agents performing social learn-ing, the global decision maker needs to perform quickest change detection. The global decision maker uses the delay penalty

.d 1 05= and false alarm penalty f 3= in the objective function (11). The optimal policy ( )*n r of the global decision maker where

[ ( ), ( )]1 2 2r r r= - l is plotted versus ( )2r in Figure 3(a). Note that ( )2 1r = means that with certainty no change has occurred, while ( )2 0r = means with certainty a change has occurred. The policy ( )*n r was computed by constructing a uniform grid of 1,000 points for ( ) [ , ]2 0 1!r and then implementing the dynamic programming equation (13) via a fixed point value iteration algo-rithm for 200 iterations. The horizontal axis ( )2r is the posterior probability of no change. The vertical axis denotes the optimal decision: u 1= denotes stop and declare a change, while u 2= denotes continue.

The most remarkable feature of Figure 3(a) is the multithresh-old behavior of the global decision maker’s optimal policy ( )*n r .


Recall that ( )2r denotes the posterior probability of no change. So consider the region where ( ) 2*n r = and sandwiched between two regions where ( ) .1*n r = Then as ( )2r increases, the optimal policy switches from ( ) 2*n r = to ( ) .1*n r = In other words, the optimal global decision policy “changes its mind”—it switches from no change to change as the posterior probability of a change decreases! Thus, the global decision (stop or continue) is a non-monotone function of the posterior probability obtained from local decisions.

Figure 3(b) shows the associated value function obtained via stochastic dynamic programming (13). Recall that ( )V r is the cost incurred by the optimal policy with initial belief state .r Unlike standard sequential detection problems in which the value func-tion is concave, the figure shows that the value function is non-concave and discontinuous. To summarize, Figure 3 shows that social learning based quickest detection results in fundamentally different decision policies compared to classical quickest time detection (which has a single threshold). Thus making global deci-sions (stop or continue) based on local decisions (from social learning) is nontrivial. In [17], a detailed analysis of the problem is given together with a characterization of this multithreshold behavior. Also more general phase-distributed change times are considered in [17].

CLOSING REMARKS FOR EXAMPLE 2We summarize here several intriguing extensions of the multia-gent social learning problem specified above.

WISDOM OF CROWDSSurowiecki’s book [44] is an excellent popular piece that explains the wisdom-of-crowds hypothesis. The wisdom-of-crowds hypoth-esis predicts that the independent judgments of a crowd of

individuals (as measured by any form of central tendency) will be relatively accurate, even when most of the individuals in the crowd are ignorant and error prone. The book also studies situations (such as rational bubbles) in which crowds are not wiser than indi-viduals. “Collect enough people on a street corner staring at the sky, and everyone who walks past will look up” [53]. Such herding behavior is typical in social learning.

IN WHICH ORDER SHOULD AGENTS BE POLLED?In the social learning protocol, we have assumed that the agents act sequentially in a predefined order. Suppose we wish to opti-mize the order in which agents are polled. For example, in a sen-sor network, we may wish to optimize the order in which to poll sensors to minimize the communication energy required to esti-mate a state to a prescribed degree of accuracy. If the sensors do not interact with each other, then it would make sense to probe the sensors in decreasing order of signal to noise ratio (reputa-tion). However, if sensors perform social learning, then the opti-mal polling order is not straightforward to determine. When each sensor is polled, it broadcasts its decision to other sensors. Other sensors use the broadcast decision to update their estimates of the state and then make decisions. If the most senior agent “speaks” first it would unduly affect the decisions of more junior agents. This could lead to an increase in bias. To quote a recent paper [45]: “In 94% of cases, groups (of people) used the first answer provided as their final answer… Groups tended to commit to the first answer provided by any group member.” People with dominant personalities tend to speak first and most forcefully “even when they actually lack competence.” On the other hand, if the most junior agent is polled first, then since its variance is large, several agents would need to be polled to reduce the vari-ance. We refer the reader to [46] for a fascinating description of

[FIG3] Optimal global decision policy for social learning based on the quickest change detection for geometric distributed change time. The parameters are specified in the section “Numerical Example.” The optimal policy ( ) { (announce change), (continue)}1 2* !n r is characterized by a triple threshold—that is, it switches from 1 to 2, three times as the posterior ( )2r increases. The value function ( )V r is nonconcave and discontinuous in .r As explained in the text, all agents herd for ( ) [ , ],2 0 *

1!r r while individual agents herd for ( ) [ , ]2 1*

2!r r . (a) Optimal global decision policy ( ).*n r (b) Value function ( )V r for global decision policy.

0 0.1 0.2 0.3 0.4 0.5

(a) (b)

0.6 0.7 0.8 0.9 1

1

2

n* (r

)

V(r

)

r(2)

0 0.1 0.2 0.3 0.4

-0.2

0.2

0.4

0

-0.4

-0.6

0.5 0.6 0.7 0.8 0.9 1r(2)

r*1 r*

2 r*1 r*

2


who should speak first in a public debate. As described in [46], seniority is considered in the rules of debate and voting in the U.S. Supreme Court. “In the past, a vote was taken after the new-est justice to the Court spoke, with the justices voting in order of ascending seniority largely, it was said, to avoid the pressure from long-term members of the Court on their junior colleagues.” It turns out that for two agents, the seniority rule is always optimal for any prior, that is, the senior agent speaks first followed by the junior agent; see [46] for the proof. However, for more than two agents, the optimal order depends on the prior, and on the obser-vations in general.

STRUCTURAL RESULTS AND SUB/SUPERMODULARITYIn both examples discussed in this article, we have alluded to threshold policies as being of great interest in adaptive sensing and dynamic decision making. If one can prove that the optimal policy has a threshold structure, then computing/estimating the optimal policy becomes simplified (compared to solving a stochas-tic dynamic programming problem that suffers from the curse of dimensionality). For example, suppose the space of actions is { , }1 2 and state space is the set of reals or a closed interval. This was the case in both the global game and the social learning problem. Then determining the optimal policy : x a"n) is a function space optimization problem. But if the optimal policy is a thresh-old, i.e., for some threshold value ,x)

( )forfor

xx xx x

12

1$

n =))

)'then determining the threshold is a finite dimensional optimiza-tion problem. We only need to search for the single threshold state .x)

In general, the optimal policy can be expressed as ( ) ( , )arg minx Q x uun =) for some cost function ( , ).Q x u The

reader might wonder what are general sufficient conditions for the existence of threshold policies. The main idea is that if ( , )Q x u is a submodular function of , ,x u i.e., ( , ) ( , )Q x Q x2 1- is decreasing in ,x then the optimal policy ( ) arg minx*

un = ( , )Q x u is increas-ing in x and therefore is a threshold. Equivalently if ( , )Q x u is a supermodular function of , ,x u i.e., ( , ) ( , )Q x Q x2 1- is increasing in ,x then the optimal policy ( ) ( , )arg maxx Q x uun =) is increas-ing in x and therefore is a threshold. For the global games prob-lem considered in the section “Bayesian Nash Equilibrium,” the reward for action u 1= is zero, so in the notation of this section

( , ) ,Q x 1 0= where x denotes the observation y in the section “Bayesian Nash Equilibrium.” Therefore a sufficient condition for the policy that maximizes the reward to be increasing in the observation is that ( , )Q x u is supermodular, i.e., ( , )Q x 2 is increasing in x (since ( , )Q x 1 0= in this case). This is precisely what we proved in Theorem 1. It required Conditions 1 and 2 of the theorem to hold. The idea of using sub/supermodularity was championed by [47] and provides a general set of sufficient condi-tions for the existence of monotone strategies in stochastic control and game-theoretic problems. This area falls under the general umbrella of monotone comparative statics that has witnessed sig-nificant interest in the area of economics [48]. More generally, x

only needs to be partially orderable for ( )xn) to be decreasing in x with respect to this partial order—this is important in Bayesian problems for which the state is a posterior distribution that is par-tially orderable using a stochastic order [16]. Actually, for games, more general sufficient conditions such as the zero-crossing con-dition [48] can be given.

SUMMARY AND EXTENSIONSThis article has considered multiagent sensing systems that use statistical signal processing algorithms for computing state estimates. We have presented two highly stylized examples with different graphical structures of information flow. The first example dealt with Bayesian global games for sensor mode selection in which all agents obtain noisy measurements of a common state and act simultaneously. The second example dealt with social learning in which agents act and exchange information sequentially. Despite the apparent simplicity in these information flows, the systems exhibit unusual behavior. Bayesian games and social learning are powerful analysis and synthesis tools for under-standing how agents interact and influence decision making. Both areas have witnessed significant advances in the last decade in eco-nomics, wireless communications, sensor networks and auto-mated decision systems.

In multiagent state estimation formulated as a social learning problem, one issue we have not addressed in this article is the inadvertent multiple reuse of data also known as misinformation propagation or the data incest problem; see [49] and [50] for details.

EXTENSION: NON-BAYESIAN APPROACH FOR COORDINATION IN DECISION MAKINGThe two examples we have detailed in this article are Bayesian in the sense that they use optimal filtering and prediction for state estimation to make decisions. We conclude this article with a short discussion of a non-Bayesian game-theoretic learning approach for adaptive decision making.

Consider the following generalization of the sensor activation problem of the section “Multiagent Adaptive Sensing: A Global Games Approach.” Suppose there are L sensors. Each sensor l has a utility function ( , , ),U a al L1 f where al denotes the action (mode) chosen by sensor l. The utility function can be quite gen-eral and takes into account the cost of sensing, communication and actions of other sensors.

REGRET-BASED DECISION MAKINGSuppose each sensor l chooses its actions according to the follow-ing adaptive algorithm running over time , , :k 1 2 f=

1) At time ,k 1+ choose action akl

1+ from pmf ,kl

1} + where

, / ,, / , .

i P a i a

r i j Cr i j C

j aj a1

kl

kl

kl

kl

kl

j i

kl

kl

1 1

!

;} = =

=- =

!

+ +

+

+

^ ^^

^h h

hh* /

(14)

Here C is a sufficiently large positive constant so that kl

1} + is a valid pmf.


2) The regret matrix rkl that determines the pmf k

l1} + is

updated via the adaptive filtering algorithm

, , , ,

( , )

(

),

r i j r i jk

U j a U a a

I r i j

1

{ }

kl

kl l

kl l

kl

kl

a i kl

1

1kl

= + -

-

-- -

= -

^ ^ ^ ^h h h h6 @ (15)

where a l- denote the actions chosen by all sensors excluding sen-sor .l Step 1 corresponds to each sensor choosing its action ran-domly from a Markov chain with transition probability .k

l1} +

These transition probabilities are computed in Step 2 in terms of the regret matrix rk

l which is the time-averaged regret sensor l experiences for choosing action i instead of action j for each possible action j i! (i.e., how much better off would it be if it had chosen action j instead of i):

, , , .r i j n U j a U a a I1{ }n

l lk

l lkl

kl

a ik

n

1kl$= -- -=

=

^ ^ ^h h h6 @/ (16)

If every sensor chooses its action according to the above regret-based algorithm, what can one say about the global behav-ior? By emergent global behavior, we mean the empirical frequency of actions taken over time by all sensors. For each L-tuple of actions ( , )a al l- define the empirical frequency of actions taken up to time n as

( , ) ( , ) .z a a n I a a a a1n

l lk

lk

l l

k

n

1= = =- - -

=

/

The seminal papers [1] and [2] show that the empirical fre-quency of actions zn converges as n " 3 to the set of correlated equilibria of a noncooperative game. Correlated equilibria are a generalization of Nash equilibria and were introduced by Aumann [51]. Aumann’s 2005 Nobel Prize in Economics press release reads

Aumann also introduced a new equilibrium concept, cor-related equilibrium, which is weaker than Nash equilibri-um, the solution concept developed by John Nash, an economics laureate in 1994. Correlated equilibrium can explain why it may be advantageous for negotiating par-ties to allow an impartial mediator to speak to the parties either jointly or separately...

The set of correlated equilibria Ce is the set of probability distribu-tions on the joint action profile ( , )a al l- that satisfy

: ( , ) [ (( , )) (( , ))] , , , .j a U i a U j a l j i0el l l l l l

a l

6#n n -- - -

-

C = ' 1/ (17)

Here ( , ) ( , )j a P a j al l l l ln = =- - denotes the randomized strategy (joint probability) of player l choosing action j and the rest of the players choosing actions .a l- The correlated equilibrium condition (17) states that instead of taking action j [which is prescribed by the equilibrium strategy ( , )],j al ln - if player l cheats and takes action ,i it is worse off. So there is no unilateral incentive for any player to cheat.

To summarize, the above algorithm ensures that all agents eventually achieve coordination (consensus) in decision

making—the randomized strategies of all agents converge to a common convex polytope .Ce Step 2 of the algorithm requires that each agent knows its own utility and the actions of other agents—but agents do not need to know the utility functions of other agents. In [52], a “blind” version of this regret-based algorithm is presented where agents do not need to know the actions of other agents. These algorithms can be viewed as simple heuristic behavior by individual agents (choosing actions according to the measured regret) resulting in sophisti-cated global outcomes [2], specifically, convergence to Ce thereby coordinating decisions. We refer to [5] and [6] for applications in sensor networks and cognitive radio and also generalizations to tracking algorithms where the step size for the regret matrix update is a constant. Such algorithms can track the correlated equilibria of games with time-varying parameters.

WHY CORRELATED EQUILIBRIA?The set of correlated equilibria is more natural in decentralized adaptive learning environments than Nash equilibria since it allows for individual players to coordinate their actions. Nash equilibria are a special case of correlated equilibria where the joint probability ( , )a al ln - is chosen as the product distribution for all players ,l i.e., all the agents choose their actions independently. The coordination inherent in correlated equilibria can lead to higher performance [51] than if each player chooses actions inde-pendently as required by a Nash equilibrium. As described in [1], it is unreasonable to expect in a learning environment that players act independently (as required by a Nash equilibrium) since the common history observed by all players acts as a natural coordina-tion device. Hart and Mas-Colell observe in [52] that for most sim-ple adaptive procedures, “...there is a natural coordination device: the common history, observed by all players. It is thus reasonable to expect that, at the end, independence among players will not obtain.” The set of correlated equilibria is also structurally simpler than the set of Nash equilibria; the set Ce in (17) is a convex poly-tope in the strategies ,n whereas Nash equilibria are isolated points at the extrema of this set.

ACKNOWLEDGMENTSThis work was supported in part by the Canada Research Chairs program, NSERC Canada, in part by the U.S. Army Research Office under MURI grant W911NF-11-1-0036, and in part by the U.S. National Science Foundation under grant CNS-09-05086. The authors gratefully acknowledge discus-sions with Dr. C. Chamley, who authored the book [10].

AUTHORSVikram Krishnamurthy ([email protected]) is a professor and Canada research chair in the Department of Electrical Engineering, University of British Columbia, Vancouver, Canada. His current research interests include computational game theory and stochastic control in sensor networks, and stochastic dynamical systems for modeling of biological ion channels and biosensors. He was a distinguished lecturer for the IEEE Signal Processing Society and editor-in-chief of


IEEE Journal Selected Topics in Signal Processing. He is a Fellow of the IEEE.

H. Vincent Poor ([email protected]) is the dean of engineer-ing and applied science at Princeton University, where he is also the Michael Henry Strater University Professor. His interests include statistical signal processing and information theory, with applications in several fields. He is a member of the National Acad-emy of Engineering, the National Academy of Sciences, and the Royal Academy of Engineering (United Kingdom). Recent recogni-tion includes the 2010 IET Fleming Medal, the 2011 IEEE Sum-ner Award, the 2011 Society Award of the IEEE Signal Processing Society, and honorary doctorates from Aalborg University, the Hong Kong University of Science and Technology, and the Univer-sity of Edinburgh. He is a Fellow of the IEEE.

REFERENCES [1] S. Hart and A. Mas-Colell, “A simple adaptive procedure leading to correlated equi-librium,” Econometrica, vol. 68, no. 5, pp. 1127–1150, 2000.

[2] S. Hart, “Adaptive heuristics,” Econometrica, vol. 73, no. 5, pp. 1401–1430, 2005.

[3] E. Biglieri, A. Goldsmith, L. Greenstein, N. Mandayam, and H. V. Poor, Principles of Cognitive Radio. Cambridge, U.K.: Cambridge Univ. Press, 2013.

[4] J. Predd, S. R. Kulkarni, and H. V. Poor, “A collaborative training algorithm for dis-tributed learning,” IEEE Trans. Inform. Theory, vol. 55, no. 4, pp. 1856–1871, 2009.

[5] V. Krishnamurthy, M. Maskery, and G. Yin, “Decentralized activation in a ZigBee-enabled unattended ground sensor network: A correlated equilibrium game theoretic analysis,” IEEE Trans. Signal Processing, vol. 56, no. 12, pp. 6086–6101, Dec. 2008.

[6] M. Maskery, V. Krishnamurthy, and Q. Zhao, “Decentralized dynamic spectrum access for cognitive radios: Cooperative design of a non-cooperative game,” IEEE Trans. Commun., vol. 57, no. 2, pp. 459–469, 2008.

[7] A. Banerjee, “A simple model of herd behavior,” Quarterly J. Econ., vol. 107, no. 3, pp. 797–817, 1992.

[8] S. Bikchandani, D. Hirshleifer, and I. Welch, “A theory of fads, fashion, custom, and cultural change as information cascades,” J. Political Econ., vol. 100, no. 4, pp. 992–1026, 1992.

[9] D. Acemoglu and A. Ozdaglar, “Opinion dynamics and learning in social networks,” Dyn. Games Appl., vol. 1, no. 1, pp. 3–49, 2011.

[10] C. Chamley, Rational Herds: Economic Models of Social Learning. Cambridge, U.K.: Cambridge Univ. Press, 2004.

[11] I. Lobel, D. Acemoglu, M. Dahleh, and A. Ozdaglar, “Preliminary results on social learning with partial observations,” in Proc. 2nd Int. Conf. Performance Evaluation Methodolgies and Tools, Nantes, France, 2007.

[12] M. Hellman and T. Cover, “Learning with finite memory,” Ann. Math. Stat., vol. 41, no. 3, pp. 765–782, 1970.

[13] G. Wang, S. R. Kulkarni, H. V. Poor, and D. Osherson, “Aggregating large sets of probabilistic forecasts by weighted coherent adjustment,” Decis. Anal., vol. 8, no. 2, pp. 128–144, June 2011.

[14] A. Zoubir, V. Krishnamurthy, and A. Sayed, “Signal processing theory and meth-ods [In the Spotlight],” IEEE Signal Processing Mag., vol. 28, no. 5, pp. 152–156, 2011.

[15] “Decentralalized activation in dense sensor networks via global games,” IEEE Trans. Signal Processing, vol. 56, no. 10, pp. 4936–4950, 2008.

[16] V. Krishnamurthy, “Bayesian sequential detection with phase-distributed change time and nonlinear penalty: A lattice programming POMDP approach,” IEEE Trans. Inform. Theory, vol. 57, no. 3, pp. 7096–7124, Oct. 2011.

[17] V. Krishnamurthy, “Quickest detection POMDPs with social learning: Interac-tion of local and global decision makers,” IEEE Trans. Inform. Theory, vol. 58, no. 8, pp. 5563–5587, 2012.

[18] A. MacKenzie and S. Wicker, “Game theory and the design of self-configuring, adaptive wireless networks,” IEEE Commun. Mag., vol. 39, no. 11, pp. 126–131, Nov. 2001.

[19] N. Li and J. Hou, “Localized fault-tolerant topology control in wireless ad hoc networks,” IEEE Trans. Parallel Distrib. Comput., vol. 17, no. 4, pp. 307–320, 2006.

[20] L. Wang and Y. Xiao, “A survey of energy-efficient scheduling mechanisms in sensor networks,” Mobile Networks Appl., vol. 11, no. 5, pp. 723–740, 2006.

[21] M. Chen, S. Gonzalez, A. Vasilakos, H. Cao, and V. Leung, “Body area networks: A survey,” Mobile Networks Appl., vol. 16, pp. 171–193, 2011.

[22] I. F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, “Wireless sensor networks: A survey,” Comput. Networks, vol. 38, no. 4, pp. 393–422, 2002.

[23] P. Biswas and S. Phoha, “Self-organizing sensor networks for integrated target surveillance,” IEEE Trans. Comput., vol. 55, no. 8, pp. 1033–1047, 2006.

[24] L. Karp, I. Lee, and R. Mason, “A global game with strategic substitutes and com-plements,” Games Econ. Behav., vol. 60, no. 1, pp. 155–175, 2007.

[25] V. Krishnamurthy, “Decentralized spectrum access amongst cognitive radios: An interacting multivariate global game-theoretic approach,” IEEE Trans. Signal Pro-cessing, vol. 57, no. 10, pp. 3999–4013, Oct. 2009.

[26] H. Carlsson and E. van Damme, “Global games and equilibrium selection,” Econometrica, vol. 61, no. 5, pp. 989–1018, Sept. 1993.

[27] S. Morris and H. Shin, “Global games: Theory and applications,” in Advances in Economic Theory and Econometrics: Proceedings of Eight World Congress of the Econometric Society. Cambridge, U.K.: Cambridge Univ. Press, 2000.

[28] F. Zhao, J. Liu, J. Liu, L. Guibas, and J. Reich, “Collaborative signal and infor-mation processing: An information-directed approach,” Proc IEEE, vol. 91, no. 8, Aug. 2003.

[29] P. Milgrom and R. Weber, “Distributional strategies for games with incomplete information,” Math. Oper. Res., vol. 10, no. 4, pp. 619–632, 1985.

[30] A. Mas-Colell, M. Whinston, and J. Green, Microeconomic Theory. New York: Oxford Univ. Press, 1995.

[31] X. Liu, E. Chong, and N. Shroff, “Opportunistic transmission scheduling with resource sharing constraints in wireless networks,” IEEE J. Select. Areas Commun., vol. 19, no. 10, pp. 2053–2064, Oct. 2001.

[32] A. Muller and D. Stoyan, Comparison Methods for Stochastic Models and Risk. Hoboken, NJ: Wiley, 2002.

[33] S. Karlin and Y. Rinott, “Classes of orderings of measures and related correlation inequalities. I. Multivariate totally positive distributions,” J. Multivariate Anal., vol. 10, no. 4, pp. 467–498, 1980.

[34] G. Angeletos, C. Hellwig, and A. Pavan, “Dynamic global games of regime change: Learning, multiplicity, and the timing of attacks,” Econometrica, vol. 75, no. 3, pp. 711–756, 2007.

[35] A. R. Cassandra, “Exact and approximate algorithms for partially observed Markov decision process,” Ph.D. dissertation, Dept. Comput. Sci., Brown Univ., Providence, RI, 1998.

[36] W. Lovejoy, “Some monotonicity results for partially observed Markov decision processes,” Oper. Res., vol. 35, no. 5, pp. 736–743, Sept.–Oct. 1987.

[37] U. Rieder, “Structural results for partially observed control models,” Methods Models Oper. Res., vol. 35, no. 6, pp. 473–490, 1991.

[38] L. Johnston and V. Krishnamurthy, “Opportunistic file transfer over a fading channel: A POMDP search theory formulation with optimal threshold policies,” IEEE Trans. Wireless Commun., vol. 5, no. 2, pp. 394–405, Feb. 2006.

[39] V. Krishnamurthy and D. Djonin, “Structured threshold policies for dynamic sen-sor scheduling: A partially observed Markov decision process approach,” IEEE Trans. Signal Processing, vol. 55, no. 10, pp. 4938–4957, Oct. 2007.

[40] H. V. Poor and O. Hadjiliadis, Quickest Detection. Cambridge, U.K.: Cambridge Univ. Press, 2008.

[41] A. Shiryaev, “On optimum methods in quickest detection problems,” Theory Probability Appl., vol. 8, no. 1, pp. 22–46, 1963.

[42] M. Avellaneda and S. Stoikov, “High-frequency trading in a limit order book,” Quantitative Finance, vol. 8, no. 3, pp. 217–224, Apr. 2008.

[43] V. Krishnamurthy and A. Aryan, “Quickest detection of market shocks in agent based models of the order book,” in Proc. 51st IEEE Conf. Decision and Control, Maui, Hawaii, Dec. 2012.

[44] J. Surowiecki, The Wisdom of Crowds. New York: Anchor, 2005.

[45] C. Anderson and G. J. Kilduff, “Why do dominant personalities attain influence in face-to-face groups? The competence-signaling effects of trait dominance,” J. Person-ality Social Psychol., vol. 96, no. 2, pp. 491–503, 2009.

[46] M. Ottaviani and P. Sørensen, “Information aggregation in debate: Who should speak first?” J. Public Econ., vol. 81, no. 3, pp. 393–421, 2001.

[47] D. Topkis, Supermodularity and Complementarity. Princeton, NJ: Princeton Univ. Press, 1998.

[48] S. Athey, “Monotone comparative statics under uncertainty,” Quarterly J. Econ., vol. 117, no. 1, pp. 187–223, 2002.

[49] V. Krishnamurthy and M. Hamdi, “Mis-information removal in social networks: Dynamic constrained estimation on directed acyclic graphs,” IEEE J. Select. Topics Signal Processing, preprint, May 2013.

[50] A. Dimakis, S. Kar, J. Moura, M. Rabbat, and A. Scaglione, “Gossip algorithms for distributed signal processing,” Proc. IEEE, vol. 98, no. 11, pp. 1847–1864, 2010.

[51] R. J. Aumann, “Correlated equilibrium as an expression of bayesian rationality,” Econometrica, vol. 55, no. 1, pp. 1–18, 1987.

[52] S. Hart and A. Mas-Colell, “A reinforcement procedure leading to correlated equilibrium,” in Economic Essays: A Festschrift for Werner Hildenbrand, G. Debreu, W. Neuefeind, and W. Trockel, Eds. New York: Springer, 2001, pp. 181–200.

[53] R. Adams. (2004, Aug. 7). Book review of The Wisdom of Crowds, The Guardian. [SP]

vikram krishnamurthy and h. vincent poor social learning ...vikramk/kp13.pdf · we refer the reader...

Documents