models of discrete choice - stanford university · 2006-04-16 · logic of the discriminal process...

Mathematical Theory Likelihood References

Models of Discrete Choice

Jonathan Wand

Polisci 350CStanford University

April 10, 2006 (revised)


Theoretical foundations of Choice models

Two main approaches to deriving models of discrete choice:

1. Discriminal process (Thurstone, 1927)Most often associated with models of choice based onnormal distributions (probit)

2. Axiomatic derivation (Luce, 1959)Most often associated with models of choice based onlogistic distribution (logit)


Thurstone: discriminal process

Logic of the discriminal process (abstract)

1. Assume a metric for making comparisons across items.Thurstone deals with a single attribute, posits apsychological continuum

2. Assume that attributes of items have a random component.Thurstone used Normal distribution, but acknowledgedarbitrariness.

3. Assume that positive difference between two items leadsto the selection of the item with the higher value.A relative comparison, requiring only simple inequalitystatements.



Logic of the discriminal process (less abstract)

1. The utility for item i is defined as ui .2. The utility can be decomposed into two parts,

• a systematic (observed) component µi• a random (unobserved) component εi ∼ F (0, σ2)

Assume that the overall utility of an item is the sum of theobserved and unobserved components ui = µi + εi .

3. For a pair of items, choose item j over k if uj > uk .



General binary choice

P(j , k) = P(uj > uk )

= P(µj + εj > µk + εk )

= P(µj − µk + εj > εk )

= P(µj − µk > εk − εj)


Notational aside

Define the standard Normal(0,1) pdf and cdf as, respectively,

φ(ε) =1√2π

exp{−ε2

2

}

Φ(ε) =

∫ ε

−∞φ(z)∂z

Define the Type I extreme value pdf and cdf as, respectively,

λ(ε) = e−ε exp{−e−ε}

Λ(ε) = exp{−e−ε}


Thurstone: discriminal processCase 1. Assume F (0, σ2

i ) is a Normal distribution

Recall, for independent normal distributions

N (µi , σ2i )−N (µj , σ

2j ) = N (µi − µj , σ

2i + σ2

j )

So,

P(j , k) = P(µj − µk > εk − εj)

=

∫ µj−µk

−∞

1√2π(σ2

j + σ2k )

exp

− z2

2√

σ2j + σ2

k

∂z

= Φ

µj − µk√σ2

j + σ2k



Given,

P(j , k) = Φ

µj − µk√σ2

j + σ2k

and µi = xiγi , consider what can be identified?

If xj = xk = x , then

P(j , k) = Φ

xγj − γk√σ2

j + σ2k

= Φ(xβ)

where β =γj−γkqσ2

j +σ2k

.



Given,

P(j , k) = Φ

µj − µk√σ2

j + σ2k


If γj = γk = γ, then

P(j , k) = Φ

(xj − xk )γ√

σ2j + σ2

k

= Φ((xj − xk )β

)where β = γq

σ2j +σ2

k

.


Thurstone: discriminal processCase 2. Assume F is a Type I extreme value distribution

P(j , k) = P(uj > uk )

= P(µj − µk + εj > εk )

=

∫ ∞

−∞λ(εj)

∫ µj−µk+εj

−∞λ(εk )

=

∫ ∞

−∞λ(εj)Λ(µj − µk + εj)

=

∫ ∞

−∞λ(εj)Λ(µj − µk + εj)

=1w

∫ ∞

−∞−e−εj w exp{−e−εj w}

=1w

=1

1 + exp{−(µ1 − µ2)}


Thurstone: discriminal processGiven,

P(j , k) =1

1 + exp{−(µ1 − µ2)}


If xj = xk = x , then

P(j , k) =1

1 + exp{−x(γ1 − γ2)}=

11 + exp{−xβ}

where β = γ1 − γ2.

If γj = γk = γ, then

P(j , k) =1

1 + exp{−β(x1 − x2)}

where β = γ.


Axiomatic Foundations of Choice Models

Assumptions

D1 Let R ⊂ S ⊂ T ⊂ U.

D2 Let x , y , z ∈ T , arbitrary elements of choice set.

D3 Let P(x , y) be the probability of choosing x instead of y ,0 < P(x , y) < 1.

D4 PS(R) is the probability of choosing R given choice fromamong alternatives in S.

Choice Axiom

(i) PT (R) = PS(R)PT (S)

(ii) If P(x , y) = 0 for some x , y ∈ T , PT (S) = PT−{x}(S − {x})



Axiom of Choice

• Defines relationship which defines how choices withinsubsets are related in the context of an individual makingprobabilistic choices.

• Can be rewritten as PT (R | S)PT (S) = PT (R)

• Two core implications,Lemma 3: Independence of Irrelevant Alternatives (IIA)Theorem 3: Probability must satisfy a ratio scale


Axiomatic Foundations of Choice ModelsLemma 3 (Independence from irrelevant alternatives):For x , y ∈ S,

P(x , y)

P(y , x)=

PS(x)

PS(y)

Proof:By Axiom we have

PS(x) = P(x , y)[PS(x) + PS(y)]

So

PS(x) = P(x , y)[PS(x) + PS(y)]

PS(x) = P(x , y)PS(x) + P(x , y)PS(y)

(1− P(x , y))PS(x) = P(x , y)PS(y)

P(y , x)PS(x) = P(x , y)PS(y)

P(x , y)

P(y , x)=

PS(x)

PS(y)



Lemma 3 (Independence from irrelevant alternatives):

• Most famous implication: relative probability of choosingtwo alternatives is invariant to the composition of the largerset of alternatives.

• Again, only ratio that is invariant not probabilitiesthemselves

• Might also hear that log-odds of two choices are constant:log(PS(x))− log(PS(y)) = c.

• Can estimate parameters defining utility of choices evenwith only a subset. Not possible if IIA does not hold (e.g.,correlation between utilities of choices)—then to estimateany choice must model all choices.


Axiomatic Foundations of Choice ModelsTheorem 3: choice probability is ratio scale∃v : T → <+, unique up to multiplication by k > 0, such that

PS(x) =v(x)∑

y∈S v(y)=

11 +

∑y∈S−{x} v(y)/v(x)

McFadden and Yellot have each pointed out the connection tologit models by setting v(x) = ex .

E.g.,

P(x , y) =v(x)

v(x) + v(y)=

eµx

eµx + eµy=

11 + eµy /eµx

=1

1 + e−(µx−µy )

Yellot (1977) shows that discriminal process based on Type Idiscrete value distribution is uniquely equivalent to ChoiceAxiom.


Likelihood for a dichotomous choiceConsider a Bernoulli process for observing y ∈ {0, 1}, whereP(y = 1) = p, and therefore P(y = 0) = 1− p.For an individual i , the likelihood of observing yi is

Li = pyi (1− p)1−yi (1)

The log-likelihood,

Li = yi log(p) + (1− yi) log(1− p) (2)

The score of the likelihood

Li

p=

yi

p+

1− yi

1− p(−1) =

yi − pp(1− p)

=yi − E(yi)

V (yi)(3)

Here, p is a single parameter for all people, but same setup formore general parameterization.


Likelihood for a dichotomous choice

Normally, would like p to be a function of covariates xi .

Therefore we seek a mapping: xi → P(yi | xi).

In general, will need to specify two things:1. Index function: a(x , β) that maps k-vector to scalar

1.1 e.g., additive-linear: a(xi , β) = xiβ

1.2 e.g., non-linear a(xi , β) = xβ11 + x2β2

2. Transformation function F() with properties,2.1 F (−∞) = 02.2 F (∞) = 12.3 ∂F (z)/∂z = f (z) > 0

i.e., monotonic function that maps real line to unit interval



Standard parameterizations for k-vector of covariates xi ,

1. Probit:1.1 a(x , β) = xβ

1.2 F (xβ) =∫ xβ

−∞1√2π

exp{− z2

2 } =∫ xβ

−∞ φ(z) = Φ(xβ)

1.3 Li = yi log(Φ(xiβ)) + (1− yi) log(1− Φ(xiβ))

2. Logit2.1 a(x , γ) = xγ2.2 F (xγ) = 1

1+exp{−xγ} = Λ(xγ)

2.3 Li = yi log(Λ(xiγ)) + (1− yi) log(1− Λ(xiγ))


Logit and Probit, CDF

−10 −5 0 5 10

0.0

0.2

0.4

0.6

0.8

1.0

beta = (0,1)

x

P( y

=1 |

x )


Logit and Cauchit (t), CDF

−10 −5 0 5 10

0.0

0.2

0.4

0.6

0.8

1.0

beta = (0,1)

x

P( y

=1 |

x )



Let pi = P(yi | xi) = F (a(xi , β)).

Recall the log-likelihood,

Li = yi log(p) + (1− yi) log(1− p)

and re-write the score of the log-likelihood,

Si(β) =∂Li

∂β=

∂Li

∂pi

∂pi

∂β=

yi − pi

pi(1− pi)

∂F (a(xiβ)

∂β

∂a(xiβ)

∂β

Si(β) =∂Li

∂β= (yi − Fi)

fiFi(1− Fi)

xi

FOC to maximize L defines ML β̂, such that∑n

i=1 Si(β̂) = 0


A quick asideLet’s remind ourselves of logic of ML using linear model.

Li = −12

log(2π)− 12(yi − xiθ)

2

Score function for an individual

Si(θ) =∂Li

∂θ= (yi − xiθ)xi

FOC, in matrix notation,

n∑i=1

Si(θ̂) = X>(Y − Xθ) = X>Y − X>X θ̂ = 0

so we have

θ̂ = (X>X )−1X>Y


Particularly important concepts from this lecture

1. How to derive binary choice from models using normal andextreme value distributions

2. How to interpret the meaning of the coefficients (what isidentified by reduced form parameters?)

3. The (log)likelihood and score of a Bernoulli process

4. How to parametrize a logit and probit (log) likelihood, andderivation of score

5. How the logit and probit differ (i) in motivation (2) in tailbehavior


Luce, R. Duncan. 1959. Individual choice behavior: atheoretical analysis. NY: Wiley.

Thurstone, L. L. 1927. A Law of Comparative Judgement.Pychological Review , 34:272–86.URL http://www.psycinfo.com/library/display.cfm?uid=1994-28135-001\&format=pdf

Yellot, John I. Jr. 1977. The Relationship between Luce’sAxiom, Thurstone’s Theory of Comparative Judgment, andthe Double Exponential Distribution. Journal of MathematicalPsychology , 15:109–144.

http://www.psycinfo.com/library/display.cfm?uid=1994-28135-001&format=pdf

http://www.psycinfo.com/library/display.cfm?uid=1994-28135-001&format=pdf

models of discrete choice - stanford university · 2006-04-16 · logic of the discriminal process...

Documents