models of discrete choice - stanford university · 2006-04-16 · logic of the discriminal process...
TRANSCRIPT
Mathematical Theory Likelihood References
Models of Discrete Choice
Jonathan Wand
Polisci 350CStanford University
April 10, 2006 (revised)
Mathematical Theory Likelihood References
Theoretical foundations of Choice models
Two main approaches to deriving models of discrete choice:
1. Discriminal process (Thurstone, 1927)Most often associated with models of choice based onnormal distributions (probit)
2. Axiomatic derivation (Luce, 1959)Most often associated with models of choice based onlogistic distribution (logit)
Mathematical Theory Likelihood References
Thurstone: discriminal process
Logic of the discriminal process (abstract)
1. Assume a metric for making comparisons across items.Thurstone deals with a single attribute, posits apsychological continuum
2. Assume that attributes of items have a random component.Thurstone used Normal distribution, but acknowledgedarbitrariness.
3. Assume that positive difference between two items leadsto the selection of the item with the higher value.A relative comparison, requiring only simple inequalitystatements.
Mathematical Theory Likelihood References
Thurstone: discriminal process
Logic of the discriminal process (less abstract)
1. The utility for item i is defined as ui .2. The utility can be decomposed into two parts,
• a systematic (observed) component µi• a random (unobserved) component εi ∼ F (0, σ2)
Assume that the overall utility of an item is the sum of theobserved and unobserved components ui = µi + εi .
3. For a pair of items, choose item j over k if uj > uk .
Mathematical Theory Likelihood References
Thurstone: discriminal process
General binary choice
P(j , k) = P(uj > uk )
= P(µj + εj > µk + εk )
= P(µj − µk + εj > εk )
= P(µj − µk > εk − εj)
Mathematical Theory Likelihood References
Notational aside
Define the standard Normal(0,1) pdf and cdf as, respectively,
φ(ε) =1√2π
exp{−ε2
2
}
Φ(ε) =
∫ ε
−∞φ(z)∂z
Define the Type I extreme value pdf and cdf as, respectively,
λ(ε) = e−ε exp{−e−ε}
Λ(ε) = exp{−e−ε}
Mathematical Theory Likelihood References
Thurstone: discriminal processCase 1. Assume F (0, σ2
i ) is a Normal distribution
Recall, for independent normal distributions
N (µi , σ2i )−N (µj , σ
2j ) = N (µi − µj , σ
2i + σ2
j )
So,
P(j , k) = P(µj − µk > εk − εj)
=
∫ µj−µk
−∞
1√2π(σ2
j + σ2k )
exp
− z2
2√
σ2j + σ2
k
∂z
= Φ
µj − µk√σ2
j + σ2k
Mathematical Theory Likelihood References
Thurstone: discriminal process
Given,
P(j , k) = Φ
µj − µk√σ2
j + σ2k
and µi = xiγi , consider what can be identified?
If xj = xk = x , then
P(j , k) = Φ
xγj − γk√σ2
j + σ2k
= Φ(xβ)
where β =γj−γkqσ2
j +σ2k
.
Mathematical Theory Likelihood References
Thurstone: discriminal process
Given,
P(j , k) = Φ
µj − µk√σ2
j + σ2k
and µi = xiγi , consider what can be identified?
If γj = γk = γ, then
P(j , k) = Φ
(xj − xk )γ√
σ2j + σ2
k
= Φ((xj − xk )β
)where β = γq
σ2j +σ2
k
.
Mathematical Theory Likelihood References
Thurstone: discriminal processCase 2. Assume F is a Type I extreme value distribution
P(j , k) = P(uj > uk )
= P(µj − µk + εj > εk )
=
∫ ∞
−∞λ(εj)
∫ µj−µk+εj
−∞λ(εk )
=
∫ ∞
−∞λ(εj)Λ(µj − µk + εj)
=
∫ ∞
−∞λ(εj)Λ(µj − µk + εj)
=1w
∫ ∞
−∞−e−εj w exp{−e−εj w}
=1w
=1
1 + exp{−(µ1 − µ2)}
Mathematical Theory Likelihood References
Thurstone: discriminal processGiven,
P(j , k) =1
1 + exp{−(µ1 − µ2)}
and µi = xiγi , consider what can be identified?
If xj = xk = x , then
P(j , k) =1
1 + exp{−x(γ1 − γ2)}=
11 + exp{−xβ}
where β = γ1 − γ2.
If γj = γk = γ, then
P(j , k) =1
1 + exp{−β(x1 − x2)}
where β = γ.
Mathematical Theory Likelihood References
Axiomatic Foundations of Choice Models
Assumptions
D1 Let R ⊂ S ⊂ T ⊂ U.
D2 Let x , y , z ∈ T , arbitrary elements of choice set.
D3 Let P(x , y) be the probability of choosing x instead of y ,0 < P(x , y) < 1.
D4 PS(R) is the probability of choosing R given choice fromamong alternatives in S.
Choice Axiom
(i) PT (R) = PS(R)PT (S)
(ii) If P(x , y) = 0 for some x , y ∈ T , PT (S) = PT−{x}(S − {x})
Mathematical Theory Likelihood References
Axiomatic Foundations of Choice Models
Axiom of Choice
• Defines relationship which defines how choices withinsubsets are related in the context of an individual makingprobabilistic choices.
• Can be rewritten as PT (R | S)PT (S) = PT (R)
• Two core implications,Lemma 3: Independence of Irrelevant Alternatives (IIA)Theorem 3: Probability must satisfy a ratio scale
Mathematical Theory Likelihood References
Axiomatic Foundations of Choice ModelsLemma 3 (Independence from irrelevant alternatives):For x , y ∈ S,
P(x , y)
P(y , x)=
PS(x)
PS(y)
Proof:By Axiom we have
PS(x) = P(x , y)[PS(x) + PS(y)]
So
PS(x) = P(x , y)[PS(x) + PS(y)]
PS(x) = P(x , y)PS(x) + P(x , y)PS(y)
(1− P(x , y))PS(x) = P(x , y)PS(y)
P(y , x)PS(x) = P(x , y)PS(y)
P(x , y)
P(y , x)=
PS(x)
PS(y)
Mathematical Theory Likelihood References
Axiomatic Foundations of Choice Models
Lemma 3 (Independence from irrelevant alternatives):
• Most famous implication: relative probability of choosingtwo alternatives is invariant to the composition of the largerset of alternatives.
• Again, only ratio that is invariant not probabilitiesthemselves
• Might also hear that log-odds of two choices are constant:log(PS(x))− log(PS(y)) = c.
• Can estimate parameters defining utility of choices evenwith only a subset. Not possible if IIA does not hold (e.g.,correlation between utilities of choices)—then to estimateany choice must model all choices.
Mathematical Theory Likelihood References
Axiomatic Foundations of Choice ModelsTheorem 3: choice probability is ratio scale∃v : T → <+, unique up to multiplication by k > 0, such that
PS(x) =v(x)∑
y∈S v(y)=
11 +
∑y∈S−{x} v(y)/v(x)
McFadden and Yellot have each pointed out the connection tologit models by setting v(x) = ex .
E.g.,
P(x , y) =v(x)
v(x) + v(y)=
eµx
eµx + eµy=
11 + eµy /eµx
=1
1 + e−(µx−µy )
Yellot (1977) shows that discriminal process based on Type Idiscrete value distribution is uniquely equivalent to ChoiceAxiom.
Mathematical Theory Likelihood References
Likelihood for a dichotomous choiceConsider a Bernoulli process for observing y ∈ {0, 1}, whereP(y = 1) = p, and therefore P(y = 0) = 1− p.For an individual i , the likelihood of observing yi is
Li = pyi (1− p)1−yi (1)
The log-likelihood,
Li = yi log(p) + (1− yi) log(1− p) (2)
The score of the likelihood
Li
p=
yi
p+
1− yi
1− p(−1) =
yi − pp(1− p)
=yi − E(yi)
V (yi)(3)
Here, p is a single parameter for all people, but same setup formore general parameterization.
Mathematical Theory Likelihood References
Likelihood for a dichotomous choice
Normally, would like p to be a function of covariates xi .
Therefore we seek a mapping: xi → P(yi | xi).
In general, will need to specify two things:1. Index function: a(x , β) that maps k-vector to scalar
1.1 e.g., additive-linear: a(xi , β) = xiβ
1.2 e.g., non-linear a(xi , β) = xβ11 + x2β2
2. Transformation function F() with properties,2.1 F (−∞) = 02.2 F (∞) = 12.3 ∂F (z)/∂z = f (z) > 0
i.e., monotonic function that maps real line to unit interval
Mathematical Theory Likelihood References
Likelihood for a dichotomous choice
Standard parameterizations for k-vector of covariates xi ,
1. Probit:1.1 a(x , β) = xβ
1.2 F (xβ) =∫ xβ
−∞1√2π
exp{− z2
2 } =∫ xβ
−∞ φ(z) = Φ(xβ)
1.3 Li = yi log(Φ(xiβ)) + (1− yi) log(1− Φ(xiβ))
2. Logit2.1 a(x , γ) = xγ2.2 F (xγ) = 1
1+exp{−xγ} = Λ(xγ)
2.3 Li = yi log(Λ(xiγ)) + (1− yi) log(1− Λ(xiγ))
Mathematical Theory Likelihood References
Logit and Probit, CDF
−10 −5 0 5 10
0.0
0.2
0.4
0.6
0.8
1.0
beta = (0,1)
x
P( y
=1 |
x )
Mathematical Theory Likelihood References
Logit and Cauchit (t), CDF
−10 −5 0 5 10
0.0
0.2
0.4
0.6
0.8
1.0
beta = (0,1)
x
P( y
=1 |
x )
Mathematical Theory Likelihood References
Likelihood for a dichotomous choice
Let pi = P(yi | xi) = F (a(xi , β)).
Recall the log-likelihood,
Li = yi log(p) + (1− yi) log(1− p)
and re-write the score of the log-likelihood,
Si(β) =∂Li
∂β=
∂Li
∂pi
∂pi
∂β=
yi − pi
pi(1− pi)
∂F (a(xiβ)
∂β
∂a(xiβ)
∂β
Si(β) =∂Li
∂β= (yi − Fi)
fiFi(1− Fi)
xi
FOC to maximize L defines ML β̂, such that∑n
i=1 Si(β̂) = 0
Mathematical Theory Likelihood References
A quick asideLet’s remind ourselves of logic of ML using linear model.
Li = −12
log(2π)− 12(yi − xiθ)
2
Score function for an individual
Si(θ) =∂Li
∂θ= (yi − xiθ)xi
FOC, in matrix notation,
n∑i=1
Si(θ̂) = X>(Y − Xθ) = X>Y − X>X θ̂ = 0
so we have
θ̂ = (X>X )−1X>Y
Mathematical Theory Likelihood References
Particularly important concepts from this lecture
1. How to derive binary choice from models using normal andextreme value distributions
2. How to interpret the meaning of the coefficients (what isidentified by reduced form parameters?)
3. The (log)likelihood and score of a Bernoulli process
4. How to parametrize a logit and probit (log) likelihood, andderivation of score
5. How the logit and probit differ (i) in motivation (2) in tailbehavior
Mathematical Theory Likelihood References
Luce, R. Duncan. 1959. Individual choice behavior: atheoretical analysis. NY: Wiley.
Thurstone, L. L. 1927. A Law of Comparative Judgement.Pychological Review , 34:272–86.URL http://www.psycinfo.com/library/display.cfm?uid=1994-28135-001\&format=pdf
Yellot, John I. Jr. 1977. The Relationship between Luce’sAxiom, Thurstone’s Theory of Comparative Judgment, andthe Double Exponential Distribution. Journal of MathematicalPsychology , 15:109–144.