applying signal detection theory to multi-level modeling ...applying signal detection theory to...

Applying Signal Detection Theory to Multi-Level Modeling:When “Accuracy” Isn't Always

Accurate

December 5th, 2011

Scott Fraundorf & Jason Finley

Outline

• Why do we need SDT?

• Sensitivity vs. Response Bias• Example SDT Analyses

• Terminology & Theory

• Logit & probit• Extensions

Categorical Decisions

• Lots of paradigms involve asking participants to make a categorical decision– Could be explicit or implicit– Here, we're focusing on cases where there

are just two categories … but can generalize to >2

Some Categorical Decisions

Did Anna dressthe baby?(D) Yes (K) No

The cop saw the spy with the binoculars.

“The coach knew(that) you missed

practice.”

Comprehension questions

Assigning a meaning to a novel word like “bouba”

Choosing to include optional words like

“that”

Baby looking at 1 screen or another

Interpreting an ambiguous sentence

Choosing referent in perspective-taking task

Some Categorical Decisions

VIKING(1) Seen (4) New

VIKING(1) Male talker(4) Female talker

Recognition item

memory – did you see this word or face in the study list?

Source memory – was this word said by a male or female talker?

Detecting whether or not a faint

signal is present

Did something change between the two displays?

What is Signal Detection Theory? For experiments with categorical judgments

– Part method for analyzing judgments– Part theory about how people make judgments

Originally developed forpsychophysics– Operators trying to detect

radar signals amidst noise Purpose:

– Better metric properties than ANOVA on proportions(logistic regression has already taken care of this for us)

– Distinguish sensitivity from response bias

Outline





Study:

POTATOSLEEP

RACCOONWITCHNAPKINBINDER

Test:

SLEEP

POTATO

BINDERWITCH

RACCOONNAPKIN

Early recognition memory experiments: Study a list of words, then see the same list again. Circle the ones you remember.

Problem: People might realize that all of these are words they studied! Could circle them all even if they don't really remember them.

Study:

POTATOSLEEP


Test:

POTATO

HEDGEWITCHBINDER

SHELLSLEEP

MONKEYOATH

Later experiments: Add foils or lures that aren't words you studied.

Here we see someone circled half of the studied words … but they circled half of the lures, too. No real ability to tell apart new & old items. They're just circling 50% of everything.

Study:

POTATOSLEEP


Test:

POTATO

HEDGEWITCHBINDER

SHELLSLEEP

MONKEYOATH

What we want is a measure that examines correct endorsements relative to incorrect endorsements … and is not influenced by an overall bias to circle things

Sensitivity vs. Response Bias“C is the most common answer in multiple choice exams”Knowing

which answers are

C andwhich aren't

Response bias

Sensitivity (or discrimination)

Sensitivity vs. Response Bias Imagine asking L2 learners of English to

judge grammaticality...

It appears that our participants are more accurate at accepting grammatical items than rejecting ungrammatical ones...

Grammatical condition

Ungrammatical cond.

70%

30%

70%

70%

ACCURACYSAID

“GRAMMATICAL”Group A

Sensitivity vs. Response Bias

But, really, they are just judging all sentences as “grammatical” 70% of the time – a response bias

No evidence they're showing sensitivity to grammaticality here

“Accuracy” can confound these 2 influences


Ungrammatical cond.

70%

30%

70%

70%

ACCURACYSAID


Sensitivity vs. Response Bias Now imagine we have speakers of two

different first languages...


Ungrammatical cond.

70%

30%

70%

70%

ACCURACYSAID


Group B


Ungrammatical cond.

60%

40%

Sensitivity vs. Response Bias It looks like Group B is better at rejecting

ungrammatical sentences... But groups just have different biases


Ungrammatical cond.

70%

30%

70%

70%

ACCURACYSAID


Group B

60%

60%Grammatical condition

Ungrammatical cond.

60%

40%

Sensitivity vs. Response Bias This would be particularly misleading if we

only looked at ungrammatical items No way to distinguish response bias vs.

sensitivity in that case!

Ungrammatical cond. 30% 70%ACCURACY

SAID“GRAMMATICAL”

Group A

Group B

60%Ungrammatical cond. 40%

Sensitivity vs. Response Bias We see participants can give the “right”

answer without really knowing it Comparisons to “chance” attempt to deal

with this But “chance” = 50% assumes both

responses equally likely– Probably not true for

e.g., attachment ambiguities– People have overall bias to

answer questions with“yes”

Sensitivity vs. Response Bias Common to balance frequency of intended

responses– e.g. 50% true statements, 50% false

But bias may still exist for other reasons– Prior frequency• e.g. low attachments are more common in English than

high attachments … might create a bias even if they're equally common in your experiment

– Motivational factors (e.g., one error “less bad” than another)

• Better to suggest a healthy patient undergo additional screening for a disease than miss someone w/ disease

Outline





Fraundorf, Watson, & Benjamin (2010)

Both the British and the French biologistshad been searching Malaysia and Indonesiafor the endangered monkeys.

Finally, the British spotted one of themonkeys in Malaysia and planted a radiotag on it.

Presentational orcontrastive pitch accent?

Hear recorded discourse:

Then, later, get true/false memory test

The British scientists spotted theendangered monkey and tagged it.

D TRUE K FALSE

The French scientists spotted theendangered monkey and tagged it.

D TRUE K FALSE

N.B. Actual experiment had multiple types of false probes … an important part of the actual experiment, but not needed for this demonstration

SDT & Multi-Level Models

Traditional logistic regression model:

Accuracy confounds sensitivity and response bias

– Manipulation might just make you say true to everything more

Accuracyof Response

= Probe Type x Pitch Accent

CORRECT MEMORY or INCORRECT MEMORY


Traditional logistic regression model:

SDT model:

Accuracyof Response


ResponseMade

CORRECT MEMORY or INCORRECT MEMORY

JUDGED GRAMMATICAL or JUDGED UNGRAMMATICAL

JUDGED TRUE vs JUDGED FALSE


SDT model involves changing the way your DV is parameterized.

Respondcorrectly

orRespond

incorrectly?

Truestatement

orFalse

statement?

This better reflects the actual judgment we are asking participants to make.

They are deciding whether to say “this is true” or “this is false” … not whether to respond accurately or respond inaccurately

SDT & Multi-Level Models SDT model:

Said“TRUE”

=

Actually isTRUE

InterceptBaseline rate of responding TRUE.

Does item being true make you more likely to say TRUE?

Response bias

Sensitivity+

w/ centered predictors...

At this point, we haven't looked at any differences between conditions (e.g. contrastive vs presentational accent or L1 vs L2). We are just analyzing overall performance.


Said“TRUE”

=

Actually isTRUE

ContrastiveAccent

Intercept

Accent xTRUE

Baseline rate of responding TRUE.


Does contrastive accent change overall rate of saying TRUE?

Does accent especially increase TRUE responses to true items?

Response bias

Sensitivity

Effect on bias

Effect on sensitivity

+

+

+

w/ centered predictors...


Said“TRUE”

=

Actually isTRUE *

ContrastiveAccent

Intercept

Accent xTRUE *

Baseline rate of responding TRUE.


Does contrastive accent change overall rate of saying TRUE?

Does accent especially increase TRUE responses to true items?

Response bias

Sensitivity

Effect on bias


+

+

+

Contrastive accent improves actual sensitivity. No effect on response bias.


Said“TRUE”

=

Actually isTRUE *

ContrastiveAccent

Intercept

Accent xTRUE *

Response bias

Sensitivity

Effect on bias


+

+

+

General heuristic:

Effects that don't interact with item type = effects on bias

Effects that do involve itemtype = effects on sensitvity

Ferreira & Dell (2000)• Are people sensitive to ambiguity in language

production?

• Ambiguous: “The coach knew you....”– “The coach knew you for a long time.”

• Here, you is the direct object of knew

– “The coach knew you missed practice.”• Here, you is actually the subject of an embedded

sentence (you missed practice). Confusing if you were expecting a direct object!

• “The coach knew that you missed practice.”– Including that avoids this ambiguity

Ferreira & Dell (2000)• Task: Read & recall sentences

• Ambiguous: “The coach knew (that) you....”• Unambiguous: “The coach knew (that) I...”

– This has to be a sentential complement (would be The coach knew me if DO)

• Will people produce “that” in the ambiguous conditions?– Especially if task instructions emphasize being

clear?


Said “that”

=

Ambiguity

Instructions

Intercept

Instructionsx Ambiguity

Baseline rate of producing “that”

Do people produce “that” more for you (ambig.) than I (unambig.)?

Are people told to avoid ambiguity?

Do instructions especially increase use of “that” for ambiguous items?

Response bias

Sensitivity

Effect on bias


+

+

+


Said “that”

=

Ambiguity

InterceptBaseline rate of producing “that”


Response bias

Sensitivity+

People show little sensitivity to ambiguity!


Ambiguity

Instructions

Intercept

Instructionsx Ambiguity

Baseline rate of producing “that”


Are people told to avoid ambiguity?

Do instructions especially increase use of “that” for ambiguous items?

Response bias

Sensitivity

Effect on bias


+

+

+

Instructions to be clear don't increase sensitivity to ambiguity.They just increase overall rate of “that,” in all conditions.An interesting response bias effect that tells us about participant's strategy! People try to increase clarity by inserting complementizer everywhere

Other Designs Imagine if critical comprehension questions

should all be answered “yes” Not possible to tease apart sensitivity &

response bias– Could say “yes” 85% of the time because you knew

the correct answer 85% of the time (sensitivity)– But maybe you would just say “yes” to 85% of

everything (response bias)– Like the memory test that only has studied words

Would need “no” probes to do SDT analysis A common design in psycholinguistics … but

limited in the conclusions we can draw from it

Other Designs

This concern holds even if we manipulate some other variable...

The priming manipulation...– Might improve sensitivity at realizing these

questions should be answered “yes”– Might just increase bias to say “yes”

Again, we would need some “no” questions to distinguish these hypotheses

“Yes” comprehension questions – NOT PRIMED

“Yes” comprehension questions – PRIMED

81%

93%

RATE OF “YES” RESPONSES

Outline




• Logit & probit• Extensions These slides courtesy

Jason Finley

# Hits# Signal Trials

Hit Rate (HR) =

# False Alarms# Noise Trials

False Alarm Rate (FAR) =

Signal Detection Performance

for each trial:

Hit

Miss

FalseAlarm

Correct Rejection

summary statistics:

(Type II error in stats)

(Type I error)

Theory

NOISEtrials

SIGNALtrials

1. Trials = events2. Strength of evidence: continuous dimension3. Conditional probability distributions for noise, signal4. Decision/response criterion5. Evidence has an arbitrary scale. By convention, noise distribution

has mean 0, variance 1“No” “Yes” Respond “yes” if

evidence above criterionRespond “no” if

evidence below criterion

decisioncriterion

CorrectRejection

NOISEtrials

SIGNALtrials

Noise trial, and evidence is below criterion

decisioncriterion

False Alarm

NOISEtrials

SIGNALtrials

Noise trial, but evidence is above criterion

decisioncriterion

Miss

NOISEtrials

SIGNALtrials

Signal trial, but evidence is below criterion

decisioncriterion

Hit

NOISEtrials

SIGNALtrials

Signal trial, and evidence is above criterion

NOISEtrials

SIGNALtrials

Response Bias

decisioncriterion

Optimal Criterion-signal probability-payoff structure

NOISEtrials

SIGNALtrials

Response Bias

decisioncriterion

High criterion will increase correct rejections, but also increase misses

(The kind of criterion intended in null hypothesis significance testing)

NOISEtrials

SIGNALtrials

Response Bias

decisioncriterion

Low criterion will increase hits,but also false alarms

Good if misses especially bad (medical screening example)

NOISEtrials

SIGNALtrials

Response Bias

decisioncriterion

c = -.5[z(HR) + z(FAR)]

c parameter describes location of criterion

d’

Sensitivity

NOISEtrials

SIGNALtrials

d’ = z(HR) – z(FAR)

Traditional SDT measure of sensitivity is d' … measuring the distance between peaks

Lower Sensitivity

Sensitivityresult of:

-external factors-internal factors

Higher Sensitivity

Sensitivityresult of:

-external factors-internal factors

Outline





Probit vs Logit How to make binomial

response continuous?

Logit = log odds– lnOR = log([p(hit)/p(miss)] /

[p(FA)/p(CR)]) Probit = cumulative

distribution function(CDF) of normal distribution– d' = z[p(hit)] – z[p(FA)] CDF at x is area under curve

from -Inf to point x

Very similar!– Probit changes more quickly in middle of distribution, more

slowly at tails

– Logit has somewhat easier interpretation (can convert to odds / odds ratios)

– Probably, you will get qualitatively similar results with both

– Could try both & see which fits your dataset better

– Literatures differ in which is used more commonly

Figures from http://www.indiana.edu/~statmath/stat/all/cdvm/cdvm1.html

PROBIT LOGIT

Can pick one or the other:– lmer(Y ~ X, family=binomial, link=logit)

• Default, used if you don't specify a link

– lmer(Y ~ X, family=binomial, link=probit)

PROBIT LOGIT

Both the logit and probit are undefined if you have probability 0 or 1 in a cell– Can apply some type of adjustment (e.g.

empirical logit)

PROBIT LOGIT

Outline





Extensions Generalizes to > 2 ordered categories:

– Traditionally, collapse over participants or items & use d

a

– MLM needs multinomial model, currently available in SAS but not R

Extensions Variance in parameters over participants,

items– e.g. different sensitivity, different response bias– Captured by random slopes– Likely that such variance exists!

Extensions Unequal variance

– So far, variability of response is a constant d' = B

0 + B

1X

1 + (1|Subject) + ɛ

– Definitely not true for recognition memory (although lots of debate about why this is)

Noisy criterion (Benjamin, Diaz, & Wee, 2009)

– Typically, all error is in the response (evidence)– But criterion could vary from trial to trial

(from Wixted, 2002)

N.B. This is definitely not the only account of this difference! (see Yonelinas, 2002)

applying signal detection theory to multi-level modeling ...applying signal detection theory to...

Documents