field reliability of static-99r and diagnosis in wic 6600 ...individual biases and preferences....

Field Reliability of Static-99R and diagnosis in WIC 6600 Evaluations

Joseph Lockhart, PhD, ABPPMelinda DiCiro, PsyD, ABPP

FMHAC 2020

Thanks to all who

contributed to this study over

nearly two years!

• Anna Brennan• Jim Rokop• Administration • …and others

Questions the Study was

designed to answer

• How consistent are raters in scoring the risk instrument (Static-99R)?

• (Only other SVP study in Texas found poor interrater reliability – Boccaccini et al., 2009)

• What are the most common diagnoses?• Do the raters show adequate diagnostic

agreement? (Other studies found varied k). • Are there differences between employees vs

independent evaluators (IEs) in their ratings?

Why do we care about

field reliability?

• “Field studies we would argue provide evidence concerning an instrument’s psychometric properties that is more generalizable to real-world cases specifically because the data were collected under similar circumstances.” (Edens and Boccaccini, 2017)

RELIABILITY & VALIDITY

• Reliability is the degree to which an assessment tool produces stable and consistent results.

• Validity refers to how well a test measures what it is purported to measure

• Why is reliability necessary?• While reliability is necessary, it alone is not sufficient. For

a test to be reliable, it also needs to be valid. For example, if your scale is off by 5 lbs, it reads your weight every day with a deficit of 5 lbs. The scale is reliable because it consistently reports the same weight every day, but it is not valid because it subtracts 5 lbs to your true weight.

How to weigh yourselfThis method might be reliable, but is it valid?

Why do we care about

bias and error?

• Threaten reliability • Violate ethical principles• Violate scientific principles• Demonstrated forensic evaluator vulnerability• Demonstrated human vulnerability• Effective ways to mitigate bias and error exist

Why test field reliability?

• Reliability of outcomes is valued• Field conditions introduce bias and error• Do chosen methods mitigate potential bias and

error?• Instruments• Diagnostic schemes• Training• QA

Bias threatens reliabilityEspecially Under Field Conditions

Agencies & AdversariesTraining

Methods

Perspectives

Incentives

Evaluators Thinking too fast

Human vulnerability

Heuristic cognitive bias

Individual biases

Ethics Codes and Guidelines, Bias & Error

Key Ethical Principles• Justice• Respect for persons• Integrity

Forensic Guidelines• Impartiality• Avoiding conflicts of interest• Mitigating impact of personal bias• Reliable sources and methods

Scientific Principles

Decry Bias

• Objectivity

• Neutrality

• Reproducibility

National Research Council National Academy of Sciences—Strengthening forensic science 2009; Rigorous protocol control for bias and error.

Similarities between forensic science and Forensic Mental Health Evaluation • Cognition• Understanding, analysis, and interpretation of data• Perception• Decision making

Mitigate bias with scientific principles and research-based methods • Work like a scientist, not a clinician• Rival hypothesis testing• Use standardized methods• Use certification programs

Zapf & Dror (2017) Understanding and Mitigating Bias in Forensic Evaluation. International Journal of Forensic Mental Health

Frye and Daubert Admissibility

Standards Requirements

FryeThe method is generally acceptedAdmissibility of expert's scientific testimony, established in Frye v. United States, 293 F. 1013 (D.C. Cir. 1923).

Daubert The theory or technique in question

1. Can be and has been tested

2. Has been subjected to peer review and publication

3. Has known or potential error rate

4. Has standards controlling its operation

5. Has widespread acceptance within a relevant scientific community.

US Supreme Court case, Daubert v. Merrell Dow Pharmaceuticals Inc., 509 U.S. 579 (1993).

http://www.law.ufl.edu/faculty/little/topic8.pdf

https://supreme.justia.com/cases/federal/us/509/579/

California admissibility is

based on People v. Kelly (1976), similar to Frye,

and Sargon v USC (2012),

mentioning Daubert.

(1) the reliability of the method must be established, usually by expert testimony, and

(2) the witness furnishing such testimony must be properly qualified as an expert to give an opinion on the subject.

CA Supreme Court, People v. Kelly, (1976).

https://supreme.justia.com/cases/federal/us/509/579/

Adversarial Bias & 3rd Party Allegiance Demonstrated Vulnerability of Forensic Evaluators to Bias

Results and scores favor retaining party or align with agency perspective

• Murrie, Boccacini, Guarnera, & Rufino (2013) • Murrie, Boccaccini, et al (2009; 2013)• Levinson (2004)• Murrie Boccacini, Johnson and Janke (2008)• Murrie, et al (2008) • Murrie & Boccacini (2015)• Chevalier, Boccaccini, Murrie, and Varela

(2015).

This Photo by Unknown Author is licensed under CC BY-SA-NC

http://duanegraham.wordpress.com/tag/supreme-court

https://creativecommons.org/licenses/by-nc-sa/3.0/

Inherent subjectivity Demonstrated Vulnerability of Forensic Evaluators to Bias

More subjective indicators more subject to bias• Murrie, Boccacini, Guarnera, & Rufino

(2013)• Murrie, Boccaccini, et al (2009)• Murrie, et al (2008) • Guanera & Murrie (2017)

Less bias with more objective evidence• Murrie, Boccacini, Guarnera, Rufino (2013)• Murrie, Boccaccini, et al (2009)

Boundaries of cognitionDemonstrated Vulnerability of Forensic Evaluators to Bias

Evaluators subject to irrelevant information• Murrie, Boccacini, Guarnera, Rufino (2013)• Zapf & Dror (2017)

More bias/unreliability with adjustments to actuarials• Hanson, Helmus, & Harris (2015)• Storey, Watt, Jackson,& Hart (2012)• Wormith, Hogg, & Guzzo (2012)

“Bounded rationality” of the brain for complex configural analysis tasks

• Faust and Faust (2012)This Photo by Unknown Author is licensed under CC BY-SA

http://thecollaboratory.wikidot.com/philosophy-of-thought-and-logic-2012

https://creativecommons.org/licenses/by-sa/3.0/

Individual biases and preferencesDemonstrated Vulnerability of Forensic Evaluators to Bias

Bias and results vary by evaluator• Murrie, Boccacini, Guarnera, Rufino

(2013)• Boccacini Turner and Murrie (2008)• Murrie (2008)

Expert status and experience Demonstrated Vulnerability of Forensic Evaluators to Bias

Experience and number of evaluations completed is not protective

• More divergence with evaluations (n=2)• Boccacini, Turner, & Murrie (2008)

• Expertise builds bias traps• Zapf, P & Dror, D. (2017

• More experience, more “bias blind” spots and ineffective mitigation

• Zapf, P. A., Kukucka, J., Kassin, S. M., & Dror, I. E. (2018).

Misguided bias mitigation strategiesDemonstrated Vulnerability of Forensic Evaluators to Bias

• Myth that willpower and introspection reduce bias

• Zapf, P. A., Kukucka, J., Kassin, S. M., & Dror, I. E. (2018)

• Kukucka, Kassin, Zapf, & Dror (2017)• Bias blind spots

• Pronin, Lin, & Ross (2002)• Neal & Brodsy (2014)

Sources of Evaluator Bias

Cognitive architecture of the brainTraining and motivationSocial interactionBase rate expectationsIrrelevant case informationReference materialsCase evidence

Zapf, P & Dror, D. (2017) Understanding and mitigating bias in forensic evaluation. International Journal of Forensic Mental Health

More well-established sources of

human bias and error

a partial list

• Confirmation bias• Earlier findings• Diagnostic momentum

• Motivated reasoning• Overconfidence

• Dunning Kruger Effect• Limits on cognition (Faust 2012)

• Memory fallibility• Nonlinearity• Too Much Information

Mitigating biases and error

Allegiance

Standardized procedures

Reduce incentives

Uniform requirements

Neutral organization

Avoid in house solutions

Diverse training

Base rate neglect

Use standardized instruments

Know base rates

Understand effects of low

base rates

Avoid anecdotes

Confirmation


Force review of data that

disconfirms Identify

alternative hypotheses

Be ready to revise

Minimize contamination

Expert Overconfidence


Mitigate Dunning

Kruger effect

QA

Don’t over-rely on unique

data demonstrate

expertise

Monitor drift

Limits on Cognition


Algorithms

Streamline data

Mask irrelevant data

Thinking too fast


Use algorithms

Put your thinking on ice

Think slow

Individual prejudice and preferences


and methods

Verify research-base

Subject to court of public

opinion

Subjectivity


Use external metrics

Use checklists

Use Instruments

with firm rules

But, as we have seen, even the structured are fallible…

https://www.google.com/imgres?imgurl=https%3A%2F%2Fnicic.gov%2Fsites%2Fdefault%2Ffiles%2Flibrary%2F027582.PNG&imgrefurl=https%3A%2F%2Fnicic.gov%2Fstatic-99static-99r&docid=an1ZxnF4rxfs1M&tbnid=uc283bNeSoP9zM%3A&vet=10ahUKEwjvlKXKkrHmAhUXJDQIHUMdCtQQMwg_KAEwAQ..i&w=231&h=300&safe=active&bih=799&biw=1368&q=static%2099r%20coding%20manual%202016&ved=0ahUKEwjvlKXKkrHmAhUXJDQIHUMdCtQQMwg_KAEwAQ&iact=mrc&uact=8

https://www.google.com/imgres?imgurl=https%3A%2F%2Fimg.yumpu.com%2F26703693%2F1%2F500x640%2Fstatic-2002r-coding-form-static-99.jpg&imgrefurl=https%3A%2F%2Fwww.yumpu.com%2Fen%2Fdocument%2Fview%2F26703693%2Fstatic-2002r-coding-form-static-99&docid=mIjVrZVGK_fylM&tbnid=aFz73GTA98kBiM%3A&vet=10ahUKEwiL6eGGk7HmAhUPITQIHUJ9DWoQMwhHKAowCg..i&w=495&h=640&safe=active&bih=863&biw=1368&q=static%2099r%20coding%20form&ved=0ahUKEwiL6eGGk7HmAhUPITQIHUJ9DWoQMwhHKAowCg&iact=mrc&uact=8

https://www.google.com/imgres?imgurl=https%3A%2F%2Fwww.psychiatry.org%2FImage%2520Library%2FGlobal%2520Navigation%2FPsychiatrists%2FPractice%2FDSM%2FDSM-Hero.jpg&imgrefurl=https%3A%2F%2Fwww.psychiatry.org%2Fpsychiatrists%2Fpractice%2Fdsm&docid=O8YY__X4ieheKM&tbnid=0NZjnNRYowbFFM%3A&vet=10ahUKEwicjvHukrHmAhWWFTQIHXNbB0AQMwjYASgEMAQ..i&w=703&h=384&itg=1&safe=active&bih=831&biw=1368&q=DSM-5&ved=0ahUKEwicjvHukrHmAhWWFTQIHXNbB0AQMwjYASgEMAQ&iact=mrc&uact=8

https://www.visualcapitalist.com/

every-single-cognitive-bias/

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.visualcapitalist.com%2Fevery-single-cognitive-bias%2F&data=02%7C01%7Cmelinda.diciro%40DSH.CA.GOV%7Cefc2c76dbffa4e94d76d08d78e334c37%7C807536a6a6b14893a01370509c59ebbb%7C0%7C1%7C637134223291929992&sdata=Guv%2Fb0R9Pius2fSxAyNi4VJBPgszrHqYny1%2FucqJ%2FT8%3D&reserved=0

Static 99 R

Standardization• .8 to .9 ICC

Field• >. 80 ICC Boccaccini et al, 2012; Hansen,et al, 2014; Hanson and Morton-Bourgon, 2009

• Overall ICC = .78, [.64, .90]) for Static-99R total score in a sample of 55 California parole and probation officers Hanson, R. K., Thornton, D., Helmus, L., & Babchishin, K. M. (2016).

• 88 vs. .73 (pre 2003 coding ) Rice et al., 2014)

DSM-5 Reliability of Diagnoses

ASPD K .21 (trials) .51 K Levenson 2004 (SVP).76 Pa Packard & Levenson 2006

Pedophilic Disorder .41 Seto et al 2016. 65 K Levenson 2004 (SVP).85 Pa Packard & Levenson 2006

Substance Use Disorders.40 K Trials.43 K Levenson 2004 (SVP).71 Pa Packard & Levenson 2006

Paraphilic Disorders.30 to .47 K Levenson 2004 (SVP).68 to .97 Pa Packard & Levenson 2006

Static 99 R, DSM-5 & ICD 11,

reliability, and bias protections

Use congruent with ethical guidelines

Use congruent with scientific principles

Generally accepted

• Standardized definitions

• Consensus-based

• Base rates described

• Mitigate memory fallibility

• Mitigate over-reliance on expert status, overconfidence, gut and intuition

But, DSM 5 has poor or unknown reliability for SVP applicable disorders

DSM-5 Field trials• ASPD Kappa .21• EtOH Use Disorder Kappa .40• Not funded for paraphilias

Key Strategies; OrganizationsImproving reliability (thus validity) in sex offender

evaluationsKill Idols! Slay Egos!USE• Actuarial risk assessments• Consensus-based diagnostic schemes• External review & guardrails• TrainingTO• Mitigate adversarial and agency bias• Increase accuracy• Mitigate individual biases and error

Sample Description

• Inmates with qualifying offenses are screened prior to their release (this was done by DSH, for the past two years by BPH).

• Those who are screened as potential SVPs are referred for a full evaluation.

• Sample thus consists of pre-screened “potential” SVPs (none low-level)

• Two separate evaluators per case.

Sample Description

• DSH maintains a large SVP database of the full evals covering several years.

• Variables include the Static-99R and diagnosis• Database prior to 2012 could contain Static-99

vs Static-99R scores (and may be less complete).

• Rather than analyzing (and cleaning) all data, made decision to analyze random subset.

• 200 “negative” cases chosen randomly• 50 “positive” cases chosen randomly• Two evaluators per case

Sample Description

• The Department of State Hospitals (DSH) Forensic Services Division (FSD) uses a database to capture and store all referral and evaluation data for the Sexually Violent Predator (SVP) program. The database is called the Sex Offender Commitment Program Support System (SOCPSS)

• The SOCPSS holds over 30,000 records relating to Sexually Violent Predator referrals. During the timeframe queried, 2012 – 2017, there were 14,089 referrals and 10,912 initial evaluations conducted.

• To gather the provided sample, the SOCPSS was queried to collect all evaluations completed between the years 2012 and 2017. Evaluations were then separated into categories of positive or negative outcomes and a random sample was taken using the RAND function in Excel of 200 negative evaluations and 50 positive evaluations. Each evaluation selected included the following information: Evaluator ID, inmate name, DOB, CDCR#, Static 99r scores, diagnoses, evaluation received date, and the evaluation decision.

Inter-rater Reliability on the Static-99R

Inter-rater reliability on continuous scales is typically measured by the ICC statistic

Characterizing Static-99R Inter-rater reliability using the ICC

>0.9 Excellent

0.75-0.89 Good

0.5 – 0.75 Moderate

< 0.5 Poor agreement

-A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research

Comparing Static-99R ICCs for final outcomes

“Negative” final outcomeICC Lower bound Upper bound

0.90 (“excellent”) 0.87 0.92

“Positive” final outcomeICC Lower bound Upper bound

0.81 (“good”) 0.69 0.89*may be lowered due to restricted range

Static-99R Inter-rater score differences

0 130 (52%)

1 88 (35%)

2 21 (8%)

3 6 (3%)

4 5 (2%)

Summary: Static-99R Inter-rater Reliability Results

All results were in the “good” or “excellent” range. The ICC for the “positive” final outcome group is likely smaller due to a restricted

range of Static-99R scores (i.e., few low scores in the “positive” group).

Static-99R scores by final outcome

Frequency of diagnoses by outcome

Diagnostic frequency – Negative cases (400 raters). *May be more than 100%

Pedophilic Disorder 161 (40%)

Antisocial Personality Disorder 98 (25%)

Alcohol Use Disorder 77 (19%)

Stimulant Use Disorder 36 (9%)

Cannabis Use Disorder 9 (2%)

Schizoaffective Disorder 11 (3%)

Psychotic Disorder/Schizophrenia 8 (2%)

Diagnostic frequency – Positive cases (100 raters)

Pedophilic Disorder 63

Antisocial Personality Disorder 19

Alcohol Use Disorder 18

Stimulant Use Disorder 4

Cannabis Use Disorder 9

Psychotic Disorder/Schizophrenia 6

Other Specified Paraphilic Disorder (OSPD) 10

Exhibitionistic Disorder 10

Fetishistic Disorder 2

Frotteuristic Disorder 2

Inter-rater diagnostic agreement for Pedophilic/non-Pedophilic

Disorder (Cohen’s kappa)

Characterizing Inter-rater diagnostic agreement using Cohen’s kappa

>0.9 Almost perfect

0.81-.89 Excellent

0.61 – 0.80 Substantial

0.41-0.60 Moderate

0.21-0.40 Fair-Landis and Koch

Comparing Cohen’s kappa for Pedophilic/non-Pedophilic Disorder diagnostic agreement – results show “substantial”

agreement

“Negative” final outcomekappa Lower bound Upper bound

0.69 0.59 0.79

“Positive” final outcomekappa Lower bound Upper bound

0.74 0.55 0.93

Static-99 scores by Pedophilic/non-Pedophilic Disorder & Final Outcome

Comparing DSH

employees vs Independent

Evaluators (IEs)

• Substantial differences in either Static-99R scores or final outcome opinions could indicate bias, as found in some studies (e.g., Chevalier et al., 2015).

• Differences in categorical outcome are typically measured by the chi-square statistic.

Final Outcome opinion by

DSH Employee vs IEs

• N=363, due to some ind evaluators becoming employees. RESULTS: The chi-square results indicate no systematic differences between how often employees/IEs came to positive/negative outcome decisions.

IEs Employees

Final Outcome Negative 125 166

Positive 32 40

Chi-square p-value0.924, non-sig

Evaluator distributions

• Next, we look at distributions of the Evaluator’s Static-99 ratings depending on whether they are DSH employees or not.

• Major differences in Static-99 scores by DSH vs IEs could suggest bias or training issues.

• As there are many more DSH employees in our sample than IEs, we will put them on the same scale using a density function.

Summary of study results

• Raters showed good or excellent consistency (ICC) in scoring the risk instrument (Static-99R)

• The most common diagnoses are pedophilia, ASPD, and substance use Disorders

• Static-99R scores and pedophilia dx are related to final outcome

• Raters show substantial diagnostic agreement (Cohen’s kappa) for Pedo/non-Pedo dx’s

• There are no significant differences between employees vs IEs in their outcome opinions (per chi-square), nor Static-99 Ratings

Potential Factors

underlying results

• Allegiance • Fluidity of allegiance • Many evaluators work for both PD and DA; variety of evaluations• Absence of incentives • Absence of pressure • Same side

• Confirmation• Use of Static 99 - least vulnerable to “pull”• Force consideration blind spot data• Lack of diagnostic momentum

• Diagnoses justified by DSM criteria –”diagnosed” disorder• CDCR qualifying diagnosis is rare

• Base Rates• Knowledge of base rate• Force Base rate consideration

• Limits on Cognition• Structured methods• Structured tools • Effective documentation and use of memory aid

Potential Factors

Mitigating Bias in SVP

Evaluations

• Overconfidence• Training • Legal review for all• “Grandma” reasoning and pseudo-expert

not tolerated• Thinking too fast & subjectivity• Standardized assessment protocol• Selection of well trained and high integrity

evaluators!

RecommendationsPre and Post

• Training• Standardized and rigorous• Regular refreshers for scoring tests

• QA• Robust

• Review all DOPs and Positives• Review key indicators

• Hiring• Hire the best• Value integrity

field reliability of static-99r and diagnosis in wic 6600 ...individual biases and preferences....

Documents