field reliability of static-99r and diagnosis in wic 6600 ...individual biases and preferences....
TRANSCRIPT
Field Reliability of Static-99R and diagnosis in WIC 6600 Evaluations
Joseph Lockhart, PhD, ABPPMelinda DiCiro, PsyD, ABPP
FMHAC 2020
Thanks to all who
contributed to this study over
nearly two years!
• Anna Brennan• Jim Rokop• Administration • …and others
Questions the Study was
designed to answer
• How consistent are raters in scoring the risk instrument (Static-99R)?
• (Only other SVP study in Texas found poor interrater reliability – Boccaccini et al., 2009)
• What are the most common diagnoses?• Do the raters show adequate diagnostic
agreement? (Other studies found varied k). • Are there differences between employees vs
independent evaluators (IEs) in their ratings?
Why do we care about
field reliability?
• “Field studies we would argue provide evidence concerning an instrument’s psychometric properties that is more generalizable to real-world cases specifically because the data were collected under similar circumstances.” (Edens and Boccaccini, 2017)
RELIABILITY & VALIDITY
• Reliability is the degree to which an assessment tool produces stable and consistent results.
• Validity refers to how well a test measures what it is purported to measure
• Why is reliability necessary?• While reliability is necessary, it alone is not sufficient. For
a test to be reliable, it also needs to be valid. For example, if your scale is off by 5 lbs, it reads your weight every day with a deficit of 5 lbs. The scale is reliable because it consistently reports the same weight every day, but it is not valid because it subtracts 5 lbs to your true weight.
How to weigh yourselfThis method might be reliable, but is it valid?
Why do we care about
bias and error?
• Threaten reliability • Violate ethical principles• Violate scientific principles• Demonstrated forensic evaluator vulnerability• Demonstrated human vulnerability• Effective ways to mitigate bias and error exist
Why test field reliability?
• Reliability of outcomes is valued• Field conditions introduce bias and error• Do chosen methods mitigate potential bias and
error?• Instruments• Diagnostic schemes• Training• QA
Bias threatens reliabilityEspecially Under Field Conditions
Agencies & AdversariesTraining
Methods
Perspectives
Incentives
Evaluators Thinking too fast
Human vulnerability
Heuristic cognitive bias
Individual biases
Ethics Codes and Guidelines, Bias & Error
Key Ethical Principles• Justice• Respect for persons• Integrity
Forensic Guidelines• Impartiality• Avoiding conflicts of interest• Mitigating impact of personal bias• Reliable sources and methods
Scientific Principles
Decry Bias
• Objectivity
• Neutrality
• Reproducibility
National Research Council National Academy of Sciences—Strengthening forensic science 2009; Rigorous protocol control for bias and error.
Similarities between forensic science and Forensic Mental Health Evaluation • Cognition• Understanding, analysis, and interpretation of data• Perception• Decision making
Mitigate bias with scientific principles and research-based methods • Work like a scientist, not a clinician• Rival hypothesis testing• Use standardized methods• Use certification programs
Zapf & Dror (2017) Understanding and Mitigating Bias in Forensic Evaluation. International Journal of Forensic Mental Health
Frye and Daubert Admissibility
Standards Requirements
FryeThe method is generally acceptedAdmissibility of expert's scientific testimony, established in Frye v. United States, 293 F. 1013 (D.C. Cir. 1923).
Daubert The theory or technique in question
1. Can be and has been tested
2. Has been subjected to peer review and publication
3. Has known or potential error rate
4. Has standards controlling its operation
5. Has widespread acceptance within a relevant scientific community.
US Supreme Court case, Daubert v. Merrell Dow Pharmaceuticals Inc., 509 U.S. 579 (1993).
California admissibility is
based on People v. Kelly (1976), similar to Frye,
and Sargon v USC (2012),
mentioning Daubert.
(1) the reliability of the method must be established, usually by expert testimony, and
(2) the witness furnishing such testimony must be properly qualified as an expert to give an opinion on the subject.
CA Supreme Court, People v. Kelly, (1976).
Adversarial Bias & 3rd Party Allegiance Demonstrated Vulnerability of Forensic Evaluators to Bias
Results and scores favor retaining party or align with agency perspective
• Murrie, Boccacini, Guarnera, & Rufino (2013) • Murrie, Boccaccini, et al (2009; 2013)• Levinson (2004)• Murrie Boccacini, Johnson and Janke (2008)• Murrie, et al (2008) • Murrie & Boccacini (2015)• Chevalier, Boccaccini, Murrie, and Varela
(2015).
This Photo by Unknown Author is licensed under CC BY-SA-NC
Inherent subjectivity Demonstrated Vulnerability of Forensic Evaluators to Bias
More subjective indicators more subject to bias• Murrie, Boccacini, Guarnera, & Rufino
(2013)• Murrie, Boccaccini, et al (2009)• Murrie, et al (2008) • Guanera & Murrie (2017)
Less bias with more objective evidence• Murrie, Boccacini, Guarnera, Rufino (2013)• Murrie, Boccaccini, et al (2009)
Boundaries of cognitionDemonstrated Vulnerability of Forensic Evaluators to Bias
Evaluators subject to irrelevant information• Murrie, Boccacini, Guarnera, Rufino (2013)• Zapf & Dror (2017)
More bias/unreliability with adjustments to actuarials• Hanson, Helmus, & Harris (2015)• Storey, Watt, Jackson,& Hart (2012)• Wormith, Hogg, & Guzzo (2012)
“Bounded rationality” of the brain for complex configural analysis tasks
• Faust and Faust (2012)This Photo by Unknown Author is licensed under CC BY-SA
Individual biases and preferencesDemonstrated Vulnerability of Forensic Evaluators to Bias
Bias and results vary by evaluator• Murrie, Boccacini, Guarnera, Rufino
(2013)• Boccacini Turner and Murrie (2008)• Murrie (2008)
Expert status and experience Demonstrated Vulnerability of Forensic Evaluators to Bias
Experience and number of evaluations completed is not protective
• More divergence with evaluations (n=2)• Boccacini, Turner, & Murrie (2008)
• Expertise builds bias traps• Zapf, P & Dror, D. (2017
• More experience, more “bias blind” spots and ineffective mitigation
• Zapf, P. A., Kukucka, J., Kassin, S. M., & Dror, I. E. (2018).
Misguided bias mitigation strategiesDemonstrated Vulnerability of Forensic Evaluators to Bias
• Myth that willpower and introspection reduce bias
• Zapf, P. A., Kukucka, J., Kassin, S. M., & Dror, I. E. (2018)
• Kukucka, Kassin, Zapf, & Dror (2017)• Bias blind spots
• Pronin, Lin, & Ross (2002)• Neal & Brodsy (2014)
Sources of Evaluator Bias
Cognitive architecture of the brainTraining and motivationSocial interactionBase rate expectationsIrrelevant case informationReference materialsCase evidence
Zapf, P & Dror, D. (2017) Understanding and mitigating bias in forensic evaluation. International Journal of Forensic Mental Health
More well-established sources of
human bias and error
a partial list
• Confirmation bias• Earlier findings• Diagnostic momentum
• Motivated reasoning• Overconfidence
• Dunning Kruger Effect• Limits on cognition (Faust 2012)
• Memory fallibility• Nonlinearity• Too Much Information
Mitigating biases and error
Allegiance
Standardized procedures
Reduce incentives
Uniform requirements
Neutral organization
Avoid in house solutions
Diverse training
Base rate neglect
Use standardized instruments
Know base rates
Understand effects of low
base rates
Avoid anecdotes
Confirmation
Use standardized instruments
Force review of data that
disconfirms Identify
alternative hypotheses
Be ready to revise
Minimize contamination
Expert Overconfidence
Use standardized instruments
Mitigate Dunning
Kruger effect
QA
Don’t over-rely on unique
data demonstrate
expertise
Monitor drift
Limits on Cognition
Use standardized instruments
Algorithms
Streamline data
Mask irrelevant data
Thinking too fast
Use standardized instruments
Use algorithms
Put your thinking on ice
Think slow
Individual prejudice and preferences
Use standardized instruments
and methods
Verify research-base
Subject to court of public
opinion
Subjectivity
Use standardized instruments
Use external metrics
Use checklists
Use Instruments
with firm rules
But, as we have seen, even the structured are fallible…
https://www.visualcapitalist.com/
every-single-cognitive-bias/
Static 99 R
Standardization• .8 to .9 ICC
Field• >. 80 ICC Boccaccini et al, 2012; Hansen,et al, 2014; Hanson and Morton-Bourgon, 2009
• Overall ICC = .78, [.64, .90]) for Static-99R total score in a sample of 55 California parole and probation officers Hanson, R. K., Thornton, D., Helmus, L., & Babchishin, K. M. (2016).
• 88 vs. .73 (pre 2003 coding ) Rice et al., 2014)
DSM-5 Reliability of Diagnoses
ASPD K .21 (trials) .51 K Levenson 2004 (SVP).76 Pa Packard & Levenson 2006
Pedophilic Disorder .41 Seto et al 2016. 65 K Levenson 2004 (SVP).85 Pa Packard & Levenson 2006
Substance Use Disorders.40 K Trials.43 K Levenson 2004 (SVP).71 Pa Packard & Levenson 2006
Paraphilic Disorders.30 to .47 K Levenson 2004 (SVP).68 to .97 Pa Packard & Levenson 2006
Static 99 R, DSM-5 & ICD 11,
reliability, and bias protections
Use congruent with ethical guidelines
Use congruent with scientific principles
Generally accepted
• Standardized definitions
• Consensus-based
• Base rates described
• Mitigate memory fallibility
• Mitigate over-reliance on expert status, overconfidence, gut and intuition
But, DSM 5 has poor or unknown reliability for SVP applicable disorders
DSM-5 Field trials• ASPD Kappa .21• EtOH Use Disorder Kappa .40• Not funded for paraphilias
Key Strategies; OrganizationsImproving reliability (thus validity) in sex offender
evaluationsKill Idols! Slay Egos!USE• Actuarial risk assessments• Consensus-based diagnostic schemes• External review & guardrails• TrainingTO• Mitigate adversarial and agency bias• Increase accuracy• Mitigate individual biases and error
Sample Description
• Inmates with qualifying offenses are screened prior to their release (this was done by DSH, for the past two years by BPH).
• Those who are screened as potential SVPs are referred for a full evaluation.
• Sample thus consists of pre-screened “potential” SVPs (none low-level)
• Two separate evaluators per case.
Sample Description
• DSH maintains a large SVP database of the full evals covering several years.
• Variables include the Static-99R and diagnosis• Database prior to 2012 could contain Static-99
vs Static-99R scores (and may be less complete).
• Rather than analyzing (and cleaning) all data, made decision to analyze random subset.
• 200 “negative” cases chosen randomly• 50 “positive” cases chosen randomly• Two evaluators per case
Sample Description
• The Department of State Hospitals (DSH) Forensic Services Division (FSD) uses a database to capture and store all referral and evaluation data for the Sexually Violent Predator (SVP) program. The database is called the Sex Offender Commitment Program Support System (SOCPSS)
• The SOCPSS holds over 30,000 records relating to Sexually Violent Predator referrals. During the timeframe queried, 2012 – 2017, there were 14,089 referrals and 10,912 initial evaluations conducted.
• To gather the provided sample, the SOCPSS was queried to collect all evaluations completed between the years 2012 and 2017. Evaluations were then separated into categories of positive or negative outcomes and a random sample was taken using the RAND function in Excel of 200 negative evaluations and 50 positive evaluations. Each evaluation selected included the following information: Evaluator ID, inmate name, DOB, CDCR#, Static 99r scores, diagnoses, evaluation received date, and the evaluation decision.
Inter-rater Reliability on the Static-99R
Inter-rater reliability on continuous scales is typically measured by the ICC statistic
Characterizing Static-99R Inter-rater reliability using the ICC
>0.9 Excellent
0.75-0.89 Good
0.5 – 0.75 Moderate
< 0.5 Poor agreement
-A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research
Comparing Static-99R ICCs for final outcomes
“Negative” final outcomeICC Lower bound Upper bound
0.90 (“excellent”) 0.87 0.92
“Positive” final outcomeICC Lower bound Upper bound
0.81 (“good”) 0.69 0.89*may be lowered due to restricted range
Static-99R Inter-rater score differences
0 130 (52%)
1 88 (35%)
2 21 (8%)
3 6 (3%)
4 5 (2%)
Summary: Static-99R Inter-rater Reliability Results
All results were in the “good” or “excellent” range. The ICC for the “positive” final outcome group is likely smaller due to a restricted
range of Static-99R scores (i.e., few low scores in the “positive” group).
Static-99R scores by final outcome
Frequency of diagnoses by outcome
Diagnostic frequency – Negative cases (400 raters). *May be more than 100%
Pedophilic Disorder 161 (40%)
Antisocial Personality Disorder 98 (25%)
Alcohol Use Disorder 77 (19%)
Stimulant Use Disorder 36 (9%)
Cannabis Use Disorder 9 (2%)
Schizoaffective Disorder 11 (3%)
Psychotic Disorder/Schizophrenia 8 (2%)
Diagnostic frequency – Positive cases (100 raters)
Pedophilic Disorder 63
Antisocial Personality Disorder 19
Alcohol Use Disorder 18
Stimulant Use Disorder 4
Cannabis Use Disorder 9
Psychotic Disorder/Schizophrenia 6
Other Specified Paraphilic Disorder (OSPD) 10
Exhibitionistic Disorder 10
Fetishistic Disorder 2
Frotteuristic Disorder 2
Inter-rater diagnostic agreement for Pedophilic/non-Pedophilic
Disorder (Cohen’s kappa)
Characterizing Inter-rater diagnostic agreement using Cohen’s kappa
>0.9 Almost perfect
0.81-.89 Excellent
0.61 – 0.80 Substantial
0.41-0.60 Moderate
0.21-0.40 Fair-Landis and Koch
Comparing Cohen’s kappa for Pedophilic/non-Pedophilic Disorder diagnostic agreement – results show “substantial”
agreement
“Negative” final outcomekappa Lower bound Upper bound
0.69 0.59 0.79
“Positive” final outcomekappa Lower bound Upper bound
0.74 0.55 0.93
Static-99 scores by Pedophilic/non-Pedophilic Disorder & Final Outcome
Comparing DSH
employees vs Independent
Evaluators (IEs)
• Substantial differences in either Static-99R scores or final outcome opinions could indicate bias, as found in some studies (e.g., Chevalier et al., 2015).
• Differences in categorical outcome are typically measured by the chi-square statistic.
Final Outcome opinion by
DSH Employee vs IEs
• N=363, due to some ind evaluators becoming employees. RESULTS: The chi-square results indicate no systematic differences between how often employees/IEs came to positive/negative outcome decisions.
IEs Employees
Final Outcome Negative 125 166
Positive 32 40
Chi-square p-value0.924, non-sig
Evaluator distributions
• Next, we look at distributions of the Evaluator’s Static-99 ratings depending on whether they are DSH employees or not.
• Major differences in Static-99 scores by DSH vs IEs could suggest bias or training issues.
• As there are many more DSH employees in our sample than IEs, we will put them on the same scale using a density function.
Summary of study results
• Raters showed good or excellent consistency (ICC) in scoring the risk instrument (Static-99R)
• The most common diagnoses are pedophilia, ASPD, and substance use Disorders
• Static-99R scores and pedophilia dx are related to final outcome
• Raters show substantial diagnostic agreement (Cohen’s kappa) for Pedo/non-Pedo dx’s
• There are no significant differences between employees vs IEs in their outcome opinions (per chi-square), nor Static-99 Ratings
Potential Factors
underlying results
• Allegiance • Fluidity of allegiance • Many evaluators work for both PD and DA; variety of evaluations• Absence of incentives • Absence of pressure • Same side
• Confirmation• Use of Static 99 - least vulnerable to “pull”• Force consideration blind spot data• Lack of diagnostic momentum
• Diagnoses justified by DSM criteria –”diagnosed” disorder• CDCR qualifying diagnosis is rare
• Base Rates• Knowledge of base rate• Force Base rate consideration
• Limits on Cognition• Structured methods• Structured tools • Effective documentation and use of memory aid
Potential Factors
Mitigating Bias in SVP
Evaluations
• Overconfidence• Training • Legal review for all• “Grandma” reasoning and pseudo-expert
not tolerated• Thinking too fast & subjectivity• Standardized assessment protocol• Selection of well trained and high integrity
evaluators!
RecommendationsPre and Post
• Training• Standardized and rigorous• Regular refreshers for scoring tests
• QA• Robust
• Review all DOPs and Positives• Review key indicators
• Hiring• Hire the best• Value integrity