1 measurement issues in health disparities research anita l. stewart, ph.d. university of...

Measurement Issues in Health Disparities Research

Anita L. Stewart, Ph.D.University of California, San Francisco

Clinical Research with Diverse CommunitiesEPI 222, SpringApril 17, 2008

Background

U.S. population becoming more diverse More minority groups are being included

in research due to:– NIH mandate

– Recent health disparities initiatives

Types of Diverse Groups

Health disparities research focuses on differences in health between the following groups:– Minority vs. non-minority– Low income vs. others– Low education vs. others– Limited English skills vs. others– …. and others

Health Disparities Research

Describe health disparities– Health differences across various diverse groups

Identify mechanisms by which health disparities occur– Individual level– Environmental level

Intervene to reduce health disparities

Health Care Disparities

Differential access to and quality of health care is well known– Thus, health care disparities become a

plausible mechanism for health disparities Understanding determinants of health care

disparities is also of interest

Types of Self-Report Measures Needed

Measures of health, and of various mechanisms for disparities– Class 4 will present numerous mechanisms

Examples from this class: sense of control, self-efficacy for managing disease, health-related quality of life for various health conditions

Measurement Implications of Research in Diverse Groups

Most self-reported measures were developed and tested in mainstream, well-educated groups– Subgroup analysis of measures has been rare

Thus, little information is available on appropriateness, reliability, validity, and responsiveness in minority and other diverse groups

The Measurement Goal: Identify Measures That Can Be Used…

To compare diverse groups To study mechanisms within any

particular “diverse” group

Group Comparisons are the Most Problematic

Disparities research involves comparing mean levels of health or its determinants

Requires “equivalent” concepts and measures – Potential true differences may be obscured

– Observed group differences may be inaccurate

Alternative Explanations for Observed Group Differences

Observed group mean differences in a measure can be due to

– culturally- or group-mediated differences in true score (true differences) -- OR --

– bias - systematic differences between group observed scores not attributable to true scores

Bias - A Special Concern

Measurement bias in any one group may make group comparisons invalid

Bias can be due to group differences in:– the meaning of concepts or items – the extent to which a measure represents a concept – cognitive processes of responding– use of response scales– appropriateness of data collection methods

Example of Effect of Biased Items

5 CES-D items administered to black and white men– 1 item subject to differential item functioning (bias)

5-item scale including item suggested that black men had higher levels of somatic symptoms than white men (p < .01)

4-item scale excluding biased item showed no differences between black and white men

S Gregorich, Med Care, 2006;44:S78-S94.

Bias or “Systematic Difference”?

Bias refers to “deviation from true score” Cannot speak of a measure being “biased” in

one group compared to another w/o knowing true score

Preferred term: differential “item” functioning– Item (or measure) that has a different meaning

in one group than another

Typical Sequence of Developing New Self-Report Measures

Develop concept

Create item pool

Pretest/revise

Field survey

Psychometric analyses

Final measures

Typical Sequence of Developing New Self-Report Measures

Develop concept

Create item pool

Pretest/revise

Field survey

Final measures

Extra Steps in Sequence of Developing New Self-Report Measures for Diverse Groups

Develop concept

Create item pool

Pretest/revise

Field survey

Final measures

Obtain perspectives of diverse groups

Develop concept

Create item pool

Pretest/revise

Field survey

Final measures

.. to reflect these perspectives

Develop concept

Create item pool

Pretest/revise

Field survey

Final measures

.. in all diverse groups

Develop concept

Create item pool

Pretest/revise

Field survey

Final measures

Develop concept

Create item pool

Pretest/revise

Field survey

Final measures

.. in all diverse groupsMeasurement studies across groups

Develop concept

Create item pool

Pretest/revise

Field survey

Final measures

If results are non-equivalent

Measurement Adequacy vs. Measurement Equivalence

Making group comparisons requires conceptual and psychometric adequacy and equivalence

Adequacy - within a “diverse” group– concepts are appropriate– psychometric properties meet minimal criteria

Equivalence - between “diverse” groups– conceptual and psychometric properties are

comparable

Why Not Use Culture-Specific Measures? Measurement goal - identify measures that

can be used across all groups, yet maintain sensitivity to diversity and have minimal bias

Most health disparities studies require comparing mean scores across diverse groups– need comparable measures

Conceptual and Psychometric Adequacy and Equivalence

Conceptual

Psychometric

Adequacyin 1 Group

EquivalenceAcross Groups

Concept equivalentacross groups

Psychometric propertiesmeet minimal standards

within one group

Psychometric propertiesinvariant (equivalent)

across groups

Concept meaningfulwithin one group

Left Side of Matrix: Issues in a Single Group

Conceptual

Psychometric

Adequacyin 1 Group

within one group

across groups

Ride Side of Matrix: Issues in More Than One Group

Conceptual

Psychometric

Adequacyin 1 Group

within one group

across groups

Conceptual Adequacy in One Group

Conceptual

Psychometric

Adequacyin 1 Group

within one group

across groups

Conceptual Adequacy in One Group Is concept relevant, meaningful, and acceptable

to a “diverse” group? Traditional research

– Conceptual adequacy = simply defining a concept– Mainstream population “assumed”

Minority and health disparities research– Mainstream concepts may be inadequate– Concept should correspond to how a particular

group thinks about it

Qualitative Approaches to Explore Conceptual Adequacy in Diverse Groups

Literature reviews– ethnographic and anthropological

In-depth interviews and focus groups – discuss concepts, obtain their views

Expert consultation from diverse groups– review concept definitions– rate relevance of items

Example of Inadequate Concept

Patient satisfaction typically conceptualized in mainstream populations in terms of, e.g.,– access, technical care, communication,

continuity, interpersonal style In minority and low income groups,

additional relevant domains include, e.g., – discrimination by health professionals

– sensitivity to language barriers

Method for Examining Conceptual Relevance

Compiled set of 33 HRQL items spanning many concepts

Assessed relevance to older African Americans After answering each question, asked “how

relevant is this question to the way you think about your health?”– Response scale: 0-10 scale with endpoints labeled– Labels: 0=not at all relevant, 10=extremely relevant

Cunningham WE et al., Qual Life Res, 1999;8:749-768.

Results: Conceptual Relevance

Most relevant items:– Spirituality (3 items)– Weight-related health (2 items)– Hopefulness (1 item)

Spirituality items– importance of spirituality to well-being, level

of spirituality, being sick affected spirituality

Results: Conceptual Relevance

Least relevant items:– Physical functioning

– Role limitations due to emotional problems All standard MOS measures ranked in the

lower 2/3, including all SF12 items

Conceptual Relevance of Spanish FACT-G

Bilingual/bicultural expert panel reviewed all 28 items for relevance– One item had low cultural relevance to quality of life

– One concept was missing – spirituality

Developed new spirituality scale (FACIT-Sp) with input from cancer patients, psychotherapists, and religious experts– Sample item “I worry about dying”

Cella D et al. Med Care 1998: 36;1407

Psychometric Adequacy in One Group

Conceptual

Psychometric

Adequacyin 1 Group

within one group

across groups

Psychometric Adequacy in any Group

Minimal standards:– Sufficient variability– Minimal missing data– Adequate reliability/reproducibility– Evidence of construct validity– Evidence of responsiveness to change

Basic classical test theory approach

Evidence of Psychometric Inadequacy of Measure in Various Diverse Groups

SF-36 social functioning scale - internal consistency reliability < .70 in three different samples:– Chinese language, adults aged 55-96 years

– Japanese language, Japanese elders

– English, Pima Indians

Stewart AL & Nápoles-Springer A, Med Care, 2000;38(9 Suppl):II-102

Conceptual Equivalence Across Groups

Conceptual

Psychometric

Adequacyin 1 Group

within one group

across groups

Conceptual Equivalence

Is the concept relevant, familiar, acceptable to all diverse groups being studied?

Is the concept defined the same way in all groups? – all relevant “domains” included (none missing)

– interpreted similarly Is the concept appropriate for all diverse

groups?

Generic/Universal vs Group-Specific(Etic versus Emic)

Concepts unlikely to be defined exactly the same way across diverse ethnic groups

Generic/universal (etic)– features of a concept that are appropriate across

groups Group-Specific (emic)

– idiosyncratic or culture-specific portions of a concept

Etic versus Emic (cont.)

Goal in health disparities research on more than one group:– identify generic/universal portion of a concept (could be entire

concept) that can be applied across all groups For within-group analyses or studies

– the culture-specific portion is also relevant– Same as examining “conceptual adequacy” within one group

Approaches Similar to Those for Conceptual Adequacy

Main difference: Need to assure concept is equivalent across groups– Additional criterion

What do we mean by equivalent conceptually?

Methods are poorly developed

Obtain Perspective of All Diverse Groups on Concept

Develop concept

Create item pool

Pretest/revise

Field survey

Final measures

Example: Develop Concept of Interpersonal Processes of Care

Began with conceptual framework from literature and psychometric studies of preliminary survey

IPC Version I Conceptual Framework Three major multi-dimensional categories:

– Communication– Decision-making– Interpersonal Style

IPC Version I: Subdomains

Communication Elicitation of concerns, explanations,

general clarity Decision-making

Involving patients in decisions Interpersonal Style

Respectfulness, emotional support, non-discrimination, cultural sensitivity

Stewart et al., Milbank Quarterly, 1999: 77:305

Limitations of First IPC Framework

Tested on small sample of 600 patients from San Francisco General Hospital

Several hypothesized concepts were not confirmed– e.g., cultural sensitivity

Needed further development and validation on a larger sample

Developed Revised IPC Concept

Draft IPC II

conceptualframework

IPC Version I frameworkin Milbank Quarterly

19 focus groups -African American, Latino,and White adults

Literature review of quality

of care in diverse groups

IPC-II Conceptual Framework

I. COMMUNICATION III. INTERPERSONAL STYLE General clarity Respectfulness Elicitation/responsiveness Courteousness Explanations of Perceived discrimination --processes, condition, Emotional support self-care, meds Cultural sensitivity Empowerment II. DECISION MAKING Responsive to patient preferences Consider ability to comply

IPC-II Conceptual Framework (cont)

IV. OFFICE STAFF Respectfulness Discrimination V. FOR LIMITED ENGLISH PROFICIENCY PATIENTS MD’s and office staff’s sensitivity to language

Psychometric Equivalence

Conceptual

Psychometric

Adequacyin 1 Group

within one group

across groups

Psychometric Equivalence

Measures have similar measurement properties in all diverse group of interest in your study– e.g., English and Spanish language, African

Americans and Caucasians Measures have similar measurement properties

in one diverse group as in original (mainstream) groups on which the measures were developed

Equivalence of Reliability?? No!

Difficult to compare reliability because it depends on the distribution of the construct in a sample– Thus lower reliability in one group may simply

reflect poorer variability More important is the adequacy of the

reliability in both groups– Reliability meets minimal criteria within each group

Equivalence of Criterion Validity

Determine if hypothesized patterns of associations with specified criteria are confirmed in both groups, e.g.– a measure predicts utilization in both groups

– a cutpoint on a screening measure has the same specificity and sensitivity in both groups

Equivalence of Construct Validity Are hypothesized patterns of associations

confirmed in both groups?– Example: Scores on the Spanish version of the

FACT had similar relationships with other health measures as scores on the English version

Primarily tested through subjectively examining pattern of correlations– Can test differences using confirmatory factor

analysis (e.g., through Structural Equation Modeling)

Item Equivalence

Differential Item Functioning (DIF)– Items are non-equivalent if they are

differentially related to the underlying trait– Equivalence indicated by no DIF

Meaning of response categories is similar across groups

Distance between response categories is similar across groups

Equivalence of Response Choices: Spanish and English Self-rated Health

Excellent Very good Good Fair Poor

Excelente Muy buena Buena Regular Mala

“Regular” in Spanish may be closer to “good” in English, thus is not comparable to the meaning of “fair”

Spanish and English Self-rated Health Responses

Excellent Very good Good Fair Poor

Excelente Muy buena Buena Regular (Pasable?) Mala

Another choice, “pasable,” may be closerin meaning to “fair”

Methods for Identifying Differential Item Functioning (DIF)

Item Response Theory (IRT) Examines each item in relation to underlying

latent trait Tests if responses to one item predict the

underlying latent “score” similarly in two groups– if not, items have “differential item functioning”

Equivalence of Factor Structure

Factor structure is similar in new group to structure in original groups in which measure was tested– In other words, the measurement model is the

same across groups Methods

– Specify the number of factors you are looking for

– Determine if the hypothesized model fits the data

Confirmatory Factor Analysis (CFA)

Can specify a hypothesized structure a priori Can test mean and covariance structures

– to estimate bias

Equivalence of Factor Structure: Testing Psychometric Invariance

Psychometric invariance (equivalence) Important properties of theoretically-based

factor structure (measurement model) do not vary across groups (are invariant)– measurement model is the same across groups

Empirical comparison across groups– Not simply by examination

Criteria for Psychometric Invariance

Across all groups – a sequential process: Same number of factors or dimensions Same items on same factors Same factor loadings No bias on any item across groups Same residuals on items No item or scale bias AND same residuals

Dimensional Invariance: Same number of factors

Configural Invariance: Same items load on same factors

Metric or Factor Pattern Invariance:Items have same loadings on same factors

Scalar or Strong Factorial Invariance:

Observed scores are unbiased

Residual Invariance:Observed item and factor

variances are unbiased

Strict Factorial InvarianceBoth scalar and residual criteria are met

Criteria for Evaluating Invariance AcrossGroups: Technical Terms

Interpersonal Processes of Care (IPC)

Social-psychological aspects of the patient-physician interaction– communication, respectfulness, patient-

centered decision-making, and being sensitive to patients’ needs

Developed survey of 92 items based on principles outlined above

Conducted Survey

From over 16,000 primary care patients, randomly sampled those who:– Made at least one visit in prior 12 months

– Records indicated they were African American, Latino, or White (Caucasian)

Sampled within race/ethnic group

Sample Size (N=1,664)

383 Spanish speaking Latino

435 African American

428 English speaking Latino

418 White

Results

Of the 92 items, 29 had similar factor structure across all 4 groups– achieved “metric invariance”

Strong Factorial or ScalarInvariance:

Residual Invariance:Observed item and factor variances

can be compared across groups

Strict Factorial InvarianceBoth scalar invariance and residual invariance criteria are met

Results: Metric Invariance Across 4 Groups for 29 Items

Seven “Metric Invariant” Scales (29 items)

I. COMMUNICATION Hurried communication Elicited concerns, responded Explained results, medications

II. DECISION MAKING Patient-centered decision-making

III. INTERPERSONAL STYLE Compassionate, respectful Discriminated Disrespectful office staff

Continued Exploration of Invariance: Item Bias

Tested invariance of model parameter estimates across groups for “scalar invariance”– Bias in items

Strong Factorial or ScalarInvariance:

Residual Invariance:Observed item and factor variances

can be compared across groups

Strict Factorial InvarianceBoth scalar invariance and residual invariance criteria are met

Obtained Partial Scalar Invariance Across 4 Groups for 18 Items

Seven “Scalar Invariant” (Unbiased) Scales (18 items)

I. COMMUNICATION Hurried communication – lack of clarity Elicited concerns, responded Explained results, medications – explained results

II. DECISION MAKING Patient-centered decision-making – decided together

III. INTERPERSONAL STYLE Compassionate, respectful–(subset) compassionate, respectful Discriminated – discriminated due to race/ethnicity Disrespectful office staff

What to do if Measures Are Not Equivalent in a Specific Study Comparing Groups

Need guidelines for how to handle data when substantial non-comparability is found in a study– Drop bad or “biased” items from scores

»Compare results with and without biased items

– Analyze study by stratifying diverse groups The current challenge for measurement in minority

health studies

Example: 20-item Spanish CES-D in Older Latinos

2 items had very low item-scale correlations, high rates of missing data in two studies– I felt hopeful about the future– I felt I was just as good as other people

20-item version Study 1 Study 2– Item-scale correlations -.20 to .73 .05 to .78– Cronbach’s alpha

18-item version – Item-scale correlations .45 to .76 .33 to .79

Example: Measure Can be Modified

GHAA Consumer Satisfaction Survey Adapted to be appropriate for African

American patients– Focus groups conducted to obtain perspectives of

African Americans– New domains added (e.g., discrimination/

stereotyping)– New items added to existing domains

Fongwa M et al. Ethnicity and Disease, 2006:16;948-955.

Approaches to Conducting Studies When You Are Not Sure

Use a combination of “universal” and group- specific items– use universal items to compare across groups – use specific items (added onto universal items)

when conducting analyses within one group»To find a variable that correlates with a health

measure within one group

Conclusions Measurement in health disparities and

minority health research is a relatively new field - few guidelines

Encourage first steps - test and report adequacy and equivalence

As evidence grows, concepts and measures that work better across diverse groups will be identified

Two Special Journal Issues on Measurement in Diverse Populations

Measurement in older ethnically diverse populations– J Mental Health Aging, Vol 7, Spring 2001

Measurement in a multi-ethnic society– Med Care, Vol 44, November 2006

Homework for Class 3

Using the same template and measure you reviewed for class 2, complete sections 14-21– Use same file and submit entire document by

email to anita.stewart@ucsf.edu

1 measurement issues in health disparities research anita l. stewart, ph.d. university of...

Documents

crchd : cancer health disparities research and diversity...

open access research geographical disparities in emergency

1 identifying and selecting measures for health disparities...

morehouse school of medicine tcc for health disparities...

mental health disparities - research

health disparities clinical summer research fellowship...

developing, implementing and evaluating interventions to...

disparities research for change fair toolkit › sites ›...

collaborating research centers to reduce oral health...

minority health and health disparities international...

anita zuberi postdoctoral associate school of social work...

disparities research for change fair...

niehs vision for environmental health disparities...

niaaa minority health and health disparities research

economic disparities and life satisfaction in...

tcc for health disparities research pilot project program...

health disparities and cancer€¦ · cancer and...

1 class 8 measurement issues in diverse populations...

1 class 7 measurement issues in research with diverse...

toward a fourth generation of disparities research