validation of healthcare databases - · pdf filevalidation of healthcare databases aldana...

Validation of Healthcare

DatabasesALDANA ROSSO, PH.D

LUND UNIVERSITY AND SKÅNE UNIVERSITY HOSPITAL. 28 APRIL 2017

Example: Patients with Hearth Failure

• We want to study all-cause mortality in patients that

suffered hearth failure.

• How do we do it?

Definition of Population

All patients that have HF

ICD10 code for Hearth Failure


Patients with ICD10 I50*





Patients fulfilling the HF

register definition

Definition of Population: Completeness



Patients fulfilling the HF

register definitionRegistered

patients


• We aware that your study population is different from your

population of interest. Generalizability of results??

Example: Swedish HF Register

Difference in All-Cause Mortality

Problem: Selection Bias

What you Need to Know about Your

Study Population

• How the population in the register is defined.

• How many patients are registered (completeness). This is

determined by comparison with other registers.

• Which clinics are reporting to the register and why?

• This is needed to ”estimate” selection bias in your study.

Example: Spanish National Acute Coronary

Syndrome register

Example: Spanish National Acute Coronary

Syndrome register

• Audit of the Spanish national acute coronary syndrome register [Ferreira-Gonzalez et al, Circulation: Cardiovascular Quality and Outcomes,2009; 2: 540-547].

• They compared enrolled patients with those that were not enrolled for some participating hospitals (17 of 50).

• Missed patients were of higher risk and received less recommended therapies than the included patients. In-hospital mortality was almost 3 times higher in the missed population.

http://www.thelovesensei.com/perfect-for-you-doesnt-necessarily-translate-to-perfect-human-being/

Research using Healthcare Databases

Implies…

1. Wrong data: misclassification

2. Missing data

• REMEMBER: you are using data for research that was

created for other purposes!

Example: Misclassification

Intensive Care Register

http://portal.icuregswe.org/Rapport.aspx

Example: Misclassification

Intensive Care Register

Diagnoser Antal

Z04.9Undersökning och observation av icke

specificerat skäl 3200

J96.9 Respiratorisk insufficiens, ospecificerad 2000

I46.9 Hjärtstillestånd, ospecificerad 1500

R57.2 Septisk chock 1500

R65.1Systemiskt inflammatoriskt svarssyndrom

[SIRS] av infektiöst ursprung med organsvikt 1500

T07.9 Icke specificerade multipla skador 1400

R56.8 Andra och icke specificerade kramper 1400

K92.2 Gastrointestinal blödning, ospecificerad 1200

J15.9 Bakteriell pneumoni, ospecificerad 1200


Implications of Misclassification

• It affects the statistical analyses, just accept it!

• How depends on how bad the misclassification is and the

reason for misclassfication => Sensitivity analysis.

Exempel: IVA-diagnoser

Diagnoser Antal

Z04.9 3200

J96.9 2000

I46.9 1500

R57.2 1500

R65.1 1500

T07.9 1400

R56.8 1400

K92.2 1200

J15.9 1200


Andel felaktiga

diagnoser som

blev ”Z04.9”

Andel korrekta

ranking

2 % 4 % ( 2 % - 10 %)

5 % 5 % ( 2 % - 12 %)

Implication of Misigness

• It depends on why the data is missing:

– missing at random: problematic from a power perspective

but it doesn’t bias the results.

– missing not at random: bias + power problems. That’s

why you need to know which clinics are reporting and

why!

Consequences of Statistical Analysis with

Missing Data

• Some calculations are more sensitive than others.

• Ranking is specially bad!

Example: Calculations with Missing

Data

• Monte Carlo simulation: a register with 20 000 primary

operations and 3 hospitals.

• 5% of those patients have a reoperation.

• How does the percentage of missing values affect the

percentage of reoperation?

Example: Calculations with Missing

data

• 20000 patients, 5 % have a reoperation.

• 5 % reoperations missing at random.

Hospital Proportion

Reoperation

Proportion

with missing

data

95 % CI

1 0.048 0.046 (0.045,0.047)

2 0.050 0.048 (0.047,0.049)

3 0.052 0.049 (0.048,0.050)

Example: Calculations with Missing Data:

Which clinic is reoperating more patients?

• 20000 patients, 5 % have a reoperation.

Percentage

missing data

Proportion

correct

ranking

5 % 95 %

7 % 87 %

Example: Swedish Knee Arthroplasty

register

Example: Neovascular Age Related

Macular Degeneration (AMD)

Leading cause of vision loss among

people age 50 and older.

Dry AMD: gradual breakdown of the

light-sensitive cells in the macula.

Neovascular (wet)AMD: abnormal

blood vessels grow underneath the

retina, which can leak fluid and

blood, which may lead to swelling

and damage of the macula.

https://nei.nih.gov/health/maculardegen/armd_facts

nAMDNormal fundus

Treatment: Anti VEGF Injections

Treatment: Anti-VEGF Injections

VEGF: Vascular endothelial growth factor

Aflibercept: Eylea, Bayer

Ranibizumab: Lucentis, Genentech

Bevacizumab: Avastin, Genentech

Visual Acuity

”Lowest acceptable” :

20/70: 60 ETDRS letters

The Swedish Macula Register

• A national register for treatment of neovascular AMD.

• 80 % coverage.

• National results for AMD treatment concerning age, sex, type of lesion,

treatment frequencies, and follow-up visits

• Medical outcome: distance visual acuity, near visual acuity and adverse

events.

• Analyze and compare different treatments and their outcome.

• Validation: Limited study gave information about errors in the database.

Treatment Compliance

• Known important factors to succeed with treatment:

– Age

– Good VA at baseline

• Patients are active in the register about 1.5 years.

• Why do patients not continue with treatment and/or

control visits?

Definition Treatment Termination

During Year 1

• Several possible definitions:

A. Patients that did not have a control visit at year 1 and

don’t have registered visits since the last visit for at least

4 months.

B. Patients terminated the treatment due to known reasons

+ patients without control visit.

C. Patients don’t have a 1 year control visit and the latest

registered VA was good.

Statistical Methods

• Objective: calculate risk of early treatment termination for all causes (outcome B).

• Model: Poisson regression to calculate risk. OBS: It is preferable to use Logistic Regression and then calculate the risk with a macro with Stata but not with multiple imputation.

• Model Validation:

– Comparison predicted/observed probabilities

– Deviance residuals

– Comparison with logistic regression

– Other ideas that work with clustered data?

Missing Data

• Missing values for VA. I used MI to calculate VA at

baseline for the treated (4 eyes) and fellow eyes (88 of

989 eyes, 9 %).

• Difficult to known whether there are missing visits, around

20% of missing patients compared to the National

Patients register.

• Errors in the database: VA has approximately 5 % wrong

registered values.

• MC simulations with 20 % missing visits and 5 %

changed VA values.

Take Home Message

• We aware that your study population is different from your

population of interest. Check:

– How the population in the register is defined.

– How many patients are registered (completeness).

This is determined by comparison with other registers.

– Which clinics are reporting to the register and why?

– Missing values and misclassification: How much? Is it

possible to compare with other registers?

Error Sources in National Quality Registries

1. Before registration

2. During registration

3. After registration

Error Sources in Healthcare Databases

1. Before registration:

– wrong data is registered in the journal.

– the patient is not enrolled in the register.

– It is very difficult to estimate the error proportion of this

type of error.

Error Sources in Healthcare Databases

2. During registration:

– misinterpretation and inaccurate typing: wrong value,

wrong calculation, wrong alternative.

– incomplete data.

• Misinterpretation and inaccurate typing: 5 %.*

• Missing data: 3 %.*

*Defining and improving data quality in medical registries: A literature review, case study and generic framwork.

Arts D.G.T. et al, J Am Med Inform Assoc, 2002 (9) 600.

Healthcare Databases

3. After registration:

– programming errors.

– Communication problems between different databases.

• They are very difficult to find and have severe consequences.



Is automatic registration the solution?

Dutch National Intensive Care Evaluation

Register (NICE)



NICE contains data from

patients who have been

admitted to Dutch

intensive care units and

provides insight into the

effectiveness and

efficiency of Dutch

intensive care.

Validation of Healthcare Databases

• Contact a statistician before startning the project!

Validation Using Sampling Theory

Principle: we only take a sample of some patients and compare

their journals to the registered data. We extend this information

to all the patients in the register.

Validation of Healthcare Databases

References:

http://www.scb.se/Upload/NSM2016/theme4/C_3_Aldana_R

osso.pdf

Simple Random Sampling

1. Select randomly some

patients.

2. Estimate the proportion

of incorrect data.

3. Extrapolate to the

register.

Register

Simple Random Sampling

• Usually without replacement.

• It is easy to program.

• It may give a sample that doesn’t represent the register

very well.

• “It may require more patients than other sampling

techniques”.

• Expensive (transportation cost).

Stratified Random Sampling

1. Divide the register in strata (e.g. hospitals, regions, etc.).

2. Within each strata, select some patients.

3. Estimate the proportion of incorrect data.

4. Extrapolate to the register.

Register


• It is possible to estimate the proportion of error for each

stratum (hospitals).

• It requires the participation in the validation of all the strata.

• The patients within the stratum can be selected in several

ways:

– a fixed amount of patients are selected randomly within

each stratum.

– selection of the same percentage as the stratum in the

frame.


• At least as effective as SRS for the same sample size.

• If information about the error distribution is known, the

design can be improved.

• It gives a more representative sample.

• Expensive (transportation cost).


register

1

2

3

4register

1

2

3

4

Efficient Stratification Non-efficient

Stratification

Cluster Sampling

1. Divide the register in

clusters (e.g. hospitals,

regions, ect.).

2. Select some clusters.

3. Within each cluster, select

some patients.

4. Estimate the proportion of

incorrect data.

5. Extrapolate to the register.

Register

Cluster Sampling

• Multistage with different sampling weights. For example:

– Level 1: cluster region.

– Level 2: cluster hospital.

– Level 3: sampling units patients.

• It can be used in combination with stratification.

Cluster Sampling

• Lower transportation cost.

• Only some hospitals are represented in the validation.

• It requires more patients than SRS to achieve the same

precision.

Cluster Sampling

register

1

2

3

4register

1

2

3

4

Non-efficient cluster Efficient cluster

Another Issue…

• The patient actually had the diagnostic, rigth???

How estimate the Proportion of

Correctly Diagnoses Patients

• It is done in a similar manner but with an expert group. It

is called ”Adjudication”.

• Usually reported as ”positive predicted value”. For

example in the HF article it was around 95 %.

What happens after Validation?

• Ethical and legal viewpoints: We should correct the data

(Sweden).

• Bias concerns?

• Information about the data quality and the missing data in

the publications?

validation of healthcare databases - · pdf filevalidation of healthcare databases aldana...

Documents