health care data - survivial analysis, draft

21
DATA MINING AND STATISTICAL ANALYSIS SOLUTIONS Readmission probability prediction using open health care data David Budaghyan Erik Hambardzumyan Habet Madoyan Lilit Simonyan Lusine Sargsyan Vahe Movsisyan

Upload: habet-madoyan

Post on 13-Apr-2017

272 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Health care data - survivial analysis, draft

DATA MINING AND STATISTICAL ANALYSIS SOLUTIONS

Readmission probability prediction using open health care data

David BudaghyanErik Hambardzumyan

Habet MadoyanLilit Simonyan

Lusine SargsyanVahe Movsisyan

Page 2: Health care data - survivial analysis, draft

The project and the team

David Budaghyan

Erik Hambardzumyan

Habet Madoyan

Lilit Simonyan

Lusine Sargsyan

Vahe Movsisyan

Our team

• The team was formed to participate in OpenData hackathon

organized by Kolba Labs, December 3-4th

• We continued working after the hackathon, this is the result of

later efforts.

• The findings here are not final and can be revised with better

data and more thorough approach.

• This is just a prototype, we are open to any criticism, and

suggestions.

Page 3: Health care data - survivial analysis, draft

The context

• The initial data is collected by Ministry of Health• Around 13mln records

• Each record is an encounter of a citizen with Health Care institution

• The data is messy with lot of missing values and inconsistencies

MarzGenderAge GroupPayment Type

Eligibility TypeEncounter PurposePhc Diagnose GroupEncounter Outcome Treatment

The following variables were present in the data

Page 4: Health care data - survivial analysis, draft

The context: Data Transformation

• The data is transformed to identify patients

• The patient identification is done by using 3 variables (gender, birthday, marz) -Thanks to Fimetech team.

Page 5: Health care data - survivial analysis, draft

The goal of analysis

• Can we predict the patients readmission over time?• What is the probability that the patient will not need new encounter within

30, 180, 365 days after the first encounter.

• Why ?• Healthcare costs: If there will be a program for mandatory health insurance-

• The model will allow to predict overall costs for the economy

• Fraud detection: Fraud comes hand to hand with insurance, the survival analysis helps to identify deviant behavior (too many repeating visits for a given disease?)

• Deviant behavior on clinic/doctor level – Will help to understand the skills gap, ineffective management and etc, on the level of local clinics,

• Modeling insurance premiums: If we understand how much is the cost on marz/disease/… levels, we can offer tailored premiums

Page 6: Health care data - survivial analysis, draft

Cox Proportional-Hazards Model

- Marz- Gender- Age Group- Payment Type- Eligibility Type- Encounter Purpose- Phc Diagnose Group- Encounter Outcome Treatment Type

coef exp(coef) robustse z Pr(>|z|)

Marz_ID_Aragatsotn

Marz_ID_Ararat -0.122 0.885 0.042 -2.921 0.0035 **

Marz_ID_Armavir 0.094 1.098 0.036 2.607 0.0091 **

Marz_ID_Gegharqunik -0.051 0.951 0.035 -1.431 0.1525

Marz_ID_Kotayk 0.063 1.065 0.041 1.545 0.1224

Marz_ID_Lori -0.059 0.943 0.037 -1.594 0.1109

Marz_ID_Shirak 0.076 1.079 0.035 2.161 0.0307 *

Marz_ID_Syunik -0.199 0.819 0.268 -0.743 0.4575

Marz_ID_Tavush 0.049 1.051 0.049 1.009 0.3131

Marz_ID_Vayots_Dzor -0.035 0.966 1.094 -0.032 0.9746

Marz_ID_Yerevan -0.086 0.918 0.041 -2.087 0.0369 *

Gender_ID_female

Gender_ID_male -0.042 0.959 0.017 -2.547 0.0109 *

Age_ID_0-5

Age_ID_5-10 -0.187 0.829 0.063 -2.974 0.0029 **

Age_ID_10-18 -0.310 0.734 0.145 -2.140 0.0324 *

Age_ID_18-30 -0.133 0.875 0.142 -0.942 0.3460

Age_ID_30-60 -0.049 0.952 0.122 -0.406 0.6850

Age_ID_60+ 0.083 1.086 0.122 0.680 0.4965

Payment_Type_ID_paid

Payment_Type_ID_state_ordered 0.153 1.166 0.166 0.923 0.3558

Eligibility_ID_ArmedForces

Eligibility_ID_children_vulnerable 0.240 1.271 0.244 0.984 0.3251

Eligibility_ID_disabled_people 0.379 1.461 0.212 1.785 0.0743 .

Eligibility_ID_elderly_people 0.277 1.320 0.218 1.270 0.2040

Eligibility_ID_family_vulnerable 0.432 1.540 0.230 1.880 0.0601 .

Eligibility_ID_other 0.018 1.019 0.221 0.083 0.9337

Eligibility_ID_poverty_beneficiary -0.084 0.919 0.296 -0.284 0.7765

Eligibility_ID_pregnancy -0.154 0.857 0.360 -0.428 0.6688

Eligibility_ID_social_package_beneficiary -0.350 0.704 0.231 -1.520 0.1285

Eligibility_ID_young_men -1.052 0.349 0.502 -2.094 0.0362 *

Encounter_Purpose_ID_disease

Encounter_Purpose_ID_control 0.120 1.127 0.027 4.487 0.0000 ***

Encounter_Purpose_ID_administrative 0.195 1.215 0.192 1.014 0.3108

Encounter_Purpose_ID_preventive -0.022 0.978 0.054 -0.404 0.6861

Encounter_Purpose_ID_reproductive 0.459 1.583 0.278 1.654 0.0982 .

Encounter_Purpose_ID_other -1.332 0.264 0.050 -26.386 0.0000 ***

Phc_Diagnose_ID_A

Phc_Diagnose_ID_I -0.107 0.899 0.024 -4.430 0.0000 ***

Phc_Diagnose_ID_J -0.028 0.973 0.039 -0.715 0.4744

Phc_Diagnose_ID_K 0.145 1.156 0.143 1.011 0.3119

Encounter_Outcome_Treatment_ID_chronic_condition

Encounter_Outcome_Treatment_ID_death -0.142 0.868 0.441 -0.322 0.7475

Encounter_Outcome_Treatment_ID_improvement 0.091 1.096 0.059 1.546 0.1222

Encounter_Outcome_Treatment_ID_recovery 0.049 1.050 0.057 0.855 0.3927

Encounter_Outcome_Treatment_ID_stabilisation 0.091 1.095 0.042 2.164 0.0305 *

Encounter_Outcome_Treatment_ID_treatment_stop 0.420 1.522 0.159 2.651 0.0080 **

Encounter_Outcome_Treatment_ID_unchanged 0.045 1.046 0.043 1.055 0.2915

Encounter_Outcome_Treatment_ID_worsening 0.132 1.141 0.084 1.567 0.1170

Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1

Concordance= 0.537 (se = 0.003 )

Rsquare= 0.038 (max possible= 1 )Likelihood ratio test= 660 on 41 df, p=0Wald test = 7450 on 41 df, p=0Score (logrank) test = 607.2 on 41 df, p=0, Robust = 472.8 p=0n= 17067, number of events= 15504 (90933 observations deleted due to missingness)

𝑆𝑡 = 1 −𝑛. 𝑒𝑣𝑒𝑛𝑡𝑡

𝑛. 𝑟𝑖𝑠𝑘𝑡 − 𝑛. 𝑐𝑒𝑛𝑠𝑜𝑟𝑒𝑑𝑡S𝑡−1

𝑆𝑡 -Proportion of patients who didn’t returnto hospital in t days after first visit

𝑆0 = 1 −𝑛. 𝑒𝑣𝑒𝑛𝑡0𝑛. 𝑟𝑖𝑠𝑘0

Page 7: Health care data - survivial analysis, draft

Findings

• Marzes: Ararat, Armavir, Shirak and Yerevan have significant coefficients and based on the signs of the

coefficients we can conclude that Armavir and Shirak have higher probability of readmission than Aragatsotn

and Ararat and Yerevan have less readmission probability than Aragatsotn.

• Gender: the significant negative coefficient of Gender_ID_Male indicates that females are more frequent

visitors to hospital in comparison to males.

• Age: people aged from 5-18 have lower readmission probability than children of 0-5 age. And this can be

easily observed from the data presented in the output, particularly the significant negative coefficients are

showing that fact.

Page 8: Health care data - survivial analysis, draft

• Eligibility: The only significant coefficient in the scope of eligibility have the young men, which has a

negative sign, meaning that armed forces have more attendance probability in comparison to young men.

• Purpose: Hospital visits for control have higher readmission probability than those for disease purpose,

which can be explained by the fact that control purpose supposes regular attendances.

• Diagnose: Certain infectious and parasitic diseases have higher readmission probability than some heart

diseases.

Findings

Page 9: Health care data - survivial analysis, draft

30 days 180 days 365 days

Aragatsotn 0.522 0.106 0.046

Ararat 0.521 0.106 0.045

Armavir 0.504 0.103 0.041

Gegharqunik 0.533 0.103 0.042

Kotayk 0.466 0.095 0.039

Lori 0.554 0.132 0.063

Shirak 0.429 0.068 0.022

Syunik 0.485 0.138 0.069

Tavush 0.558 0.079 0.034

Vayots Dzor 0.635 0.163 0.070

Yerevan 0.486 0.119 0.047

Shirak and Tavush have the lowest 𝑆𝑡(Proportion of patients who didn’t return to hospital in t days after first visit)

Vayots Dzor, Syunik, and Lori have the highest 𝑆𝑡

Gender

Page 10: Health care data - survivial analysis, draft

30 days 180 days 365 days

female 0.491 0.102 0.042

male 0.493 0.097 0.038

Gender and the hospital visit intensity are not related

Gender

Page 11: Health care data - survivial analysis, draft

30 days 180 days 365 days

0-5 0.469 0.066 0.021

5-10 0.571 0.201 0.086

10-18 0.595 0.272 0.126

18-30 0.562 0.206 0.100

30-60 0.512 0.113 0.048

60+ 0.455 0.056 0.019

“0-5” and “60+” are the most risky age groups (lowest 𝑆𝑡)

“10-18” years old patients have the lowest pro𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑡𝑜 𝑣𝑖𝑠𝑖𝑡 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙 𝑚𝑜𝑟𝑒 𝑡ℎ𝑎𝑛 𝑜𝑛𝑐𝑒 𝑎𝑦𝑒𝑎𝑟.

Age

Page 12: Health care data - survivial analysis, draft

30 days 180 days 365 days

paid 0.513 0.138 0.032

state ordered 0.492 0.099 0.041

Patients who have state ordered payment type have lower survival probability meaning their visits are more frequent.

Payment type

Page 13: Health care data - survivial analysis, draft

30 days 180 days 365 days

children vulnerable 0.479 0.069 0.021

disabled people 0.470 0.031 0.006

elderly people 0.466 0.041 0.019

pregnancy 0.514 0.105 0.045

social package

beneficiary0.593 0.341 0.249

The most frequent visitors are disabled and elderly people an contrast with social package beneficiaries who have the highest survival rate.

Eligibility

Page 14: Health care data - survivial analysis, draft

30 days 180 days 365 days

disease 0.508 0.109 0.029

control 0.478 0.045 0.012

administrative 0.543 0.183 0.079

preventive 0.499 0.156 0.079

reproductive 0.532 0.126 0.041

other 0.418 0.044 0.009

As visits for administrative purposes are mostly occasional they have higher survival rate, whereas the same figure for control purpose visits is the least frequent one.

Encounter Purpose

Page 15: Health care data - survivial analysis, draft

30 days 180 days 365 days

A 0.487 0.072 0.022

I 0.503 0.069 0.016

J 0.518 0.101 0.026

K 0.514 0.098 0.025

Certain infectious (A) and heart diseases (I) have lower survival probability in comparison to upper respiratory infections (J), sclerosis and other mental disorders(K).

Diagnosis

Page 16: Health care data - survivial analysis, draft

30 days 180 days 365 days

chronic condition 0.497 0.051 0.017

improvement 0.525 0.102 0.024

recovery 0.525 0.109 0.030

stabilisation 0.475 0.055 0.016

treatment stop 0.458 0.122 0.087

unchanged 0.501 0.088 0.030

worsening 0.407 0.054 0.012

Worsening, chronic condition and stabilization have thelowest survival rates since they require moreattendances, whereas improvement, recovery andtreatment stop have the highest survival rates since theyrequire less visits.

Encounter Outcome Treatment

Page 17: Health care data - survivial analysis, draft

Predicting single readmission (machine learning case)• The data is transformed so each row is a person

• There is an indicator variable, showing if the patient was “readmitted”, thus have more than 2 records in the database

• The goal of the modeling is to predict no-readmission rate based on 7 variables (age, gender, payment, treatment, etc.)

The Business goal

• Low readmission is a sign of good patient care

• Low readmission means low insurance and healthcare costs

Page 18: Health care data - survivial analysis, draft

Tested models

•Random Forests

•Naïve Bayes

•Decision trees

•Gradient boosting

Page 19: Health care data - survivial analysis, draft

AUC ≈ 0.7035657

ROC curve of blender model for Testing set.

Page 20: Health care data - survivial analysis, draft

What next ?

• What other data can we obtain?

• From Ministry of health?

• From Clinics?

• Insurance companies?

• How can we make data cleaner and more reliable?

• What is the real need of stakeholders?

Page 21: Health care data - survivial analysis, draft

Datamotus LLC 21