health care data - survivial analysis, draft
TRANSCRIPT
DATA MINING AND STATISTICAL ANALYSIS SOLUTIONS
Readmission probability prediction using open health care data
David BudaghyanErik Hambardzumyan
Habet MadoyanLilit Simonyan
Lusine SargsyanVahe Movsisyan
The project and the team
David Budaghyan
Erik Hambardzumyan
Habet Madoyan
Lilit Simonyan
Lusine Sargsyan
Vahe Movsisyan
Our team
• The team was formed to participate in OpenData hackathon
organized by Kolba Labs, December 3-4th
• We continued working after the hackathon, this is the result of
later efforts.
• The findings here are not final and can be revised with better
data and more thorough approach.
• This is just a prototype, we are open to any criticism, and
suggestions.
The context
• The initial data is collected by Ministry of Health• Around 13mln records
• Each record is an encounter of a citizen with Health Care institution
• The data is messy with lot of missing values and inconsistencies
MarzGenderAge GroupPayment Type
Eligibility TypeEncounter PurposePhc Diagnose GroupEncounter Outcome Treatment
The following variables were present in the data
The context: Data Transformation
• The data is transformed to identify patients
• The patient identification is done by using 3 variables (gender, birthday, marz) -Thanks to Fimetech team.
The goal of analysis
• Can we predict the patients readmission over time?• What is the probability that the patient will not need new encounter within
30, 180, 365 days after the first encounter.
• Why ?• Healthcare costs: If there will be a program for mandatory health insurance-
• The model will allow to predict overall costs for the economy
• Fraud detection: Fraud comes hand to hand with insurance, the survival analysis helps to identify deviant behavior (too many repeating visits for a given disease?)
• Deviant behavior on clinic/doctor level – Will help to understand the skills gap, ineffective management and etc, on the level of local clinics,
• Modeling insurance premiums: If we understand how much is the cost on marz/disease/… levels, we can offer tailored premiums
Cox Proportional-Hazards Model
- Marz- Gender- Age Group- Payment Type- Eligibility Type- Encounter Purpose- Phc Diagnose Group- Encounter Outcome Treatment Type
coef exp(coef) robustse z Pr(>|z|)
Marz_ID_Aragatsotn
Marz_ID_Ararat -0.122 0.885 0.042 -2.921 0.0035 **
Marz_ID_Armavir 0.094 1.098 0.036 2.607 0.0091 **
Marz_ID_Gegharqunik -0.051 0.951 0.035 -1.431 0.1525
Marz_ID_Kotayk 0.063 1.065 0.041 1.545 0.1224
Marz_ID_Lori -0.059 0.943 0.037 -1.594 0.1109
Marz_ID_Shirak 0.076 1.079 0.035 2.161 0.0307 *
Marz_ID_Syunik -0.199 0.819 0.268 -0.743 0.4575
Marz_ID_Tavush 0.049 1.051 0.049 1.009 0.3131
Marz_ID_Vayots_Dzor -0.035 0.966 1.094 -0.032 0.9746
Marz_ID_Yerevan -0.086 0.918 0.041 -2.087 0.0369 *
Gender_ID_female
Gender_ID_male -0.042 0.959 0.017 -2.547 0.0109 *
Age_ID_0-5
Age_ID_5-10 -0.187 0.829 0.063 -2.974 0.0029 **
Age_ID_10-18 -0.310 0.734 0.145 -2.140 0.0324 *
Age_ID_18-30 -0.133 0.875 0.142 -0.942 0.3460
Age_ID_30-60 -0.049 0.952 0.122 -0.406 0.6850
Age_ID_60+ 0.083 1.086 0.122 0.680 0.4965
Payment_Type_ID_paid
Payment_Type_ID_state_ordered 0.153 1.166 0.166 0.923 0.3558
Eligibility_ID_ArmedForces
Eligibility_ID_children_vulnerable 0.240 1.271 0.244 0.984 0.3251
Eligibility_ID_disabled_people 0.379 1.461 0.212 1.785 0.0743 .
Eligibility_ID_elderly_people 0.277 1.320 0.218 1.270 0.2040
Eligibility_ID_family_vulnerable 0.432 1.540 0.230 1.880 0.0601 .
Eligibility_ID_other 0.018 1.019 0.221 0.083 0.9337
Eligibility_ID_poverty_beneficiary -0.084 0.919 0.296 -0.284 0.7765
Eligibility_ID_pregnancy -0.154 0.857 0.360 -0.428 0.6688
Eligibility_ID_social_package_beneficiary -0.350 0.704 0.231 -1.520 0.1285
Eligibility_ID_young_men -1.052 0.349 0.502 -2.094 0.0362 *
Encounter_Purpose_ID_disease
Encounter_Purpose_ID_control 0.120 1.127 0.027 4.487 0.0000 ***
Encounter_Purpose_ID_administrative 0.195 1.215 0.192 1.014 0.3108
Encounter_Purpose_ID_preventive -0.022 0.978 0.054 -0.404 0.6861
Encounter_Purpose_ID_reproductive 0.459 1.583 0.278 1.654 0.0982 .
Encounter_Purpose_ID_other -1.332 0.264 0.050 -26.386 0.0000 ***
Phc_Diagnose_ID_A
Phc_Diagnose_ID_I -0.107 0.899 0.024 -4.430 0.0000 ***
Phc_Diagnose_ID_J -0.028 0.973 0.039 -0.715 0.4744
Phc_Diagnose_ID_K 0.145 1.156 0.143 1.011 0.3119
Encounter_Outcome_Treatment_ID_chronic_condition
Encounter_Outcome_Treatment_ID_death -0.142 0.868 0.441 -0.322 0.7475
Encounter_Outcome_Treatment_ID_improvement 0.091 1.096 0.059 1.546 0.1222
Encounter_Outcome_Treatment_ID_recovery 0.049 1.050 0.057 0.855 0.3927
Encounter_Outcome_Treatment_ID_stabilisation 0.091 1.095 0.042 2.164 0.0305 *
Encounter_Outcome_Treatment_ID_treatment_stop 0.420 1.522 0.159 2.651 0.0080 **
Encounter_Outcome_Treatment_ID_unchanged 0.045 1.046 0.043 1.055 0.2915
Encounter_Outcome_Treatment_ID_worsening 0.132 1.141 0.084 1.567 0.1170
Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1
Concordance= 0.537 (se = 0.003 )
Rsquare= 0.038 (max possible= 1 )Likelihood ratio test= 660 on 41 df, p=0Wald test = 7450 on 41 df, p=0Score (logrank) test = 607.2 on 41 df, p=0, Robust = 472.8 p=0n= 17067, number of events= 15504 (90933 observations deleted due to missingness)
𝑆𝑡 = 1 −𝑛. 𝑒𝑣𝑒𝑛𝑡𝑡
𝑛. 𝑟𝑖𝑠𝑘𝑡 − 𝑛. 𝑐𝑒𝑛𝑠𝑜𝑟𝑒𝑑𝑡S𝑡−1
𝑆𝑡 -Proportion of patients who didn’t returnto hospital in t days after first visit
𝑆0 = 1 −𝑛. 𝑒𝑣𝑒𝑛𝑡0𝑛. 𝑟𝑖𝑠𝑘0
Findings
• Marzes: Ararat, Armavir, Shirak and Yerevan have significant coefficients and based on the signs of the
coefficients we can conclude that Armavir and Shirak have higher probability of readmission than Aragatsotn
and Ararat and Yerevan have less readmission probability than Aragatsotn.
• Gender: the significant negative coefficient of Gender_ID_Male indicates that females are more frequent
visitors to hospital in comparison to males.
• Age: people aged from 5-18 have lower readmission probability than children of 0-5 age. And this can be
easily observed from the data presented in the output, particularly the significant negative coefficients are
showing that fact.
• Eligibility: The only significant coefficient in the scope of eligibility have the young men, which has a
negative sign, meaning that armed forces have more attendance probability in comparison to young men.
• Purpose: Hospital visits for control have higher readmission probability than those for disease purpose,
which can be explained by the fact that control purpose supposes regular attendances.
• Diagnose: Certain infectious and parasitic diseases have higher readmission probability than some heart
diseases.
Findings
30 days 180 days 365 days
Aragatsotn 0.522 0.106 0.046
Ararat 0.521 0.106 0.045
Armavir 0.504 0.103 0.041
Gegharqunik 0.533 0.103 0.042
Kotayk 0.466 0.095 0.039
Lori 0.554 0.132 0.063
Shirak 0.429 0.068 0.022
Syunik 0.485 0.138 0.069
Tavush 0.558 0.079 0.034
Vayots Dzor 0.635 0.163 0.070
Yerevan 0.486 0.119 0.047
Shirak and Tavush have the lowest 𝑆𝑡(Proportion of patients who didn’t return to hospital in t days after first visit)
Vayots Dzor, Syunik, and Lori have the highest 𝑆𝑡
Gender
30 days 180 days 365 days
female 0.491 0.102 0.042
male 0.493 0.097 0.038
Gender and the hospital visit intensity are not related
Gender
30 days 180 days 365 days
0-5 0.469 0.066 0.021
5-10 0.571 0.201 0.086
10-18 0.595 0.272 0.126
18-30 0.562 0.206 0.100
30-60 0.512 0.113 0.048
60+ 0.455 0.056 0.019
“0-5” and “60+” are the most risky age groups (lowest 𝑆𝑡)
“10-18” years old patients have the lowest pro𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑡𝑜 𝑣𝑖𝑠𝑖𝑡 ℎ𝑜𝑠𝑝𝑖𝑡𝑎𝑙 𝑚𝑜𝑟𝑒 𝑡ℎ𝑎𝑛 𝑜𝑛𝑐𝑒 𝑎𝑦𝑒𝑎𝑟.
Age
30 days 180 days 365 days
paid 0.513 0.138 0.032
state ordered 0.492 0.099 0.041
Patients who have state ordered payment type have lower survival probability meaning their visits are more frequent.
Payment type
30 days 180 days 365 days
children vulnerable 0.479 0.069 0.021
disabled people 0.470 0.031 0.006
elderly people 0.466 0.041 0.019
pregnancy 0.514 0.105 0.045
social package
beneficiary0.593 0.341 0.249
The most frequent visitors are disabled and elderly people an contrast with social package beneficiaries who have the highest survival rate.
Eligibility
30 days 180 days 365 days
disease 0.508 0.109 0.029
control 0.478 0.045 0.012
administrative 0.543 0.183 0.079
preventive 0.499 0.156 0.079
reproductive 0.532 0.126 0.041
other 0.418 0.044 0.009
As visits for administrative purposes are mostly occasional they have higher survival rate, whereas the same figure for control purpose visits is the least frequent one.
Encounter Purpose
30 days 180 days 365 days
A 0.487 0.072 0.022
I 0.503 0.069 0.016
J 0.518 0.101 0.026
K 0.514 0.098 0.025
Certain infectious (A) and heart diseases (I) have lower survival probability in comparison to upper respiratory infections (J), sclerosis and other mental disorders(K).
Diagnosis
30 days 180 days 365 days
chronic condition 0.497 0.051 0.017
improvement 0.525 0.102 0.024
recovery 0.525 0.109 0.030
stabilisation 0.475 0.055 0.016
treatment stop 0.458 0.122 0.087
unchanged 0.501 0.088 0.030
worsening 0.407 0.054 0.012
Worsening, chronic condition and stabilization have thelowest survival rates since they require moreattendances, whereas improvement, recovery andtreatment stop have the highest survival rates since theyrequire less visits.
Encounter Outcome Treatment
Predicting single readmission (machine learning case)• The data is transformed so each row is a person
• There is an indicator variable, showing if the patient was “readmitted”, thus have more than 2 records in the database
• The goal of the modeling is to predict no-readmission rate based on 7 variables (age, gender, payment, treatment, etc.)
The Business goal
• Low readmission is a sign of good patient care
• Low readmission means low insurance and healthcare costs
Tested models
•Random Forests
•Naïve Bayes
•Decision trees
•Gradient boosting
AUC ≈ 0.7035657
ROC curve of blender model for Testing set.
What next ?
• What other data can we obtain?
• From Ministry of health?
• From Clinics?
• Insurance companies?
• How can we make data cleaner and more reliable?
• What is the real need of stakeholders?
Datamotus LLC 21