jayaraman j. thiagarajan deepta rajan prasanna …understanding behavior of clinical models under...

Understanding Behavior of Clinical Models under Domain ShiftsJayaraman J. Thiagarajan

Lawrence Livermore National [email protected]

Deepta RajanIBM Research [email protected]

Prasanna SattigeriMIT-IBM Watson AI [email protected]

ABSTRACTThe hypothesis that computational models can be reliable enough tobe adopted in prognosis and patient care is revolutionizing health-care. Deep learning, in particular, has been a game changer inbuilding predictive models, thus leading to community-wide datacuration efforts. However, due to inherent variabilities in popula-tion characteristics and biological systems, these models are oftenbiased to the training datasets. This can be limiting when modelsare deployed in new environments, when there are systematic do-main shifts not known a priori. In this paper, we propose to emulatea large class of domain shifts, that can occur in clinical settings,with a given dataset, and argue that evaluating the behavior ofpredictive models in light of those shifts is an effective way to quan-tify their reliability. More specifically, we develop an approach forbuilding realistic scenarios, based on analysis of disease landscapesin multi-label classification. Using the openly available MIMIC-IIIEHR dataset for phenotyping, for the first time, our work sheds lightinto data regimes where deep clinical models can fail to generalize.This work emphasizes the need for novel validation mechanismsdriven by real-world domain shifts in AI for healthcare.

CCS CONCEPTS• Computing methodologies → Neural networks; • Appliedcomputing → Health informatics.

KEYWORDSEHR diagnosis, unsupervised learning, label shifts, clinical models

ACM Reference Format:Jayaraman J. Thiagarajan, Deepta Rajan, and Prasanna Sattigeri. 2019. Under-standing Behavior of Clinical Models under Domain Shifts. In Proceedingsof ACM Conference (Conference’19). ACM, New York, NY, USA, 4 pages.https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONThe role of automation in healthcare and medicine is steered byboth the ever-growing need to leverage knowledge from large-scale,heterogeneous information systems, and the hypothesis that com-putational models can actually be reliable enough to be adopted indiagnostics. Deep learning, in particular, has been a game changerin this context, given recent successes with both electronic health

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]’19, August 2019, Alaska, USA© 2019 Association for Computing Machinery.ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

records (EHR) and rawmeasurements (e.g. ECG, EEG) [4], [3].Whiledesigning computational models that can account for complex bi-ological phenomena and all interactions during disease evolutionis not possible yet, building predictive models is a significant steptowards improving patient care. There is a fundamental trade-offin predictive modeling with healthcare data. While the complexityof disease conditions naturally demands expanding the number ofpredictor variables, this often makes it extremely challenging toidentify reliable patterns in data, thus rendering the models heav-ily biased. In other words, despite the availability of large-sizeddatasets, we are still operating in the small data regime, whereinthe curated data is not fully representative of what the model mightencounter when deployed.

In this paper, we consider a variety of population biases, labeldistribution shifts and measurement discrepancies that can occurin clinical settings, and argue that evaluating the behavior of pre-dictive models in light of those shifts is crucial to understandingstrengths and weaknesses of predictive models. To this end, weconsider the problem of phenotyping using EHR data, characterizedifferent forms of discrepancies that can occur between train andtest environments, and study how the prediction capability variesacross these discrepancies. Our study is carried out using the openlyavailable MIMIC-III EHR dataset [2] and the state-of-the-art deepResNet model [1]. The proposed analysis provides interesting in-sights about which discrepancies are challenging to handle, whichin turn can provide guidelines while deploying data-driven models.

2 DATASET DESCRIPTIONAll experiments are carried out using the MIMIC-III dataset [2],the largest publicly available database of de-identified EHR fromICU patients. It includes a variety of data types such as diagnos-tic codes, survival rates, length of stay etc., yielding a total of 76measurements. For our study, we focus on the task of acute carephenotyping that involves retrospectively predicting the likely dis-ease conditions for each patient given their ICU measurements.Preparation of such large heterogeneous records involve organiz-ing each patient’s data into episodes containing both time-seriesevents as well as episode-level outcomes like diagnosis, mortalityetc. Phenotyping is typically a multi-label classification problem,where each patient can be associated with several disease condi-tions. The dataset contains 25 disease categories, 12 of which arecritical such as respiratory/renal failure, 8 are chronic conditionssuch as diabetes, atherosclerosis, with the remaining 5 being ’mixed’conditions such as liver infections. Note that, in our study, we refor-mulate the phenotyping problem as a binary classification task ofdetecting the presence or absence of a selected subset of diseases.

3 ANALYSIS OF DEEP CLINICAL MODELSClinical time-series modeling broadly focuses on problems suchas anomaly detection, tracking progression of specific diseases or

arX

iv:1

809.

0780

6v2

[st

at.M

L]

14

Jun

2019

https://doi.org/10.1145/nnnnnnn.nnnnnnn

https://doi.org/10.1145/nnnnnnn.nnnnnnn

Conference’19, August 2019, Alaska, USA J. J. Thiagarajan, et al.

Figure 1: Constructing disease landscapes: Using information decomposition on the MIMIC-III EHR dataset reveals the group-ing of disease outcomes.

detecting groups of diseases. Deep learning provides a data-drivenapproach to exploiting relationships between a large number ofinput measurements while producing robust diagnostic predictions.In particular, Recurrent Neural Networks (RNN) was one of theearlier solutions for dealing with sequential data in tasks such asdiscovering phenotypes and detecting cardiac diseases [3], etc. Thework by [4] extensively benchmarked the effectiveness of deeplearning techniques in analyzing EHR data. More recently, convolu-tional networks [1], and attention models [5] have become highlyeffective alternatives to sequential models. In particular, deep Resid-ual Networks with 1D convolutions (ResNet-1D), have producedstate-of-the-art performance in many clinical tasks [1]. Hence, inthis work, all studies are carried out using this architecture.Architecture: We use a ResNet-1D model with 7 residual blockswhich transform the raw time-series input into feature represen-tations. Each residual block is made up of two 1-D convolutionand batch normalization layers with a kernel size of 3 and 64 or256 filters, a dropout and a leaky-ReLU activation layer. Further, a1−D max pooling layer is used after the third and seventh residualblocks to aggregate the features. Additional parameters include: anAdam optimizer with a learning rate of 0.001, and a batch size of128 with an equal balance of positive and negative samples.Proposed Work: An important challenge with the design of deepclinical models is the lack of a holistic understanding of their behav-ior under the wide-range of discrepancies that can occur betweentrain and test environments. In particular, we consider the follow-ing domain shifts – (i) population biases such as age, gender, raceetc.; (ii) label distribution shifts such as novel disease conditionsat test time, presence of combinations of observed diseases etc.;and (iii) measurement discrepancies such as noisy labels, missingmeasurements, variations in sampling rate etc. Understanding howmuch these discrepancies impact a model’s generalization perfor-mance is critical to qualitatively understanding its strengths andexpected regimes of failure, and thus provide guidelines for effec-tive deployment. Conceptually, this amounts to quantifying the

prediction uncertainties arising due to the lack of representativedata and inherent randomness from clinical data.

4 CHARACTERIZING DOMAIN SHIFTSIn this section, we describe our approach for emulating a varietyof domain shifts that commonly occur in a clinical setting, whichwill be then used for evaluating the deep 1D-ResNet model in thenext section. For each of the cases, we construct a source anda target dataset, where the data samples are chosen based on adiscrepancy-specific constraint (e.g. gender distribution, males usedin source and females in target). The positive class correspondsto the presence of a selected subset of diseases (e.g. organ specificconditions), while the negative class indicates their absence in thepatients. Note, we ensured that the resulting datasets are balanced,and had positive-negative split of ∼ 60% − 40%. Before we describethe different discrepancies considered in our study, we will brieflydiscuss the information decomposition technique [6] that will beused to explore the landscape of diseases in order to pick a subsetof related conditions for defining the positive class.Exploring Disease Landscapes: We utilize information sieve [6],a recent information-theoretic technique for identifying latentfactors in data that maximally describe the total correlation be-tween variables. Denoting a set of multivariate random variables byX̃ = {X̃i }di=1, dependencies between the variables can be quantifiedusing the multivariate mutual information, also known as totalcorrelation (TC), which is defined as

TC(X̃ ) =d∑i=1

H (X̃i ) − H (X̃ ), (1)

where H (X̃i ) denotes the marginal entropy. This quantity is non-negative and zero only when all the X̃i ’s are independent. Further,denoting the latent source of dependence in X̃ by Z̃ , we can definethe conditional TC(X̃ |Z̃ ), i.e., the residual total correlation afterobserving Z̃ . Hence, we solve the problem of searching for Z̃ thatminimizes TC(X̃ |Z̃ ). Equivalently, we can define the reduction inTC after conditioning on Z̃ as TC(X̃ ; Z̃ ) = TC(X̃ ) −TC(X̃ |Z̃ ). The

Understanding Behavior of Clinical Models under Domain Shifts Conference’19, August 2019, Alaska, USA

optimization begins with X̃ , constructs Z̃0 to maximize TC(X̃ ; Z̃0).Subsequently, it computes the remainder information not explainedby Z̃0, and learns another factor, Z̃1, that infers additional depen-dence structure. This procedure is repeatedk times until the entiretyof multivariate mutual information in X̃ is explained.

Denoting a labeled dataset using the tuple (X,Y), where X ∈RN×D is the input data with D variables, and Y ∈ RN×d is thelabel matrix with d different categories, we use information sieve toanalyze the outcome variables. In other words, we identify latentfactors to describe the total correlation in Y and the resulting hi-erarchical decomposition is referred as the disease landscape. Theforce-based layout in Figure 1 provides a holistic view of the land-scape for the entire MIMIC-III dataset. Here, each circle correspondsto a latent factor and their sizes indicate the amount of total corre-lation explained by that factor, and the thickness of edges revealcontributions of each outcome variable to that factor.

4.1 Population BiasesClinical datasets can be imbalanced as to population characteristics.(i) Age: (a) Older-to-Younger : Source comprises of patients whoare 60 years and above, and the target includes patients who arebelow 60 years of age; (b) Younger-to-Older : This represents thescenario with source containing patients younger than 60. Once thepopulation is divided based on the chosen age criteria, we pick acluster of diseases in the source’s disease landscape as the positiveclass. However, due to the domain shift, the landscape can changesignificantly in the target. Hence, in the target, we recompute thelandscape and consider the source conditions as well as diseasesstrongly correlated to any of those conditions as positive.(ii) Gender: Similar to the previous case, we emulate two cases,Male-to-Female, i.e. source data consists only of patients who iden-tify as male and target has only patients who are female. While, thesecond scenario Female-to-Male is comprised of female-only sourcedata and male-only target data.(iii) Race: We emulateWhite-to-Minority, that contains prevalentracial groups such as white American, Russian, and European in thesource, while constructing the target domain comprising of patientsbelonging to minority racial groups like Hispanic, South America,African, Asian, Portuguese, and others marked as unknown.

4.2 Label Distribution ShiftsWhen models are trained to detect the presence of certain diseases,they can be ineffective when novel disease conditions or variantsof existing conditions, previously unseen by the model, appear.(i) Novel Diseases at Test Time: (a) Resp-to-CardiacRenal: In thesource, we detect the presence or absence of diseases from cluster1 (Figure 1), while the target requires detection of the presence ofat least one of the conditions from cluster 1 or unobserved condi-tions in cluster 0; (b) Cerebro-to-CardiacRenal: This is formulatedas detecting diseases from cluster 2 in the source dataset, whileexpanding the disease set in the target by including cluster 0.(ii) Dual-to-Single: As a common practice, lab tests for two dis-eases could be conducted together based on the likelihood of themco-occurring, giving rise to a dataset containing patients with bothdiseases. However, one would expect to detect the presence of evenone of those diseases. Here, source includes patients associated

with two disease conditions simultaneously, while target includespatients who present only one of the two. Note, we do not constructthe landscapes for this case, instead, we divide patients into fourgroups, namely: those that have only cardiac diseases, only renaldiseases, neither cardiac or renal diseases, and, both cardiac andrenal diseases, and build the source-target pair.(iii) Single-to-Dual: Similar to the previous case, an alternate situ-ation could occur withmodels having to adapt to predicting patientsthat have two diseases, while having been trained on patients whowere diagnosed with only one of them. Such a scenario is exploredusing a source dataset comprising of patients diagnosed as havingeither cardiac only or respiratory only disease, and a target datasetwith patients that have both diseases.

4.3 Measurement Discrepancies(i) Noisy Labels: Variabilities in diagnoses between different ex-perts is common in clinical settings [7]. Consequently, it is impor-tant to understand the impact of those variabilities on behavior ofthe model, when adopted to a new environment. We extend theResp-to-CardiacRenal case by adding uncertainties to the diagnosticlabels in the source data and study its impact on the target. In par-ticular, we emulate two such scenarios by randomly flipping 10%and 20% of the labels respectively.(ii) Sampling Rate Change: Another typical issue with time-series data is the variability in sampling rates. We create a source-target pair based on the Resp-to-CardiacRenal scenario with mea-surements collected at different sampling rates. The source usessamples at every 96 hours while the target is sampled at 48 hours.(iii) Missing Measurements: Clinical time-series is plagued bymissing measurements during deployment, often due to resource ortime limitations. We emulate this discrepancy based on the Dual-to-Single scenario, wherein we assume that the set of measurements,pH, Temperature, Height, Weight, and all Verbal Response GCS pa-rameters, were not available in the target dataset. We impute themissing measurements with zero and this essentially reduces thedimensionality of valid measurements from 76 to 41 in the target.

5 EMPIRICAL RESULTSIn this section, we present empirical results from our study – per-formances of the ResNet-1D model trained on the source data whensubjected to perform under a variety of discrepancies. Followingcommon practice, we considered the weighted average AUPRCmet-ric. Figure 2 illustrates the performance of the ResNet-1D modelfor the three categories of discrepancies. We report the weightedAUPRC score for each of the cases by training the model on sourceand testing on the target. Broadly, these results characterize theamount of uncertainty for even a sophisticated ML model, whencommonly occurring discrepancies are present.

The first striking observation is that population bias leads to sig-nificant performance variability, wherein the AUPRC score variesin a wide range between 0.65 and 0.9. Note that, by identifyingdisparate sets of disease conditions, the reported results correspondto the worst-case performance for each scenario. In particular, weobserve that the racial biasWhite-to-Minority, the gender biasMale-to-Female, and the age biasOlder-to-Younger demonstrate challengesin generalization. This sheds light on the fact that manifestation of

Conference’19, August 2019, Alaska, USA J. J. Thiagarajan, et al.

0.5 0.6 0.7 0.8 0.9AUPRC

Older-to-Younger

Younger-to-Older

White-to-Minority

Female-to-Male

Male-to-Female

(a) Population Biases

0.70 0.75 0.80 0.85AUPRC

Resp-to-CardiacRenal

Cerebro-to-CardiacRenal

Dual-to-Single

Single-to-Dual

(b) Label Distribution Shifts

0.750 0.775 0.800AUPRC

10% Label Flips

20% Label Flips

Sampling Rate Change

Missing-Measurements

(c) Measurement Discrepancies

Figure 2: Results from our empirical study on generalizability of the ResNet-1D model with complex domain-shifts.

Table 1: Disease conditions considered in source-target pairs for scenarios where themodel demonstrated poor generalization.

Shift Source Diseases Target Diseases

Older-to-Younger Coronary atherosclerosis, Disorders of lipid metabo-lism, Essential hypertension

Coronary atherosclerosis, Disorders of lipid metabo-lism, Essential hypertension, Chronic kidney disease,Secondary hypertension

Male-to-Female Dysrhythmia, Congestive heart failure Dysrhythmia, Congestive Heart Failure, Coronaryatherosclerosis

White-to-Minority Dysrhythmia, Conduction disorder, Congestive heartfailure

Dysrhythmia, Conduction disorder, Congestive heartfailure, Chronic kidney disease, Secondary hyperten-sion, Diabetes

Single-to-Dual Cardiac-only or Renal-only Both Cardiac and Renal Diseases

different disease conditions in the younger population shows highervariability compared to older patient groups, and hence a modeloverfit to the older case does not generalize to the former. Table 1lists the set of source and target diseases for the cases with poorgeneralization. In the Older-to-Younger scenario, it is observed thatthe disease landscape of older patients reveals strong co-occurrencebetween Coronary atherosclerosis, Disorders of lipid metabolism, andEssential hypertension. In contrast, when one of these diseases occurin younger patients, it is often accompanied by other conditionssuch as Secondary hypertension and Chronic kidney disease. Thissystematic shift challenges pre-trained models to generalize wellto the target. Similarly, in the case ofWhite-to-Minority, while Con-gestive Heart Failure commonly manifests with Dysrhythmia ina predominantly white population, additional conditions such asSecondary Hypertension and Diabetes co-occur among minorities.

In the case of label distribution shifts, when there is no popu-lation bias, surprisingly the model is able to generalize well evenwhen unseen disease conditions are present in the target. This isevident from the high AUPRC scores for both Resp-to-CardiacRenaland Cerebro-to-CardiacRenal. However, the EHR signatures for pa-tients that present subsets or supersets of diseases observed insource are challenging to handle. In particular, detecting the pres-ence of both cardiac and renal conditions using source data withpatients who had only either of the diseases as in Single-to-Dual.This clearly shows the inherent uncertainties in biological systems(often referred to as aleatoric uncertainties) that cannot be arbitrar-ily reduced by building sophisticated ML models.

Finally, with respect to measurement discrepancies, it is widelybelieved that quality of labels in the source data is highly criti-cal. However, surprisingly, we observe that with limited noise inthe labels (10% Label Flips), the performance degradation is mini-mal. However, as the amount of noise increases (20% Label Flips),there is further degradation. Another important observation is thatsampling rate change has a negative effect on the generalizationperformance and simple imputation does not suffice (we adopt astrategy where value from the last time-step is repeated). In com-parison, the model is fairly robust to missing measurements.

REFERENCES[1] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. 2018. An Empirical Evalua-

tion of Generic Convolutional and Recurrent Networks for Sequence Modeling.arXiv:1803.01271 (2018).

[2] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng,Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, andRoger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientificdata 3 (2016).

[3] Deepta Rajan and Jayaraman J Thiagarajan. 2018. A Generative Modeling Ap-proach to Limited Channel ECG Classification. arXiv preprint arXiv:1802.06458(2018).

[4] Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, MichaelaHardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. 2018. Scalable andaccurate deep learning with electronic health records. npj Digital Medicine 1, 1(2018), 18.

[5] Huan Song, Deepta Rajan, Jayaraman J Thiagarajan, and Andreas Spanias. 2018.Attend and Diagnose: Clinical Time Series Analysis using Attention Models. Pro-ceedings of AAAI 2018 (2018).

[6] Greg Ver Steeg and Aram Galstyan. 2016. The information sieve. In Proceedings ofthe 33rd International Conference on International Conference on Machine Learning-Volume 48. JMLR. org, 164–172.

[7] Hamed Valizadegan, Quang Nguyen, and Milos Hauskrecht. 2013. Learningclassification models from multiple experts. Journal of biomedical informatics 46,6 (2013), 1125–1135.

jayaraman j. thiagarajan deepta rajan prasanna …understanding behavior of clinical models under...

Documents