scott d patterson, glaxosmithkline, king of prussia, pa shi ...scott d patterson, glaxosmithkline,...

20
ABSTRACT Paper SP07 SAS Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA ® Markov Chain Monte Carlo (MCMC) Simulation in Practice Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with Monte Carlo integration using Markov chains. MCMC has gained popularity in many applications due to the advancement of computational algorithms and power. The SAS ® MI Procedure provides MCMC method for filling arbitrary missing data and for simulating random samples based on complete data information. Extensions of this procedure are currently available in experimental form to perform Bayesian statistical analysis. The purpose of this paper is to use a simulated hypothetical clinical trial efficacy data set and Challenger’s O- ring failure data as input in order to perform the MCMC method for missing data imputation, model parameter simulation, and model diagnostics, and to use SAS to perform a Bayesian analysis of data commonly encountered in clinical trials. The SAS V9 products used in this paper are SAS BASE ® , SAS/STAT ® , and SAS/GRAPH ® on a PC Windows ® platform. INTRODUCTION Monte Carlo methods are sampling techniques that draw pseudo-random samples from specified probability distributions. In other words, Monte Carlo methods are numerical methods that utilize sequence numbers of random numbers to perform statistical simulations. A Monte Carlo algorithm involves the following components: 1) probability distribution functions (pdf’s) – the target distribution must be specified by a set of pdf’s, 2) random number generator – a source of random numbers uniformly distributed on the unit interval, 3) sampling rule – a prescription for sampling from the specified pdf’s, 4) scoring – the outcomes must be summarized into overall scores, 5) error estimation – an estimate of the statistical error (variance) as a function of the number of trials, 6) variance reduction techniques – methods for reducing the variance in the estimated solution to reduce the computational time, 7) parallelization and vectorization – an algorithm to allow Monte Carlo methods to be implemented efficiently on computer computation. For independent samples, the simulation outcomes can apply ‘Law of Large Numbers’. But independent sampling from Monte Carlo methods may be difficult. The issue of independent samples can be solved by using a Markov chain. A Markov chain is a sequence of random values whose probabilities in a time interval depend upon the value of the number from a previous time point. It converts the sampling schema into a time-series sequence. The controlling factor in a Markov chain is the transition probability which is a conditional probability for the system to go to a particular new state, given the current state of the system. Because the Markov chain is in a time-series format, we can check the sample independence by examination of sample auto-correlation. As time interval increases toward infinite, the Markov chain converges to its stationary distribution. Assuming a stationary distribution exists, it is unique if the chain is irreducible. Irreducible means any set of states can be reached from any other state in a finite number of moves. A Markov chain taking only a finite number of values is aperiodic if the greatest common divisor of return multiplied by any particular state is 1. We can have an ergodic theorem if we assume the Markov chain has the following properties: 1) It has the stationary distribution, and 2) It is aperiodic and irreducible.

Upload: others

Post on 29-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

ABSTRACT

Paper SP07 SAS

Shi-Tao Yeh, GlaxoSmithKline, King of Prussia, PA

® Markov Chain Monte Carlo (MCMC) Simulation in Practice Scott D Patterson, GlaxoSmithKline, King of Prussia, PA

Markov Chain Monte Carlo (MCMC) is a random sampling method with Monte Carlo integration using Markov chains. MCMC has gained popularity in many applications due to the advancement of computational algorithms and power. The SAS® MI Procedure provides MCMC method for filling arbitrary missing data and for simulating random samples based on complete data information. Extensions of this procedure are currently available in experimental form to perform Bayesian statistical analysis. The purpose of this paper is to use a simulated hypothetical clinical trial efficacy data set and Challenger’s O-ring failure data as input in order to perform the MCMC method for missing data imputation, model parameter simulation, and model diagnostics, and to use SAS to perform a Bayesian analysis of data commonly encountered in clinical trials. The SAS V9 products used in this paper are SAS BASE®, SAS/STAT®, and SAS/GRAPH® on a PC Windows® platform. INTRODUCTION

Monte Carlo methods are sampling techniques that draw pseudo-random samples from specified probability distributions. In other words, Monte Carlo methods are numerical methods that utilize sequence numbers of random numbers to perform statistical simulations. A Monte Carlo algorithm involves the following components:

1) probability distribution functions (pdf’s) – the target distribution must be specified by a set of pdf’s, 2) random number generator – a source of random numbers uniformly distributed on the unit interval, 3) sampling rule – a prescription for sampling from the specified pdf’s, 4) scoring – the outcomes must be summarized into overall scores, 5) error estimation – an estimate of the statistical error (variance) as a function of the number of trials, 6) variance reduction techniques – methods for reducing the variance in the estimated solution to reduce the

computational time, 7) parallelization and vectorization – an algorithm to allow Monte Carlo methods to be implemented efficiently

on computer computation.

For independent samples, the simulation outcomes can apply ‘Law of Large Numbers’. But independent sampling from Monte Carlo methods may be difficult. The issue of independent samples can be solved by using a Markov chain.

A Markov chain is a sequence of random values whose probabilities in a time interval depend upon the value of the number from a previous time point. It converts the sampling schema into a time-series sequence. The controlling factor in a Markov chain is the transition probability which is a conditional probability for the system to go to a particular new state, given the current state of the system. Because the Markov chain is in a time-series format, we can check the sample independence by examination of sample auto-correlation. As time interval increases toward infinite, the Markov chain converges to its stationary distribution. Assuming a stationary distribution exists, it is unique if the chain is irreducible. Irreducible means any set of states can be reached from any other state in a finite number of moves.

A Markov chain taking only a finite number of values is aperiodic if the greatest common divisor of return multiplied by any particular state is 1.

We can have an ergodic theorem if we assume the Markov chain has the following properties:

1) It has the stationary distribution, and 2) It is aperiodic and irreducible.

Page 2: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

The ergodic theorem proves that:

1) the central limit theorem holds, and 2) convergence occurs geometrically.

The Markov chain Monte Carlo (MCMC) method consists of a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its stationary distribution. It combines the Monte Carlo method for sampling randomness and the Markov chain method for sampling independence with its stationary distribution. MCMC methods have gained popularity in a wide range of fields and are useful in both Bayesian and frequentist statistical inference. MCMC has been applied as a method for exploring posterior distributions in Bayesian inference. In other words, you can simulate the entire joint posterior distribution of the unknown quantities and obtain simulation-based estimates of posterior parameters that are of interest. The SAS/STAT® system provides the MI procedure for performing multiple imputation of missing data. “Missing values are an issue in a substantial number of statistical analysis. Most SAS statistical procedures exclude observations with any missing variable values from the analysis. While analyzing only complete cases has its simplicity, the information contained in the incomplete cases is lost. This approach also ignores possible systematic differences between the complete cases and the incomplete cases, and the resulting inference may be not applicable to the population of all cases, especially with a smaller number of complete cases.” [14] MCMC imputation is one of the features provided in the MI procedure. You can use the SAS MCMC method for arbitrary missing data imputation or random sample data set simulation based on the complete input data set as prior information. Graphical display is an important component of the MCMC process. It provides the visual displays of MCMC output for checking the behavior from the random sampling process, including convergence of Markov chains and independency of samples. The data used in this paper are for illustration purposes only. They are:

1) A simulated hypothetical longitudinal clinical trial data set. This data set contains a continuous response variable and variables of age, sex, treatment, race, baseline value, subject ID, and visit, as covariate variables.

2) A NASA Challenger O-ring failure data set. It actually contains only 2 variables: failure as binary response variable and temperature as continuous covariate variable.

3) Event data from a recent clinical trial. The SAS V9 products used in this paper are SAS BASE®, SAS/STAT®, and SAS/GRAPH® on a PC Windows® platform. GETTING STARTED The MI procedure made MCMC imputation a simple and easy, but powerful, process. The following sample code is an example for running MCMC sampling.

proc mi data=eff seed=54321 nimpute=1000 out=outmono; mcmc impute=monotone chain=multiple ; var subjid y1 y2 y3 y4 y5; run;

2

Page 3: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

The sample code uses only three SAS statements – PROC MI statement, MCMC statement, and VAR statement. The options available in the PROC MI and MCMC statements are listed in the appendices. AN EXTENSION MCMC is also useful for Bayesian inference. SAS recently released an experimental version of Bayesian software in SAS/STAT at http://support.sas.com/rnd/app/da/bayesproc.html for users to assess and suggest improvements to the procedures. To apply Bayesian statistics, one applies inductive reasoning. One begins with a rough idea of Θ (where Θ here represents the parameters of interest.) This rough idea statistically is termed the prior distribution. One then collects data (designated as X for the purpose of this paper). Traditionally, one then integrates this information to derive a refined understanding of Θ|X (the posterior distribution of Θ given X) using Bayes rule (see, for example, Patterson et al., 1999). Often, however, mathematical derivation of this posterior distribution is not possible, and in such circumstances, MCMC may be used to derive samples from the posterior density of Θ|X at any time as data are collected. This type of analysis is very useful for independent data monitoring in clinical trials and for use in sequential and adaptive designs. We will discuss such an analysis in Example III. EXAMPLE I – LONGITUDINAL CLINICAL EFFICACY DATA Input Data A simulated hypothetical clinical efficacy data set is used for this example. The data set contains 252 subjects with 5 treatments. The variables are listed as follows:

Variable Name Description Valid Value response a derived ‘change from baseline’ variable. numeric value visit 5 levels of clinical visit. 4,5,7,9,11 trt 5 levels of treatment including a placebo. 1=Dose A, 2=Dose B, 3=Dose C,

4=Dose D, 5=Dose E. sex subject’s gender. 1=male, 2=female race 5 levels of racial group 1=white, 2=black, 3=American

hispanic, 4=Asian, 5=other baseval standardized baseline value numeric value age standardized age numeric value subjid subject ID numeric value

Table 1. Description of Sample Clinical Data

Efficacy endpoints are measured at selected on-therapy visits. The MIXED procedure is selected for repeated measures analysis using restricted maximum likelihood.

3

Page 4: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

proc mixed data=eff Method=REML ; class trtc visit subjid Sex Race; model response= trt*visit baseval*visit Sex Race Age / solution ddfm=kr influence residual ; repeated visit / type=un subject=subjid; estimate "Dose B – Dose A at V 11" trt*visit 0 0 0 0 -1 0 0 0 0 1; estimate "Dose C – Dose A at V 11" trt*visit 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 1; estimate "Dose D – Dose A at V 11" trt*visit 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1; estimate "Dose E – Dose A at V 11" trt*visit 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1;

lsmeans trt*visit;

Table 2. Sample Efficacy Analysis Code

The dataset contains some missing values due to early withdrawal subjects and some unexpected missing values. Incomplete Data Output The SAS MIXED procedure excludes observations with any missing values from the analysis. The output of estimates is shown in Table 3 below.

Table 3. Output of Estimates from Incomplete Data LOCF Output

Last Observation Carried Forward (LOCF) is a method specific to longitudinal missing data problems. This method replaces the missing data in later visits by the last available observed data. This method can be illustrated in Tables 4 and 5. Table 4 shows the missing data pattern. SUBJID BASELINE VISIT 4 (y1) VISIT 5 (y2) VISIT 7 (y3) VISIT 9 (y4) VISIT 11 (y5) 001 0.5 0.3 -1.2 -2.8 002 0.4 -0.6 -0.6 -0.9 003 0.7 -1.2 -1.0 -1.1 -1.4 -1.6 004 0.6 1.1 005 -0.2 -0.9 -1.0

Table 4. Missing Data Patterns

4

Page 5: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

SUBJID BASELINE VISIT 4 (y1) VISIT 5 (y2) VISIT 7 (y3) VISIT 9 (y4) VISIT 11 (y5) 001 0.5 0.3 -1.2 -2.8 == -2.8 == -2.8 002 0.4 -0.6 -0.6 -0.9 == -0.9 == -0.9 003 0.7 -1.2 -1.0 -1.1 -1.4 -1.6 004 0.6 1.1 == 1.1 == 1.1 == 1.1 == 1.1 005 -0.2 -0.9 == - 0.9 -1.0 == -1.0

Table 5. LOCF to Fill Missing Data Using LOCF, once the data set has been completed in this way, it is analyzed as if it were fully observed. In New Drug Applications (NDA), LOCF is still widely used; many statisticians consider the method as producing biased point estimate, biased variance, and incorrect inference. See for example Mallinckrodt et al., 2004.

Table 6. Output of Estimates from LOCF Data

MCMC Output – Monotone Imputation Method

There are several possible patterns of missing data in a clinical study. The sources of missing data can be categorized as:

1) some subjects dropping out from the study, resulting in a monotone pattern of missing data. 2) some data missing intermittently, due, for example, to an illness or death, an invalid measurement, or

forgetfulness. This type of missing data is missing at random (MAR) with a non-monotone pattern.

A longitudinal clinical study generally suffer from both types of missingness, and the collected data are often incomplete with a mixed monotone and non-monotone structure. A monotone missing pattern is a particular type of statistical missing data that can be arranged in a monotone pattern. A data set with variables, Y1, Y2, …Yp, , in this specific order, is defined having a monotone missing pattern when the event where a variable Yj is observed for a particular subject implies that all previous Yk, k < j, are also observed for the subject. Table 7 shows the missing data patterns: an “X” means that the variable is observed in the corresponding group, a “.” means that the variable is missing and will be imputed to achieve the monotone missingness for the imputed data set, and an “O” means that the variable is missing and will not be imputed.

5

Page 6: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

Table 7. Missing Data Patterns

The missing data patterns in Table 8 are rearranged to show a triangular monotone missing data pattern. They provide a clear overview of the quantity, positioning, and type of missing values in the dataset. The variable order specified in the VAR statement determines the monotone missing pattern in the imputed data set. With a different order in the VAR list, the results will be different because the monotone missing pattern to be constructed will be different.

Table 8. Rearranged Missing Data Patterns Model Parameter Simulation The number of imputations is set to 1000 in the MI procedure to create 1000 imputed monotone data sets. Each imputed data set is used to run the PROC MIXED model to generate a set of model parameter estimates. Table 9 shows the 1000 sets of simulated model parameters.

Table 9. 1000 Sets of Simulated Model Parameters Rubin (1987) and others proposed a simple rule that the overall estimates and standard errors can be produced by averaging the simulated model parameters. Table 10 shows the overall results.

Label Estimate Std Err DF t Value Pr > |t| Dose B – Dose A at V 11 -0.7597 0.4910 229 -1.55 0.1232 Dose C – Dose A at V 11 0.0077 0.4910 229 0.02 0.9876 Dose D – Dose A at V 11 -1.2151 0.4979 232 -2.44 0.0154 Dose E – Dose A at V 11 0.7028 0.5060 228 1.39 0.1665

6

Page 7: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

Table 10. Average of 1000 Sets of Simulated Model Parameters If we combine the 1000 imputed data sets and average each data cell, we can construct a single averaged imputed data set. Table 11 shows the model estimation results.

Table 11. Model Estimation Results from Averaged Imputed Data Set In the cases of monotone missing pattern, the fraction of missing information is low, methods that average imputed values of the missing data can be more efficient than methods that average simulated parameter values in parameter simulation. (Schafer, 1997). A comparison of Tables 10 and 11, shows that the parameter estimates are very similar. The relative efficiency (RE) of MI is a measurement of imputation efficiency, proposed by Rubin (1987) as follows:

Λ is the fraction of missing information for the quantity being estimated. Table 12 shows relative efficiencies with different values of m and λ. For cases with little missing information, only a small number of imputations are necessary.

Table 12. RE with different values of m and λ The estimation results from the 10 imputations are shown in Table 13.

Table 13. Output of Estimates from 10 Imputations

7

Page 8: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

A comparison of Tables 11 and 13, shows that the differences in estimates are very small due to the fraction of missing information being low. MCMC Output – Full Imputation Method The full imputation method replaces all missing values by imputed values. Table 14 shows the missing data patterns: an “X” means that the variable is observed in the corresponding group, and a “.” means that the variable is missing and will be imputed to replace the missing values in the imputed data set.

Table 14. Missing Data Patterns – Full Imputation Method

The parameter simulation results from 1000 imputations are shown in Table 15.

Table 15. Model Parameter Simulation of 1000 Imputation The average overall estimates and standard errors are shown in Table 16.

Label Estimate Std Err DF t Value Pr > |t| Dose B – Dose A at V 11 -0.6733 0.4784 245 -1.41 0.1736 Dose C – Dose A at V 11 0.0247 0.4795 246 0.05 0.8500 Dose D – Dose A at V 11 -1.0531 0.4752 246 -2.22 0.0362 Dose E – Dose A at V 11 0.5927 0.4869 244 1.22 0.2454

Table 16. Average of 1000 Sets of Simulated Model Parameters

8

Page 9: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

The parameter simulation process with estimates can be depicted in Figure 1 below. Figure 1 consists of 3 panels: descriptive statistics panel, simulation time series panel, and simulation density plot panel.

Figure 1. MCMC Parameter Simulation Plot with Average Parameter Values of 1000 Imputations Table 17 is the output from using the methods that average the imputed values of the missing data.

Table 17. Model Estimates from Average of 1000 Imputed Data Sets The joint density plots can be constructed from the pairs of y1 vs y2, y2 vs y3, y3 vs y4, and y4 vs y5. Figure 2 shows the joint density estimates of MCMC simulated responses.

9

Page 10: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

Figure 2. Joint Density Estimates of MCMC Simulated Responses Imputation Diagnostics Convergence diagnostics are critical when the MCMC based simulations are used. MCMC diagnostics focus on iteration series convergence and sampling independence. These diagnostics can be achieved by graphical presentation of the iteration process. Two plots are suggested.

1) Plot the time-series for each variable of interest. 2) Plot the auto-correlation functions.

The MCMC statement provides two options for plotting the time-series for each variable and the autocorrelation functions. The sample code below produces an auto-correlation function for each parameter. Figure 3 shows the combination of autocorrelation functions from variables y1, y2, y3, y4, y5.

goptions lfactor=5 ftext=swissb htext=2; proc mi data=test seed=54321 nimpute=100 out=outmono; mcmc impute=monotone chain=multiple acfplot (mean( y1 y2 y3 y4 y5) /symbol=dot csymbol=red hsymbol=0.01 cneedles=red wneedles=3 cref=blue cconf=blue ); var y1 y2 y3 y4 y5 ; run; quit;run;

10

Page 11: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

Figure 3. Autocorrelations with 95% Confidence Limits You can use the outiter option in the MCMC statement to capture the iteration history data. Figure 4 shows the iteration time-series plot by response variables at each visit.

Figure 4. MCMC Iteration Time Series Plot by Visit Model Diagnostics – Residual Analysis Residual analysis is one of regression diagnostics tools for graphical and numerical examinations of the adequacy of model specification. A model misspecification can affect the validity and efficacy of regression analysis. One of the residual analyses is based on the plots of raw residuals. In this illustration, variable age is the only numerical covariate. A boxplot is selected to display the raw residuals. Figure 5 shows that the means of residuals are close to zero, confirming variable age is not misspecified.

11

Page 12: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

Figure 5. Residual Boxplot for Checking Functional Form for Variable Age The other residual analysis simulation is based on using aggregates of residuals, such as moving sums, moving averages or cumulative residuals, to check the distributions of certain zero-mean Gaussian stochastic processes[5]. Figure 6 shows the moving average residuals for checking functional form specification graphically.

Figure 6. Checking Functional Form for Variable Age EXAMPLE II – CHALLENGER O-RING FAILURE DATA Input Data On January 28, 1986, the space shuttle Challenger exploded 73 seconds after launch. The Challenger disaster investigation determined that cold weather with cold air temperature caused the rubber to stiffen and not adequately seal the joint. The data relating O-ring failure to temperature are used in this section for illustration purposes. The data are listed in Table 17 below. Flight No. 14 9 23 10 1 5 13 15 4 3 8 17 2 11 6 7 16 21 19 22 12 20 18Failure 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 Temp(oF) 53 57 58 63 66 67 67 67 68 69 70 70 70 70 72 73 75 75 76 76 78 78 79

Table 17. O-ring Failure data

12

Page 13: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

MCMC is used to fill in some missing data and then is fitted into a logistic regression model. The sample code is shown below.

data inprior(type=cov); input _type_ $1-4 _name_ $7-13 @16 failure 6.2 @25 temp 6.2; datalines; COV failure 0.15 -1.5 COV temp -1.5 35.1 N 25. 25. MEAN 0.3 67.7 ; proc mi data=test seed=54321 nimpute=300 out=outmono; mcmc prior=input=inprior ; var failure temp; run; data outmono; set outmono; if failure > 0.5 then failure=1; else failure=0; ods output CLParmWald=wparm association=asso; proc logistic data=outmono; by _imputation_; model failure(event='1')=temp/outroc=roc1 clparm=wald; run;

The results are depicted in Figure 7. Figure 7 consists of 5 panels:

1) The left panel shows the average logistic function and variation. 2) The middle upper panel shows simulated predicted failure probability at 65 degree. 3) The middle lower panel shows simulated predicted failure probability at 49 degree. 4) The right upper panel shows density of predicted failure probability at 65 degree. 5) The right lower panel shows density of predicted failure probability at 49 degree.

Figure 7. MCMC Simulated predicted Probability of O-ring Failure

13

Page 14: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

The O-ring data is a binary response variable with failure and non-failure outcomes. The risk factor variable is a continuous variable, Temperature. The odds ratio (OR) Ψ is defined in the dichotomous risk factor as the ratio of the odds for those with the risk factor to the odds for those without the risk factor. For continuous explanatory variables, these odds ratios correspond to a unit increase in the risk factors. Three types of prior information are used to construct the imputed data sets: non-informative prior (default), Ridge, and prior information data set. These data sets are filled into the LOGISTIC procedure to generate the odds ratios. In the displayed output of PROC LOGISTIC, the “Odds Ratio Estimates” table contains the odds ratio estimates. Figure 8 shows the OR density estimates by prior information.

Figure 8. OR Density Estimates from MCMC Simulation by Prior Info. The Receiver Operating Characteristic (ROC) curve is a curve presented in a probability scale graph and is used to judge the discrimination ability of various statistical methods for predictive purposes. The area under the ROC curve can be measured and converted to a single quantitative index for diagnostic accuracy. Receiver Operating Characteristic (ROC) curves are popular as tools for detection of events or various conditions such as asymptomatic dysfunction or disease. For binary response data, the response is either an event or a nonevent. The accuracy of the classification is measured by its sensitivity. Sensitivity is the ability to predict an event correctly. Specificity is the ability to predict a nonevent correctly. “PROC LOGISTIC also computes three other conditional probabilities: false positive rate, false negative rate, and rate of correct classification. The false positive rate is the proportion of predicted event responses that were observed as nonevents. The false negative rate is the proportion of predicted nonevent responses that were observed as events. Given prior probabilities specified with the PEVENT=option, these conditional probabilities can be computed as posterior probabilities using Bayes’ theorem.” [14] The ROC curve, shown in Figure 9 by plotting of ‘sensitivity’ versus ‘1 – specificity’, is used to judge the discrimination ability of various statistical methods for predictive purposes.

14

Page 15: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

Figure 9. Simulated ROC Curves The area under a ROC curve is a summarized quantitative index. This index, varies between 0.5 (no discrimination power) to 1.0 (perfect accuracy) as the ROC travels towards the left and top boundaries of the graph. The meaning of the area under a ROC curve, namely the index, is a "probability of correctly ranking a (normal, abnormal) pair"[14]. In other words, the index is a probability of correct pairwise rankings. The area under the ROC curve, as determined by the trapezoidal rule, is estimated by the statistic ‘c’ in the “Association of Predicted Probabilities and Observed Responses” table [14]. Table 18 shows the AUC estimations from the first 10 imputations.

Table 18. AUC Estimation from first 10 Imputations. The functional form specification technique [] is used to smooth the simulated ROC curves. Figure 10 shows the smoothed ROC curves.

15

Page 16: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

Figure 10. Smoothed Simulated ROC Curves EXAMPLE III – CLINICAL EVENT DATA For the purposes of this example, we will consider a clinical event observed with a frequency following a clinical procedure. Our interest is in understanding the frequency of this event following a clinical procedure when patients are receiving one of four treatments (labeled Groups 1-4). For this example, we will assume that little is known about the frequency of this type of event prior to the conduct of the clinical trial. This is known as the assumption of a uniform or non-informative prior. Data from the clinical trial will be incorporated in our model with the intent of updating this knowledge. The clinical events themselves are assumed to follow a binomial distribution with unknown parameter 0≤p≤1. Our intent is to use clinical data as it is collected to refine our knowledge of p. Data were observed as follows where x is the number of observed clinical events out of the n number of patients randomized to receive treatments 1 through 4:

data test;input group x n; datalines; 1 74 164 2 75 169 3 66 160 4 49 157 ;run;

Such data are easily modeled using logistic regression in PREC GENMOD (see model statement below); however, the new experimental feature using PROC BGENMOD allows one to conduct a Bayesian analysis of data at any point in the trial. The additional code involved is found below in the `bayes’ statement, where it is specified that the prior distribution should be uniform, and that one-million MCMC simulations of the posterior density of p should be generated with a burn-in of 500 iterations. Parameter estimates are output in the ODS statement and may be summarized using PROC UNIVARIATE for example to compare the posterior distributions of p.

16

Page 17: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

proc bgenmod data=test;class group; model x/n = group/dist=bin link=logit p cl; bayes seed=2010 NBI=500 NMC=1000000 coeffprior=uniform; ods output ParameterEstimates=params PosteriorSample=post; run;

Here it is found that the posterior median value for p is lowest (0.31) in Group 4, and that the probability of an event in groups 1-3 is higher with posterior median values lying from 0.41-0.45. The confidence to be placed in the findings is reflected in the width of the posterior distributions (see Figure 11). Here we see that the findings are pretty compelling. Treatment groups 1-3 have higher probability of an event than Group 4.

Figure 11. Summary of Estimated Posterior Densities for the Probability of an Event in Treatment Groups 1-4 It is easy to confirm that a Bayesian credible regions for each group based on the estimates from the posterior distribution described above are equivalent to a frequentist confidence interval when a uniform prior is utilized, and we defer further discussion of the utility of Bayesian methods to future works. CONCLUSION SAS MI procedure is a powerful tool for missing data imputation and parameter simulation. The key features are summarized as follows:

• It is very easy to use. Only three SAS statements are needed. • The procedure provides options for the users to modify the initial value settings and to select the

method for imputation. • It provides two optional output datasets, out and outiter, for further analysis.

17

Page 18: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

• The procedure can be used in the data preparation steps before calling the analysis model to simplify the clinical efficacy data analysis process.

MI is attractive because it can be highly efficient even for small values of imputation. In many applications, just 3-5 imputations are sufficient to obtain excellent results. Extensions of this approach to Bayesian analysis are easily accommodated using the experimental version provided recently by SAS, and in our example only one additional line of code is needed. Modification in future versions of SAS to include a general MCMC Gibbs sampler will allow for even greater utility in clinical statistics and programming. APPENDICES Appendix 1. Summary of PROC MI Options Option Description Option Default

Value alpha= specifies the confidence limits be constructed for the mean estimates

with confidence level 100(1 – α)%, where 0 < α < 1. α=0.05

data= input data set name most recent created data set name

maximum= specifies maximum values for imputed variables. a missing value minimum= specifies the minimum values for imputed variables. a missing value minmaxiter= specifies the maximum number of iterations for imputed values to be in

the specified range when the option MINIMUM or MAXIMUM is also specified.

100

mu0= theta0=

specifies the parameter values µ0 under the null hypothesis µ=µ0 for the population means corresponding to the analysis variables.

0

nimpute= specifies the number of imputations. 5

noprint suppresses the display of all output. out= creates an output SAS data set containing imputation results. SAS data set name round= specifies the units to round variables in the imputation. a missing value seed= specifies a positive integer to start the pseudo-number generator. the time of day from

the computer’s clock.

simple displays simple descriptive univariate statistics and pairwise correlations from available cases.

singular= specifies the criterion for determining the singularity of a covariance matrix based on standardized variables, where 0 < p < 1.

IE-8

Appendix 2. Summary of MCMC Statement Options Type Option Description Default input inest= the input INEST=data set is a TYPE=EST data set and

contains a variable _imputation_ to identify the imputation number.

Initial=input= the input INITIAL=INPUT=data set is a TYPE=COV or CORR data set and provides initial parameter estimates for the MCMC process.

prior=input= the PRIOR=INPUT= data set is a TYPE=COV data set that provides information for the prior information

18

Page 19: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

output outest= the OUTEST= data set is a TYPE=EST data set and contains parameter estimates used in each imputation in the MCMC process.

outiter<(options)>=

the OUTITER=data set in an MCMC statement is a TYPE=COV data set and contains parameters used in the imputation step for each iteration.

imputation impute= specifies whether a full-data imputation is used for all missing values or a monotone data imputation is used for a subset of missing values to make the imputed data sets have a monotone missing pattern.

full

chain= specifies whether a single or a separate chain is used for all imputations.

single

nbiter= specifies the number of burn-in iterations before the first imputation in each chain.

200

niter= specifies the number of iterations between imputations in a single chain.

100

initial= specifies the initial mean and covariance estimates for the MCMC process

prior= specifies the prior information for the means and covariances. Valid values for name are as follows: JEFFREYS, RIDGE, and INPUT=dataset

JEFFREYS

start= specifies that the initial parameter estimates are used as either the starting value or as the starting distribution in the first imputation step of each chain.

value

graphics timeplot= displays the time-series plots of parameters from iterations. acfplot= displays the autocorrelation function plots of parameters from

iterations.

gout= specifies the graphics catalog for saving graphics output from PROC MI.

printed output

WLF displays worst linear function.

displayinit Displays initial parameter values in the MCMC process for each imputation.

Appendix 3. Summary of ODS Table from MCMC Statement ODS Table Name Description Option EMPostEstimates EM (posterior mode) estimates INITIAL=EM EMPostIterHistory EM (posterior mode) iteration history INITIAL=EM EMWLF coefficients of the worst linear function WLF MCMCInitEstimates initial parameter estimates for MCMC DISPLAYINIT REFERENCES

[1] Harrell, E.F.(2000): Practical Bayesian Data Analysis from a Former Frequentist, Mastering Statistical Issues in Drug Development, Henry Stewart Conference Studies, 15-16 May, 2000 [2] Harrell, E.F.(2005): A Good P-value is Hard to Find: Why I Became a Bayesian, Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, TN, March 2, 2005

19

Page 20: Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Shi ...Scott D Patterson, GlaxoSmithKline, King of Prussia, PA Markov Chain Monte Carlo (MCMC) is a random sampling method with

[3] Gilks, W.R., S Richardson, and D.J. Spiegelhalter edt. (1996): MARKOV CHAIN MONTE CARLO IN PRACTICE, Chapman & Hall, London, UK [4] Fan X., A. Felsovalyi, S.A. Sivo, and S.C. Keenan(2001): SAS for Monte Carlo Studies – A Guide for Quantitative Researchers, SAS Institute Inc., Cary, N.C., USA [5] Lin, D.Y., L.J. Wei, and Z. Ying (2002): Model-Checking Techniques Based on Cumulative Residuals, Biometrics, 58, 1-12, March 2002 [6] Little, R.J.A. and D. B. Rubin, (1987), Statistical Analysis with Missing Data, New York: John Wiley & Sons, Inc. [7] Mallinckrodt, C., Kaiser, C., Watkin, J., Molenberghs, G., Carroll, R. (2004) The effect of correlation structure on treatment contrasts estimated from incomplete clinical trial data with likelihood-based repeated measures compared with last observation carried forward ANOVA. Clinical Trials, 1, 477--489.

[8] Patterson, S. Francis, S. Ireson, M. Webber, D. and J. Whitehead (1999): A NOVEL BAYESIAN DECISION PROCEDURE FOR EARLY-PHASE DOSE FINDING STUDIES, Journal of Biopharmaceutical Statistics, 9(4), 583-597 [9] Rubin, D.B., (1987): Multiple Imputation for Nonresponses in Surveys, New York: John Wiley & Sons, Inc. [10] SAS Institute Inc(2004).: SAS/STAT 9.1 User’s Guide, SAS Institute Inc., Cary NC, USA. [11] SAS Institute Inc.,(2003): SAS® OnlineDoc 9. , Cary NC. http://support.sas.com/91doc/docMainpage.jsp . [12] SAS Institute Inc.,(2004): Base SAS® 9.1 Procedure Guide , SAS Institute Inc. Cary NC. USA [13] SAS Institute Inc.,(2004): SAS/GRAPH® 9.1 Reference, Volumes 1 and 2 , Cary NC. SAS Institute Inc. [14] SAS Institute Inc.,(2004): SAS/STAT® 9.1 User’s Guide , Cary NC. SAS Institute Inc. [15] SAS Institute Inc.,(2007): Preliminary Capabilities for Bayesian Analysis in SAS/STAT® Software , Cary NC. SAS Institute Inc. [16] Schafer, J.L., (1997): Analysis of Incomplete Multivariate Data, New York: Chapman and Hall [17] Yeh, S.T.(2004): Graphical Display of Clinical Data – A Nonparametric Approach, NESUG 2004 Proceedings, Paper po10, Nov. 2004.

TRADEMARKS

SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. AUTHOR CONTACT INFORMATION

Scott D Patterson, Ph. D. (610) 787-3296 (W) E-mail: [email protected] Shi-Tao Yeh, Ph. D. (610) 787-3856 (W) E-mail: [email protected]

20