how to assess and compare the accuracy of continuous glucose monitors?

12
DIABETES TECHNOLOGY & THERAPEUTICS Volume 10, Number 2, 2008 © Mary Ann Liebert, Inc. DOI: 10.1089/dia.2007.0216 Review How to Assess and Compare the Accuracy of Continuous Glucose Monitors? I.M.E. WENTHOLT, M.D., 1 A.A.M. HART, Ph.D., 2 J.B.L. HOEKSTRA, M.D., Ph.D., 1 and J.H. DEVRIES, M.D., Ph.D. 1 ABSTRACT Continuous glucose monitors may be valuable tools for improving glycemic control and avoid- ing hypoglycemia in patients with diabetes. To this goal, sensor readings must adequately re- flect the actual blood glucose, emphasizing the need for solid accuracy assessment methods for continuous glucose sensor readings. Analysis of continuous glucose data is challenging, and de- spite many efforts there still is no all-embracing method to overcome the obstacles in the as- sessment of continuous data. In this review we disclose the weaknesses of currently available methods and propose a guideline for sensor accuracy assessment and comparison. For accuracy assessment it is best to first plot the sensor readings against the reference values and draw a line of identity, visualizing the degree of agreement. Thereafter data pairs should be given in a Bland- Altman plot to detect a possible relationship between the difference and the mean. The next step is to calculate the absolute relative difference over all paired readings together and per glucose range. A possible lag time between the measurements of both methods can be detected by com- bined curve fitting. Finally, sensitivity and positive predictive value for detecting hypoglycemia are important indicators of the sensors’ performance. For comparing the accuracy between dif- ferent glucose sensors it is best to use indirect comparison against blood glucose, rather than di- rect comparison methods, since none of the current glucose sensors is accurate enough to be considered the reference value. 57 INTRODUCTION B OTH RANDOMIZED 1–3 and nonrandomized 4–6 clinical trials have proven continuous glu- cose monitoring to be beneficial for glycemic control in patients with diabetes. Well-func- tioning continuous glucose monitors may also be valuable tools for avoiding impending hy- poglycemia, the limiting factor of intensive in- sulin treatment. 7 The introduction of sensors al- lowing for prolonged or even long-term con- tinuous use may go hand in hand with insulin dosing solely based on sensor glucose readings, despite the Conformité Européenne (CE)/Food and Drug Administration (FDA) approval as an adjunctive device and not as replacement of self-monitoring of blood glucose (SMBG), which makes the need for a well-founded ac- Departments of 1 Internal Medicine and 2 Clinical Epidemiology and Biostatistics, Academic Medical Center, Am- sterdam, the Netherlands.

Upload: jh

Post on 12-Apr-2017

220 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: How to Assess and Compare the Accuracy of Continuous Glucose Monitors?

DIABETES TECHNOLOGY & THERAPEUTICSVolume 10, Number 2, 2008© Mary Ann Liebert, Inc.DOI: 10.1089/dia.2007.0216

Review

How to Assess and Compare the Accuracy ofContinuous Glucose Monitors?

I.M.E. WENTHOLT, M.D.,1 A.A.M. HART, Ph.D.,2 J.B.L. HOEKSTRA, M.D., Ph.D.,1and J.H. DEVRIES, M.D., Ph.D.1

ABSTRACT

Continuous glucose monitors may be valuable tools for improving glycemic control and avoid-ing hypoglycemia in patients with diabetes. To this goal, sensor readings must adequately re-flect the actual blood glucose, emphasizing the need for solid accuracy assessment methods forcontinuous glucose sensor readings. Analysis of continuous glucose data is challenging, and de-spite many efforts there still is no all-embracing method to overcome the obstacles in the as-sessment of continuous data. In this review we disclose the weaknesses of currently availablemethods and propose a guideline for sensor accuracy assessment and comparison. For accuracyassessment it is best to first plot the sensor readings against the reference values and draw a lineof identity, visualizing the degree of agreement. Thereafter data pairs should be given in a Bland-Altman plot to detect a possible relationship between the difference and the mean. The next stepis to calculate the absolute relative difference over all paired readings together and per glucoserange. A possible lag time between the measurements of both methods can be detected by com-bined curve fitting. Finally, sensitivity and positive predictive value for detecting hypoglycemiaare important indicators of the sensors’ performance. For comparing the accuracy between dif-ferent glucose sensors it is best to use indirect comparison against blood glucose, rather than di-rect comparison methods, since none of the current glucose sensors is accurate enough to beconsidered the reference value.

57

INTRODUCTION

BOTH RANDOMIZED1–3 and nonrandomized4–6

clinical trials have proven continuous glu-cose monitoring to be beneficial for glycemiccontrol in patients with diabetes. Well-func-tioning continuous glucose monitors may alsobe valuable tools for avoiding impending hy-poglycemia, the limiting factor of intensive in-

sulin treatment.7 The introduction of sensors al-lowing for prolonged or even long-term con-tinuous use may go hand in hand with insulindosing solely based on sensor glucose readings,despite the Conformité Européenne (CE)/Foodand Drug Administration (FDA) approval asan adjunctive device and not as replacement ofself-monitoring of blood glucose (SMBG),which makes the need for a well-founded ac-

Departments of 1Internal Medicine and 2Clinical Epidemiology and Biostatistics, Academic Medical Center, Am-sterdam, the Netherlands.

Page 2: How to Assess and Compare the Accuracy of Continuous Glucose Monitors?

curacy assessment of continuous glucose sen-sor readings even more stringent. Previous re-ports already emphasized this need8,9 and pre-sented a critical evaluation of some of theavailable accuracy assessment methods.10

To date, eight devices are on or entering themarket. The continuous glucose monitoringsystem CGMS® System Gold™ (MedtronicMiniMed, Northridge, CA), GlucoWatch® G2™Biographer (Cygnus Inc., Redwood City, CA),Guardian® Telemetered Glucose MonitoringSystem (Medtronic MiniMed), Guardian RT(Medtronic MiniMed), and Paradigm® REAL-Time (Medtronic MiniMed) are both CE andFDA approved, while GlucoDay® (MenariniDiagnostics, Florence, Italy) is only CE marked,and STS™ System (DexCom™, San Diego, CA)is only FDA approved. Furthermore, FDA ap-proval is pending for the Freestyle® Navigator®

from Abbott Laboratories (Alameda, CA). Withthis ongoing development, interest has risen incomparative studies to detect the most accurateof two or more sensors. In the setting of head-to-head studies, sensors are not necessarilycompared to the “gold standard” of blood glu-cose, and this requires a careful considerationsomewhat different from the accuracy assess-ment of a single sensor. Although biostatisti-cians agree the goal of these comparisons is todetect disagreement or bias rather than agree-ment, so far no consensus exists on how to bestachieve this goal.11

This report aims to provide an overview ofthe currently available accuracy assessmentmethods and to discuss the pros and cons ofeach method. Advantages and disadvantagesrelate to both the practical feasibility—e.g., easeor complexity of interpretation, the amount ofdata required for the assessment—and the sta-tistical correctness of principles underlying theassessment. Finally, we propose a clear andwell-founded way to assess accuracy of onecontinuous glucose sensor as well as for head-to-head comparisons of more than one sensor.

MATERIALS AND METHODS

An English language literature search wasdone using the Medline database of the Na-tional Library of Medicine starting from 1983

onwards, the year Bland and Altman12 firstpublished on the topic of analyzing methodcomparison studies by entering the various sta-tistical accuracy assessment tools, including re-gression analysis, correlation (coefficient),mean and median absolute (relative) differenceor relative absolute difference, InternationalStandardization Organization (ISO) criteria,Clarke, consensus and continuous glucose er-ror grid analyses, sensitivity and specificity,positive and negative predictive value (PPVand NPV, respectively), and various synonymsas search terms. Animal and in vitro studieswere excluded, and if needed, the number ofhits was restricted by including the terms “di-abetes mellitus,” “accuracy,” and “continuousglucose monitoring.” In addition, articles rele-vant to this topic were hand-searched by theauthors and checked for references.

Assessment tools

Regression analysis and correlation coefficient.Two types of regression analysis are in use:least squares regression (LSR) and least prod-ucts regression (LPR).13 With LSR, a straightline is fitted in a sensor–reference glucose plotso that the sum of the squared vertical devia-tions from the line is minimized. Only verticaldeviations are included, assuming that the xvariables—representing the reference method—are fixed or measured without error.14 WithLPR, the sum of the products of the absolutedeviations of both x and y values from the re-gression line is minimized, thus taking into ac-count a measurement error for both the sensorand the reference method.15–17 In actual prac-tice the method of LSR is applied most of thetime. In contrast to LSR, LPR accounts for avariation of both the sensor and the referencemethod, which has been described as an ad-vantage and drawback at the same time, the lat-ter because the variation of the reference andthe sensor is equally weighted. Which methodis most appropriate—a reference method as-sumed to have either zero variation (LSR) or avariation similar to the sensor method (LPR)—thus varies per applied reference method anddepends on how the investigator estimates theaccuracy of the reference method.13,18 Thus,certain assumptions must be made regarding

WENTHOLT ET AL.58

Page 3: How to Assess and Compare the Accuracy of Continuous Glucose Monitors?

the variation of the reference method, whichcauses an important drawback for regressionanalysis. In the absence of highly accurate ref-erence data, results from regression analysisare hard to interpret. To a lesser extent, the factthat regression analysis gives insight in the re-lationship rather than the (dis)agreement be-tween two methods has been reported as an-other drawback for regression analysis.12 Insummary, when laboratory blood glucose isused as the reference, LSR seems appropriate,where LPR seems preferable when SMBG isused as a reference.

The correlation coefficient has widely beenused but also widely criticized as a tool for ac-curacy assessment. For instance, correlation ig-nores any systematic bias between the mea-surements of two methods.19 Also, a change inmeasurement scale does not affect the correla-tion, which looks at the degree of associationof all the samples, while it does affect the agree-ment compared to the line of identity.20 Whentwo methods measure the same substance, thea priori chance of finding a correlation is high,but this does not exclude the possibility thatboth methods have a (large) measurement er-ror. Thus, a strong correlation does not guar-antee that both methods can be used inter-changeably.21 In a recent study concomitantsensor–reference glucose pairs were simulated10,000 times using fixed accuracy and standarddeviation (SD) of the sensor error, to investi-gate how this affected correlation. Correlationanalysis indicated inconsistent accuracy forsensors with identical accuracies by definition,with values ranging from 0.5 to 0.96.10 Anotherreport confirmed the incongruence betweencorrelation and accuracy, with the Pearson cor-relation coefficient between sensor and bloodglucose values being 0.85, while only 64% ofthe paired values were within 20% of eachother.22

The major limitation of the correlation coef-ficient—and to some extent regression analysisas well—is that it is highly influenced by therange and distribution of measured values18

and hence on the way in which the sample ofsubjects was chosen. A well-known advantageof regression analysis is that it quantifies thedeviation from the line of identity, whereas amaximal correlation of 1 can exist where two

methods both show an intercept other than 0and a slope other than 45°. In accordance withearlier reports10,19,20 we think that the correla-tion coefficient is inappropriate for measuringthe agreement between two glucose measuringmethods; this method should be abandoned inaccuracy studies of continuous glucose moni-toring.

Accuracy assessment using paired data in sepa-rate glucose ranges. Methods using paired val-ues include the mean difference (MD), themean relative difference (MRD), the mean ab-solute relative difference (ARD), the medianARD, the Bland-Altman analysis with the lim-its of agreement when requirements explainedbelow are met and—up to a certain extent—thepercentage of glucose measurements meetingthe ISO requirements. The MD (average of thesensor values minus reference values) and theMRD (MD divided by the reference value mul-tiplied by 100 to convert the proportion into apercentage) convey a systematic under- oroverestimation of one method compared to an-other, but with the negative and positive devi-ations neutralizing each other, overestimatedand underestimated glucose values are flat-tened out. These methods are therefore di-rected at estimating an assumed constant ab-solute or relative bias of one method relative tothe other. The mean and median ARD are theaverage and median absolute differences, re-spectively, between sensor and reference val-ues divided by the reference value and multi-plied by 100 and indicate how many percentby which a method deviates from the referencemethod—either under- or overestimation. Cal-culation of both the mean and median ARD isstraightforward, and the results are easy to in-terpret. The median and mean ARDs give animpression of both the bias and the variation:the bigger the bias and/or variation betweentwo measuring systems, the larger the meanand median ARD. The measures do not enabledifferentiation between and quantification ofthe bias and variation in their contribution tothe final results of mean and median ARD. Ingeneral, the median ARD for continuous glu-cose sensors turns out lower than the meanARD. Although the median ARD seems statis-tically more correct than the mean ARD—in

CONTINUOUS GLUCOSE MONITORING 59

Page 4: How to Assess and Compare the Accuracy of Continuous Glucose Monitors?

view of the distribution of the data—this mea-sure is currently less well accepted and usedthan the mean ARD.

The ISO formulates requirements for bloodglucose monitoring systems as follows: for ref-erence values �75 mg/dL (4.2 mmol/L) theISO criterion is defined as sensor values within�15 mg/dL (0.8 mmol/L), and for referencevalues �75 mg/dL (4.2 mmol/L) within �20%.So, the ISO criteria combine absolute and rela-tive differences between sensor and referencevalues, and the results are dichotomized as toeither satisfying the criteria or not.10

The Bland-Altman analysis plots the differ-ences between two methods (y-axis) against themean of both methods (x-axis), enabling (1) de-tection of a relation between bias and bloodglucose level, (2) detection of a tendency for thevariation to change with the level of the (glu-cose) measurements, and (3) calculation of thelimits of agreement. A plot of the differenceagainst the reference measurement per se issometimes suggested, but this will show a re-lation between difference and magnitude evenwhen there is none, due to the so-called re-gression towards the mean effect. When (sen-sor – reference) is plotted against the referencevalue, the difference will seemingly diminishas the reference value rises, suggesting a neg-ative correlation. A plot of the differenceagainst the mean is less likely to be misleadingin that way. Some investigators argue that withthe use of an averaged value on the x-axis, theBland-Altman method would be inappropriatefor accuracy assessment, where the sensorvalue is compared to a highly accurate refer-ence method, regarded as having no or negli-gible measurement error.14,16 As discussed ear-lier regarding LSR and LPR, whatever methodwill be used is based on the investigator’s ap-praisal of the accuracy of the available refer-ence method. One way to resolve this issuemight be the following. If one knows the ratio(sensor SD)/(reference SD)—denoted by, say,� —then use of the weighted mean withweights 1/(1 � �2) for the sensor and �2/(1 � �2)for the reference method would remove thecorrelation between difference and mean causedby the regression towards the mean effect. Forequal SD values this would lead to use of theunweighted mean on the horizontal axis as in

the standard Bland-Altman method, while fora reference without error this method wouldlead to the use of the reference value instead.The SDs, and thereby �, might be estimated byfitting the glucose–time curves using fixed in-tervals–natural splines23 (see next section) anduse the square root of residual mean squares.Alternatively, sensor and reference SD valuescan be taken from literature. Based on this werecommend the use of the reference value alonein case sensor readings are compared with lab-oratory blood glucose measurements and totake the (weighted) averaged values of the sen-sor and reference method when sensor read-ings are compared with SMBG values obtainedwith a home blood glucose meter or othermethods known to have a considerable varia-tion.

Provided that the mean and SD of the dif-ferences between both methods are (1) constantthroughout the range of measurements and (2)normally distributed, the Bland-Altman plotcan be quantified by the limits of agreement,defined as the mean difference � 1.96 SD. Bymeans of a scatter diagram of the differenceagainst the average of the two measurementsand a histogram of the differences whetherthese two criteria are met can be evaluated.19

Often the differences increase proportionally tothe measurement. This might be resolved byanalyzing the logarithm of the measurementrather than the measurement itself. The limitsof agreement indicate how much one methodmay be below or above the other. Which lim-its cause clinical difficulties will be a matter ofjudgment, and decisions regarding the accept-able size of limits should be made in advanceof the analysis.24

As briefly mentioned earlier, the accuracy ofseveral sensors, e.g., the CGMS Gold,25,26 Glu-coWatch,22 Navigator,27 and to a lesser extentGlucoDay,26 becomes worse in the hypo-glycemic range, and the absolute deviation hasbeen shown to increase with increasing glucoselevels. Therefore, accuracy assessment toolsshould be reported with separate evaluation ofvalues in the hyper-, hypo-, and normo-glycemic range.

We prefer mean and median ARD or Bland-Altman analysis over the ISO criteria: first, be-cause mean and median ARDs are the simplest

WENTHOLT ET AL.60

Page 5: How to Assess and Compare the Accuracy of Continuous Glucose Monitors?

method to gain insight in the sensor’s perfor-mance per glycemic range and the Bland-Alt-man analysis quickly discloses the accuracy inproportion to the glucose levels displayed ascontinuous data, the distribution of error—ei-ther or not allowing calculation of the limits ofagreement—and the occurrence of outliers.28

Second, in contrast to the ISO criteria, boththese methods do not use arbitrary cutoff val-ues to create binary outcome values and thusoffer more objective results. Another drawbackof dichotomization of the data is that glucosemeasurements close to but on opposite sides ofthe cutoff value are characterized as being verydifferent rather than very similar.19 Third, bothmethods work constantly with absolute differ-ences instead of a combination of absolute andrelative differences like with the ISO criteria. Aminor drawback of all above-mentioned meth-ods is that they are calculated over paired data,so that exclusion of sensor data is often in-evitable.

Combined curve fitting. With the method ofcombined curve fitting, the sensor glucosereadings over time in the same patient are fit-ted by a curve using least square regressionbased on natural splines with knots at equal in-tervals.23 Natural splines are very flexible, al-lowing almost every shape to be fitted with rel-atively few parameters. By assuming the sensorcurve to have the same shape as the concomi-tant reference blood glucose curve it is possi-ble to shift the sensor curve across the x- andy-axes until it overlies the blood glucose curveas much as possible. This enables one to calcu-late the horizontal and vertical shift, indicatingdelay and systematic deviation, respectively. Arecent study calculated the horizontal shift fortwo sensors, indicating a 7-min delay for theGlucoDay sensor and no delay for the CGMSGold sensor26 relative to direct blood glucosemeasurement. Like the earlier mentioned meandifference, the vertical shift assesses a system-atic under- or overestimation, but has two ad-vantages compared to the mean difference.First, curve fitting does not necessitate data col-lection to occur simultaneously for both meth-ods, as it does not require paired data but in-cludes all measurements in the analysis.Second, curve fitting accounts for a possible

horizontal shift. Both the mean difference andvertical shift are influenced by the assumptionthat the glucose curves of the different mea-suring methods have the same shape. For if not,the systematic bias would alter over time. Withcurve fitting reflecting the systematic under- oroverestimation by the sensor, but at the cost offlattening out overestimated and underesti-mated glucose values, and mean and medianARDs reflecting the absolute deviation, bothmethods are complementary. The unique valueof combined curve fitting lies in the possibilityof in vivo assessment of delay via calculationof the horizontal shift.23,26

Accuracy assessment with clinical focus: errorgrid analyses. Paired blood glucose–sensor read-ings can be plotted in a Clarke error grid, whichis divided into five zones. A clinical conse-quence—varying from no consequence to po-tentially dangerous, opposite treatment—is at-tached to each zone. This analysis was the firstmethod to relate accuracy of glucose measure-ment to clinical consequences, developed in1987.29

In 2000, a novel version of the Clarke errorgrid was developed, the consensus error grid.30

The consensus error grid is more up-to-date,created by a large group of endocrinologists,and does not contain risk boundaries that skipcategories, all in contrast to the former Clarkeerror grid. It still contains five risk zones, likethe Clarke error grid, but these are slightly dif-ferently defined (Fig. 1).

With Zone E in the higher glycemic referenceranges being omitted and Zones A and B cov-ering a larger part of the grid, the consensus er-ror grid is obviously more forgiving than theClarke error grid. Again, these boundaries arenot free of arbitrariness. With strict glycemiccontrol being associated with a reduced risk ofdeveloping long-term complications of dia-betes, missing extremely high glucose valuesshould be considered as a sensor reading withpotentially dangerous consequences. In this re-spect the consensus error grid seems too for-giving. In addition, we prefer the Clarke errorgrid since it has been much more widely used.

In 2004, a novel continuous glucose-errorgrid analysis (CG-EGA) was specifically de-signed to evaluate continuous glucose moni-

CONTINUOUS GLUCOSE MONITORING 61

Page 6: How to Assess and Compare the Accuracy of Continuous Glucose Monitors?

tors. (Syntaxes for both the Clarke and contin-uous glucose error grids are available by e-mailing the corresponding author.) The CG-EGA takes into account the interdependency ofsuccessive data by combining point with rateaccuracy. With rate defined as the differencebetween glucose at points 1 and 2 expressedper minute, the rate accuracy indicates howwell the sensor is capable of following direc-tion and tempo of changes in blood glucose.The point error grid is similar to the Clarke er-ror grid except that it allows for a possible shiftof the upper and lower boundaries of Zones A,B, and D in proportion to the rate to accountfor the delay between sensor and blood glucosevalues, assumed by Clarke’s group to be con-stant and 7 min. For instance, if the sensor re-ports a glucose value of 4.2 mmol/L, but theconcurrent reference glucose value is 3.8mmol/L, this sensor reading would tradition-ally end up in Zone D, suggesting poor accu-racy because of a missed hypoglycemia. How-ever, if the glucose is decreasing at a rate of 0.06mmol/L/min, then in 7 min, the sensor valueof 4.2 mmol/L would become 3.8 mmol/L, cor-responding with Zone A accuracy. With this inmind, the expansion of the boundaries of ZonesA, B, and D is determined by multiplying themean rate by 7 min. Major pitfalls of the con-tinuous error grid are that (1) it is time con-suming because of the required frequent blood

sampling and the laborious construction ofboth rate and point error grid plots and a final9 � 11 matrix, (2) differences in rate accuracybarely influence the final CG-EGA outcome, (3)it is sensitive to interobserver variability witharbitrarily chosen time intervals affecting thefinal CG-EGA results, and (4) the formula toshift the point error grid lines in proportion torate differences is based on a questionable as-sumption of a constant delay between intersti-tial and blood glucose.31 A previous study cal-culated the mean ARD values belonging toeach of the five CG-EGA zones. Zone A ap-peared to include mean ARDs of 9.4 � 6.1%and 11.3 � 7.4% for two different sensors, re-spectively. However, mean ARDs for the read-ings in the so-called clinically acceptable ZoneB were 29.6 � 7.9% and 28.6 � 7.4% with amaximum mean ARD of 46% in that zone. Inthe earlier mentioned simulation study sensorswere made artificially inaccurate by randomlyshuffling the data pairs.10 The results clearlydemonstrated that both error grid analyses aretoo forgiving: after shuffling the data pairs,78%, 79%, and 80% of the (artificially inaccu-rate) pairs still fell in Zone A or B of the Clarke,consensus, and continuous (rate) error grids,respectively.

The composers of the CG-EGA have re-sponded to this criticism in a comment letter.32

In this letter they rightfully argued that when

WENTHOLT ET AL.62

FIG. 1. Clarke error grid (left) versus consensus error grid (right).

Page 7: How to Assess and Compare the Accuracy of Continuous Glucose Monitors?

calculating an ARD—as was done in the head-to-head comparison study, evaluating the Glu-coDay and CGMS System Gold sensor25—it isincorrect to use degrees of freedom derivedfrom all the pooled data, while these data mayoriginate from one person and are thus inter-dependent. However, the approach of usingdegrees of freedom per person—so, one aver-age ARD value per person—may in turn be toorigid, because not all sensor readings are in-terdependent. For example, postprandial glu-cose sensor readings at lunch and at night de-pend little on each other, if at all. It is commonpractice to derive degrees of freedom frompooled data in the sensor field.33

Diagnostic epidemiological accuracy measures.Sensitivity and specificity for detecting hypo-and hyperglycemia are calculated based onpaired sensor–blood glucose readings and de-fined as the percentage reported hypo- or hy-perglycemic values of all true hypo- or hyper-glycemic values and the percentage reportednormoglycemic values of all true normo-glycemic values, respectively.34 With themethod of receiver operating characteristics,10

one can assess the ability to detect hypo- andhyperglycemia by plotting the sensitivity, i.e.,the percentage of true events correctly classi-fied, against (1 – specificity), i.e., the percent-age of nonevents, incorrectly classified by vary-ing the threshold for sensor values indicatinghypo- and hyperglycemia. This enables one toselect the threshold based on balancing sensi-tivity against specificity.

If sensors display on-line glucose data andare equipped with an alarm function, the equa-tion for sensitivity and specificity may also in-clude the alarm setting. This enables calcula-tion of sensitivity and specificity with variousalarm settings, until the optimal setting isachieved, i.e., adequately alerting for hypo-glycemia while the number of false alarms re-mains limited. However, unless sensor tech-nology and resulting accuracy are significantlyimproved, desirable sensor sensitivity willkeep on going hand in hand with unacceptablespecificity. Bearing this in mind, together withthe American Diabetes Association consensusguidelines defining hypoglycemia as glucosevalues �3.9 mmol/L,35 calculation of the sen-

sitivity and specificity for glucose values �3.9mmol/L with the hypoglycemic alarm set ac-cordingly seems clinically more useful thansearching for alarm cutoff levels with high sen-sitivity. These two measures evaluate an im-portant sensor task—properly detecting andalerting for hyper- or hypoglycemia—but somedescribe them as irrelevant to clinicians unlessthey can be converted into predictive values.36

The PPV and NPV indicate how many pa-tients with hypo- or hyperglycemia accordingto the sensor are truly hypo- or hyperglycemic.They provide information comparable withsensitivity and specificity but depend on theprevalence of hypo- and hyperglycemias in thepatients being tested. These values can be cal-culated as follows:

PPV �

NPV �

If the prevalence of, for instance, hypo-glycemia in the study population is very low,the PPV will not be close to 1 even if both thesensitivity and specificity are high. The preva-lence of hypoglycemia (hyperglycemia) can beinterpreted as the probability before the sensormeasurement is carried out that the patient hashypoglycemic (hyperglycemic) glucose valuesand is known as the prior probability of hypo-glycemia (hyperglycemia). The PPVs andNPVs are the revised estimates of the sameprobability for those patients who are positiveand negative for hypo- or hyperglycemia ac-cording to the sensor measurement and areknown as posterior probabilities. The differ-ence between the prior and posterior probabil-ities implicates the usefulness of the sensormeasurement. PPVs and NPVs are useful toolsbecause they supply information on the chanceof actually having hypoglycemia if the sensoralerts for one, yet, with the occurrence of hypo-or hyperglycemia depending on multiple fac-tors and being patient-related, determination of

specificity � (1 prevalence)(1 sensitivity) � prevalence

� specificity� (1 prevalence)

sensitivity � prevalencesensitivity � prevalence

� (1 specificity)� (1 prevalence)

CONTINUOUS GLUCOSE MONITORING 63

Page 8: How to Assess and Compare the Accuracy of Continuous Glucose Monitors?

the prevalence is not easy. Therefore, calcula-tion of both sensitivity and specificity and PPVsand NPVs seems advisable.

The advantage of all above-mentioned meth-ods is that they evaluate the most importanttask of continuous glucose sensors, detectinghyperglycemia and hypoglycemia. In particu-lar, the ability of detecting hypoglycemia—in-dicated by the sensitivity—is crucial since sev-eral studies found that the relative error of allmethods of continuous glucose monitoring in-creases to some extent at the lower end of thedetection range.25,37 Of course, the added valueof continuous glucose monitors lies predomi-nantly in providing information on whatshould normally be missed with glucose spotmeasurements. If lack of accuracy hinders thesensor in this function, this will cast doubts asto whether continuous glucose monitoring de-vices have added value or not. Therefore, de-scriptive statistics of how many hyper- and hy-poglycemic episodes were missed with thesensor give good insight in the sensor’s per-formance.38

RECOMMENDATIONS ON ACCURACYASSESSMENT AND HEAD-TO-HEAD

COMPARISON OF CONTINUOUSGLUCOSE SENSORS

The purpose of assessing sensor accuracy isto uncover and quantify discrepancies betweenmeasurements reported by the sensor and a ref-erence method, applied to the same subjects.Accuracy studies have been accomplished witheither ambulatory blood glucose meters or hos-pital/laboratory blood glucose meters, like theBeckman® glucose analyzer (Beckman, Fuller-ton, CA) or a hexokinase (HK)/glucose-6-phos-phate dehydrogenase (G6PDH) method, as ref-erence method. The advantage of the latter is abetter reported reproducibility with coeffi-cients of variation (CVs) for the Beckman glu-cose analyzer and the HK/G6PDH method be-ing less than 3%39 and between 0.63 and 1.6%,respectively.40 In contrast, a previous study cal-culated the CVs for five home blood glucosemeters in four different glucose levels andfound that four meters had acceptable CVsranging from 1% to 6%, but one meter had a

CV above 10%, regardless of the blood glucoselevel.21 Another study evaluating the perfor-mance of four home blood glucose metersfound that the reproducibility deteriorated inthe hypoglycemic range with intra-assay CVvalues up to 8.7% in that range.40 Another po-tential cause of error in SMBG values obtainedwith home blood glucose meters may be thetechnical ability of the patient, particularly inthe case of hypoglycemia, which can alter thepsychomotor ability.21 In turn, the advantageof SMBG values as reference41,42 is that patientscan continue their daily life activities, allowingfor a glucose profile that reflects real life cir-cumstances, while the use of the HK/G6PDHor Beckman glucose analyzer as referencemethod inevitably requires a ward study. Whatreference method should be used is up to theinvestigators to decide, as long as SMBG val-ues used for comparison with the concomitantsensor readings are not simultaneously in-serted for calibration,41,43 as this will alwayspull the sensor curve towards the referencevalue to some extent, with inflated accuracy asa consequence. Furthermore, one should takethe estimated accuracy of the reference methodinto account when deciding to use either thereference value alone or the averaged value ofthe sensor and the reference value, in the caseof accuracy assessment methods like the Bland-Altman analysis, as discussed previously.

Resuming the accuracy assessment, afterdata have been collected it is best to first plotthe glucose sensor readings against the refer-ence values. A schematic overview of the rec-ommendations is given in Figure 2. Drawing aline of identity (x � y) and to a lesser extent aregression line is essential to visualize the de-gree of agreement between the sensor readingsand the reference values. Hereafter, leaving theabove-mentioned discussion about whether anaveraged value should be used in a sensor–ref-erence comparison for what it is, we recom-mend that the data pairs should be given in aBland-Altman plot24 to detect trends of biasand variation against the glucose level. If thereis no obvious relation between the differenceand the mean, the limits of agreement can becalculated by the MD between the sensor andthe reference method, immediately revealingany systematic under- or overestimation, �1.96

WENTHOLT ET AL.64

Page 9: How to Assess and Compare the Accuracy of Continuous Glucose Monitors?

times the SD of the differences. The resultingestimated agreement may or may not be con-sidered clinically acceptable12,24 according tocutoff values that have been determined priorto the analysis. The next step is to calculate themean or median ARD, depending on the dis-tribution of the data, over all paired readingstogether and per glucose range. Mean and me-

dian ARDs per glucose range specify the accu-racy compared to the reference value in thenormo-, hyper-, and hypoglycemic range, re-spectively, and have the advantage that theyare—unlike curve fitting with vertical shift andthe MD—not sensitive to underestimationsflattening out overestimations, and vice versa.Also, the limits of agreement are not sensitive

CONTINUOUS GLUCOSE MONITORING 65

FIG. 2. Schematic overview of the recommendations on accuracy assessment of one sensor and head-to-head com-parison of two or more continuous glucose sensors. ANOVA, analysis of variance; Df, degrees of freedom.

Page 10: How to Assess and Compare the Accuracy of Continuous Glucose Monitors?

to the flattening out effect. However, the latteris more sensitive to outliers, compared to meanand particularly median ARD, because it isstrongly based on the SD. Hereafter, combinedcurve fitting can be done to gain insight in apossible lag time (horizontal shift) between themeasurements of both methods. The sensitiv-ity and PPV for detecting hypoglycemia arevaluable to evaluate the sensors’ ability to de-tect hypoglycemia in intensively treated pa-tients with diabetes. The influence of the sen-sor threshold for hypoglycemia on sensitivityand specificity can be visualized using receiveroperating characteristics curves.

Comparison of continuous glucose monitorscan be classified into direct and indirect com-parison. The latter is carried out by comparingthe differences between each method and a ref-erence method with one another: (sensor A –reference value) versus (sensor B – referencevalue), the first by direct comparing both mea-suring methods without use of a referencemethod. In the case of a direct comparison nei-ther method is regarded as the reference; bothmethods are expected to have measurement er-rors. Thus with the use of averaged values,Bland-Altman and LPR analyses are best suit-able for the method of direct comparison. How-ever, as none of the currently available glucosesensors is accurate enough to be considered thereference value, not much can be learned fromevaluating how much two sensor methods cor-respond with each other. Direct comparisontherefore is not recommended. The goal of in-direct comparison is to learn which one of thetwo methods reports the “best” glucose read-ings, i.e., closest to the reference values. So be-sides the glucose measurements of two sensors,concomitant reference values—determined atthe laboratory—must be collected. Calculatingthe differences between the glucose values ofboth sensors and the corresponding referencevalues enables final comparison of the devia-tions between both sensors. This assessment re-quires a careful consideration somewhat dif-ferent from the accuracy assessment of onlyone sensor. Similar to accuracy assessment ofone sensor, data plotting followed by drawinga line of identity, Bland-Altman analysis, andcombined curve fitting with calculation of the

horizontal shift are valuable tools for indirectcomparison of two methods against a referencemethod. Comparison of the mean and medianARDs with the reference values for sensor Awith those for sensor B using a number of de-grees of freedom that is based on the total num-ber of pooled paired values minus 1 is only al-lowed if the interpatient variability within onesensor group does not exceed the between-sen-sor variability.26 This can be verified using ei-ther an analysis of variance or Kruskal-Wallistest, depending on the distribution of data. Ifthe interpatient variability exceeds the be-tween-sensor variability, data pairs of all pa-tients in one sensor group are not supposed tobe pooled. Instead, one mean or median ARDbetween sensor and reference values must becalculated per patient and per sensor, using anumber of degrees of freedom derived from thenumber of included patients (n 1). Via a sim-ple t test or Mann-Whitney U test, dependingon the distribution of the data, the mean andmedian ARDs between two methods can becompared. With good accuracy being guaran-teed for the reference method, indirect com-parison allows for more firm conclusions aboutwhich one of the two methods is best and istherefore preferred to direct comparison.

Obviously, the analysis of continuous glu-cose data over time is challenging, and al-though many attempts have been made—like,for instance, the CG-EGA by Kovatchev’s studygroup that included the factor time interde-pendency—there still is no all-embracingmethod with which the obstacles in the field ofaccuracy assessment of continuous data can beovercome.

With this review we hope we have not onlydisclosed the weaknesses but also brought upa useful guideline with respect to sensor accu-racy assessment and comparison. The clini-cian—after all not a statistician—may benefitfrom the straightforward approach as sug-gested above in deciding which monitoring de-vice deserves his or her preference. The rec-ommendations may also facilitate the FDA andCE approval processes, making them moretransparent. Nevertheless, it must be empha-sized that although continuous glucose moni-toring has been shown to ameliorate glycemic

WENTHOLT ET AL.66

Page 11: How to Assess and Compare the Accuracy of Continuous Glucose Monitors?

control2 and although companies invest lots ofmoney in developing and launching new mon-itoring devices, with reported mean ARDsranging from 12% to 21%, the accuracy of cur-rent sensors is in need of substantial improve-ment.

REFERENCES

1. Chase HP, Beck R, Tamborlane W, Buckingham B,Mauras N, Tsalikian E, Wysocki T, Weinzimer S, Koll-man C, Ruedy K, Xing D: A randomized multicentertrial comparing the GlucoWatch Biographer withstandard glucose monitoring in children with type 1diabetes. Diabetes Care 2005;28:1101–1106.

2. Deiss D, Bolinder J, Riveline JP, Battelino T, Bosi E,Tubiana-Rufi N, Kerr D, Philip M: Improved glyce-mic control in poorly controlled patients with type 1diabetes using real-time continuous glucose monitor-ing. Diabetes Care 2006;29:2730–2732.

3. Ludvigsson J, Hanas R: Continuous subcutaneousglucose monitoring improved metabolic control in pe-diatric patients with type 1 diabetes: a controlledcrossover study. Pediatrics 2003;111:933–938.

4. Bode BW, Gross TM, Thornton KR, Mastrototaro JJ:Continuous glucose monitoring used to adjust dia-betes therapy improves glycosylated hemoglobin: apilot study. Diabetes Res Clin Pract 1999;46:183–190.

5. Kaufman FR, Gibson LC, Halvorson M, Carpenter S,Fisher LK, Pitukcheewanont P: A pilot study of thecontinuous glucose monitoring system: clinical deci-sions and glycemic control after its use in pediatrictype 1 diabetic subjects. Diabetes Care 2001;24:2030–2034.

6. Salardi S, Zucchini S, Santoni R, Ragni L, Gualandi S,Cicognani A, Cacciari E: The glucose area under theprofiles obtained with continuous glucose monitoringsystem relationships with HbAlc in pediatric type 1diabetic patients. Diabetes Care 2002;25:1840–1844.

7. The effect of intensive treatment of diabetes on thedevelopment and progression of long-term compli-cations in insulin-dependent diabetes mellitus. TheDiabetes Control and Complications Trial ResearchGroup. N Engl J Med 1993;329:977–986.

8. Cameron FJ, Widdison J, Boyce D, Gebert R: A com-parison between optimal and actuarial health carecosts of adolescents with diabetes. J Paediatr ChildHealth 2004;40:56–59.

9. Klonoff DC: A review of continuous glucose monitor-ing technology. Diabetes Technol Ther 2005;7:770–775.

10. Kollman C, Wilson DM, Wysocki T, Tamborlane WV,Beck RW: Limitations of statistical measures of errorin assessing the accuracy of continuous glucose sen-sors. Diabetes Technol Ther 2005;7:665–672.

11. Ludbrook J: Statistical techniques for comparing mea-surers and methods of measurement: a critical review.Clin Exp Pharmacol Physiol 2002;29:527–536.

12. Altman DG, Bland JM: Measurement in medicine: theanalysis of method comparison studies. Statistician1983;32:307–317.

13. Ludbrook J: Comparing methods of measurements.Clin Exp Pharmacol Physiol 1997;24:193–203.

14. Draper N, Smith H: Applied Regression Analysis.New York: Wiley, 1997.

15. Batterham A: Commentary on bias in Bland-Altmanbut not regression validity analyses. Sportscience2004;8:47–49.

16. Hopkins WG: Measures of reliability in sports medi-cine and science. Sports Med 2000;30:1–15.

17. Hopkins W: Bias in Bland-Altman but not regressionvalidity analyses. Sportscience 2004;8:42–46.

18. Atkinson G, Nevill AM: Statistical methods for as-sessing measurement error (reliability) in variablesrelevant to sports medicine. Sports Med 1998;26:217–238.

19. Bland JM, Altman DG: Applying the right statistics:analyses of measurement studies. Ultrasound ObstetGynecol 2003;22:85–93.

20. Bland JM, Altman DG: Measuring agreement inmethod comparison studies. Stat Methods Med Res1999;8:135–160.

21. Poirier JY, Le Prieur N, Campion L, Guilhem I, Al-lannic H, Maugendre D: Clinical and statistical eval-uation of self-monitoring blood glucose meters. Dia-betes Care 1998;21:1919–1924.

22. The accuracy of the GlucoWatch G2 Biographer inchildren with type 1 diabetes: results of the DiabetesResearch in Children Network (DirecNet) accuracystudy. Diabetes Technol Ther 2003;5:791–800.

23. Harrell F: Regression Modeling Strategies: With Ap-plications to Linear Models, Logistic Regression, andSurvival Analysis. New York: Springer, 2001.

24. Bland JM, Altman DG: Statistical methods for assess-ing agreement between two methods of clinical mea-surement. Lancet 1986;1:307–310.

25. McGowan K, Thomas W, Moran A: Spurious report-ing of nocturnal hypoglycemia by CGMS in patientswith tightly controlled type 1 diabetes. Diabetes Care2002;25:1499–1503.

26. Wentholt IM, Vollebregt MA, Hart AA, Hoekstra JB,DeVries JH: Comparison of a needle-type and a mi-crodialysis continuous glucose monitor in type 1 di-abetic patients. Diabetes Care 2005;28:2871–2876.

27. Feldman B, Brazg R, Schwartz S, Weinstein R: A con-tinuous glucose sensor based on wired enzyme tech-nology—results from a 3-day trial in patients withtype 1 diabetes. Diabetes Technol Ther 2003;5:769–779.

28. Hayter PG, Sharma M, Dunka L, Stout P, Price DA,Horwitz DL, Marhoul J, Vaez-Zadeh S: Performancestandards for continuous glucose monitors. DiabetesTechnol Ther 2005;7:721–726.

29. Clarke WL, Cox D, Gonder-Frederick LA, Carter W,Pohl SL: Evaluating clinical accuracy of systems forself-monitoring of blood glucose. Diabetes Care 1987;10:622–628.

CONTINUOUS GLUCOSE MONITORING 67

Page 12: How to Assess and Compare the Accuracy of Continuous Glucose Monitors?

30. Parkes JL, Slatin SL, Pardo S, Ginsberg BH: A newconsensus error grid to evaluate the clinical signifi-cance of inaccuracies in the measurement of bloodglucose. Diabetes Care 2000;23:1143–1148.

31. Wentholt IM, Hoekstra JB, DeVries JH: A critical ap-praisal of the continuous glucose-error grid analysis.Diabetes Care 2006;29:1805–1811.

32. Clarke WL, Gonder-Frederick L, Cox D, Kovatchev B:A critical appraisal of the continuous glucose-errorgrid analysis: response to Wentholt et al. DiabetesCare 2007;30:449–450.

33. Wentholt IM, Hoekstra JB, DeVries JH: A critical ap-praisal of the continuous glucose-error grid analysis:response to Clarke et al. Diabetes Care 2007;30:450–451.

34. Altman DG, Bland JM: Diagnostic tests. 1: Sensitivityand specificity. BMJ 1994;308:1552.

35. Defining and reporting hypoglycemia in diabetes: areport from the American Diabetes Association Work-group on Hypoglycemia. Diabetes Care 2005;28:1245–1249.

36. Rud B, Matzen P, Hilden J: [Measures for the perfor-mance of diagnostic tests]. Ugeskr Laeger 2005;167:3018–3022.

37. Boland E, Monsod T, Delucia M, Brandt CA, Fer-nando S, Tamborlane WV: Limitations of conven-tional methods of self-monitoring of blood glucose:lessons learned from 3 days of continuous glucosesensing in pediatric patients with type 1 diabetes. Di-abetes Care 2001;24:1858–1862.

38. Weinstein RL, Schwartz SL, Brazg RL, Bugler JR,Peyser TA, McGarraugh GV: Accuracy of the 5-dayFreeStyle Navigator Continuous Glucose MonitoringSystem: comparison with frequent laboratory refer-

ence measurements. Diabetes Care 2007;30:1125–1130.

39. Guerci B, Floriot M, Bohme P, Durain D, Benichou M,Jellimann S, Drouin P: Clinical performance of CGMSin type 1 diabetic patients treated by continuous sub-cutaneous insulin infusion using insulin analogs. Di-abetes Care 2003;26:582–589.

40. Chen ET, Nichols JH, Duh SH, Hortin G: Performanceevaluation of blood glucose monitoring devices. Dia-betes Technol Ther 2003;5:749–768.

41. Gross TM, Bode BW, Einhorn D, Kayne DM, Reed JH,White NH, Mastrototaro JJ: Performance evaluationof the MiniMed Continuous Glucose Monitoring Sys-tem during patient home use. Diabetes Technol Ther2000;2:49–56.

42. Sachedina N, Pickup JC: Performance assessment ofthe Medtronic-MiniMed Continuous Glucose Moni-toring System and its use for measurement of gly-caemic control in Type 1 diabetic subjects. Diabet Med2003;20:1012–1015.

43. Gross TM, Ter Veer A: Continuous glucose monitor-ing in previously unstudied population subgroups.Diabetes Technol Ther 2000;2(Suppl 1):S27–S34.

Address reprint requests to:I.M.E. Wentholt, M.D.

Department of Internal MedicineAcademic Medical Center

Meibergdreef 9Amsterdam, the Netherlands 1100 DD

E-mail: [email protected]

WENTHOLT ET AL.68