overview courses
TRANSCRIPT
• Kristian Linnet, MD, [email protected]
• Per Hyltoft Petersen, [email protected]
• Sverre Sandberg, MD, [email protected]
Statistics & graphics for the
laboratory
Linda [email protected]
Dietmar Stö[email protected]
In cooperation with AQML: D Stöckl, L Thienpont &
ApplicationsReference interval & Biological variation
Statistics & graphics for the laboratory 2
Prof Dr Linda M ThienpontUniversity of Gent
Institute for Pharmaceutical SciencesLaboratory for Analytical Chemistry
Harelbekestraat 72, B-9000 Gent, Belgiume-mail: [email protected]
STT ConsultingDietmar Stöckl, PhD
Abraham Hansstraat 11B-9667 Horebeke, Belgium
e-mail: [email protected] + FAX: +32/5549 8671
Copyright: STT Consulting 2007
Statistics & graphics for the laboratory 3
Content overview
Reference interval
Introduction
Data presentation• Histogram• Normal probability plot & rankit-transformation • Graphical interpretation of rankit-plots
Partitioning
Statistical estimation• Parametric and non-parametric
Biological variation• Introduction• Estimation (ANOVA application)• Index-of-individuality• Comparison of a result with a reference interval ("Grey-zone")• Reference change value (RCV)
Content
Statistics & graphics for the laboratory 4
Estimation of reference intervals – Overview
REFERENCE INDIVIDUALS
comprise a
REFERENCE POPULATION
from which is selected a
REFERENCE SAMPLE GROUP
on which are determined
REFERENCE VALUES
on which is observed a
REFERENCE DISTRIBUTION
from which are calculated
REFERENCE LIMITS
that may define
REFERENCE INTERVALS
that help with the interpretation of an
OBSERVED VALUE
Flowchart
Introduction
Start
Select statistics
Parametric method
Stop
Gaussian
No
No
Yes
Collect samples
Inspect distribution
Detect & handle outliers
Partition?
Intuitive assessment
Non-parametric method Stop
Stop
Transform data
GaussianYes
Backtransform estimates Stop
Statistics & graphics for the laboratory 5
Inclusion criteria (NORIP, Malmø 27/4-2004)
The reference individual should • be feeling subjectively well• have reached the age of 18 • not be pregnant or breast-feeding • not been an in-patient in a hospital nor been subjectively dangerously ill during the last month • not had more than 2 measures of alcohol (24 g) in the last 24 hours • not given blood as a donor in the last five months • not taken prescribed drugs other than the P-pill or estrogens (female sex hormone) during the last two weeks • not smoked in the last hour prior to blood sampling
Preanalytical conditions (NORIP, Malmø 27/4-2004)Reference individual• Sitting at least 15 min before samplingSample collection• Li-heparin plasma or serum, EDTA-blood for haematology• Standard procedure• Minimal stasisSample handling (plasma and serum)• Stored in the dark• Storage in room temperature before centrifugation
serum: 0.5-1.5 h, plasma: max 15 min• Centrifugation: 10 min at min 1500 g• Distributed to secondary tubes within 2 h• Stored at -80 °C within 4 h
Outliers• Gross or slight deviation• Check records• Check results• Re-analyse• ? Include• ? Omit
Introduction
Statistics & graphics for the laboratory 6
Data presentation
Tools for presentation and inspection of distributions • Histogram• Normal probability plot and "Rankit-transformation"
Rankit-transformationReference population: Gauss-distributionHyltoft Petersen P, Hørder M. Influence of analytical quality on test results. Scand J Clin Lab Invest 1992;52 Suppl 208:65-87.The frequency distribution is transformed to the cumulated frequency distribution and then transformed to the Rankit- or Normal Probability Plot.
Data presentation & inspection
In the Normal Probability Plot, the values are plotted on the x-axis and their normalized deviation from the mean (z-value, or Rankit) on the y-axis. In the figure below, a second axis has been introduced where the corresponding probabilities (to the z-value) can be read. Note, this second axis is non-linear and needs to be introduced as picture. It cannot be created with EXCEL. The tick-marks, however, can be programmed into an EXCEL chart (see: NormalRankitPlot.xls).
Use: Visual test for Normal distribution: data should fit a line.
Statistics & graphics for the laboratory 7
The rankit plot
Triacylglyceride example
Effect of imprecision (left Fig) and bias (right Fig)on the Normal Probability PlotAn increase in imprecision (here 1.5 x) rotates the line clockwise and changes the probability at z = 1.65 from 95% to 84%.The introduction of a bias (here = 1) moves the line to the right and changes the probability at z = 1.65 from 95% to 74%.
Data presentation & inspection
NormalRankitPlot
Statistics & graphics for the laboratory 8
The rankit plot
Bimodal situation: left population healthy, right population diseased
Data presentation & inspection
When we apply the plot in the bimodal situation, we can directly read the fase negatives (FN) and the false positives (FP).Note, the healthy are cumulated from right to left.
Under the conditions chosen (diseased at a distance of +2 SD and cutoff = 1.28 SD), FN = 24% and FP = 10%.
Statistics & graphics for the laboratory 9
Data inspection – Examples
Uric acid (µmol/l) – Simulation (distributions moved!)Female Male
Mean 250 370SD 40 40n 1000 1000
Depending on the bin-size, bimodal distributions may be hidden in histograms!
Uric acid ~reality, but Normal distributedFemale Male
Mean 250 330SD 55 65n 1000 1000
Graphical techniques are too weak to uncover bimodal situations where the population means are close together!
Test for normal distribution PChi-square 0.836Kolmogorov-Smirnov 0.249Anderson-Darling 0.02D'Agostino-Pearson 0.016
Statistical techniques may uncover that "something is wrong" (not Normal) with the distribution. From that, one may consider to look for subgroups!However, different tests may have enourmously different power!
Data presentation & inspection
0
0.06
0.12
0.18
0.24
0.3
100 200 300 400 500
Analyte Conc.
Statistics & graphics for the laboratory 10
Calculations with logarithms
Data transformation: LogarithmsWhen the data are not normal distributed, one can try a transformation. Because, in nature, data are often log-normal distributed, logarithmic transformation of data can make them normal distributed.
Test for normality: Triglycerides (See: Datasets.xls)n = 282; Lowest value: 0.3 mmol/L; Highest value: 3.2 mmol/L; Median: 0.92 mmol/L.
CBstatAnderson Darling test: Anderson Darling test after
logarithmic (natural) transformationP < 0.01 P = 0.13 data not normally distributed data log-normally distributed
Normal Probability Plot (ln-transformed dataData are "on a line" Data are ln-Normal distributed
Testing normality
Statistics & graphics for the laboratory 11
Working with logarithms
Calculate the reference interval of a logarithmic distribution
Triglycerides
1. Transform the original data to ln2. Calculate the mean of the ln (xi) values
3. Take the anti-ln of the mean of ln (x i)
This equals the geometric mean of the original population, which is close to its median.
The anti-ln of the mean of the logged value e-0.0689 is equal to the geometricmean of the original distribution where the latter is given by [x1*x2 …Xn]1/n
The anti-ln of the SD is meaningless.
Calculation of 2.5 and 97.5% percentileMean (ln transformed) -0.0689SD (ln transformed) 0.3952.5 Percentile -0.0689 – 1.96*0.395 = - 0.84397.5 percentile -0.0689 + 1.96*0.394 = 0.7053Anti-ln of 2.5 & 97.5 perc 0.43 – 2.02
Calculations with logarithms
Number mmol/l ln1 0.3 -1.2042 0.32 -1.1393 0.34 -1.0794 0.38 -0.9685 0.4 -0.9166 0.4 -0.916… … …
282 3.2 1.163Median 0.92
Anti-ln (ex) 0.933 -0.069 Mean, lnEXCEL: EXP(x)
Geometric mean 0.933EXCEL: GEOMEAN
Statistics & graphics for the laboratory 12
Partitioning of reference intervals
Visual, on the basis of suspected differences (sex, race, age, …)
The reference interval
Frequency polygon
Rankit-plot
Statistics & graphics for the laboratory 13
Example: Partitioning – Visual
Comparison of oromucosid values: Caucasians and Indians in Leeds (Johnson et al. CCLM 2004;42:792-9).
Statistical criteria for partitioning(Lahti et al. Clin Chem 2002;48:338-52)
Difference between two upper or lower limits• D <0,25*s: No partitioning• D = 0,25 – 0.75*s: Variable• D >0,75*s: Partitioning• or percentage: Pb 0.9 and Pa 4.1 %
The reference interval
Statistics & graphics for the laboratory 14
Statistical model for estimating a reference interval
The statistical procedures assume random sampling in the target population.Traditionally: 2.5- and 97.5-percentiles are estimated with on average 95% of population included.In some contexts, one-sided: 95-, 97.5- or 99-percentiles are used.
Statistical estimation procedures
Parametric• Assumes normal distribution or distribution that can be transformed to the
normal distributionNonparametric
• Model-free estimation of percentilesPartitioning
• Subdivision according to gender, age, race, etc. should be considered where relevant
Reference interval & type of distribution
Normal distributions can be expected for analytes with relatively narrow biological distribution, e.g. Electrolytes.
The reference interval for Normal distributions ranges from the2.5th to the 97.5th percentile (= mean+/-1.96 SD).
The reference interval
95%
Reference interval
Upperreference limit
Lowerreference limit
Statistics & graphics for the laboratory 15
Skewed distributions
• Biological variation is very often skewed to the right, i.e. there is a tailing with high values.
• The theoretical background is many factors that has a multiplicative impact (an additive impact of many independent factors yields a normal distribution).
Skewed distributions often can be modeled by the log-normal distribution.The log-normal type of distribution is actually constituted of a family of distributions with a spectrum of degrees of skewness determined by the parameter values (ratio between standard deviation and mean).
Coefficient of skewness: Cskew = [Σ(xi – xm)3/N]/SD3
Zero: symmetric distribution; Positive: skewed to the right; Negative: skewed to the left
Nonparametric procedureApplicable to all types of distributions
Simple procedures • Based on ordering (ranking) of values according to size
Refined procedures• Weighted percentile estimation, smoothing techniques, resampling principle (bootstrap).
The reference interval
Coefficient of kurtosis:Ckurt = [Σ(xi – xm)4/N]/SD4 – 3
Zero: Normal distribution; Positive: Peaked distribution; Negative: Flat distribution
Statistics & graphics for the laboratory 16
Simple nonparametric procedure(s)
Approach• Sort N reference values in increasing numerical order• Assign rank numbers; lowest = 1; highest = N• Rank number of 2.5-Percentile = 0.025 x (N+1) or 0.025 x (N) + 0.5• Rank number of 97.5-Percentile = 0.975 x (N+1) or 0.975 x (N) + 0.5• Lower reference limit = reference value corresponding to rank number of 2.5-Percentile• Upper reference limit = reference value corresponding to rank number of 97.5-Percentile
Remark – Estimation of 2.5 & 97.5 percentiles
Procedure recommended by the IFCC and CLSI:• 2.5-Percentile = Value of number: (0.025) x (N+1)• 97.5-Percentile = Value of number: (0.975) x (N+1)
Optimal procedure (slightly different from above):• 2.5-Percentile = Value of number: (0.025) x (N) + 0.5• 97.5-Percentile = Value of number: (0.975) x (N) + 0.5(Linnet K. Clin Chem 2000;46:867-9)
Triglycerides: n = 2820.025 x (282 + 1) = 7.1 = Rank: 7 = 0.42 mmol/L0.975 x (282 + 1) = 276 = Rank: 276 = 2.12 mmol/LReference interval = 0.42 – 2.12 mmol/L
The reference interval
Statistics & graphics for the laboratory 17
Sample size and precision of estimates
Precision of percentiles of Normal distributionCan be expressed as a ratio between 90%-confidence intervals (90%-CI) and the width of the 95%-reference interval (e.g. ratios 0.3, 0.2 or 0.1 as outlined below).The necessary sample sizes are indicated:
Ratio Parametric N(90% CI/95% RI)0.3 230.2 500.1 205
Precision of percentiles of normal distributionComparison between parametric and non-parametric procedures.
Ratio Parametric N Non-parametric N(90% CI/95% RI)0.3 23 560.2 50 1250.1 205 500
The reference interval
Statistics & graphics for the laboratory 18
Sample size and precision of estimates
Coefficient of skewness: 0.75
Ratio Parametric N Non-parametric N(90% CI/95% RI)0.3 90 1400.2 200 3150.1 800 1250
Coefficient of skewness: 1.5Ratio Parametric N Non-parametric N(90% CI/95% RI)0.3 200 3150.2 440 6950.1 1750 2740
The reference interval
Statistics & graphics for the laboratory 19
Bootstrap principle
Repeated random re-sampling with replacement of observations.• For a set of N observations: Each observation has the probability of 1/N of being re-sampled.• A re-sampled set of N observations (a so-called pseudo-set of observations) may contain several copies of one observation and lack others.
Origin of the nameThe bootstrap term refers to the phrase to pull oneself up by one´s bootstrap originating from the tale The Adventures of Baron Munchausen (by Rudolph Erich Raspe (1737-94)) in which ”The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps”.
Calculation of estimates• For each pseudo-set of N observations, the percentiles are computed by the simple nonparametric procedure.• By repetition on a computer, e.g. 100 or more times, a distribution of estimated percentiles are obtained that mimicks the real sampling variation.• The bootstrap estimates are the means of the pseudo-estimates.• The bootstrap procedure is slightly (5-15%) more efficient than simple nonparametric estimation.• Standard errors of estimates are provided.
Limitations• Too low coverage* at small sample sizes (N < 40)• Modified versions with smoothing might improve coverage at small sample sizes• Some bias problems with the bootstrap estimates at low sample sizes (N < 40)*Coverage: Expected percentage of times an estimated CI-interval includes the true value, i.e. ideally 90% for a supposed 90%-CI
The reference interval
Statistics & graphics for the laboratory 20
Comparison of statistical procedures
Can be studied theoretically and/or by simulation on the basis of specified model distributions, e.g. normal and log-normal types. In simulation, the procedure is repeated a large number of times in order to study bias (= difference between average of percentile estimates and true value) and precision (standard error: SE) of the estimation procedure (small SE: efficient procedure). SEs should reflect the real uncertainty so that estimated confidence intervals become correct.
Tool: Root mean squared error (RMSE)RMSE: [Σ(xobs – xTrue)2/Nrun]2 = [Bias2 + SE2]0.5
(Nrun: no. of simulation runs)
A combined error measure taking both systematic deviation and random error into account.Often used in statistics as an overall error measure allowing ranking of various statistical estimation procedures studied theoretically or by simulations.
Model exampleUsing a theoretical model distribution, e.g. a CHI-square-distribution, the true percentile values are known.
By simulation, the performance of parametric and nonparametric procedures can be compared and the RMSE of the percentile estimates can be related to the sample size.
OutcomeThe higher the sample size, the higher is the likelihood that the nonparametric procedure is the optimal approach (lowest RMSE at given sample size).The relationship relies in the general fact that a bias associated with parametric estimation is independent of sample size and will tend to dominate the RMSE at high sample sizes where the random error vanishes.
The reference interval
Statistics & graphics for the laboratory 21
Statistical procedures – Summary
Ranking of procedures according to efficiency1. Parametric procedure2. Bootstrap non-parametric –3. Simple non-parametric –
Non-parametric vs parametricAbout half as effective, i.e. about twice the sample size required to attain the same SE of the percentilesThe difference in effectiveness is larger the more extreme the percentiles are (e.g. 99 vs 97.5 percentile)
Simple non-parametric procedureN p +0.5 slightly better than N p +1 for both normal and skewed distributions
Bootstrap non-parametric vs simple non-parametricSlightly more efficient (5-15% savings of sample size)Confidence intervals can be estimated for smaller sample sizes (for simple non-parametric N 120 for 90%-CI)
The reference interval
Statistics & graphics for the laboratory 22
Example
Example: Triglycerides with CBstat
Procedure CI Lower limit CI Upper limitParametric direct 0.08 – 0.23 1.79 – 1.94Non-parametric 0.34 – 0.52 1.92 – 2.60Non-parametric bootstrap 0.37 – 0.52 1.88 – 2.33Parametric after log-transform 0.40 – 0.46 1.90 – 2.16
Note: Direct parametric is not correct!
Simulation of triacylglyceride dataWe simulate data that correspond to the triacylglyceride data: skew ~1.64. We do that with Worksheet LnNormal 3 (mean = 0; SD = 0.48; n = 1000). Copy the data in the file RefInt.xls. Adapt the digits to 2 after the point (precision as displayed). Sample 20 values from these data (Tools>Data Analysis>Sampling). Compare the 90% confidence intervals n = 20 with the respective ones for n = 1000.
The reference interval
DataGeneration
Statistics & graphics for the laboratory 23
Software & references
CBstat
A Windows program distributed by K. Linnet (via aaccdirect.org).
Offers general statistical methods and procedures dedicated for clinical biochemistry
Estimation of reference intervals:• Simple nonparametric and bootstrap procedure• Parametric direct• Parametric after transformations
–One-stage: log-, 3-parameter-log-, Box & Cox- and Manly-–Two-stage: 1) Correction of skewness; 2) Correction of kurtosis
• Normality testing with appropriate corrections after transformations• Appropriate confidence intervals of percentiles after transformation
Further information:• www.cbstat.com
References
Linnet K. Nonparametric estimation of reference intervals by simple and bootstrap-based procedures. Clin Chem 2000;46:867-9.
Linnet K. Two-stage transformation systems for normalization of reference distributions evaluated. Clin Chem 1987;33:381-6.
IFCC. J Clin Chem Clin Biochem 1987;25:645-56.
Linnet K. Testing normality of transformed distributions. Appl Statist 1988;37:180-6.
The reference interval
Statistics & graphics for the laboratory 24
Notes
Notes