overview courses

• Kristian Linnet, MD, [email protected]

• Per Hyltoft Petersen, [email protected]

• Sverre Sandberg, MD, [email protected]

Statistics & graphics for the

laboratory

Linda [email protected]

Dietmar Stö[email protected]

In cooperation with AQML: D Stöckl, L Thienpont &

ApplicationsReference interval & Biological variation

Statistics & graphics for the laboratory 2

Prof Dr Linda M ThienpontUniversity of Gent

Institute for Pharmaceutical SciencesLaboratory for Analytical Chemistry

Harelbekestraat 72, B-9000 Gent, Belgiume-mail: [email protected]

STT ConsultingDietmar Stöckl, PhD

Abraham Hansstraat 11B-9667 Horebeke, Belgium

e-mail: [email protected] + FAX: +32/5549 8671

Copyright: STT Consulting 2007

mailto:[email protected]

mailto:[email protected]


Content overview

Reference interval

Introduction

Data presentation• Histogram• Normal probability plot & rankit-transformation • Graphical interpretation of rankit-plots

Partitioning

Statistical estimation• Parametric and non-parametric

Biological variation• Introduction• Estimation (ANOVA application)• Index-of-individuality• Comparison of a result with a reference interval ("Grey-zone")• Reference change value (RCV)

Content


Estimation of reference intervals – Overview

REFERENCE INDIVIDUALS

comprise a

REFERENCE POPULATION

from which is selected a

REFERENCE SAMPLE GROUP

on which are determined

REFERENCE VALUES

on which is observed a

REFERENCE DISTRIBUTION

from which are calculated

REFERENCE LIMITS

that may define

REFERENCE INTERVALS

that help with the interpretation of an

OBSERVED VALUE

Flowchart

Introduction

Start

Select statistics

Parametric method

Stop

Gaussian

No

No

Yes

Collect samples

Inspect distribution

Detect & handle outliers

Partition?

Intuitive assessment

Non-parametric method Stop

Stop

Transform data

GaussianYes

Backtransform estimates Stop


Inclusion criteria (NORIP, Malmø 27/4-2004)

The reference individual should • be feeling subjectively well• have reached the age of 18 • not be pregnant or breast-feeding • not been an in-patient in a hospital nor been subjectively dangerously ill during the last month • not had more than 2 measures of alcohol (24 g) in the last 24 hours • not given blood as a donor in the last five months • not taken prescribed drugs other than the P-pill or estrogens (female sex hormone) during the last two weeks • not smoked in the last hour prior to blood sampling

Preanalytical conditions (NORIP, Malmø 27/4-2004)Reference individual• Sitting at least 15 min before samplingSample collection• Li-heparin plasma or serum, EDTA-blood for haematology• Standard procedure• Minimal stasisSample handling (plasma and serum)• Stored in the dark• Storage in room temperature before centrifugation

serum: 0.5-1.5 h, plasma: max 15 min• Centrifugation: 10 min at min 1500 g• Distributed to secondary tubes within 2 h• Stored at -80 °C within 4 h

Outliers• Gross or slight deviation• Check records• Check results• Re-analyse• ? Include• ? Omit

Introduction


Data presentation

Tools for presentation and inspection of distributions • Histogram• Normal probability plot and "Rankit-transformation"

Rankit-transformationReference population: Gauss-distributionHyltoft Petersen P, Hørder M. Influence of analytical quality on test results. Scand J Clin Lab Invest 1992;52 Suppl 208:65-87.The frequency distribution is transformed to the cumulated frequency distribution and then transformed to the Rankit- or Normal Probability Plot.

Data presentation & inspection

In the Normal Probability Plot, the values are plotted on the x-axis and their normalized deviation from the mean (z-value, or Rankit) on the y-axis. In the figure below, a second axis has been introduced where the corresponding probabilities (to the z-value) can be read. Note, this second axis is non-linear and needs to be introduced as picture. It cannot be created with EXCEL. The tick-marks, however, can be programmed into an EXCEL chart (see: NormalRankitPlot.xls).

Use: Visual test for Normal distribution: data should fit a line.


The rankit plot

Triacylglyceride example

Effect of imprecision (left Fig) and bias (right Fig)on the Normal Probability PlotAn increase in imprecision (here 1.5 x) rotates the line clockwise and changes the probability at z = 1.65 from 95% to 84%.The introduction of a bias (here = 1) moves the line to the right and changes the probability at z = 1.65 from 95% to 74%.


NormalRankitPlot


The rankit plot

Bimodal situation: left population healthy, right population diseased


When we apply the plot in the bimodal situation, we can directly read the fase negatives (FN) and the false positives (FP).Note, the healthy are cumulated from right to left.

Under the conditions chosen (diseased at a distance of +2 SD and cutoff = 1.28 SD), FN = 24% and FP = 10%.


Data inspection – Examples

Uric acid (µmol/l) – Simulation (distributions moved!)Female Male

Mean 250 370SD 40 40n 1000 1000

Depending on the bin-size, bimodal distributions may be hidden in histograms!

Uric acid ~reality, but Normal distributedFemale Male

Mean 250 330SD 55 65n 1000 1000

Graphical techniques are too weak to uncover bimodal situations where the population means are close together!

Test for normal distribution PChi-square 0.836Kolmogorov-Smirnov 0.249Anderson-Darling 0.02D'Agostino-Pearson 0.016

Statistical techniques may uncover that "something is wrong" (not Normal) with the distribution. From that, one may consider to look for subgroups!However, different tests may have enourmously different power!


0

0.06

0.12

0.18

0.24

0.3

100 200 300 400 500

Analyte Conc.


Calculations with logarithms

Data transformation: LogarithmsWhen the data are not normal distributed, one can try a transformation. Because, in nature, data are often log-normal distributed, logarithmic transformation of data can make them normal distributed.

Test for normality: Triglycerides (See: Datasets.xls)n = 282; Lowest value: 0.3 mmol/L; Highest value: 3.2 mmol/L; Median: 0.92 mmol/L.

CBstatAnderson Darling test: Anderson Darling test after

logarithmic (natural) transformationP < 0.01 P = 0.13 data not normally distributed data log-normally distributed

Normal Probability Plot (ln-transformed dataData are "on a line" Data are ln-Normal distributed

Testing normality


Working with logarithms

Calculate the reference interval of a logarithmic distribution

Triglycerides

1. Transform the original data to ln2. Calculate the mean of the ln (xi) values

3. Take the anti-ln of the mean of ln (x i)

This equals the geometric mean of the original population, which is close to its median.

The anti-ln of the mean of the logged value e-0.0689 is equal to the geometricmean of the original distribution where the latter is given by [x1*x2 …Xn]1/n

The anti-ln of the SD is meaningless.

Calculation of 2.5 and 97.5% percentileMean (ln transformed) -0.0689SD (ln transformed) 0.3952.5 Percentile -0.0689 – 1.96*0.395 = - 0.84397.5 percentile -0.0689 + 1.96*0.394 = 0.7053Anti-ln of 2.5 & 97.5 perc 0.43 – 2.02

Calculations with logarithms

Number mmol/l ln1 0.3 -1.2042 0.32 -1.1393 0.34 -1.0794 0.38 -0.9685 0.4 -0.9166 0.4 -0.916… … …

282 3.2 1.163Median 0.92

Anti-ln (ex) 0.933 -0.069 Mean, lnEXCEL: EXP(x)

Geometric mean 0.933EXCEL: GEOMEAN


Partitioning of reference intervals

Visual, on the basis of suspected differences (sex, race, age, …)

The reference interval

Frequency polygon

Rankit-plot


Example: Partitioning – Visual

Comparison of oromucosid values: Caucasians and Indians in Leeds (Johnson et al. CCLM 2004;42:792-9).

Statistical criteria for partitioning(Lahti et al. Clin Chem 2002;48:338-52)

Difference between two upper or lower limits• D <0,25*s: No partitioning• D = 0,25 – 0.75*s: Variable• D >0,75*s: Partitioning• or percentage: Pb 0.9 and Pa 4.1 %



Statistical model for estimating a reference interval

The statistical procedures assume random sampling in the target population.Traditionally: 2.5- and 97.5-percentiles are estimated with on average 95% of population included.In some contexts, one-sided: 95-, 97.5- or 99-percentiles are used.

Statistical estimation procedures

Parametric• Assumes normal distribution or distribution that can be transformed to the

normal distributionNonparametric

• Model-free estimation of percentilesPartitioning

• Subdivision according to gender, age, race, etc. should be considered where relevant

Reference interval & type of distribution

Normal distributions can be expected for analytes with relatively narrow biological distribution, e.g. Electrolytes.

The reference interval for Normal distributions ranges from the2.5th to the 97.5th percentile (= mean+/-1.96 SD).


95%

Reference interval

Upperreference limit

Lowerreference limit


Skewed distributions

• Biological variation is very often skewed to the right, i.e. there is a tailing with high values.

• The theoretical background is many factors that has a multiplicative impact (an additive impact of many independent factors yields a normal distribution).

Skewed distributions often can be modeled by the log-normal distribution.The log-normal type of distribution is actually constituted of a family of distributions with a spectrum of degrees of skewness determined by the parameter values (ratio between standard deviation and mean).

Coefficient of skewness: Cskew = [Σ(xi – xm)3/N]/SD3

Zero: symmetric distribution; Positive: skewed to the right; Negative: skewed to the left

Nonparametric procedureApplicable to all types of distributions

Simple procedures • Based on ordering (ranking) of values according to size

Refined procedures• Weighted percentile estimation, smoothing techniques, resampling principle (bootstrap).


Coefficient of kurtosis:Ckurt = [Σ(xi – xm)4/N]/SD4 – 3

Zero: Normal distribution; Positive: Peaked distribution; Negative: Flat distribution


Simple nonparametric procedure(s)

Approach• Sort N reference values in increasing numerical order• Assign rank numbers; lowest = 1; highest = N• Rank number of 2.5-Percentile = 0.025 x (N+1) or 0.025 x (N) + 0.5• Rank number of 97.5-Percentile = 0.975 x (N+1) or 0.975 x (N) + 0.5• Lower reference limit = reference value corresponding to rank number of 2.5-Percentile• Upper reference limit = reference value corresponding to rank number of 97.5-Percentile

Remark – Estimation of 2.5 & 97.5 percentiles

Procedure recommended by the IFCC and CLSI:• 2.5-Percentile = Value of number: (0.025) x (N+1)• 97.5-Percentile = Value of number: (0.975) x (N+1)

Optimal procedure (slightly different from above):• 2.5-Percentile = Value of number: (0.025) x (N) + 0.5• 97.5-Percentile = Value of number: (0.975) x (N) + 0.5(Linnet K. Clin Chem 2000;46:867-9)

Triglycerides: n = 2820.025 x (282 + 1) = 7.1 = Rank: 7 = 0.42 mmol/L0.975 x (282 + 1) = 276 = Rank: 276 = 2.12 mmol/LReference interval = 0.42 – 2.12 mmol/L



Sample size and precision of estimates

Precision of percentiles of Normal distributionCan be expressed as a ratio between 90%-confidence intervals (90%-CI) and the width of the 95%-reference interval (e.g. ratios 0.3, 0.2 or 0.1 as outlined below).The necessary sample sizes are indicated:

Ratio Parametric N(90% CI/95% RI)0.3 230.2 500.1 205

Precision of percentiles of normal distributionComparison between parametric and non-parametric procedures.

Ratio Parametric N Non-parametric N(90% CI/95% RI)0.3 23 560.2 50 1250.1 205 500



Sample size and precision of estimates

Coefficient of skewness: 0.75

Ratio Parametric N Non-parametric N(90% CI/95% RI)0.3 90 1400.2 200 3150.1 800 1250

Coefficient of skewness: 1.5Ratio Parametric N Non-parametric N(90% CI/95% RI)0.3 200 3150.2 440 6950.1 1750 2740



Bootstrap principle

Repeated random re-sampling with replacement of observations.• For a set of N observations: Each observation has the probability of 1/N of being re-sampled.• A re-sampled set of N observations (a so-called pseudo-set of observations) may contain several copies of one observation and lack others.

Origin of the nameThe bootstrap term refers to the phrase to pull oneself up by one´s bootstrap originating from the tale The Adventures of Baron Munchausen (by Rudolph Erich Raspe (1737-94)) in which ”The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps”.

Calculation of estimates• For each pseudo-set of N observations, the percentiles are computed by the simple nonparametric procedure.• By repetition on a computer, e.g. 100 or more times, a distribution of estimated percentiles are obtained that mimicks the real sampling variation.• The bootstrap estimates are the means of the pseudo-estimates.• The bootstrap procedure is slightly (5-15%) more efficient than simple nonparametric estimation.• Standard errors of estimates are provided.

Limitations• Too low coverage* at small sample sizes (N < 40)• Modified versions with smoothing might improve coverage at small sample sizes• Some bias problems with the bootstrap estimates at low sample sizes (N < 40)*Coverage: Expected percentage of times an estimated CI-interval includes the true value, i.e. ideally 90% for a supposed 90%-CI



Comparison of statistical procedures

Can be studied theoretically and/or by simulation on the basis of specified model distributions, e.g. normal and log-normal types. In simulation, the procedure is repeated a large number of times in order to study bias (= difference between average of percentile estimates and true value) and precision (standard error: SE) of the estimation procedure (small SE: efficient procedure). SEs should reflect the real uncertainty so that estimated confidence intervals become correct.

Tool: Root mean squared error (RMSE)RMSE: [Σ(xobs – xTrue)2/Nrun]2 = [Bias2 + SE2]0.5

(Nrun: no. of simulation runs)

A combined error measure taking both systematic deviation and random error into account.Often used in statistics as an overall error measure allowing ranking of various statistical estimation procedures studied theoretically or by simulations.

Model exampleUsing a theoretical model distribution, e.g. a CHI-square-distribution, the true percentile values are known.

By simulation, the performance of parametric and nonparametric procedures can be compared and the RMSE of the percentile estimates can be related to the sample size.

OutcomeThe higher the sample size, the higher is the likelihood that the nonparametric procedure is the optimal approach (lowest RMSE at given sample size).The relationship relies in the general fact that a bias associated with parametric estimation is independent of sample size and will tend to dominate the RMSE at high sample sizes where the random error vanishes.



Statistical procedures – Summary

Ranking of procedures according to efficiency1. Parametric procedure2. Bootstrap non-parametric –3. Simple non-parametric –

Non-parametric vs parametricAbout half as effective, i.e. about twice the sample size required to attain the same SE of the percentilesThe difference in effectiveness is larger the more extreme the percentiles are (e.g. 99 vs 97.5 percentile)

Simple non-parametric procedureN p +0.5 slightly better than N p +1 for both normal and skewed distributions

Bootstrap non-parametric vs simple non-parametricSlightly more efficient (5-15% savings of sample size)Confidence intervals can be estimated for smaller sample sizes (for simple non-parametric N 120 for 90%-CI)



Example

Example: Triglycerides with CBstat

Procedure CI Lower limit CI Upper limitParametric direct 0.08 – 0.23 1.79 – 1.94Non-parametric 0.34 – 0.52 1.92 – 2.60Non-parametric bootstrap 0.37 – 0.52 1.88 – 2.33Parametric after log-transform 0.40 – 0.46 1.90 – 2.16

Note: Direct parametric is not correct!

Simulation of triacylglyceride dataWe simulate data that correspond to the triacylglyceride data: skew ~1.64. We do that with Worksheet LnNormal 3 (mean = 0; SD = 0.48; n = 1000). Copy the data in the file RefInt.xls. Adapt the digits to 2 after the point (precision as displayed). Sample 20 values from these data (Tools>Data Analysis>Sampling). Compare the 90% confidence intervals n = 20 with the respective ones for n = 1000.


DataGeneration


Software & references

CBstat

A Windows program distributed by K. Linnet (via aaccdirect.org).

Offers general statistical methods and procedures dedicated for clinical biochemistry

Estimation of reference intervals:• Simple nonparametric and bootstrap procedure• Parametric direct• Parametric after transformations

–One-stage: log-, 3-parameter-log-, Box & Cox- and Manly-–Two-stage: 1) Correction of skewness; 2) Correction of kurtosis

• Normality testing with appropriate corrections after transformations• Appropriate confidence intervals of percentiles after transformation

Further information:• www.cbstat.com

References

Linnet K. Nonparametric estimation of reference intervals by simple and bootstrap-based procedures. Clin Chem 2000;46:867-9.

Linnet K. Two-stage transformation systems for normalization of reference distributions evaluated. Clin Chem 1987;33:381-6.

IFCC. J Clin Chem Clin Biochem 1987;25:645-56.

Linnet K. Testing normality of transformed distributions. Appl Statist 1988;37:180-6.



Notes

Notes

overview courses

Documents

reference distribution

statistics graphics

reference values

reference sample group

normal distribution

calculated reference

statistics parametric

data gaussian