overview courses

24
Kristian Linnet, MD, PhD [email protected] Per Hyltoft Petersen, MSc [email protected] Sverre Sandberg, MD, PhD [email protected] Statistics & graphics for the laboratory Linda Thienpont [email protected] Dietmar Stöckl Dietmar@stt- consulting.com In cooperation with AQML: D Stöckl, L Thienpont & Applications Reference interval & Biological variation

Upload: christina101

Post on 18-May-2015

434 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Overview  Courses

• Kristian Linnet, MD, [email protected]

• Per Hyltoft Petersen, [email protected]

• Sverre Sandberg, MD, [email protected]

Statistics & graphics for the

laboratory

Linda [email protected]

Dietmar Stö[email protected]

In cooperation with AQML: D Stöckl, L Thienpont &

ApplicationsReference interval & Biological variation

Page 2: Overview  Courses

Statistics & graphics for the laboratory 2

Prof Dr Linda M ThienpontUniversity of Gent

Institute for Pharmaceutical SciencesLaboratory for Analytical Chemistry

Harelbekestraat 72, B-9000 Gent, Belgiume-mail: [email protected]

STT ConsultingDietmar Stöckl, PhD

Abraham Hansstraat 11B-9667 Horebeke, Belgium

e-mail: [email protected] + FAX: +32/5549 8671

Copyright: STT Consulting 2007

Page 3: Overview  Courses

Statistics & graphics for the laboratory 3

Content overview

Reference interval

Introduction

Data presentation• Histogram• Normal probability plot & rankit-transformation • Graphical interpretation of rankit-plots

Partitioning

Statistical estimation• Parametric and non-parametric

Biological variation• Introduction• Estimation (ANOVA application)• Index-of-individuality• Comparison of a result with a reference interval ("Grey-zone")• Reference change value (RCV)

Content

Page 4: Overview  Courses

Statistics & graphics for the laboratory 4

Estimation of reference intervals – Overview

REFERENCE INDIVIDUALS

comprise a

REFERENCE POPULATION

from which is selected a

REFERENCE SAMPLE GROUP

on which are determined

REFERENCE VALUES

on which is observed a

REFERENCE DISTRIBUTION

from which are calculated

REFERENCE LIMITS

that may define

REFERENCE INTERVALS

that help with the interpretation of an

OBSERVED VALUE

Flowchart

Introduction

Start

Select statistics

Parametric method

Stop

Gaussian

No

No

Yes

Collect samples

Inspect distribution

Detect & handle outliers

Partition?

Intuitive assessment

Non-parametric method Stop

Stop

Transform data

GaussianYes

Backtransform estimates Stop

Page 5: Overview  Courses

Statistics & graphics for the laboratory 5

Inclusion criteria (NORIP, Malmø 27/4-2004)

The reference individual should • be feeling subjectively well• have reached the age of 18 • not be pregnant or breast-feeding • not been an in-patient in a hospital nor been subjectively dangerously ill during the last month • not had more than 2 measures of alcohol (24 g) in the last 24 hours • not given blood as a donor in the last five months • not taken prescribed drugs other than the P-pill or estrogens (female sex hormone) during the last two weeks • not smoked in the last hour prior to blood sampling

Preanalytical conditions (NORIP, Malmø 27/4-2004)Reference individual• Sitting at least 15 min before samplingSample collection• Li-heparin plasma or serum, EDTA-blood for haematology• Standard procedure• Minimal stasisSample handling (plasma and serum)• Stored in the dark• Storage in room temperature before centrifugation

serum: 0.5-1.5 h, plasma: max 15 min• Centrifugation: 10 min at min 1500 g• Distributed to secondary tubes within 2 h• Stored at -80 °C within 4 h

Outliers• Gross or slight deviation• Check records• Check results• Re-analyse• ? Include• ? Omit

Introduction

Page 6: Overview  Courses

Statistics & graphics for the laboratory 6

Data presentation

Tools for presentation and inspection of distributions • Histogram• Normal probability plot and "Rankit-transformation"

Rankit-transformationReference population: Gauss-distributionHyltoft Petersen P, Hørder M. Influence of analytical quality on test results. Scand J Clin Lab Invest 1992;52 Suppl 208:65-87.The frequency distribution is transformed to the cumulated frequency distribution and then transformed to the Rankit- or Normal Probability Plot.

Data presentation & inspection

In the Normal Probability Plot, the values are plotted on the x-axis and their normalized deviation from the mean (z-value, or Rankit) on the y-axis. In the figure below, a second axis has been introduced where the corresponding probabilities (to the z-value) can be read. Note, this second axis is non-linear and needs to be introduced as picture. It cannot be created with EXCEL. The tick-marks, however, can be programmed into an EXCEL chart (see: NormalRankitPlot.xls).

Use: Visual test for Normal distribution: data should fit a line.

Page 7: Overview  Courses

Statistics & graphics for the laboratory 7

The rankit plot

Triacylglyceride example

Effect of imprecision (left Fig) and bias (right Fig)on the Normal Probability PlotAn increase in imprecision (here 1.5 x) rotates the line clockwise and changes the probability at z = 1.65 from 95% to 84%.The introduction of a bias (here = 1) moves the line to the right and changes the probability at z = 1.65 from 95% to 74%.

Data presentation & inspection

NormalRankitPlot

Page 8: Overview  Courses

Statistics & graphics for the laboratory 8

The rankit plot

Bimodal situation: left population healthy, right population diseased

Data presentation & inspection

When we apply the plot in the bimodal situation, we can directly read the fase negatives (FN) and the false positives (FP).Note, the healthy are cumulated from right to left.

Under the conditions chosen (diseased at a distance of +2 SD and cutoff = 1.28 SD), FN = 24% and FP = 10%.

Page 9: Overview  Courses

Statistics & graphics for the laboratory 9

Data inspection – Examples

Uric acid (µmol/l) – Simulation (distributions moved!)Female Male

Mean 250 370SD 40 40n 1000 1000

Depending on the bin-size, bimodal distributions may be hidden in histograms!

Uric acid ~reality, but Normal distributedFemale Male

Mean 250 330SD 55 65n 1000 1000

Graphical techniques are too weak to uncover bimodal situations where the population means are close together!

Test for normal distribution PChi-square 0.836Kolmogorov-Smirnov 0.249Anderson-Darling 0.02D'Agostino-Pearson 0.016

Statistical techniques may uncover that "something is wrong" (not Normal) with the distribution. From that, one may consider to look for subgroups!However, different tests may have enourmously different power!

Data presentation & inspection

0

0.06

0.12

0.18

0.24

0.3

100 200 300 400 500

Analyte Conc.

Page 10: Overview  Courses

Statistics & graphics for the laboratory 10

Calculations with logarithms

Data transformation: LogarithmsWhen the data are not normal distributed, one can try a transformation. Because, in nature, data are often log-normal distributed, logarithmic transformation of data can make them normal distributed.

Test for normality: Triglycerides (See: Datasets.xls)n = 282; Lowest value: 0.3 mmol/L; Highest value: 3.2 mmol/L; Median: 0.92 mmol/L.

CBstatAnderson Darling test: Anderson Darling test after

logarithmic (natural) transformationP < 0.01 P = 0.13 data not normally distributed data log-normally distributed

Normal Probability Plot (ln-transformed dataData are "on a line" Data are ln-Normal distributed

Testing normality

Page 11: Overview  Courses

Statistics & graphics for the laboratory 11

Working with logarithms

Calculate the reference interval of a logarithmic distribution

Triglycerides

1. Transform the original data to ln2. Calculate the mean of the ln (xi) values

3. Take the anti-ln of the mean of ln (x i)

This equals the geometric mean of the original population, which is close to its median.

The anti-ln of the mean of the logged value e-0.0689 is equal to the geometricmean of the original distribution where the latter is given by [x1*x2 …Xn]1/n

The anti-ln of the SD is meaningless.

Calculation of 2.5 and 97.5% percentileMean (ln transformed) -0.0689SD (ln transformed) 0.3952.5 Percentile -0.0689 – 1.96*0.395 = - 0.84397.5 percentile -0.0689 + 1.96*0.394 = 0.7053Anti-ln of 2.5 & 97.5 perc 0.43 – 2.02

Calculations with logarithms

Number mmol/l ln1 0.3 -1.2042 0.32 -1.1393 0.34 -1.0794 0.38 -0.9685 0.4 -0.9166 0.4 -0.916… … …

282 3.2 1.163Median 0.92

Anti-ln (ex) 0.933 -0.069 Mean, lnEXCEL: EXP(x)

Geometric mean 0.933EXCEL: GEOMEAN

Page 12: Overview  Courses

Statistics & graphics for the laboratory 12

Partitioning of reference intervals

Visual, on the basis of suspected differences (sex, race, age, …)

The reference interval

Frequency polygon

Rankit-plot

Page 13: Overview  Courses

Statistics & graphics for the laboratory 13

Example: Partitioning – Visual

Comparison of oromucosid values: Caucasians and Indians in Leeds (Johnson et al. CCLM 2004;42:792-9).

Statistical criteria for partitioning(Lahti et al. Clin Chem 2002;48:338-52)

Difference between two upper or lower limits• D <0,25*s: No partitioning• D = 0,25 – 0.75*s: Variable• D >0,75*s: Partitioning• or percentage: Pb 0.9 and Pa 4.1 %

The reference interval

Page 14: Overview  Courses

Statistics & graphics for the laboratory 14

Statistical model for estimating a reference interval

The statistical procedures assume random sampling in the target population.Traditionally: 2.5- and 97.5-percentiles are estimated with on average 95% of population included.In some contexts, one-sided: 95-, 97.5- or 99-percentiles are used.

Statistical estimation procedures

Parametric• Assumes normal distribution or distribution that can be transformed to the

normal distributionNonparametric

• Model-free estimation of percentilesPartitioning

• Subdivision according to gender, age, race, etc. should be considered where relevant

Reference interval & type of distribution

Normal distributions can be expected for analytes with relatively narrow biological distribution, e.g. Electrolytes.

The reference interval for Normal distributions ranges from the2.5th to the 97.5th percentile (= mean+/-1.96 SD).

The reference interval

95%

Reference interval

Upperreference limit

Lowerreference limit

Page 15: Overview  Courses

Statistics & graphics for the laboratory 15

Skewed distributions

• Biological variation is very often skewed to the right, i.e. there is a tailing with high values.

• The theoretical background is many factors that has a multiplicative impact (an additive impact of many independent factors yields a normal distribution).

Skewed distributions often can be modeled by the log-normal distribution.The log-normal type of distribution is actually constituted of a family of distributions with a spectrum of degrees of skewness determined by the parameter values (ratio between standard deviation and mean).

Coefficient of skewness: Cskew = [Σ(xi – xm)3/N]/SD3

Zero: symmetric distribution; Positive: skewed to the right; Negative: skewed to the left

Nonparametric procedureApplicable to all types of distributions

Simple procedures • Based on ordering (ranking) of values according to size

Refined procedures• Weighted percentile estimation, smoothing techniques, resampling principle (bootstrap).

The reference interval

Coefficient of kurtosis:Ckurt = [Σ(xi – xm)4/N]/SD4 – 3

Zero: Normal distribution; Positive: Peaked distribution; Negative: Flat distribution

Page 16: Overview  Courses

Statistics & graphics for the laboratory 16

Simple nonparametric procedure(s)

Approach• Sort N reference values in increasing numerical order• Assign rank numbers; lowest = 1; highest = N• Rank number of 2.5-Percentile = 0.025 x (N+1) or 0.025 x (N) + 0.5• Rank number of 97.5-Percentile = 0.975 x (N+1) or 0.975 x (N) + 0.5• Lower reference limit = reference value corresponding to rank number of 2.5-Percentile• Upper reference limit = reference value corresponding to rank number of 97.5-Percentile

Remark – Estimation of 2.5 & 97.5 percentiles

Procedure recommended by the IFCC and CLSI:• 2.5-Percentile = Value of number: (0.025) x (N+1)• 97.5-Percentile = Value of number: (0.975) x (N+1)

Optimal procedure (slightly different from above):• 2.5-Percentile = Value of number: (0.025) x (N) + 0.5• 97.5-Percentile = Value of number: (0.975) x (N) + 0.5(Linnet K. Clin Chem 2000;46:867-9)

Triglycerides: n = 2820.025 x (282 + 1) = 7.1 = Rank: 7 = 0.42 mmol/L0.975 x (282 + 1) = 276 = Rank: 276 = 2.12 mmol/LReference interval = 0.42 – 2.12 mmol/L

The reference interval

Page 17: Overview  Courses

Statistics & graphics for the laboratory 17

Sample size and precision of estimates

Precision of percentiles of Normal distributionCan be expressed as a ratio between 90%-confidence intervals (90%-CI) and the width of the 95%-reference interval (e.g. ratios 0.3, 0.2 or 0.1 as outlined below).The necessary sample sizes are indicated:

Ratio Parametric N(90% CI/95% RI)0.3 230.2 500.1 205

Precision of percentiles of normal distributionComparison between parametric and non-parametric procedures.

Ratio Parametric N Non-parametric N(90% CI/95% RI)0.3 23 560.2 50 1250.1 205 500

The reference interval

Page 18: Overview  Courses

Statistics & graphics for the laboratory 18

Sample size and precision of estimates

Coefficient of skewness: 0.75

Ratio Parametric N Non-parametric N(90% CI/95% RI)0.3 90 1400.2 200 3150.1 800 1250

Coefficient of skewness: 1.5Ratio Parametric N Non-parametric N(90% CI/95% RI)0.3 200 3150.2 440 6950.1 1750 2740

The reference interval

Page 19: Overview  Courses

Statistics & graphics for the laboratory 19

Bootstrap principle

Repeated random re-sampling with replacement of observations.• For a set of N observations: Each observation has the probability of 1/N of being re-sampled.• A re-sampled set of N observations (a so-called pseudo-set of observations) may contain several copies of one observation and lack others.

Origin of the nameThe bootstrap term refers to the phrase to pull oneself up by one´s bootstrap originating from the tale The Adventures of Baron Munchausen (by Rudolph Erich Raspe (1737-94)) in which ”The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps”.

Calculation of estimates• For each pseudo-set of N observations, the percentiles are computed by the simple nonparametric procedure.• By repetition on a computer, e.g. 100 or more times, a distribution of estimated percentiles are obtained that mimicks the real sampling variation.• The bootstrap estimates are the means of the pseudo-estimates.• The bootstrap procedure is slightly (5-15%) more efficient than simple nonparametric estimation.• Standard errors of estimates are provided.

Limitations• Too low coverage* at small sample sizes (N < 40)• Modified versions with smoothing might improve coverage at small sample sizes• Some bias problems with the bootstrap estimates at low sample sizes (N < 40)*Coverage: Expected percentage of times an estimated CI-interval includes the true value, i.e. ideally 90% for a supposed 90%-CI

The reference interval

Page 20: Overview  Courses

Statistics & graphics for the laboratory 20

Comparison of statistical procedures

Can be studied theoretically and/or by simulation on the basis of specified model distributions, e.g. normal and log-normal types. In simulation, the procedure is repeated a large number of times in order to study bias (= difference between average of percentile estimates and true value) and precision (standard error: SE) of the estimation procedure (small SE: efficient procedure). SEs should reflect the real uncertainty so that estimated confidence intervals become correct.

Tool: Root mean squared error (RMSE)RMSE: [Σ(xobs – xTrue)2/Nrun]2 = [Bias2 + SE2]0.5

(Nrun: no. of simulation runs)

A combined error measure taking both systematic deviation and random error into account.Often used in statistics as an overall error measure allowing ranking of various statistical estimation procedures studied theoretically or by simulations.

Model exampleUsing a theoretical model distribution, e.g. a CHI-square-distribution, the true percentile values are known.

By simulation, the performance of parametric and nonparametric procedures can be compared and the RMSE of the percentile estimates can be related to the sample size.

OutcomeThe higher the sample size, the higher is the likelihood that the nonparametric procedure is the optimal approach (lowest RMSE at given sample size).The relationship relies in the general fact that a bias associated with parametric estimation is independent of sample size and will tend to dominate the RMSE at high sample sizes where the random error vanishes.

The reference interval

Page 21: Overview  Courses

Statistics & graphics for the laboratory 21

Statistical procedures – Summary

Ranking of procedures according to efficiency1. Parametric procedure2. Bootstrap non-parametric –3. Simple non-parametric –

Non-parametric vs parametricAbout half as effective, i.e. about twice the sample size required to attain the same SE of the percentilesThe difference in effectiveness is larger the more extreme the percentiles are (e.g. 99 vs 97.5 percentile)

Simple non-parametric procedureN p +0.5 slightly better than N p +1 for both normal and skewed distributions

Bootstrap non-parametric vs simple non-parametricSlightly more efficient (5-15% savings of sample size)Confidence intervals can be estimated for smaller sample sizes (for simple non-parametric N 120 for 90%-CI)

The reference interval

Page 22: Overview  Courses

Statistics & graphics for the laboratory 22

Example

Example: Triglycerides with CBstat

Procedure CI Lower limit CI Upper limitParametric direct 0.08 – 0.23 1.79 – 1.94Non-parametric 0.34 – 0.52 1.92 – 2.60Non-parametric bootstrap 0.37 – 0.52 1.88 – 2.33Parametric after log-transform 0.40 – 0.46 1.90 – 2.16

Note: Direct parametric is not correct!

Simulation of triacylglyceride dataWe simulate data that correspond to the triacylglyceride data: skew ~1.64. We do that with Worksheet LnNormal 3 (mean = 0; SD = 0.48; n = 1000). Copy the data in the file RefInt.xls. Adapt the digits to 2 after the point (precision as displayed). Sample 20 values from these data (Tools>Data Analysis>Sampling). Compare the 90% confidence intervals n = 20 with the respective ones for n = 1000.

The reference interval

DataGeneration

Page 23: Overview  Courses

Statistics & graphics for the laboratory 23

Software & references

CBstat

A Windows program distributed by K. Linnet (via aaccdirect.org).

Offers general statistical methods and procedures dedicated for clinical biochemistry

Estimation of reference intervals:• Simple nonparametric and bootstrap procedure• Parametric direct• Parametric after transformations

–One-stage: log-, 3-parameter-log-, Box & Cox- and Manly-–Two-stage: 1) Correction of skewness; 2) Correction of kurtosis

• Normality testing with appropriate corrections after transformations• Appropriate confidence intervals of percentiles after transformation

Further information:• www.cbstat.com

References

Linnet K. Nonparametric estimation of reference intervals by simple and bootstrap-based procedures. Clin Chem 2000;46:867-9.

Linnet K. Two-stage transformation systems for normalization of reference distributions evaluated. Clin Chem 1987;33:381-6.

IFCC. J Clin Chem Clin Biochem 1987;25:645-56.

Linnet K. Testing normality of transformed distributions. Appl Statist 1988;37:180-6.

The reference interval

Page 24: Overview  Courses

Statistics & graphics for the laboratory 24

Notes

Notes