summarizing data - people.musc.edupeople.musc.edu/~elg26/hcclectureseries/hem_onc_lecture1a.pdf ·...
TRANSCRIPT
Summarizing Data
Hematology/Oncology Fellows Lecture Series in Statistics
Elizabeth G. Hill, PhD
Associate Professor of Biostatistics
08 May 2014
Summarizing Data – p. 1/31
Objectives
At the end of the lecture, students will be able to
• Understand the difference between a sample and apopulation
• Define continuous and categorical variables
• Describe the distribution of a continuous variable usingcommon measures of central tendency and dispersion
• Describe the shape of the distribution of a continuousvariable using a histogram
• Describe properties of and interpret graphical summariesof time-to-event variables
• Describe the distribution of a categorical variable
Summarizing Data – p. 2/31
Some basics
TARGET (STUDY) POPULATION: The patients of clinical orscientific interest. If a study is well-designed, anyconclusions drawn from the study generalize to the targetpopulation.
SAMPLE: The patients of clinical or scientific interest thatparticipate in a study. In a well-designed study, thesesubjects are carefully selected from the target populationand the sample is said to be representative.
VARIABLE: A quantity or trait of interest in the studypopulation. Variables of particular interest in a study aresometimes called primary endpoints or primary outcomevariables.
DATA: Quantitative measurements of variables obtained fromsubjects in the sample population.
Summarizing Data – p. 3/31
Variable types
Depending on how its values are measured, a variable is(broadly) classified as either continuous or categorical.
• Continuous variables:◦ measured on a continuum◦ often have units of measure (e.g. months; kg/m2)◦ are quantitative - variable’s value reflects (on some
scale) that which is being measured
• Categorical variables:◦ values represent discrete ‘levels’ or ‘classes’ (e.g.
Male, Female)◦ value associated with class labels are not inherently
meaningful (e.g. 1 = Male, 2 = Female)◦ can be ordinal (i.e. ordered) or nominal
Summarizing Data – p. 4/31
Variable types (cont.)
In the article by Ohtsu et al. (JCO 2011), there are a numberof continuous and categorical variables.
Variable Continuous (units) CategoricalNominal Ordinal
Sex Nominal
Age Continuous (years)
ECOG PS Ordinal
Geographic region Nominal
Primary tumor site Nominal
Overall survival Continuous (months)
Response Ordinal
Summarizing Data – p. 5/31
Describing a continuous variable’s distribution
When we collect data on study subjects, one objective is todescribe how different variables are distributed in the targetpopulation.
For example, we might wish to summarize age in ourpopulation. We would therefore measure the ages of subjectsin our sample, and infer to the target population.
Q How do we generalize from sample data to targetpopulation?
A We use statistics.
A statistic is a summary measure based on sample data.Statistics can be displayed numerically or graphically. Whatstatistics should we construct to adequately summarize avariable’s distribution?
Summarizing Data – p. 6/31
Describing shape
Consider a plot of ages of 100 subjects in a study. The pointsare jittered to enhance visibility. If we group the data infive-year age bands (35-40, 40-45, ... , 75-80, 80-85) webegin to see the underlying structure of the distribution of agein our population.
40 50 60 70 80Age (years)
Summarizing Data – p. 7/31
Describing shape (cont.)
x
40 50 60 70 80
05
1015
2025
Age (years)
Fre
quen
cy
xxxx
xxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxx
xxxxxxxxxx
xxxxxxx
x
Summarizing Data – p. 8/31
Describing shape with a histogram
The figure shown on slide 8 illustrates the steps inconstructing a histogram. The histogram is a useful graphicaltool that succinctly depicts the distribution of a continuousvariable. The individual groupings of values are called bins.We complete the histogram by plotting contiguous bars withheights showing the frequency (or proportion) of themeasures occurring within each bin. For the figure shown onslide 8, the graph on slide 10 shows the completed histogram.
(N.b. histogram 6= bar chart)
Summarizing Data – p. 9/31
Describing shape with a histogram (cont.)
Age (years)
Fre
quen
cy
40 50 60 70 80
05
1015
20
Summarizing Data – p. 10/31
Shape terminology
A B C D
A: Approximately symmetric and unimodal
B: Positively skewed and unimodal
C: Negatively skewed and unimodal
D: Asymmetric and bimodal
Summarizing Data – p. 11/31
The normal distribution
Symmetric unimodal distributions describe many variables weencounter in biologic and medical research. This distributionshape is so common, we call it the normal (or Gaussian)distribution. Normally distributed variables have manyimportant properties in addition to symmetry and uni-modality.By examining the shape of a continuous variable’s distributionwith a histogram, we can gauge whether or not that variableis approximately normally distributed.
Summarizing Data – p. 12/31
Describing central tendency
Beyond shape, an obvious summary measure for acontinuous variable is one describing central tendency. Ameasure of central tendency answers the question:
“Around what value do measures tend to aggregate?”
The two most common measures of the center of acontinuous variable’s distribution are
• sample mean (= arithmetic average)
• sample median (= 50th percentile)
If we know a patient is a member of the target population, ourbest guess of their value of the variable of interest - in theabsence of any additional information - is the measure ofcentral tendency.
Summarizing Data – p. 13/31
Example - Age central tendency
In the figures below, a sample of 100 ages are plotted. Thepoints are jittered to enhance viewing. The blue lines showthe locations of the sample mean, and the red lines show thelocations of the sample median.
30 40 50 60 70 80Age (years)
0 10 20 30 40Age (years)
Summarizing Data – p. 14/31
Example - Age central tendency (cont.)
In the first example, both the mean and median adequatelydescribe the distribution’s center.
30 40 50 60 70 80Age (years)
In the second example, the median better describes thedistribution’s center. The mean over-estimates the center.
0 10 20 30 40Age (years)
Summarizing Data – p. 15/31
Shape and central tendency
The difference between the distributions of age in the twofigures can be seen better using histograms. Both the meanand median are adequate measures of central tendency forsymmetric unimodal distributions. But only the median is anappropriate measure of central tendency if the data areskewed.
Age (years)
Freq
uenc
y
30 50 70
05
1015
20
Age (years)
Freq
uenc
y
0 20 40
010
2030
Summarizing Data – p. 16/31
Describing dispersion
In the figures below, a sample of 100 ages are plotted for twodifferent populations. Both distributions are centered atroughly the same value - the mean is approximately 61 inboth figures - but the ‘spread’ of the values differs.
40 50 60 70 80Age (years)
40 50 60 70 80Age (years)
Summarizing Data – p. 17/31
Standard deviation
Statistics that measure the spread of a distribution are calledmeasures of dispersion. The most familiar measure ofdispersion is called the standard deviation (abbreviated ‘SD’).Here is how we calculate SD.
1. Calculate the difference between each data point and thesample mean.
2. Sum the squared differences.
3. Standardize the sum by dividing by n− 1, where n is thenumber of data points. The result of these steps is calledthe variance.
4. The square root of the variance is the standard deviation.
Summarizing Data – p. 18/31
Standard deviation (cont.)
Step 1 on Slide 18 is depicted below for two data points,Age
1= 45 and Age
2= 70.
40 50 60 70 80Age (years)
d1 = 45−61 = −16 d2 = 70−61
= 9
Variance = d2
1+d
2
2+...+d
2
100
n−1and SD =
√Variance
For this example, SD = 9.5 years.
Summarizing Data – p. 19/31
Standard deviation (cont.)
40 50 60 70 80Age (years)
mean +/− 1 SD
mean +/− 2 SDs
The vertical lines in this figure show the locations of the mean± 1 SD and the mean ± 2 SDs. (The mean ± 3 SDs extendsbeyond the plotting region.) Of the 100 data points shown68/100 = 68% fall within 1 SD of the mean, 96/100 = 96% fallwithin 2 SDs of the mean, and 100/100 = 100% of the datapoints are within 3 SDs of the mean.
Summarizing Data – p. 20/31
SD and the normal distribution
This observation is not just a coincidence. In fact, all normallydistributed variables have the following properties:
• The mean is the center of the distribution
• 68% of all the variable’s values lie within 1 SD of themean
• 95% of all the variable’s values lie within 2 SDs of themean
• >99% of all the variable’s values lie within 3 SDs of themean
The percents reported on slide 20 don’t match thesetheoretical results exactly, because the figure on slide 20 isconstructed from sample data.
Summarizing Data – p. 21/31
Additional comments about SD
• The SD for a variable is often reported as “mean ± SD”
• Results regarding proportions of values within 1, 2 and 3SDs of the mean are true only for normally distributedvariables
• The SD is an appropriate measure of dispersion to reportonly for approximately normally distributed variables.
Summarizing Data – p. 22/31
What is wrong with this sentence?
“Mean treatment duration was 6.8 (±5.1) months in thebevacizumab plus fluoropyrimidine-cisplatin group ...” (Ohtsuet al., JCO 2011, page 4)
• 68% of subjects had treatment durations between 1.7months and 11.9 months.
• 95% of subjects had treatment durations between -3.4months and 17 months.
• >99% of subjects had treatment durations between -8.5months and 22.1 months.
Summarizing Data – p. 23/31
Other measures of dispersion
If a variable is not approximately normally distributed, then abetter measure of dispersion is one of the following:
• Interquartile range (IQR)
IQR = 75th percentile - 25th percentile
• Range = max - min
Typically, what is reported in the literature is not the IQR orrange, but the values used to construct that measure. Forexample, the range for age in the placebo arm of the trial isreported as 22 - 82 (Ohtsu et al., JCO 2011, page 3, Table 1).Using the exact definition, however, the range is 60.
Summarizing Data – p. 24/31
Continuous variable’s distribution - redux
For a unimodal continuous variable, the table belowsummarizes measures of central tendency and dispersionthat are appropriate to report depending on the distribution’sshape.
Center DispersionDistribution Mean Median SD IQR Range
Normal√ √ √ √ √
Non-normal√ √ √
Summarizing Data – p. 25/31
Time-to-event variables
A time-to-event variable is a special type of continuousvariable common in clinical research. Some examples are:
• Time to death◦ Measured from date of diagnosis or possibly date of
randomization in a trial◦ Called overall survival
• Time to disease progression◦ Measured from baseline measure of disease burden◦ Called progression free survival
• Time to disease recurrence◦ Measured from time point when patient is considered
disease free (e.g. post resection)◦ Called recurrence free survival
Summarizing Data – p. 26/31
Properties of time-to-event variables
• Event times are subject to censoring, defined as follows:◦ The event of interest is not observed for all subjects in
the study because ...EITHER
⊲ the subject has not yet had the event at the end ofthe studyOR
⊲ the subject is lost to follow-up (drops out of thestudy)
Summarizing Data – p. 27/31
Properties of time-to-event variables (cont.)
• Event time distributions tend to be positively skewed.◦ Some subjects have much longer event times than
others◦ Distribution is non-normal
Summarizing Data – p. 28/31
Kaplan-Meier estimates of overall survival (A) and progression-free survival (B) in the intention-to-treat population.
Ohtsu A et al. JCO 2011;29:3968-3976
©2011 by American Society of Clinical Oncology
Summarizing Data – p. 29/31
Interpreting the Kaplan-Meier curve
In Ohtsu et al., Figure 2A,
• median overall survival (OS) for subjects in the placeboarm is 10.1 months
• median OS for subjects in the intervention arm is 12.1months
In Ohtsu et al., Figure 2B,
• median progression free (PFS) for subjects in the placeboarm is 5.3 months
• median PFS for subjects in the intervention arm is 6.7months
Summarizing Data – p. 30/31
Describing a categorical variable’s distribution
When describing a categorical variable’s distribution, wegenerally report the frequency and percent of observationsoccurring in each level. For example, in Table 1 of the Ohtsuarticle, the number of males in the bevacizumab arm is 257,which represents 66% of those randomized to that arm.
Summarizing Data – p. 31/31