summarizing data - people.musc.edupeople.musc.edu/~elg26/hcclectureseries/hem_onc_lecture1a.pdf ·...

Summarizing Data

Hematology/Oncology Fellows Lecture Series in Statistics

Elizabeth G. Hill, PhD

Associate Professor of Biostatistics

08 May 2014

Summarizing Data – p. 1/31

Objectives

At the end of the lecture, students will be able to

• Understand the difference between a sample and apopulation

• Define continuous and categorical variables

• Describe the distribution of a continuous variable usingcommon measures of central tendency and dispersion

• Describe the shape of the distribution of a continuousvariable using a histogram

• Describe properties of and interpret graphical summariesof time-to-event variables

• Describe the distribution of a categorical variable


Some basics

TARGET (STUDY) POPULATION: The patients of clinical orscientific interest. If a study is well-designed, anyconclusions drawn from the study generalize to the targetpopulation.

SAMPLE: The patients of clinical or scientific interest thatparticipate in a study. In a well-designed study, thesesubjects are carefully selected from the target populationand the sample is said to be representative.

VARIABLE: A quantity or trait of interest in the studypopulation. Variables of particular interest in a study aresometimes called primary endpoints or primary outcomevariables.

DATA: Quantitative measurements of variables obtained fromsubjects in the sample population.


Variable types

Depending on how its values are measured, a variable is(broadly) classified as either continuous or categorical.

• Continuous variables:◦ measured on a continuum◦ often have units of measure (e.g. months; kg/m2)◦ are quantitative - variable’s value reflects (on some

scale) that which is being measured

• Categorical variables:◦ values represent discrete ‘levels’ or ‘classes’ (e.g.

Male, Female)◦ value associated with class labels are not inherently

meaningful (e.g. 1 = Male, 2 = Female)◦ can be ordinal (i.e. ordered) or nominal


Variable types (cont.)

In the article by Ohtsu et al. (JCO 2011), there are a numberof continuous and categorical variables.

Variable Continuous (units) CategoricalNominal Ordinal

Sex Nominal

Age Continuous (years)

ECOG PS Ordinal

Geographic region Nominal

Primary tumor site Nominal

Overall survival Continuous (months)

Response Ordinal


Describing a continuous variable’s distribution

When we collect data on study subjects, one objective is todescribe how different variables are distributed in the targetpopulation.

For example, we might wish to summarize age in ourpopulation. We would therefore measure the ages of subjectsin our sample, and infer to the target population.

Q How do we generalize from sample data to targetpopulation?

A We use statistics.

A statistic is a summary measure based on sample data.Statistics can be displayed numerically or graphically. Whatstatistics should we construct to adequately summarize avariable’s distribution?


Describing shape

Consider a plot of ages of 100 subjects in a study. The pointsare jittered to enhance visibility. If we group the data infive-year age bands (35-40, 40-45, ... , 75-80, 80-85) webegin to see the underlying structure of the distribution of agein our population.

40 50 60 70 80Age (years)


Describing shape (cont.)

x

40 50 60 70 80

05

1015

2025

Age (years)

Fre

quen

cy

xxxx

xxxxxxxx

xxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxx

xxxxxxxxxx

xxxxxxx

x


Describing shape with a histogram

The figure shown on slide 8 illustrates the steps inconstructing a histogram. The histogram is a useful graphicaltool that succinctly depicts the distribution of a continuousvariable. The individual groupings of values are called bins.We complete the histogram by plotting contiguous bars withheights showing the frequency (or proportion) of themeasures occurring within each bin. For the figure shown onslide 8, the graph on slide 10 shows the completed histogram.

(N.b. histogram 6= bar chart)


Describing shape with a histogram (cont.)

Age (years)

Fre

quen

cy

40 50 60 70 80

05

1015

20


Shape terminology

A B C D

A: Approximately symmetric and unimodal

B: Positively skewed and unimodal

C: Negatively skewed and unimodal

D: Asymmetric and bimodal


The normal distribution

Symmetric unimodal distributions describe many variables weencounter in biologic and medical research. This distributionshape is so common, we call it the normal (or Gaussian)distribution. Normally distributed variables have manyimportant properties in addition to symmetry and uni-modality.By examining the shape of a continuous variable’s distributionwith a histogram, we can gauge whether or not that variableis approximately normally distributed.


Describing central tendency

Beyond shape, an obvious summary measure for acontinuous variable is one describing central tendency. Ameasure of central tendency answers the question:

“Around what value do measures tend to aggregate?”

The two most common measures of the center of acontinuous variable’s distribution are

• sample mean (= arithmetic average)

• sample median (= 50th percentile)

If we know a patient is a member of the target population, ourbest guess of their value of the variable of interest - in theabsence of any additional information - is the measure ofcentral tendency.


Example - Age central tendency

In the figures below, a sample of 100 ages are plotted. Thepoints are jittered to enhance viewing. The blue lines showthe locations of the sample mean, and the red lines show thelocations of the sample median.

30 40 50 60 70 80Age (years)

0 10 20 30 40Age (years)


Example - Age central tendency (cont.)

In the first example, both the mean and median adequatelydescribe the distribution’s center.

30 40 50 60 70 80Age (years)

In the second example, the median better describes thedistribution’s center. The mean over-estimates the center.

0 10 20 30 40Age (years)


Shape and central tendency

The difference between the distributions of age in the twofigures can be seen better using histograms. Both the meanand median are adequate measures of central tendency forsymmetric unimodal distributions. But only the median is anappropriate measure of central tendency if the data areskewed.

Age (years)

Freq

uenc

y

30 50 70

05

1015

20

Age (years)

Freq

uenc

y

0 20 40

010

2030


Describing dispersion

In the figures below, a sample of 100 ages are plotted for twodifferent populations. Both distributions are centered atroughly the same value - the mean is approximately 61 inboth figures - but the ‘spread’ of the values differs.

40 50 60 70 80Age (years)

40 50 60 70 80Age (years)


Standard deviation

Statistics that measure the spread of a distribution are calledmeasures of dispersion. The most familiar measure ofdispersion is called the standard deviation (abbreviated ‘SD’).Here is how we calculate SD.

1. Calculate the difference between each data point and thesample mean.

2. Sum the squared differences.

3. Standardize the sum by dividing by n− 1, where n is thenumber of data points. The result of these steps is calledthe variance.

4. The square root of the variance is the standard deviation.


Standard deviation (cont.)

Step 1 on Slide 18 is depicted below for two data points,Age

1= 45 and Age

2= 70.

40 50 60 70 80Age (years)

d1 = 45−61 = −16 d2 = 70−61

= 9

Variance = d2

1+d

2

2+...+d

2

100

n−1and SD =

√Variance

For this example, SD = 9.5 years.


Standard deviation (cont.)

40 50 60 70 80Age (years)

mean +/− 1 SD

mean +/− 2 SDs

The vertical lines in this figure show the locations of the mean± 1 SD and the mean ± 2 SDs. (The mean ± 3 SDs extendsbeyond the plotting region.) Of the 100 data points shown68/100 = 68% fall within 1 SD of the mean, 96/100 = 96% fallwithin 2 SDs of the mean, and 100/100 = 100% of the datapoints are within 3 SDs of the mean.


SD and the normal distribution

This observation is not just a coincidence. In fact, all normallydistributed variables have the following properties:

• The mean is the center of the distribution

• 68% of all the variable’s values lie within 1 SD of themean

• 95% of all the variable’s values lie within 2 SDs of themean

• >99% of all the variable’s values lie within 3 SDs of themean

The percents reported on slide 20 don’t match thesetheoretical results exactly, because the figure on slide 20 isconstructed from sample data.


Additional comments about SD

• The SD for a variable is often reported as “mean ± SD”

• Results regarding proportions of values within 1, 2 and 3SDs of the mean are true only for normally distributedvariables

• The SD is an appropriate measure of dispersion to reportonly for approximately normally distributed variables.


What is wrong with this sentence?

“Mean treatment duration was 6.8 (±5.1) months in thebevacizumab plus fluoropyrimidine-cisplatin group ...” (Ohtsuet al., JCO 2011, page 4)

• 68% of subjects had treatment durations between 1.7months and 11.9 months.

• 95% of subjects had treatment durations between -3.4months and 17 months.

• >99% of subjects had treatment durations between -8.5months and 22.1 months.


Other measures of dispersion

If a variable is not approximately normally distributed, then abetter measure of dispersion is one of the following:

• Interquartile range (IQR)

IQR = 75th percentile - 25th percentile

• Range = max - min

Typically, what is reported in the literature is not the IQR orrange, but the values used to construct that measure. Forexample, the range for age in the placebo arm of the trial isreported as 22 - 82 (Ohtsu et al., JCO 2011, page 3, Table 1).Using the exact definition, however, the range is 60.


Continuous variable’s distribution - redux

For a unimodal continuous variable, the table belowsummarizes measures of central tendency and dispersionthat are appropriate to report depending on the distribution’sshape.

Center DispersionDistribution Mean Median SD IQR Range

Normal√ √ √ √ √

Non-normal√ √ √


Time-to-event variables

A time-to-event variable is a special type of continuousvariable common in clinical research. Some examples are:

• Time to death◦ Measured from date of diagnosis or possibly date of

randomization in a trial◦ Called overall survival

• Time to disease progression◦ Measured from baseline measure of disease burden◦ Called progression free survival

• Time to disease recurrence◦ Measured from time point when patient is considered

disease free (e.g. post resection)◦ Called recurrence free survival


Properties of time-to-event variables

• Event times are subject to censoring, defined as follows:◦ The event of interest is not observed for all subjects in

the study because ...EITHER

⊲ the subject has not yet had the event at the end ofthe studyOR

⊲ the subject is lost to follow-up (drops out of thestudy)


Properties of time-to-event variables (cont.)

• Event time distributions tend to be positively skewed.◦ Some subjects have much longer event times than

others◦ Distribution is non-normal


Kaplan-Meier estimates of overall survival (A) and progression-free survival (B) in the intention-to-treat population.

Ohtsu A et al. JCO 2011;29:3968-3976

©2011 by American Society of Clinical Oncology


Interpreting the Kaplan-Meier curve

In Ohtsu et al., Figure 2A,

• median overall survival (OS) for subjects in the placeboarm is 10.1 months

• median OS for subjects in the intervention arm is 12.1months

In Ohtsu et al., Figure 2B,

• median progression free (PFS) for subjects in the placeboarm is 5.3 months

• median PFS for subjects in the intervention arm is 6.7months


Describing a categorical variable’s distribution

When describing a categorical variable’s distribution, wegenerally report the frequency and percent of observationsoccurring in each level. For example, in Table 1 of the Ohtsuarticle, the number of males in the bevacizumab arm is 257,which represents 66% of those randomized to that arm.


summarizing data - people.musc.edupeople.musc.edu/~elg26/hcclectureseries/hem_onc_lecture1a.pdf ·...

Documents