biostatistics, part 1, descriptive statistics: key concepts

Biostatistics, part 1, Descriptive statistics: Key concepts

• Population, sample, and individual• What kinds of data? Continuous vs.

categorical• How do we summarize data? Statistics

(numerical summaries) and graphics.• Measures of central tendency and dispersion• Standard error and 95% confidence intervals

“Aristotle maintained that women have fewer teeth than men; although he was twice married, it never occurred to him to verify this statement by examining his wives’ mouths.” -- Sir Bertrand Russell, The Impact of Science on Society, 1952.

“It is a capital mistake to theorize before you have data.” -- Sir Arthur Conan Doyle, Scandal in Bohemia.

And, for another viewpoint:

“If your experiment needs statistics, you ought to have done a better experiment.” Ernest Rutherford.

The bench science perspective: you can control all the variables! Clinicians, however, know better … human variation is large, and often inexplicable. Statistics help us describe it and generalize at least enough to improve our ability to practice medicine.

Populations, Samples, and Individuals

Aristotle speculated about the population of all women (compared to the population of men). He had immediately available to him a sample of two women, and he could have counted the number of teeth for two individuals.

The population is the collection of all people about whom you would like to ask a research question. This might be a fairly clear-cut easily defined set of people:

“What proportion of people 65 or older in the US today have Alzheimer’s disease?”

Or it might be a more hypothetical group:

“How much of a reduction in symptomatic days could a person expect if treated with a new antiviral for flu?”

Typically, you can’t study everyone in the population.

You can’t afford to have everyone 65 or older in the US seen by a neurologist, even if you could find all the old people!

You can’t test everyone with the flu because the cases haven’t even occurred yet!

So you study a sample, and you try to generalize to the population. The sample size is the number of individuals in the sample (not the number of measurements you make on each person!)

A good study design will help make your sample representative of the population you are concerned about. Good statistical analysis will help tell you the best answer to your question about the population, and also how far off you might be.

All biostatistics begins with description. Before you do anything else, you look at the data and summarize the data. Our goal in this hour is to show you how to get a first look at the data and get ready to do more elaborate procedures. A statistic is just a numerical summary of the data, like the largest number in the data set.

Descriptive statistics should be clear and easily interpreted. They should not mislead you about the data they are summarizing.

“A habit of basing convictions uponevidence, and of giving to them only thatdegree of certainty which the evidencewarrants, would, if it became general, curemost of the ills from which the worldsuffers.” -- Bertrand Russell

Looking at data: categorical or continuous?

Most data fall into two broad classes.

Continuous data are used to report a measurement of the individual that can take on any value within an acceptable range. For example, age, systolic BP, [K+], change in weight over 6 months.

Categorical data are used to report a characteristic of the individual that has a finite, usually small number of possibilities. The categories should be clear cut, not overlapping, and cover all the possibilities. For example, sex (male or female), vital status (alive or dead), disease stage (depends on disease), ever smoked (yes or no).

Make sure you are very clear about the definitions. Does “one cigarette and I didn’t inhale” count as smoking?

When designing a study, allow for missing values and refusals.

An example to work with:

A hypothetical clinical trial in small cell lung cancer (SCLC).

Often advanced when diagnosed, poor prognosis.

Many SCLC tumors express receptor tyrosine kinase, KIT.

Blocking KIT might help; previous trials show little benefit.

A novel drug: binds selectively to activated e-KIT receptor.

Preliminary results: may help reverse KIT action.

So we will examine a randomized, double-blind clinical trial of this new drug: BST-TIK.

Features of this study:

1. Not enough data to know whether should restrict to patients whose tumors express KIT.

2. Design: double-blind, randomized, two-arm study. One arm is standard chemo (cisplatin, irinotecan), other is standard chemo plus BST-TIK. Total n=500.

3. Primary endpoint: overall survival.

4. Secondary endpoints: toxicity - major (neutropenia, thrombocytopenia); minor (diarrhea).

5. Possible markers: KIT expression before, after.

6. Demographics: age, sex.

Summarizing categorical data:

Frequency, proportion, percentages in categories.

Male: 321 (64.2%) (overall)

By arm of study:

Standard therapy: Kit expression: 67.6%

New therapy: KIT expression: 69.3%

Note: don’t carry every decimal place imaginable:

Note: categorizing continuous data loses information.

A second way to summarize categorical data: graphics

• Bar graphs for categories that are separate

• Histograms if you got categories by dividing up continuous data.

• Bars do not touch, histogram rectangles do touch.

Summarizing continuous data: Measures of central tendency

Measures of central tendency tell you in some sense where you might expect a “typical” person to be, in the middle of the data.

The mean is the arithmetic average. For example, if 3 people were in hospital 8, 10 and 30 days respectively, the mean time is 48/3 = 16 days. But if they were 8, 10 and 12, the mean is 30/3 = 10 days. Note: mean is sensitive to outliers!

The median is the value at which half the numbers are higher and half are lower. If number of individuals is odd, it is the middle value (rank (n+1)/2) and if number is even, it is average of two middle values. Note that median in both examples above is 10. Not sensitive to outliers!

A patient might want to know median; an insurer the mean.

The mode is the most common value; rarely used.

Measures of central dispersion for baseline KIT expression:

Overall: Mean =10.3, median = 11.0, mode = 0.0.

Patients expressing KIT: Mean = 15.1, median 15.0, mode 13.0

If data are long tailed to right, mean will be > median, influenced by those high-valued outliers. Here mean roughly = median, a good sign that data are fairly symmetric. The picture looks more or less bell-shaped, after we take out those who do not express KIT.

Features to look for in pictures:

Symmetry vs. skewness

Short tails vs. outliers

Bell-shaped vs. very peaked, very flat, or multiple peaks.

Baseline KIT expression (patients expressing KIT)

Stem Leaf # 34 000 3 32 0 1 30 000 3 28 0000 4 26 0000000000 10 24 000000000 9 22 000000000000000000000000 24 20 000000000000000000000000000000000000 36 18 00000000000000000000000000000000 32 16 000000000000000000000000000000000 33 14 000000000000000000000000000000000000000000000 45 12 00000000000000000000000000000000000000 38 10 00000000000000000000000000000000 32 8 00000000000000000000000000 26 6 0000000000000000 16 4 000000000000000000000000 24 2 0000 4 0 00 2 ----+----+----+----+----+----+----+----+----+

Another piece of the picture: measures of spread.

The simplest is the range, largest - smallest. Very sensitive to outliers. Almost worthless for doing any real statistics.

More useful: measures based on percentiles; the median is also known as the 50th percentile, because half the data are less than that value. The 25th and 75th percentiles are called the quartiles, because one-quarter and three-quarters, respectively, of the data fall below them. The difference between the quartiles is the inter-quartile range. Some epidemiologists also work with tertiles, quintiles, or deciles.

The most useful measure for biostatistics work is the standard deviation. It is based on the average of the squared distances from the mean. (Then the square root is taken to make the units come out right - that is, same units as the original measurement.)

Interpreting the standard deviation: related to bell-shaped curve

If your data are nicely behaved and follow a bell-shaped distribution curve (also known as the normal or Gaussian distribution), the standard deviation tells you a lot about how far any one individual might stray from the mean.

For a bell-shaped distribution:

Two-thirds of the individuals will lie within one standard deviation below the mean to one standard deviation above the mean.

95% of the individuals will lie within two standard deviations below the mean to two standard deviations above the mean.

Hardly anyone will ever fall outside three standard deviations above or below the mean.

Mean KIT expression was 15.1, SD 6.5.

We should find that two-thirds of the data fall between 8.6 and 21.6

We would expect to find around 5% out below 2.1 or above 28.1, and in fact this is just about right.

We have one outlier in this data set (more than 3 SD out).

How accurate is our guess at the mean?

Suppose we’d like to say that mean BL KIT is 15.1.

We haven’t seen ALL people with SCLC who express KIT, just 342 of them.

How sure can we be about that estimate of 15.1? Could we be off by 5? 1? How can you guess without studying all patients?

Answer: We can’t be completely sure about this group of 342 patients, but we know a lot about how the scientific process of taking a random sample and finding its average will behave. And we hope our sample reflects a somewhat “random” process!

Two key facts about our scientific process:

1. The means from random samples like ours are centered around the true population mean. That is, our process is unbiased.

2. The means from random samples like ours have approximately a bell-shaped distribution, that gets closer and closer to the true population mean, as the sample size gets bigger. The more data you get, the more precise your guess at the population mean.

The yardstick for how close the sample is to the truth is called the standard error. It is the standard deviation (how much a single individual might differ from the mean) divided by n. So the more data we have, the closer our sample mean should be to the truth, since almost all random samples will be very close to each other and to the true mean.

In our data set, the standard error is 6.5/ 342 =0.35.

In fact, since the means from all the random samples we might have gotten follow a bell-shaped distribution, we know that 95% of them should be within two standard errors of the truth.

So we guess that the truth is somewhere within two standard errors above our mean or two standard errors below it. We call this a 95% confidence interval.

For example, our estimate was 15.1and our standard error turned out to be 035, so two standard errors would be 0.7. A 95% confidence interval for the mean would be about from 14.4 to 15.8. This is usually written (14.4, 15.8). We have 95% confidence that the population mean lies in this interval, because we are using a scientific procedure that works like that! We know that 95% of our studies will give us an interval that covers the truth (but we will be off in 5% of our studies.)

Suppose we wanted to get a confidence interval that was narrower. We can improve our precision by increasing the sample size.

The more data you have, the more you know about the population, and the better your guesses about the population mean, or any other population characteristic of interest.

If we wanted to cut the width in half (make the study twice as precise) we would have to sample 4 times as many people

The precision of our study only increases like the square root of n, not like n. So quadrupling the sample size only cuts the standard error in half.

So it takes some planning in advance to design a study that will meet your goals, for a reasonable cost.

Now a follow-up question. How likely is it that an individual in such a study would have a KIT expression as high as 21?

We claim that normal patients average KIT expression of 15.1 (95% CI,14.4-15.8). How much higher or lower would a person have to be to seem “unusual”?

No, 21 is not unusual! No matter how well we know the MEAN, individuals don’t have to sit right on top of it. The standard error refers to how well we can estimate the center. The standard deviation refers to how well we can guess any individual.

So we wouldn’t find 21 especially surprising. We probably wouldn’t be surprised by any value between two standard deviations above and below the mean.

What will you find reported in the medical literature?

Most studies will summarize central tendency by the mean if the data look normal, and by the median otherwise.

Some papers will report the standard deviation, some the standard error, and some both. They are not always labeled! Be careful! People often show a graph with an “error bar”; could be either.

If the data are oddly behaved (skewed, multiple peaks, very long or short tails), people often report the median and percentiles instead of mean and standard deviation.

Keep your basic scientific question in mind:

Do you want to ask about the average or typical person?Or do you want to figure out how unusual your patient might be?

biostatistics, part 1, descriptive statistics: key concepts

Documents

categorical data

continuous data

data set

data fall

kinds of data

kit expression

population of men

arm study