saa 2023 computationaltechnique for biostatistics introduction & descriptive statistics

111
SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Introduction & Descriptive Descriptive Statistics Statistics

Upload: clifford-hawkins

Post on 26-Dec-2015

233 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS

SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS

Introduction & Introduction &

Descriptive StatisticsDescriptive Statistics

Page 2: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

StatisticsStatistics - technology used to describe and measure aspects of nature from samples

Statistics lets us quantify the quantify the uncertaintyuncertainty of these measures

IntroductionIntroduction

Page 3: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

StatisticsStatistics is also about good is also about good scientific practicescientific practice

The history of statistics has its roots in biologybiology

IntroductionIntroduction

Page 4: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Sir Francis GaltonSir Francis Galton

Inventor of fingerprints, study of heredity of quantitative traits

Regression & correlation

Also: efficacy of prayer, attractiveness as function of distance from London

Page 5: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Karl PearsonKarl Pearson

Polymath-

Studied genetics

Correlation coefficientc2 testStandard deviation

Page 6: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Sir Ronald FisherSir Ronald Fisher

The Genetical Theory of Natural Selection

Founder of population genetics

Analysis of variance Likelihood P-valueRandomized experiments Multiple regressionetc., etc., etc.

Page 7: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Statistical quotationsStatistical quotations

There are three kinds of lies: lies, damn lies, and statistics. Benjamin Disraeli / Mark Twain

It is easy to lie with statistics, but easier to lie without them. Frederick Mosteller

Page 8: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Goals of statisticsGoals of statistics

Estimation Estimation Infer an unknown quantity of a population

using sample data Hypothesis testingHypothesis testing

Differences among groups Relationships among variables

Page 9: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

IntroductionIntroduction

Introduction to the basic concepts of statistics as applied to problems in biological science.

Goal of the course Understand statistical concepts (population,

sample,, slope, significant etc.); Identify appropriate methods for your data (e.g.,

one-sample, two-sample, paired t-test or independent t-test, one-way or two-way ANOVA);

Select correct MINITAB procedures to analyze data Scientific reading and interpretation.

Page 10: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

BiostatisticsBiostatistics Why study Biostatistics?

Statistical methods are widely used in biological field; Examples are from biological field, practical and useful; Focus on application instead of mathematical

derivation; Help to evaluate the paper in an intelligent manner.

Statistics - the science and art of obtaining reliable results and conclusions from data that is subject to variation.

Biostatistics (Biometry)- the application of statistics to the biological sciences.

Page 11: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Why Computer Applications?Why Computer Applications? Statistical methods are mostly difficult and

complicated (ANOVA, regression etc); Advances in computer technology and

statistical software development make the application of statistical method much easier today than before;

Software such as MINITAB needs time to learn.

BiostatisticsBiostatistics

Page 12: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Is Biostatistics hard to study?Is Biostatistics hard to study? Factors make it hard for some students to

learn statistics: The terminology is deceptive. To

understand statistics, you have to understand the statistical meaning of terms such as significant, error and hypothesis are distinct from ordinary uses of these words.

Page 13: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Is Biostatistics hard to study?Is Biostatistics hard to study? Statistics requires mastering abstract

concepts. It is not easy to think about theoretical concepts such as populations, probability distributions, and null hypotheses.

Statistics is at the interface of mathematics and science. To really grasp the concepts of statistics, you need to be able to think about it from both angles.

Page 14: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

The derivation of many statistical tests involves difficult math. However, you can learn to use statistical tests and interpret the results even if you do not fully understand how they work. You only need to know enough about how the tool works so that you can avoid using them in inappropriate situations.

Is Biostatistics hard to study?Is Biostatistics hard to study?

Page 15: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basically, you can calculate statistical tests and interpret results even if you don’t understand how the equations were derived, as long as you know enough to use

the statistical tests appropriately.

Is Biostatistics hard to study?Is Biostatistics hard to study?

Page 16: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Questions about this courseQuestions about this course Is this course to be hard?

No. Concept is easy and procedure is clear.

Why do we spend time on theoretical stuff? Helpful to understand the application

Do we need to know all the stuff? You may not need all, but be prepared

Page 17: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Role of statistics in Role of statistics in Biological ScienceBiological Science

Science

1.Idea or Question

2.Collect data/make observations

3.Describe data / observations

4.Assess the strength of evidence for / against the hypothesis

Statistics

1.Mathematical model / hypothesis

2.Study design

3.Descriptive statistics

4.Inferential statistics

Page 18: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Contents of the courseContents of the course Descriptive statistics

Graph, table, mean and standard deviation Inferential statistics

Probability and distribution Hypothesis test Analysis of Variation Correlation and regression analysis Other special topics

Page 19: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept DataData

numerical facts, measurements, or observations obtained from an investigation, experiment aimed at answering a question

Statistical analyses deal with numbers

Page 20: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept QuantitativeQuantitative

Usual type of measurement, such as height or weight - measurements of quantitative variables carry information about 'amount' - can calculate means, etc., and can use in calculations

Page 21: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept QualitativeQualitative

Carry information about category or classification, such as medical diagnosis, ethnic group, gender - cannot calculate means as such, but can tabulate counts or frequencies and analyze frequencies

Page 22: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept VariableVariable

a characteristic that can take on different values for different persons, places or things

Statistical analyses need variability; otherwise there is nothing to study

Examples:Examples: Concentration of a substance, pH values

obtained from atmospheric precipitation, birth weight of babies whose mothers are smokers, etc.

Page 23: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

A variablevariable is a characteristic measured on individuals drawn from a population under study.

DataData are measurements of one or more variables made on a collection of individuals.

Basic ConceptBasic Concept

Page 24: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept Type of VariableType of Variable

Continuous variable Between any two values of a variable,

there is another possible value Examples: height, weight,

concentration Discrete variable

Value can be only integer Example: number of people, plant etc.

Page 25: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Continuous variablesContinuous variables Can take any value to any degree of

precision in a certain range - height, weight, temperature (?)

Basic ConceptBasic Concept

Page 26: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Discrete variables:Discrete variables: Can take only certain values or can

only be measured to a certain degree of accuracy - e.g., # of children that a woman has delivered, # of teeth with fillings, blood pressure (?) - may be handled differently in analysis

Basic ConceptBasic Concept

Page 27: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Independent VariableIndependent Variable Dependent VariableDependent Variable

We try to predict or explain a response variable from an explanatory variable.

Basic ConceptBasic Concept

Page 28: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Populations and samplesPopulations and samples

Page 29: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Populations <-> Parameters;Samples <-> Estimates

Basic ConceptBasic Concept

Page 30: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Nomenclature

Population

Parameters

Sample

Statistics

Mean

Variance s2

Standard Deviation

s

x

Basic ConceptBasic Concept

Page 31: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept PopulationPopulation

Population parameters are constants whereas estimates are random variables, changing from one random sample to the next from the same population.

Page 32: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept Population and SamplePopulation and Sample

SamplePopulation, StatisticParameter

population

sample

Parameter

predict properties of sample

statistic

Generalize to a population

Page 33: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept PopulationPopulation

Population: a set or collection of objects we are interested in. (finite, infinite)

Parameter: a descriptive measure associated with a variable of an entire population, usually unknown because the whole population cannot be enumerated.

For example,Plant height under warming conditions;Graduates in USIM; Smokers in the world.

Example: number of people, plant etc.

Page 34: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept Population and SamplePopulation and Sample

- Population Population - largest collection of values of a random variable for which we have an interest at a particular time - school children in Negeri Sembilan.

- Sample Sample - selected part of a population – Form Three girls, Form Five boys, etc.

Page 35: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept

A sample of conveniencesample of convenience is a collection of individuals that happen to be available at the time.

Page 36: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept SamplingSampling

essence of statistical inference – why?

Why sample?Why sample? Cannot afford time or money to record measurements on entire population and new members of the population may be entering all of the time - We use statistical analysis of a sample to answer questions about a population - cancer patients, teen-age boys, women after child birth, etc.

Page 37: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept

SamplingSamplingPrecise Imprecise

Biased

Unbiased

Page 38: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept

BiasBias is a systematic discrepancy between estimates and the true population characteristic.

Page 39: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept

Sampling error Sampling error - The difference between the estimate and average value of the estimate is a systematic discrepancy between estimates and the true population characteristic.

Page 40: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept

Larger samplesLarger samples on average will have smaller sampling error.

Page 41: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept Properties of a good sampleProperties of a good sample

Independent selection of individuals Random selection of individuals Sufficiently large

Page 42: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept SamplingSampling So how do 'intervention studies fit So how do 'intervention studies fit

into this?into this? Studies select a sample of the population (e.g., cancer patients) to study the effects of a new therapy and then make inferences about how the rest of the cancer patient population would react to the new therapy.

Page 43: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept SampleSample

SampleSample: a small number of subjects from a population to make inference about the population;

Random sampleRandom sample: A sample of size n drawn from a population of size N in such a way that every possible sample of size n has the same chance of being selected.

StatisticStatistic: a descriptive measure associated with a random variable of a sample.

Page 44: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept RandomRandom

Variables whose values arise by chance factors which cannot be predicted in advance, such as height or weight

race or age are 'fixed' variables; i.e., not random

Page 45: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Basic ConceptBasic Concept RandomRandom

In a random samplerandom sample, each member of a population has an equal and independent chance of being selected.

Page 46: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Descriptive StatisticsDescriptive Statistics Graphical SummariesGraphical Summaries

Frequency distribution Histogram Stem and Leaf plot Boxplot

Numerical SummariesNumerical Summaries Location – mean, median, mode. Spread – range, variance, standard deviation Shape – skewness, kurtosis

Page 47: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Example:Example: Number of grass plants, Mytilus edulis, found in 800 sample quadrats (1m2) in an ecological study of grasses:

Frequency DistributionFrequency Distribution- Discrete variables- Discrete variables

Page 48: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Example:Example: Number of grass plants, Mytilus edulis, found in 800 sample quadrats (1m2) in an ecological study of grasses:

1, 4, 1, 0, 0, 1, 0, 0, 2, 3, 1, 2, 3, 1, 0, 2, 0, 1, 2,

………………………………………………………

1, 2, 3, 2, 1, 1, 0, 5, 0, 0, 1, 0, 1, 0, 2, 4, 7, 2, 1,0

How is the plant number in a quadrat distributed?

Frequency DistributionFrequency Distribution- Discrete variables- Discrete variables

Page 49: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Table 1. The frequency, relative frequency, cumulative frequencies of plant sedge in a quadrat.

Plants/quadrat (Xi) Frequency (fi) Relative frequency (fi/n*100) Cumulative relative frequency0 268 33.500 33.5001 316 39.500 73.0002 135 16.875 89.8753 61 7.625 97.5004 15 1.875 99.3755 3 0.375 99.7506 1 0.125 99.8757 1 0.125 100.000

Total 800 100.000

• frequency - number of times value occurs in data.(probability for population).

• relative frequency - the % of the time that the value occurs (frequency/n).

• cumulative relative frequency - the % of the sample that is equal to or smaller than the value (cumulative frequency/n).

Frequency DistributionFrequency Distribution- Discrete variables- Discrete variables

Page 50: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Histogram (Bar graph) and polygonHistogram (Bar graph) and polygon

Histogram graph of frequencies Histogram graph of frequencies Can be used to visually compare frequencies Easier to assess magnitude of differences rather than

trying to judge numbers

Frequency polygon - similar to histogramFrequency polygon - similar to histogram

Fig. 1. Frequency distribution of plants in a quadrat.

Page 51: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Grouping of Grouping of continuouscontinuous outcome outcome Examples: weight, height. Better understanding of what data show

rather than individual values Example:Example: Fiber length of a cotton (n=106)

Data:

27.5,28.6,29.4,30.5,31.4,29.8,27.6,28.7,27.6…………

31.8,32.0,27.8

Frequency DistributionFrequency Distribution- Continuous variables- Continuous variables

Page 52: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Length (Xi, mm) Frequency (fi) Relative frequency (%) Cumulative relative frequency27.0~27.5 1 0.943396226 0.94339622627.5~28.0 3 2.830188679 3.77358490628.0~28.5 6 5.660377358 9.43396226428.5~29.0 13 12.26415094 21.6981132129.0~29.5 18 16.98113208 38.6792452829.5~30.0 19 17.9245283 56.6037735830.0~30.5 17 16.03773585 72.6415094330.5~31.0 16 15.09433962 87.7358490631.0~31.5 6 5.660377358 93.3962264231.5~32.0 5 4.716981132 98.1132075532.0~32.5 2 1.886792453 100Total 106 100

Table 2. Frequency and relative frequency distribution of fiber length (mm) of a cotton variety (n=106)

Frequency DistributionFrequency Distribution- Continuous variables- Continuous variables

Page 53: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Calculate Range: R=max(X)-min(x)=5.13Calculate Range: R=max(X)-min(x)=5.13 Set Number of intervals g and interval Set Number of intervals g and interval

range irange i Some “rules” exist, but generally create 8-15

equal sized intervals, g=11 i =R/(g-1)=0.5

Set intervalsSet intervals L1=min(X)-i /2=27.0, L2=L1+i =27.5, …

Count number in each intervalCount number in each interval

Frequency DistributionFrequency Distribution- Continuous variables- Continuous variables

Page 54: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

02468

101214161820

27.0~

27.5

27.5~

28.0

28.0~

28.5

28.5~

29.0

29.0~

29.5

29.5~

30.0

30.0~

30.5

30.5~

31.0

31.0~

31.5

31.5~

32.0

32.0~

32.5

Length (mm)

Fre

qu

ency

Fig. 2. Frequency distribution in fiber length of a cotton.

0

2

4

6

8

10

12

14

16

18

20

27 28 29 30 31 32 33Length (mm)

Fre

qu

ency

0

10

20

30

40

50

60

70

80

90

100

27 28 29 30 31 32 33

Length (mm)

Acc

um

ula

te r

elat

ive

freq

uen

cy

Histogram (Bar graph) and polygonHistogram (Bar graph) and polygon

Page 55: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

HistogramHistogram A histogram is a way of summarising data that are

measured on an interval scale (either discrete or continuous). It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. It divides up the range of possible values in a data set into classes or groups. For each group, a rectangle is constructed with a base length equal to the range of values in that specific group, and an area proportional to the number of observations falling into that group. This means that the rectangles might be drawn of non-uniform height.

Page 56: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Histogram

The histogram is only appropriate for variables whose values are numerical and measured on an interval scale. It is generally used when dealing with large data sets (>100 observations), when stem and leaf plots become tedious to construct. A histogram can also help detect any unusual observations (outliers), or any gaps in the data set.

Page 57: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Histogram

Page 58: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Another way to assess frequenciesAnother way to assess frequencies Does preserve individual measure information, so

not useful for large data sets Stem is first digit(s) of measurements, leaves are

last digit of measurements Most useful for two digit numbers, more

cumbersome for three+ digits 20: X30: XXX40: XXXX50: XX60: X

2* | 13* | 2444* | 24685* | 266* | 4

Stem leaf

Stem and Leaf DisplaysStem and Leaf Displays

Page 59: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Stem and Leaf Plot A stem and leaf plot is a way of

summarising a set of data measured on an interval scale. It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient and easily drawn form.

Page 60: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Stem and Leaf Plot

A stem and leaf plot is similar to a histogram but is usually a more informative display for relatively small data sets (<100 data points). It provides a table as well as a picture of the data and from it we can readily write down the data in order of magnitude, which is useful for many statistical procedures, e.g. in the skinfold thickness example below:

Page 61: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Stem and Leaf Plot

We can compare more than one data set by the use of multiple stem and leaf plots. By using a back-to-back stem and leaf plot, we are able to compare the same characteristic in two different groups, for example, pulse rate after exercise of smokers and non-smokers.

Page 62: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

In practice, descriptive statistics play In practice, descriptive statistics play a major rolea major role Always the first 1-2 tables/figures in a paper Statistician needs to know about each

variable before deciding how to analyze to answer research questions

In any analysis, 90% of the effort goes In any analysis, 90% of the effort goes into setting up the datainto setting up the data Descriptive statistics are part of that 90%

SummarySummary

Page 63: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Descriptive measure computed from Descriptive measure computed from population data - parameterpopulation data - parameter

Descriptive measure computed from Descriptive measure computed from sample data - statisticsample data - statistic

Most common measures of locationMost common measures of location Mean Median Mode Geometric Mean, harmonic mean

Descriptive StatisticsDescriptive Statistics - Measures of Location- Measures of Location

Page 64: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Suppose we have N measurements of a particular variable in a population.We denote these N measurements as:

X1, X2, X3,…,XN

where X1 is the first measurement, X2 is the second, etc.

DefinitionDefinition

More accurately called the arithmetic mean, it is defined as the sum of measures observed divided by the number of observations.

N

X

N

XX

NX

NX

N

N

ii

N

121

1...

11

Arithmetic mean (population)Arithmetic mean (population)

Page 65: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Sample: Suppose we have n measurements of a particular variable in a population with N measurements.The n measurements are:

X1, X2, X3,…,Xn

where X1 is the first measurement, X2 is the second, etc.

DefinitionDefinition

n

XX

nX

nX

nx i

n

1...

1121

Arithmetic mean (sample)Arithmetic mean (sample)

Page 66: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Some Properties of the Arithmetic Mean

1. ,

2.

Prove: 1.

2.

min)( 22 xXxi

)( xXx ii ;0)( xXx ii

;0)( xnXxXx iii

,' exx

22

2222

222

)(

)(2)(])(2)[(

])[()()'(

exX

exXexXexXexX

exXexXxX

i

iiii

iii

Arithmetic mean (sample)Arithmetic mean (sample)

Page 67: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Frequently used if there are extreme values in a distribution or if the distribution is non-normal

DefinitionDefinition That value that divides the ‘ordered array’ into two

equal parts If an odd number of observations, the median Md will be

the (n+1)/2 observation ex.: median of 11 observations is the 6th observation

If an even number of observations, the median Md will be the midpoint between the middle two observations

ex.: median of 12 observations is the midpoint between 6th and 7th

MedianMedian

Page 68: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Definition Value that occurs most frequently in data

set ExampleExample

2 3 4 5 3 4 5 6 7 5 3 2 5, mode Mo=5 If all values different, no modeIf all values different, no mode May be more than one modeMay be more than one mode

Bimodal or multimodal

Not used very frequently in practiceNot used very frequently in practice

ModeMode

Page 69: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Suppose the ages of the 10 trees you are studying are: 34,24,56,52,21,44,64,34,42,46

Then the mean age of this group is:

To find the median, first order the data:

21,24,34,34,42,44,46,52,56,64

The mode is 34 years Mo=34 (occurred twice).

years7.41

10/417

10/)46423464442152562434(1

Xn

x

Median1

2

years

X X102

102

1

1

242 44

43

Mean are commonly used

Example: Central LocationExample: Central Location

Page 70: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Used to calculate mean growth rateUsed to calculate mean growth rate DefinitionDefinition

Antilog of the mean of the log xi

nnXXXG

1

21 )(

n

XXXG nlog...loglog

log 21

Geometric mean Geometric mean

Page 71: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Example: Root growth at 25Example: Root growth at 25ooC, C, calculate mean growth rate (mm/d).calculate mean growth rate (mm/d).

)/(31.11173.0log,1173.06

7040.0log 1 dmmGG

Day Root length(mm) Growth rate (Xi,mm/d)log(Xi)0 171 23 1.352941176 0.1312792 30 1.304347826 0.1153933 38 1.266666667 0.1026624 51 1.342105263 0.1277875 72 1.411764706 0.1497626 86 1.194444444 0.077166

Total 7.872270083 0.70405

Geometric mean Geometric mean

Page 72: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Look at these two data sets:Look at these two data sets:Set 1: 100, 30, 20, 7, –20, –30, –100

Set 2: 10, 3, 2, 7, -2, -3, -10

If we calculate mean:If we calculate mean:

Set 1. Set 1.

Set 2.Set 2.

How to measure dispersion (spread, variability)?

1,7 xn1,7 xn

Descriptive StatisticsDescriptive Statistics- Measures of Dispersion- Measures of Dispersion

Page 73: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Common measuresCommon measures Range Variance and Standard deviation Coefficient of variation

Many distributions are well-described Many distributions are well-described by measure of location and dispersionby measure of location and dispersion

Descriptive StatisticsDescriptive Statistics- Measures of Dispersion- Measures of Dispersion

Page 74: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Range is the difference between the Range is the difference between the largest and smallest values in the data setlargest and smallest values in the data set

R=Max (Xi) - Min (Xi)

Heavily influenced by two most extreme values and ignores the rest of the distribution

Set 1: 100, 30, 20, 7, –20, –30, –100

Set 2: 10, 3, 2, 7, -2, -3, -10 R1=200 R2=20

Range Range

Page 75: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Suppose we have N measurements of a particular variable in a population: X1, X2, X3,…,XN,

The mean is , as , we define:

as variance, unit is X unitas variance, unit is X unit22

as standard deviationas standard deviation

0)( iX

N

XX

NX

NX

Ni

N

222

22

12 )(

)(1

...)(1

)(1

N

X i2)(

Variance and Standard DeviationVariance and Standard Deviation- Population - Population

Page 76: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Suppose we have n measurements of a particular variable in a sample: X1, X2, X3,…,Xn,

The mean is , we define:

as mean squares, or sample varianceas mean squares, or sample variance

as standard deviationas standard deviation

x

1

)( 22

n

xXs i

1

)( 2

n

xXs i

2

Variance and Standard DeviationVariance and Standard Deviation- Sample- Sample

Page 77: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Corrected Sum of Squares (CSS)

Degree of freedom n-1 used because if we know n-1 deviations, the

nth deviation is known Deviations have to sum to zero

1

)( 22

n

xxs i

n

XXxXSS i

ii

222 )(

)(

1 ndf

Variance and Standard DeviationVariance and Standard Deviation

Page 78: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Suppose the ages of the 10 trees you are studying are: 34,24,56,52,21,44,64,34,42,46, We calculated

Calculate range, variation, standard deviation and CV.7.41x

No. Xi x_bar Xi-x_bar (Xi-x_bar) 2̂ Xi 2̂1 34 41.7 -7.7 59.29 11562 24 41.7 -17.7 313.29 5763 56 41.7 14.3 204.49 31364 52 41.7 10.3 106.09 27045 21 41.7 -20.7 428.49 4416 44 41.7 2.3 5.29 19367 64 41.7 22.3 497.29 40968 34 41.7 -7.7 59.29 11569 42 41.7 0.3 0.09 1764

10 46 41.7 4.3 18.49 2116Total 417 0 1692.1 19081

R=64-21=43 y, s2=1692.1/9=188.01 y2, s=13.72 y.

Example Example

Page 79: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Relative variation rather than absolute Relative variation rather than absolute variation such as standard deviationvariation such as standard deviation

Definition of C.VDefinition of C.V.

Useful in comparing variation between two Useful in comparing variation between two distributionsdistributions Used particularly in comparing laboratory

measures to identify those determinations with more variation

100x

sCV

Coefficient of Variation Coefficient of Variation

Page 80: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Set 1: 100, 30, 20, 7, –20, –30, –100

Set 2: 10, 3, 2, 7, -2, -3, -10

Calculate , s2, s and CV.

Set s2 s CV

1 1 3773.7 61.4 61.4

2 1 44.7 6.7 6.7

x

x

Example Example

Page 81: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Descriptive method to convey information Descriptive method to convey information about measures of location and dispersionabout measures of location and dispersion Box-and-Whisker plots

Construction of boxplotConstruction of boxplot Box is IQR Line at median Whiskers at smallest and largest

observations Other conventions can be used, especially

to represent extreme values

Box PlotsBox Plots

Page 82: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

-20

0

20

40

Increment in Systolic B.P.

1 2 3 4Drug

Box PlotsBox Plots

Page 83: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Box and Whisker Plot (or Boxplot)

A box and whisker plot is a way of summarising a set of data measured on an interval scale. It is often used in exploratory data analysis. It is a type of graph which is used to show the shape of the distribution, its central value, and variability. The picture produced consists of the most extreme values in the data set (maximum and minimum values), the lower and upper quartiles, and the median

Page 84: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Box and Whisker Plot (or Boxplot)

A box plot (as it is often called) is especially helpful for indicating whether a distribution is skewed and whether there are any unusual observations (outliers) in the data set.Box and whisker plots are also very useful when large numbers of observations are involved and when two or more data sets are being compared.

Page 85: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Box and Whisker Plot (or Boxplot)

Page 86: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Box and Whisker Plot (or Boxplot)

Page 87: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Box and Whisker Plot (or Boxplot)

Page 88: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

5-Number Summary

A 5-number summary is especially useful when we have so many data that it is sufficient to present a summary of the data rather than the whole data set. It consists of 5 values: the most extreme values in the data set (maximum and minimum values), the lower and upper quartiles, and the median.A 5-number summary can be represented in a diagram known as a box and whisker plot. In cases where we have more than one data set to analyse, a 5-number summary is constructed for each, with corresponding multiple box and whisker plots.

Page 89: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Outlier

An outlier is an observation in a data set which is far removed in value from the others in the data set. It is an unusually large or an unusually small value compared to the others.An outlier might be the result of an error in measurement, in which case it will distort the interpretation of the data, having undue influence on many summary statistics, for example, the mean.

Page 90: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Outlier

If an outlier is a genuine result, it is important because it might indicate an extreme of behaviour of the process under study. For this reason, all outliers must be examined carefully before embarking on any formal analysis. Outliers should not routinely be removed without further justification.

Page 91: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Interpreting a Boxplot

Page 92: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Interpreting a Boxplot

The boxplot is interpreted as follows:The box itself contains the middle 50% of the data. The upper edge (hinge) of the box indicates the 75th percentile of the data set, and the lower hinge indicates the 25th percentile. The range of the middle two quartiles is known as the inter-quartile range.The line in the box indicates the median value of the data.

Page 93: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Interpreting a Boxplot

The boxplot is interpreted as follows:If the median line within the box is not equidistant from the hinges, then the data is skewed.The ends of the vertical lines or "whiskers" indicate the minimum and maximum data values, unless outliers are present in which case the whiskers extend to a maximum of 1.5 times the inter-quartile range.The points outside the ends of the whiskers are outliers or suspected outliers.

Page 94: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Boxplot Enhancements

Beyond the basic information, boxplots sometimes are enhanced to convey additional information:The mean and its confidence interval can be shown using a diamond shape in the box.The expected range of the median can be shown using notches in the box.The width of the box can be varied in proportion to the log of the sample size.

Page 95: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Advantages of Boxplots

Boxplots have the following strengths:Graphically display a variable's location and spread at a glance.Provide some indication of the data's symmetry and skewness.Unlike many other methods of data display, boxplots show outliers.By using a boxplot for each categorical variable side-by-side on the same graph, one quickly can compare data sets.

Page 96: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Disadvantage of Boxplots

One drawback of boxplots is that they tend to emphasize the tails of a distribution, which are the least certain points in the data set. They also hide many of the details of the distribution. Displaying a histogram in conjunction with the boxplot helps in this regard, and both are important tools for exploratory data analysis.

Page 97: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Boxplot Example 1

Check location and variation shifts Box plots are an excellent tool for conveying location and variation information in data sets, particularly for detecting and illustrating location and variation changes between different groups of data. Sample Plot:This box plot reveals that machine has a significant effect on energy with respect to location and possibly variation

Page 98: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Boxplot Example 1

Page 99: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Boxplot Example 1

This box plot, comparing four machines for energy output, shows that machine has a significant effect on energy with respect to both location and variation. Machine 3 has the highest energy response (about 72.5); machine 4 has the least variable energy response with about 50% of its readings being within 1 energy unit.

Page 100: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Boxplot Example 1

These MINITAB boxplots represent lottery payoffs for winning numbers for three time periods (May 1975-March 1976, November 1976-September 1977, and December 1980-September 1981).

Page 101: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Boxplot Example 1

The median for each dataset is indicated by the black center line, and the first and third quartiles are the edges of the red area, which is known as the inter-quartile range (IQR).

Page 102: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Boxplot Example 1

The extreme values (within 1.5 times the inter-quartile range from the upper or lower quartile) are the ends of the lines extending from the IQR. Points at a greater distance from the median than 1.5 times the IQR are plotted individually as asterisks. These points represent potential outliers.

Page 103: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Boxplot Example 1

In this example, the three boxplots have nearly identical median values. The IQR is decreasing from one time period to the next, indicating reduced variability of payoffs in the second and third periods. In addition, the extreme values are closer to the median in the later time periods.

Page 104: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Boxplot Example 2

As shown in the figure, a line is drawn from the upper hinge to the upper adjacent value and from the lower hinge to the lower adjacent value. Every score between the inner and outer fences is indicated by an "o" whereas a score beyond the outer fences is indicated by a "*".

Page 105: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Boxplot Example 2

It is often useful to compare data from two or more groups by viewing box plots from the groups side by side. The data from 2b are higher, more spread out, and have a positive skew. That the skew is positive can be determined by the fact that the mean is higher than the median and the upper whisker is longer than the lower whisker.

Page 106: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Boxplot Example 3

Although the medians are all roughly the same, you can see at a glance that the spread of each data set is different. The boxplot on the left shows data that appears to be distributed evenly. The median is in the middle of the rectangle, and the whiskers are about the same length. In addition, the plot contains no outside values. The median of the second plot from the left appears to be slightly off-center. The amount of extreme values is a point of concern because it suggests that the data vary widely.

Page 107: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Boxplot Example 3

The third boxplot shows data that has less variation and spread than the other plots. The fourth boxplot shows data that is significantly upwardly-skewed. The median of this plot is closer to the top of the rectangle than to the bottom, and the upper whisker is longer than the bottom one. All the boxplots have approximately the same median, and the two boxplots on the left have approximately the same variation in the data.

Page 108: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Descriptive Statistics

(Summmary) Graphical Summaries

Frequency distribution Histogram Stem and Leaf plot Boxplot

Numerical Summaries Location - mean, median, mode. Dispersion - range, variance, standard

deviation Shape

Page 109: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

Statistical softwareStatistical software SAS SPSS Stata BMDP MINITAB

Graphical softwareGraphical software Sigmaplot Harvard Graphics PowerPoint Excel

Software Software

Page 110: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

BiostatisticsBiostatistics

Page 111: SAA 2023 COMPUTATIONALTECHNIQUE FOR BIOSTATISTICS Introduction & Descriptive Statistics

BiostatisticsBiostatistics