1 statistics (biostatistics, biometrics) is ‘the science of learning from sample data’ to make...

1

Statistics (Biostatistics, Biometrics) is‘the science of learning from sample data’

to make inference about populations that were samples

Incorrect definition of statistics?

Data (singular: datum): are numerical values of a variable on an observation that constitute the building blocks of statistics

Variable: A characteristic or an attribute that takes on different values.

It can be:

Response (dependent) or Explanatory (independent) variable

It can be:

Quantitative or Qualitative variable

2

Recognition of the kinds of variables is crucial in choosing appropriate statistical analysis

Examples of Response and Explanatory variables (when appropriate)?

Examples of quantitative and qualitative variables?

3

Quantitative variables:

Convey amount1. Discrete (counts--frequencies): number of

patients, species- there are gaps in the values (zero or positive

integers)

2. Continuous (measurements): a. ratio (with a natural zero origin: mass,

height, age, income, etc.) b. interval (no natural zero origin: calendar

dates, ºF or ºC, etc.) - no gaps in the values - measuring precision is limited by the

precision of the measuring device which results in recording it as discrete

4

Qualitative (categorical) variables:Convey attributes, cannot be measured in the usual sense but can be categorized

1. Nominal: mutually exclusive and collectively exhaustive (gender, race, religion, etc.)

- can be assigned numbers but cannot be

ordered

2. Ordinal: categorical differences exist, which can be:

- numbered and ordered but the distances between values not equal (cold, cool, warm, hot; low, medium, high; sick, normal, healthy; depressed, normal, happy; etc.)

Note: both nominal and ordinal variables are

discrete

5

Source of Data and Type of Research:

1. Observational studies

Examples?

2. Experimental studies Examples?

Field Mesocosm Green house Laboratory

Recognition of the kind of the study, and the way the study units are selected and treated are crucial in: - making statistical inference

- establishing correlational or causal relationship

6

Data can be used to perform:

1. Descriptive Statistics

2. Statistical Modeling

3. Inferential Statistics--Hypothesis Testing

7

Inferential Statistics

• Inference from one observation to a populatione.g.1, comparing body temperature of one bird with the mean body temp. of a bird species

- test statistics?

a) population parameters are known

b) population parameters not known

8


• Inference from several observations to a populatione.g.2, comparing body temperature of three birds with the mean body temp. of a bird species

- test statistics?

a) population parameters are known

b) population parameters not known

9


•Comparing two or more sets of observations to test whether or not they belong to different populations

e.g., 3, comparing starting salaries of females

and males in several organizations to see if

their starting salaries differ - do we know the populations’ parameters?

- can they be estimated?

- what would be the test statistics?

- what would be the scope of inference?

- why do we need Inferential Statistics to do

so?

10

What do we mean by ‘inference’?

An inference is a conclusion that patterns in the data are present in some broader context

A statistical inference is the one justified by a probability model linking the data to the broader context

11

• Inference from a sample to its parent population, or from samples to compare their parent populations, can be drawn from observational studies

- such inference is optimally valid if the sampling is random

- results from observational studies cannot be used establish causal relationships, but are still valuable in suggesting

hypothesis and the direction of controlled experiments

• Inference to draw causal relationships can be drawn from randomized, controlled experiments, and not from observational studies

12

Inferences topopulationsCan be drawn

Causalinferencescan be drawn

Statistical inference permitted by study designs (Adapted from Ramsey and Schafer, 2002)

13

Population (probability) Distribution:

•Discrete

1. Binomial (Bernoulli)

2. Multinomial: a generalized form of binomial where there are > 2

outcomes

3. Uniform: similar to Multinomial except that the probability of occurrence is equal

4. Hypergeometric: similar to binomial except that the probability of

occurrence is not constant from trial to trial sampling without replacement—dependent trials)

5. Poisson: similar to binomial but p is very small and the # of trials (s) is very large such that s.p approaches a constant

14


•Discrete (previous slide)

•Continuous

1. Uniform: similar to the discrete uniform but the # of possible outcomes is

infinite 2. Normal (Gaussian):

15


•Discrete

1. Binomial (Bernoulli):

a. in a series of trials a variable (x) takes on only 2 discrete outcomes (probability of occurrence --p)

b. trials are independent (sampling with replacement) and constant p from trial to trial (e.g., probability distribution of having 1, 2, or 3 girls in 10 families each with 3 kids)

c. probability of occurrence can be equal or unequal

16

1. Discrete Binomial Distribution

Probability (p) distribution for the number of smokers in a

group of 5 people ( n = 5, p of smoking = 0.2, 0.5, or 0.8)

• Tabular presentation

# of smoker(s) p = 0.2 p =0 .5 p = 0.8

• Graphical presentation

0

p = 0.2

0 0.328 0.031 0.000

1 0.410 0.156 0.006

2 0.205 0.312 0.051

3 0.051 0.312 0.205

4 0.006 0.156 0.410

5 0.000 0.031 0.328

2 3 4 51 0 2 3 4 51 0 2 3 4 51

p = 0.5 p = 0.8

# of smokers

17

3. Discrete Uniform Distribution

Probability (p) distribution for one toss of a die

• Tabular presentation

toss p

1 1/6

2 1/6

3 1/6

4 1/6

5 1/6

6 1/6

• Graphical presentation

6

1/6

1 2 3 4 5

p

toss

18

• Continuous

1. Uniform

2. Normal (Gaussian): A bell-shaped symmetricdistribution

19

The Normal distribution is central to the theory and practice of parametric inferential statistics because

a.Distributions of many biological and environmental variables are approximately normal

b.When the sample size (# of independent trials) is large, or p and q are similar, other distributions such as Binomial and Poisson distributions can be approximated by the Normal Dist.

c.The distribution of the means of samples taken from a population

i. is normal when samples are taken from a normal population

ii. approaches normality as the size of samples (n) taken from non-normal populations increases

(Central Limit Theorem--CLT)

- implication of CLT?

20

• The normal distribution is a mathematical function (may you observe it in real life??) defined by the following equation:

Y = 1 / ( 2) e - (Xi - m)2 / 22, where:

X

Y : height of the curve for a given Xi

e : 2.718

: 3.142

m : arithmetic mean, measure of central tendency

: measure of the dispersion (variability) of the observations around the mean

The last two characteristics are the two unkown parameters that shape the distribution

21

:m the mean, a measure of central tendency

-calculated as the arithmetic average

- is a parameter

- is constant

- does not indicate the variability within a population,

- when not known, estimated by (a statistic) fromunbiased sample(s)

y

22

s2: Variance, a measure of variability (dispersion)

calculated as: ( y i - m ) 2 / N (definition formula)

or [ y i2 – ( yi)2/N ] / N (calculation formula)

- its unit is the squared unit of the variable of interest

- also called mean square of error (why?)

- estimated best by S 2 (a statistic) from unbiased sample(s) calculated as: ( y i - ) 2 / (n-1)

or [ y i2 - ( yi)2/n ] / (n-1)

-- what do we call [ y i2 - ( yi)2/ n ]?

-- is it a good measure of variability?

-- what do we call ‘n-1’?

So, S 2 = SS / df

(reason to call it ………………….. – MSe)

y

23

- to get a measure of variability in the unit of the variable of interest , we take the square root of the variance and call it Standard Deviation

-- is denoted by (s a parameter) for a population,

estimated best by S or SD (a statistic) from unbiased sample(s)

-- it is a rough (not exact) measure of average absolute deviations from the mean

24

• Important properties of a normal distribution:

Because the normal distribution is symmetric characterized by a mean m and a standard deviation of s the followings are true:

a. the total area under the normal curve is 1 or 100%

b. half of the population (50%) is greater and half (50%) is smaller than m

c. 68.27% of the observations are within m 1s

d. 95.44% of the observations are within m 2s

e. 99.74% of the observations are within m 3s

- what do above statements mean in terms of probability (e.g., probabilistic status of one observation falling on the mean, on any boundaries, or anywhere within or outside a boundary)?

25

Standard Normal Distribution Is the distribution of the Z values where:

Z = (Xi - m) / s

How many Normal Distributions you may find?

How many Standard Normal Distributions you may find?

What are the properties of the Standard Normal Distribution?

A Z table in a stat book shows the proportion of the population beyond the calculated Z value

26

To study populations, it is usually not feasible to measure the entire populationof N members.

Why not sampling the entire population?

Real (finite and infinite) and Imaginary populations?

27

Therefore, we draw sample(s) to represent the parent populations

Sample is a subset of a population, drawn and analyzed to infer conclusions regarding the population. Its size is usually denoted by n.

Sampling can be done:

1. With replacement

2. Without replacement (the norm in practice)

28

• To infer valid (unbiased) conclusions regarding a population from a sample, the sample must represent the entire population.

• For a sample to represent the entire population, it is best to be drawn RANDOMLY.

• A sample is random when each and every member of the population has an equal and independent chance of being sampled (exceptions?).

• Random sampling, on average: - represents the parent population

- prevents known and unknown biases to affect the selection of observations, and thus,

- allows the application of the laws of probability in drawing a statisticalinferences

29

Descriptive Statistics

• Data organization, summarization and presentation

1. Tabular (tally, simple, relative, relative-cumulative, and cumulative frequencies)

2. Graphical (histograms, polygons)

Suppose we have the following random sample of creativity scores:

Case Score Case Score

1 26.7 13 29.7 2 12.0 14 20.5 3 24.0 15 12.0 4 13.6 16 20.6 5 16.6 17 17.2 6 24.3 18 21.3 7 17.5 19 12.9 8 18.2 20 21.6 9 19.1 21 19.810 19.3 22 22.111 23.1 23 22.612 20.3 24 19.8

30

Steps in data organization, summarization and

presentation (Geng and Hills, 1985)

1. Determine the range (R):

largest - smallest , R = 29.7 - 12 = 17.7

2. Determine the number of classes (k) into

which data are to be grouped

a. 8 - 20 classes is often recommended

- too few--information loss

- too many--too expensive (time, etc.)

b. can be calculated based on Sturges’

rule as:

k = 1 + 3.3 log n, where n = number

of cases, in our case 1 + (3.3)(log 24) =

5.55, and thus k should be at least 6,

and this is what we use

31

Steps in data organization, summarization and presentation (Geng and Hills, 1985)

3. Select a class interval (difference between upper and lower class boundaries--R/k may be used, we use 3)

4. Select the lower boundary of the lowest class and add the interval successively to it until all data are

classified

- to avoid falling of data on the boundaries, they are usually expressed to half unit greater than the measurement accuracy

- in our case measurement accuracy is 0.1 and so the class boundaries would for example be expressed as 11.05-14.05.

5. Arrange the table as follows:

32

Cla

ssB

ound

-M

id-

Tall

ied

Sim

ple

Rel

ativ

eR

el. C

um.

Cum

.ar

ies

poin

tfr

eq.

freq

.fr

eq.

freq

. f

req.

111

.55-

14.5

513

.05

IIII

40.

167

0.16

74

214

.55-

17.5

516

.05

III

30.

125

0.29

27

317

.55-

20.5

519

.05

IIII

IIII

80.

333

0.62

515

420

.55-

23.5

522

.05

IIII

III

70.

292

0.91

722

523

.55-

26.5

525

.05

I1

0.04

20.

958

23

626

.55-

29.5

528

.05

I1

0.04

21.

000

24

33

Histogram of Class Scores

Score Classes

1 2 3 4 5 6

Freq

uenc

y

2

4

6

8

34

Polygon of Class Scores

Score Classes

1 2 3 4 5 6

Freq

uenc

y

2

4

6

8

35


Score Classes

1 2 3 4 5 6Re

lativ

e Fr

eque

ncy

0.2

0.4

0.6

0.8

36

Histogram of Class Scores

Score Classes

1 2 3 4 5 6R

elat

ive

Freq

uenc

y

0.2

0.4

0.6

0.8

37


Score Classes

1 2 3 4 5 6R

el. C

um. F

requ

ency

0.2

0.4

0.6

0.8

38


Score Classes

1 2 3 4 5 6

Cum

ulat

ive

Freq

uenc

y

6

12

18

24

39

Stem-and-Leaf Diagram

A cross between a table and a graph

12 0 0 913 6141526 617 2 518 219 1 3 820 3 5 621 3 622 1 2 623 124 0 32526 7272829 7

40

Steps in Creating Stem-and-Leaf Diagram

1. Arrange the data in increasing order

2. First, write the whole numbers as the stem

3. Then, write the numbers after decimals, in increasingorder, as leaves

Advantages:

1. Ease of construction

2. Depiction of individual numbers, min., max., range, median, and mode

3. Depiction of center, spread, and shape of distribution

Disadvantages:

1. Difficulty in comparing distributions when they havea very different ranges

2. Difficulty in comprehension and construction when the sample size is very large

41

Measures of a Dataset that are Important in Descriptive / Inferential Statistics

1. Measure of Central Location (Tendency)

1.1. Mode: the value with highest frequency- there may be no mode, one mode, or several

modes- not influenced by extremes- cannot be involved in algebraic manipulation - not very informative

1.2. Median: the middle value when data are arranged in order of magnitude

- not influenced by extremes (i.e., useful in economics when extremes should be disregarded- not involved in algebraic manipulation - if n is odd, is the middle value when data are ordered- if n is even, is the average of the two middle values when data are ordered

42

1.3. Mean: arithmetic average, denoted by:

a. m = ( Xi / N): a parameter, for a population, which is best estimated by:

b. x : a statistic, from a sample or samples

- most frequently used in statistics and subject of algebraic manipulation

- is the best estimate of m if the sample is unbiased (representative of its population); the sample isunbiased if it is drawn randomly, not otherwise

- the mode, the median, and the mean are the same when the distribution is perfectly symmetrical

- the units for the mode, median, and the mean are the same as the unit of the variable of interest

- the mean does not indicate the variability of a dataset; e.g., consider the following three sets of data with a common mean:

22, 24, 2620, 24, 2816, 24, 32

43

2. Measure of variability--dispersion

2.1. Range: the largest value - the smallest value (R)

- not very informative, often affected by

extremes

2.2. Variance (mean square*), represented by two symbols [ s 2, sigma square(d),

and S 2]:

a. s 2: represents the variance of a

population; it is a

parameter, constant for a given

population,

quantified as the sum of the squared deviations

of

individual members from their mean divided

by the

population size (finite):

s 2

= [(X1 - m)2 + (X2 - m)2 + … + (XN - m)2]

/ N

= (Xi - m)2 / N

s 2

is estimated best by:

44

2.2.b. S 2: the variance of a sample; it is a statistic,

which varies from

sample to sample taken from a

given population, and is calculated as the

sum of the squared

differences between the

sampled individuals and their mean divided by

the sample size minus one:

S 2 = ( x i - x ) 2 / n-1,

which is the definition

formula and can be reduced to a computation formula

as:

S 2 = [ x i2 - ( x i ) 2 / n ] / n-1

- easier to calculate

- scientific calculators

provide the components

- more accurate because no rounding of numbers

is involved

45

Notes:

- S 2 is the best estimate of s 2 if the sample is

unbiased (representative of its population); the sample is unbiased if it is drawn

randomly, not otherwise

- the unit of the variance is the square of the unit of the

variable of interest

- the quantity “ x i2 - ( x i ) 2 / n” is called sum of

squares or SS, it is a minimum

- the quantity “( x i ) 2 / n” is called the correction

factor or C

- the quantity “n-1” is called degrees of freedom

or df

- Thus, Variance = sum of squares / degrees of

freedom, or S 2 = SS / df

* this is why the variance also is called “mean square”

46

2.3. Standard Deviation: square root of the Variance

a. s : a parameter, for a population:

b. S or SD: a statistic, for a sample- unit of standard deviation is the same

as that of the variable of interest

2.4. Coefficient of Variation (CV ): is a relative term

(%) - calculated as (SD / x ) × 100

- used to compare the results of several studies done differently (different experimenters, procedures, etc.) on the same variable

47

2.5. Skewness: measure of deviations from

symmetry

a. symmetrical (skewness = 0)

b. Skewed to the left(skewness > 0)

c. skewed to the right(skewness < 0)

2.6. Kurtosis : measure of peakedness or tailedness

a. mesokurtic (kurtosis = 0)

b. leptokurtic(kurtosis > 0)

c. platykurtic(kurtosis < 0)

48

Sampling Means Distribution

Is the distribution of the means of all possible samples of size n taken from a population.

• If sampled from a normal distribution will be normal

• If sampled from a non-normal dist. becomes more normal than the parent dist. (Central Limit Theorem- CLT*)

- as sample size is increased, the sampling mean distribution approaches

normality

*Fuzzy CLT: Data influenced by many small, unrelated, random effects are approximately normally distributed.

49

• The sampling mean distribution has a:

- mean, equal to m (mean of population)- variance (measuring average squared deviation of the sampled means from ),m

-- calculated as variance of the means,

or s2/n (based on LLN)

-- estimated best by S2 /n

- - denoted as or ,

respectively

-- square root of the above variance

is called …………………………

or …………..

……………..

or …………..

……………..

2σy2sy

50

• Standard Error: Is a rough measure of the average absolute

deviation of sampling means from m(typical

error made when estimating m from the mean of a sample of size n)

can be calculated as: s / n best estimated by S / n

denoted as or or SE

2σy2sy

51

Adapted from Ramsey and Schafer, 2002

52

Adapted from Ramsey and Schafer, 2002

1 statistics (biostatistics, biometrics) is ‘the science of learning from sample data’ to make...

Documents

discrete slide

statistics biostatistics

descriptive statistics

qualitative variable

bird species test statistics

qualitative variables

quantitative variables

b population parameters