bgy5901

DATA ANALYSIS

types of data/ variables

measures of central tendency

measures of dispersion, SE

assumptions of parametric tests

tests of normality

homogeneity of variance

hypothesis testing

T-test

ANOVA

Outline

Definition of Biostatistics

STATISTICS : Field of study relating to the collection, classification, summarization, analysis and interpretation of numerical information.

Definition of Statistics

BIOSTATISTICS : Application of statistics to the analysis of biological and medical data.

Experimental unit- object, person, anything upon which a treatment is applied.

A factor of an experiment is a controlled independent variable, experimenter determines levels of factor.

Level- different values of a factor. Level implies amount or magnitude.

Response variable is the dependent variable which is dependent on the factor.

Definitions

a complete set of units of interest

can be general or specific

usually determined by research question

parameter- any measure that tells us something about the entire population; uses lower case of Greek letters e.g. μ

Population

Sample a fraction of the population

cannot collect data from everyone in population- why?

statistic- any measure that tells us something about the sample; uses upper case of Latin letters e.g. X

use sample to estimate parameters of population

must be representative and random

sample should reflect the composition of the population of interest

every person (or unit) in the population from which the sample is drawn has an equal probability of being chosen

Sample as good estimators of population

representative

random

Descriptive vs. Inferential statistics

descriptive

inferential

the use of graphical or numerical methods to summarize and identify patterns in a data set

only provides information on data being analyzed

the use of sample data to make generalizations about a larger set of data

provides estimation about population of interest based on selected parts of the population

A variable is any measured characteristic or attribute that differs for different subjects.

What is a Variable?

For example, if the length of 45 leaves were measured, then length would be a variable.

Levels of Precision in Measurement

names assigned to categories but no relation between the categories can be inferred

Nominal

Ordinal

Interval

Ratio

values are ranked (put in order)

distance between any two adjacent values is the same but the zero point is arbitrary

similar to interval level but contains an absolute zero point

Types of data/ variables

Matric. No. Marks Position

123456 98 1

123457 95 2

123458 72 3

123459 71 4

123460 60 5

Example of ordinal scale

1 2 3 4 5 6 7 8 9 10

interval same length

Example of interval scale

Measures of central tendency

Central tendency is the point at which the distribution of

scores is centred.

Three measures of central tendency:

1. Mode

2. Median

3. Mean


Mode

the most frequent value

for nominal data the mode is the only measure of central tendency

easy to calculate and understand

possible to have several modes in a data set may not always represent the data well and can change if a new

value is added


Median

the middle value of a distribution when the values are arranged in numerical order; if even number of values, take the average of the two middle values

stable: relatively unaffected by extreme values & skewed distributions

can be used with ordinal, interval or ratio data

sampling fluctuations: likely to differ in samples from same population

can be misleading when comparing samples therefore less useful than the mean


Mean (average)

sum of all values of a variable divided by the number of values

uses every value (no loss of information)

most accurate summary of the dataresistant to sampling variation (if several samples taken from the

same population, means likely the same)

can be influenced by extreme values (outliers) can only be used with continuous data

Measures of dispersion

Dispersion refers to the variability of values in a data set i.e. the extent to which a set of values differ


Range

difference between the highest and the lowest value

easy to compute

outliers: easily influences by extreme values

based on only two of the observations and gives no idea of how the other observations are arranged between these two numbers

tends to increase as the size of the sample increases


Interquartile Range

range of the middle 50% of values

less susceptible to outliers

uses only half of the data

Standard deviation

average difference between each value and the mean

measures the variability within the data set

how well the mean represents the data

uses every value

can be influenced by extreme values


Sampling Distribution

9

8

7

4

3

2

1

56

μ = 10

Sample Mean

1 8

2 10

3 9

4 10

5 10

6 11

7 12

8 9

9 11

Distribution of the sample means

3

2

1

8 9 10 11 12

Sample mean

Freq

uenc

ySampling distribution of the sample means

How well does the sample represent the population?

If we want to know how well the mean represents the data we calculate the standard deviation of the mean.

Similarly, to estimate how accurate the sample represents the population we will calculate the standard deviation of the distribution of the sample means i.e. the Standard Error (SE).

SE = standard deviation of the population/ √ n

Since the SD of population is not known, SD of sample will be used instead

Standard Deviation vs. Standard Error of the Mean

(SEM)

When to use which?

1. Independent valuesvalue from one subject does not influence the value of another

2. Interval datadata should be measured at least at the interval level

3. Normally distributedbell-shaped; tests of normality should be conducted

4. Homogeneity of variancevariances should be the same throughout the data

Assumptions of Parametric Tests

Tests of normality

1. Skewness and kurtosis

2. Histogram and stem and leaf plot

3. Kolmogorov-Smirnov & Shapiro-Wilk

tests

4. Normal probability plot (Q-Q plot)

5. Box-plot

Skewness and Kurtosis

values of skewness and kurtosis should be zero in a normal distribution

values of skewness & kurtosis should be divided by their respective standard errors

look for values greater than 1.96; if > 1.96 then data is NOT normally distributed

Descriptives

161.52 .765

160.01

163.03

161.57

161.70

94.212

9.706

127

190

63

13

-.170 .191

.804 .380

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

height (cm)Statistic Std. Error

Skewness and Kurtosis

3.503.002.502.001.501.000.50

length between internodes (cm)

60

50

40

30

20

10

0

Fre

qu

en

cy

Mean =2.175Std. Dev. =0.47512

N =745

Histogram

Kolmogorov Smirnov & Shapiro-Wilk

compare the scores in the sample to a normally distributed set of scores with the same mean and standard deviation.

if test is non-significant (p > 0.05) then distribution is not significantly different from a normal distribution therefore it is normally distributed

if p < 0.05 then distribution is significantly different from a normal distribution therefore it is NOT normally distributed.

Kolmogorov Smirnov & Shapiro-Wilk

Tests of Normality

.036 161 .200* .991 161 .393height (cm)Statistic df Sig. Statistic df Sig.

Kolmogorov-Smirnova

Shapiro-Wilk

This is a lower bound of the true significance.*.

Lilliefors Significance Correctiona.

4.03.53.02.52.01.51.00.5

Observed Value

2.5

0.0

-2.5

Ex

pe

cte

d N

orm

al

Normal Q-Q Plot of length between internodes (cm)

Box-Plot

Median should be in the middle of the box.

Outliers

values that are widely separated from the rest

Possible reasons for outliers:

measurement invalid (device not functioning, misrecorded value)

misclassified measurement- belongs to a population different from which the rest of sample was drawn

represents a rare or chance event

Homogeneity of Variance

Levene’s Test

Tests if variances in different groups are the same.

If significant (p< 0.05) variances are NOT equal.

If non-significant (p > 0.05) variances are equal.

Variance Ratio (VR)

Compare two or more groups.

Variance ratio = largest variance/ smallest variance

If VR < 2, homogeneity can be assumed.

Components of a hypothesis test

1. null hypothesis (Ho)

2. alternative hypothesis (Ha)

3. test statistic

4. reject or accept? p-value vs. significance level

Hypothesis Testing

A tentative explanation for an observation, phenomenon, or scientific problem that can be tested by further investigation.

Hypothesis TestingYou have some claim about the parameter and you want to see whether the data supports the claim or not

Hypothesis

Null hypothesis (Ho)

statement being tested in a statistical test usually the null hypothesis is a statement of no effect

or no difference

Alternative hypothesis (Ha)

experimental hypothesis- a hypothesis to be

considered as an alternative to the null hypothesis

Null and alternative hypothesis

Definition

Value used to decide whether or not the null hypothesis should be rejected in hypothesis testing

Sources of variation

In any experiment there are two basic sources of variation

1. systematic- variation due to experimental manipulation

2. unsystematic- due to random factors

Test Statistics

Need to calculate test statistic to find differences between samples

test statistic = systematic variation

unsystematic variation

Then need to calculate the probability of obtaining a value that large

Compare the amount of variance created by an experimental effect against amount of variance due to random factors- WHY?

Test Statistics

if experiment has had an effect we’d expect it to create more variance than random factors alone

the bigger the test statistic, the more unlikely it is to occur by chance; probability of them occurring by chance becomes smaller

when probability falls below a certain pre-determined value, accept that test statistic as large as it is because of experimental manipulation and not due to random factors

Test Statistics

significance level, α

probability that the test rejects the null hypothesis on the assumption that the null hypothesis is true

pre-determined value

p-value

probability that the test statistic be as large or larger than that actually observed by chance alone if the Ho is true

the smaller the p-value, the stronger is the evidence against Ho i.e. the observed result is unlikely to occur just by chance

p-value and significance level

Statistical Significance

– In statistics, a result is called significant if it is unlikely to have occurred by chance.

– "A statistically significant difference" simply means there is statistical evidence that there is a difference.

– However it does not mean the difference is necessarily large, important or meaningful.

– Means that observed effects are unlikely due to chance and results are reliable and likely to be repeatable

Two kinds of errors can be made in significance testing

1. a true null hypothesis can be incorrectly rejected (Type I)

conclusion drawn that the null hypothesis is false when in fact it is true

probability of Type I error (α) is the significance level

2. a false null hypothesis can be failed to be rejected (Type II error)

considered an error because fail to reject the null hypothesis correctly e.g. assuming no effect of treatment when there was

probability of Type II error is β

Type I and II errors

1 – β = power of a statistical test

Power of a statistical test is the ability of a study to find a significant difference if indeed one exists.

It is the probability that you will reject the null hypothesis when it is false

Power of a statistical test

T-test

To look for differences in mean between group of subjects from two different experimental conditions

experimental condition- the procedure that is varied in order to estimate a variable's effect

If the mean difference between groups is large, it could mean two things:

1. The two groups were taken from the same population but differ simply due to chance.

2. The two groups come from different populations. (If for example we have manipulated one of the groups then this is evidence that the experimental manipulation has caused the large difference between the groups).

Independent & dependent t-test

Independent t-test

two experimental conditions

different subjects were assigned to each condition i.e. each subject is tested under only one condition.

Dependent t-test (paired t-test)

two experimental conditions

same subjects took part in both conditions of the experiment

Group Statistics

292 2.0766 .46262 .02707

453 2.2384 .47277 .02221

type of fertilizerinorganic

organic

length betweeninternodes (cm)

N Mean Std. DeviationStd. Error

Mean

Independent Samples Test

.421 .517 -4.597 743 .000 -.16174 .03518Equal variancesassumed

length betweeninternodes (cm)

F Sig.

Levene's Testfor Equality of

Variances

t dfSig.

(2-tailed)Mean

DifferenceStd. ErrorDifference

t-test for Equality of Means

Independent t-test (SPSS output)

compares mean from three or more groups

need to first test the Ho that all group means are equal; Ha is that the group means differ

if the null hypothesis is rejected this means that the means of these group are not equal

need to conduct post-hoc test to determine which means significantly differ

ANOVA

One-way independent ANOVA only one independent variable- age group (independent) and height

(dependent) and different participants will be used in each condition

Two-way independent ANOVA two independent variables- age group and gender (independent) and

height (dependent) and different participants will be used in each condition

One-way repeated measures ANOVA only one independent variable- exercise type (independent) and blood

pressure level (dependent) and same participants will be used in all conditions

Types of ANOVA

Identify the experimental unit, factor(s), response variable, level(s) and most appropriate statistical test.

1. 240 chickens of four different breeds were randomly assigned to three different farms. After five weeks the weight of the chickens were measured.

2. A researcher wanted to investigate whether different types of fertilizer mixtures affect the growth of plants differently. 36 seeds were randomly assigned to two different types of fertilizer treatment. The height of each plant was measured after 3 weeks.

bgy5901

Education

data set

use of sample data

data analysis

set of values

nominal data

data resistant

medical data

continuous data