Download - Descriptive statistics (cont.) - variability
1
STAT03 - Descriptive statistics (cont.) - variability
Descriptive statistics (cont.) - variability
Lecturer:Smilen Dimitrov
Applied statistics for testing and evaluation – MED4
2
STAT03 - Descriptive statistics (cont.) - variability
Introduction
• We previously discussed measures of central tendency (location) of a data sample (collection) in descriptive statistics – arithmetic mean, median and mode; and also the range as a measure of statistical dispersion (variability)
• Here we continue with other important measures of variability – namely variance and standard deviations
• We will also get acquainted with some parameters leading to their definitions
• We will look at how we perform these operations in R, and a bit more about plotting as well
3
STAT03 - Descriptive statistics (cont.) - variability
Variability and deviations
• A measure of variability is perhaps the most important quantity in statistical analysis. – The greater the variability in the data, the greater will be
our uncertainty in the values of the parameters estimated from the data, and
– the lower will be our ability to distinguish between competing hypotheses about the data.
• Measures of variability – a single number describing the variability of data – eventually we look for variance and standard deviation
4
STAT03 - Descriptive statistics (cont.) - variability
Variability and deviations
• Deviations – distances of the individual values in the data sample, from the mean value
• Plotting – using lines in a for loop
5
STAT03 - Descriptive statistics (cont.) - variability
Variability and deviations
• The longer the lines – the more variable the data• Could we use the sum of the deviations as a measure of
variability?• No – because of the
definition of arithmetic mean, it is the line positioned such that the sum of the deviations cancels out.
01
1111
N
xNxxNxxxd
N
iiN
ii
N
ii
N
ii
N
ii
• Quick proof
6
STAT03 - Descriptive statistics (cont.) - variability
Absolute deviations
• The minus signs of the deviations could be seen as the reason for cancellation of the sum
• We could try using the absolute deviations xxdD iii
• Their sum will be obviously different from 0.
• However, hard to compute – need an easier way
7
STAT03 - Descriptive statistics (cont.) - variability
Squared deviations and sum of squares
• Squaring the deviations is computationally less intensive
22 xxd ii
N
ii xxSS
1
2
• Their sum will, again, be obviously different from 0.
• It is the well known sum of squares:
• More properly – it is the sum of squared deviations• An unscaled, or unadjusted measure of dispersion
8
STAT03 - Descriptive statistics (cont.) - variability
Scaling the sum of squares – Mean Squared Deviation
• Now, what would happen to the sum of squares if we added an [additional] data point? – It would get bigger, of course.
• So usually, the sum of squares will grow with the size of the data collection. – That is a manifestation of the fact that it is unscaled.– Scaling (also known as normalizing) means adjusting the sum of
squares so that it does not grow as the size of the data collection grows.
• We don't want our measure of variability to depend on sample size in this way, so the obvious solution is to divide by the number of samples, to get the mean squared deviation
• The MSD can be taken to be the wanted variance parameter, but…
N
iiN xx
NN
SSsMSD
1
22 1
9
STAT03 - Descriptive statistics (cont.) - variability
Degrees of freedom
• Suppose we had a sample of five numbers and their average was 4, What was the sum of the five numbers? It must have been 20, otherwise the mean would not have been 4. So now let us think about each of the five numbers in turn:
• We are going to put a number in each of the five boxes. • If we allow that the numbers could be positive or negative real
numbers, we ask how many values could the first number take.
10
STAT03 - Descriptive statistics (cont.) - variability
Degrees of freedom
• If we allow that the numbers could be positive or negative real numbers, we ask how many values could the first number take.
• You will realize it could take any value. Suppose it was a 2.
2
11
STAT03 - Descriptive statistics (cont.) - variability
Degrees of freedom
• How many values could the next number take? It could be anything.
• Say it was a 7.
2 7
2
12
STAT03 - Descriptive statistics (cont.) - variability
Degrees of freedom
• And the third number could be anything.
• Suppose it was a 4.
2 7 4
2 7
13
STAT03 - Descriptive statistics (cont.) - variability
Degrees of freedom
• The fourth number could be anything at all.
• Say it was 0.
2 7 4 0
2 7 4
14
STAT03 - Descriptive statistics (cont.) - variability
Degrees of freedom
• Now, how many values could the last number take?
• Just one - it has to be another 7 because the numbers have to add up to 20 because the mean of the five numbers is 4.
2 7 4 0 7
2 7 4 0
15
STAT03 - Descriptive statistics (cont.) - variability
Degrees of freedom
• We have total freedom in selecting the first number - and the second, third and fourth numbers.
• But we have no choice at all in selecting the fifth number. • We have four degrees of freedom when we have five numbers
(and their mean).
• In general we have (n-1) degrees if freedom if we estimated the mean from a sample of size n.
• More generally still, we can propose a formal definition of degrees of freedom: degrees of freedom is the sample size, N, minus the number of parameters, p, estimated from the data.
2 7 4 0 7
16
STAT03 - Descriptive statistics (cont.) - variability
Scaling the sum of squares – variance
• The mean is a parameter estimated from the data itself – hence we lose one degree of freedom
• Thus we finally arrive at a definition for variance – sum of squares divided by the degrees of freedom
• Only difference between MSD and variance – division with N or N-1, respectively
p)-(N freedom of degrees
(SS) squares of sumvariance
11
1
1
221
N
SSxx
Ns
N
iiN
17
STAT03 - Descriptive statistics (cont.) - variability
Standard deviation
• Variance has a unit of measure which is squared (cm2 ) in relation to the original units (cm)
• Therefore, another measure is used – standard deviation – measured in same units as the data
N
iiN xx
Nss
1
221 1
1
18
STAT03 - Descriptive statistics (cont.) - variability
Sample and population parameters
• Usually you are interested in drawing conclusions about the population from which your (random) sample of data is drawn.
• It is very important to keep in mind the difference between the descriptive statistics that characterise your sample, and the corresponding parameters that characterise the population from which your sample is drawn.
Population (finite, infinite)
“true” parameters
Sample (finite)Estimates of population
parameters
x
xs2xs
2
mean
variance
standard deviation
Ex. All raisin boxes ever produced by the company/factory
Ex. The particular data collection for only 17 particular raisin boxes
Needs (probability) distributions
19
STAT03 - Descriptive statistics (cont.) - variability
Geometric interpretations - quantity graph
• Standard deviation – same units as the quantity
20
STAT03 - Descriptive statistics (cont.) - variability
Geometric interpretations - quantity graph
• Variance - area
21
STAT03 - Descriptive statistics (cont.) - variability
Geometric interpretations - quantity graph
• Variance - area
22
STAT03 - Descriptive statistics (cont.) - variability
Geometric interpretation - histogram (frequency count)
• More commonly – geometric interpretation on a histogram. • Makes it easier to see
the spread
• If no deviations – standard deviation is 0 – the whole histogram collapses to a single peak
23
STAT03 - Descriptive statistics (cont.) - variability
Review
• Arithmetic mean• Median• Mode
• Range• Variance• Standard deviation
Measures of
Central tendency (location)
Measure of
Statistical variability (dispersion - spread)
Descriptive statistics
24
STAT03 - Descriptive statistics (cont.) - variability
Exercise for mini-module 3 – STAT03
Exercise
Use the following data: The data in the following table come from three garden markets. The data show the ozone concentrations in parts per hundre million (pphm) on ten consecutive summer days
• 1. Import the data into R, and for each garden, find the the central tendency parameters of the ozone concentrations.
• 2. Using R, for each garden, find dispersion parameters - the sample variance and sample standard deviation.
• 3. Using R, plot the relative frequency histogram for each of the gardens. Mark graphically the arithmetic mean on each graph and the one standard deviation range.
Delivery:Deliver the collected data (in tabular format), the found statistics and the requested
graphs for the assigned years in an electronic document. You are welcome to include R code as well.