stats lecture 02 descriptive stats

Upload: katherine-sauer

Post on 06-Apr-2018

250 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    1/51

    Descriptive Statistics

    Chapter 2

    Quantitative Methods for Economics

    Dr. Katherine Sauer

    Metropolitan State College of Denver

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    2/51

    Chapter Overview:I. Working With Raw Data

    II. Working With Grouped Data

    III. Measures of Dispersion for Raw Data

    IV. Measures of Dispersion for Grouped DataV. Other Measures of Dispersion

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    3/51

    I. Working with Raw Data (mean, median and mode)

    Suppose you are a manager preparing a report on hours worked

    by your 49 staff members.

    You might like to know the average number of hours worked.

    49

    49

    11

    i

    i

    N

    i

    ix

    N

    x

    = 1592.5 = 32.5

    49

    20.0 37.3 54.2 25.3 59.6 24.5 29.7

    18.0 38.8 42.1 39.5 56.8 16.9 28.5

    45.5 42.0 39.5 42.6 40.0 44.2 40.1

    44.0 56.4 30.2 20.0 22.7 37.8 23.4

    26.0 20.2 36.1 18.3 19.7 36.8 26.5

    24.0 23.4 15.4 20.0 38.9 42.1 24.1

    41.0 18.5 21.3 22.6 37.2 42.9 17.9

    Hours worked in a given week by 49 staff members

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    4/51

    You might also like to know the median hours worked.

    - sort the data in ascending order

    15.4 20.0 23.4 26.5 37.3 40.1 44.0

    16.9 20.0 23.4 28.5 37.8 41.0 44.2

    17.9 20.0 24.0 29.7 38.8 42.0 45.5

    18.0 20.2 24.1 30.2 38.9 42.1 54.2

    18.3 21.3 24.5 36.1 39.5 42.1 56.4

    18.5 22.6 25.3 36.8 39.5 42.6 56.8

    19.7 22.7 26.0 37.2 40.0 42.9 59.6

    Hours worked in a given week by 49 staff members

    The mode and median can be determined from the sorted data.

    Are there any outliers we should make note of?

    mean: 32.5 hours

    median: 30.2 hours

    mode: 20 hours

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    5/51

    15.4 20.0 23.4 26.5 37.3 40.1 44.0

    16.9 20.0 23.4 28.5 37.8 41.0 44.2

    17.9 20.0 24.0 29.7 38.8 42.0 45.5

    18.0 20.2 24.1 30.2 38.9 42.1 54.2

    18.3 21.3 24.5 36.1 39.5 42.1 56.4

    18.5 22.6 25.3 36.8 39.5 42.6 56.8

    19.7 22.7 26.0 37.2 40.0 42.9 59.6

    Hours worked in a given week by 49 staff members

    One final calculation we might like to make is arranging the data

    into quartiles.

    The position of the lower quartile (Q1) is the item that is closest to

    position

    0.25(n+1)

    Q1: 0.25(49 + 1)

    = 12.5

    There is no 12.5thposition so well average

    the 12th and 13th positions together.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    6/51

    15.4 20.0 23.4 26.5 37.3 40.1 44.0

    16.9 20.0 23.4 28.5 37.8 41.0 44.2

    17.9 20.0 24.0 29.7 38.8 42.0 45.5

    18.0 20.2 24.1 30.2 38.9 42.1 54.2

    18.3 21.3 24.5 36.1 39.5 42.1 56.4

    18.5 22.6 25.3 36.8 39.5 42.6 56.8

    19.7 22.7 26.0 37.2 40.0 42.9 59.6

    Hours worked in a given week by 49 staff members

    So, Q1 = 21.3+22.6 = 21.45

    2

    Weve already found Q2.

    30.2

    To find the upper quartile (Q3), use the value of the item closest

    to position

    0.75(n + 1).

    Q3: 0.75(50) = 37.5

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    7/51

    15.4 20.0 23.4 26.5 37.3 40.1 44.0

    16.9 20.0 23.4 28.5 37.8 41.0 44.2

    17.9 20.0 24.0 29.7 38.8 42.0 45.5

    18.0 20.2 24.1 30.2 38.9 42.1 54.2

    18.3 21.3 24.5 36.1 39.5 42.1 56.4

    18.5 22.6 25.3 36.8 39.5 42.6 56.8

    19.7 22.7 26.0 37.2 40.0 42.9 59.6

    Hours worked in a given week by 49 staff members

    So, Q3 has a value of

    41+42 = 41.52

    Q1 = 21.45 Q2 = 30.2 Q3 = 41.5

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    8/51

    Sometimes the mean is not a good representation of the data.

    - a representative statistic is fairly typical of most of the

    data

    Outliers can skew the mean.

    Ex: Suppose we have the following data on ages of student taking

    piano lessons.5,6,7,7,7,8,9,9,32

    Calculate the mean, median and mode:

    10, 7, 7

    Drop the outlier and re-calculate the mean, median and mode:

    7.25, (7+7)/2 = 7 , 7

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    9/51

    Graphically, skewed data has a long tail extending to the outlier.

    - low outliers produce skewed to the left graphs

    - high outliers produce skewed to the right graphs

    For low outliers, the value of the mean will be less than the

    value of the median.

    For high outliers, the value of the mean will be more than the

    value of the median.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    10/51

    II. Working with Grouped Data (mean, median and mode)

    Many times it would be impractical to list all of the raw data.

    Often data is first put into groups.

    Example: employment data in the farming, fishing and forestry

    industry

    Age Group 1991 1996

    15-19 4,585 2,826

    20-24 11,872 9,319

    25-34 27,171 24,492

    35-44 31,299 28,21045-54 31,626 30,902

    55-64 33,477 25,846

    65 and over 23,519 19,030

    Total 163,549 140,625

    Employment in the Farming, Fishing and Forestry Industry

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    11/51

    Note: We are assuming that the values within each interval vary

    uniformly between the lowest and highest values for the interval.

    The mid-interval value is the average value of the data in

    any interval.

    - used to represent the group numerically

    Mid-Interval Value for 15-19: 15+19 = 17

    2

    The age of each person in the interval is assumed to be 17.

    Age Group 1991 1996

    15-19 4,585 2,826

    20-24 11,872 9,319

    25-34 27,171 24,492

    35-44 31,299 28,210

    45-54 31,626 30,902

    55-64 33,477 25,846

    65 and over 23,519 19,030

    Total 163,549 140,625

    Employment in the Farming, Fishing and Forestry Industry

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    12/51

    Back to our hours worked example

    15.4 20.0 23.4 26.5 37.3 40.1 44.0

    16.9 20.0 23.4 28.5 37.8 41.0 44.2

    17.9 20.0 24.0 29.7 38.8 42.0 45.5

    18.0 20.2 24.1 30.2 38.9 42.1 54.2

    18.3 21.3 24.5 36.1 39.5 42.1 56.4

    18.5 22.6 25.3 36.8 39.5 42.6 56.819.7 22.7 26.0 37.2 40.0 42.9 59.6

    Hours worked in a given week by 49 staff members

    Lets group this data into a frequency distribution table.

    - choose between 5 and 20 intervals

    Data starts at 15.4 and goes to 59.6.

    Grouping hours by 5s or 10s makes sense.

    For our data, by 5s will be more revealing.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    13/51

    Hours Worked Frequency

    15

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    14/51

    Lets calculate the mid-interval values and add them to our table.

    Hours Worked Frequency Mid Interval Value15

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    15/51

    Lets calculate the total hours worked for each interval and add to

    the table.

    frequency x mid-interval value

    Hours Worked Frequency Mid Interval Value Sub-Group Total Hours Worked

    15

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    16/51

    Hours Worked Frequency Mid Interval Value Sub-Group Total Hours Worked

    15

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    17/51

    To find the mode, we simply need our frequencies and intervals.

    Hours Worked Frequency

    15

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    18/51

    Now lets calculate the median and quartiles.

    Well first need to compute the cumulative frequency and add it toour table.

    Hours Worked Frequency Less Than Cumulative Frequency

    15

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    19/51

    To determine the value of Q1:-From the 7 items in the preceding interval, 5.5 more are needed to

    reach the 12.5th position.

    -There are 12 items in the interval that contains Q1.

    From this we get: 5.5 / 12 = 0.46

    Take this times the size of the interval to get: 0.46 x 5 = 2.3

    Add this to the beginning of the interval to get: 2.3 + 20 = 22.3 = Q1

    Hours Worked Frequency Less Than Cumulative Frequency

    15

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    20/51

    Hours Worked Frequency Less Than Cumulative Frequency

    15

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    21/51

    Hours Worked Frequency Less Than Cumulative Frequency

    15

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    22/51

    Hours Worked

    Raw Data: Grouped Data:

    mean 32.5 32.602

    median 30.2 32.5

    mode 20 22.08Q1 21.45 22.3

    Q2 30.2 32.5

    Q3 45.1 41.75

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    23/51

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    24/51

    Lets calculate the average price per bottle for each option.

    Bundle 1:8 + 10 + 12 + 55 + 150

    5

    = $47 per bottle

    Bundle 2:

    8(8) + 10(8) + 12(8) + 55(8) + 150(8)

    5(8)

    = $47 per bottle

    Bundle 2 is a weighted average, but all the weights are the same.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    25/51

    For Bundle 3, the weights will be different.

    Bundle 3:

    8(123) + 10(62) + 12(32) + 55(2) + 150(1)

    123+62+32+2+1

    = 2248

    220

    = $10.22 per bottle

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    26/51

    Quick Summary:

    A summary statistic is used to represent a typical value of our data.

    - mean

    - median- mode

    - quartiles

    We can calculate summary statistics for raw data and grouped data.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    27/51

    III. Measures of Dispersion for Raw Data

    A summary statistic gives no indication about the dispersion of

    values within a set of data.

    Ex: You are a tour operator planning activities for two different

    tour groups. You are told the average age for each group is 50

    years old.

    When the tourists arrive you discover the ages of the individuals in

    each group are as follows:

    group 1: 48, 50, 52, 51, 49

    group 2: 22, 85, 72, 27, 64, 39, 41

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    28/51

    The range is the difference between the highest and lowest

    value in the data set.

    group 1 range =

    5248 = 4

    group 2 range =

    8522 = 63

    A smaller number indicates all data values are closer together.

    A larger number could indicate:

    1. data are disperse

    2. there are outliers

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    29/51

    Variance is a way of measuring how much each data point

    varies from the mean value.

    Lets calculate the difference between each data point and themean. or

    Then, calculate the sum of the differences for each group.

    or

    xi xi - 50 xi xi - 50

    48 -2 22 -28

    50 0 85 35

    52 2 72 22

    51 1 27 -23

    49 -1 64 14

    39 -11

    41 -9

    Total 0 Total 0

    Group 1 Group 2

    )( ix

    ix xx

    i

    )( xxi

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    30/51

    To overcome the problem of the differences from the mean

    summing to zero:

    square each difference and then sum.

    2

    ix 2

    xxi

    xi xi - 50 (xi - 50)^2 xi xi - 50 (xi - 50)^2

    48 -2 4 22 -28 784

    50 0 0 85 35 1225

    52 2 4 72 22 484

    51 1 1 27 -23 529

    49 -1 1 64 14 196

    39 -11 121

    41 -9 81

    Total 0 10 Total 0 3420

    Group 1 Group 2

    We can see that there is much larger variation from the mean in

    group 2 data.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    31/51

    However, because our data sets are of unequal size, we should

    adjust for that.

    Divide the sum of squared differences by the number ofobservations.

    group 1: 10/ 5 = 2

    group 2: 3420 / 7 = 488.57

    This statistic is called the variance.

    N

    xi2

    2 1

    2

    2

    n

    xxs

    i

    If the n-1 is used in the defining formula for the sample variance, then it is possible to

    prove that the average value of the sample variance equals the true variance.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    32/51

    The square root of the variance is called the standard deviation.

    - it is another way to measure the dispersion around the mean

    - it is measured in the same units as the data- unless data is a percent, then standard deviation is

    in percentage points

    N

    xi

    2

    1

    2

    n

    xxs

    i

    For our example:

    group 1: 1.41

    group 2: 22.1

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    33/51

    In the same way that a mean can be skewed by outliers, so can thevariance and standard deviation.

    Looking at the median and quartiles may be informative.

    The semi-interquartile range is the difference between the upperand lower quartile.

    The quartile deviation is the semi-interquartile divided by 2.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    34/51

    Lets arrange our raw data into quartiles:

    First, order the data:

    group 1: 48,50,52,51,49 becomes 48, 49, 50, 51, 52

    Then, find Q1, Q2, Q3:

    Q2 = median = 50

    Q1: 0.25(5+1) = 1.5

    so average the 1st and 2nd values Q1 = 48.5

    Q3: 0.75(5+1) = 4.5

    so average the 4th and 5th values Q3 = 51.5

    Now, find the IQR and QD:

    IQR = 51.548.5 = 3

    QD = 3/2 = 1.5

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    35/51

    First, order the data:

    group 2: 22,85,72,27,64,39,41 becomes 22, 27, 39, 41, 64, 72, 85

    Then, find Q1, Q2, Q3:Q2 = median = 41

    Q1 = 0.25(7+1) = 2

    Q1 = 27

    Q3 = 0.75(7+1) = 6

    Q3 = 72

    Now, find the IQR and QD:

    IQR = 7241 = 31

    QD = 31/2 = 15.5

    Group 1 has a much lower IQR and QD than Group 2.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    36/51

    Group1: Group 2:Mean 50 50

    Median 50 41

    Range 4 63

    Variance 2 488.57Stand. Dev. 1.14 22.1

    Q1 48.5 27

    Q2 50 41

    Q3 51.5 72

    IQR 3 31QD 1.5 15.5

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    37/51

    IV. Measures of Dispersion for Grouped Data

    Suppose we have the following frequency distribution table for

    swimmers and their ages.

    frequency

    Ages fi

    17 < 19 14

    19 < 21 1921 < 23 11

    23 < 25 4

    25 < 27 1

    27 < 29 1

    Total 50

    To calculate the mean, well need the mid-interval values.

    Lets calculate the mid-interval values.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    38/51

    frequency

    Ages fi xi

    17 < 19 14 18

    19 < 21 19 20

    21 < 23 11 22

    23 < 25 4 24

    25 < 27 1 26

    27 < 29 1 28Total 50 na

    Mid-Interval

    Value

    The mean is given by

    i

    ii

    f

    xf

    We know the sum of the frequencies. We need to calculate the

    product of the frequencies and mid-interval value and then sum.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    39/51

    Ages fi xi (fi)(xi)

    17 < 19 14 18 252

    19 < 21 19 20 380

    21 < 23 11 22 242

    23 < 25 4 24 96

    25 < 27 1 26 26

    27 < 29 1 28 28

    Total 50 na 1024

    So the mean for this grouped data is:

    1024 / 50 = 20.48

    Now that we have the mean, we can calculate the dispersion

    around the mean for each mid-interval value. Then square.

    - instead of taking each data point minus the mean, we

    are using the mid-interval value

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    40/51

    Multiply the squared terms by the frequency. Then sum.

    Ages fi xi (fi)(xi) (xi - mean) (xi - mean)^2

    17 < 19 14 18 252 -2.48 6.150419 < 21 19 20 380 -0.48 0.2304

    21 < 23 11 22 242 1.52 2.3104

    23 < 25 4 24 96 3.52 12.3904

    25 < 27 1 26 26 5.52 30.4704

    27 < 29 1 28 28 7.52 56.5504

    Total 50 na 1024 na na

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    41/51

    Ages fi xi (fi)(xi) (xi - mean) (xi - mean)^2 fi(xi - mean)^2

    17 < 19 14 18 252 -2.48 6.1504 86.1056

    19 < 21 19 20 380 -0.48 0.2304 4.3776

    21 < 23 11 22 242 1.52 2.3104 25.414423 < 25 4 24 96 3.52 12.3904 49.5616

    25 < 27 1 26 26 5.52 30.4704 30.4704

    27 < 29 1 28 28 7.52 56.5504 56.5504

    Total 50 na 1024 na na 252.48

    i

    ii

    fxf

    2

    2 )( Variance =

    2)( ii xf if

    We can now use our grouped data variance formula.

    = 252.48 = 5.049650

    The standard deviation is 2.247

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    42/51

    There is an alternative formula for calculating the variance for

    grouped data:

    22

    2

    i

    ii

    i

    ii

    f

    xf

    f

    xf

    Lets calculate the mid-interval value squared and then multiply

    it by the frequency. Then sum.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    43/51

    Ages fi xi (fi)(xi) (xi - mean) (xi - mean)^2 fi(xi - mean)^2 fi(xi)^2

    17 < 19 14 18 252 -2.48 6.1504 86.1056 4536

    19 < 21 19 20 380 -0.48 0.2304 4.3776 7600

    21 < 23 11 22 242 1.52 2.3104 25.4144 5324

    23 < 25 4 24 96 3.52 12.3904 49.5616 2304

    25 < 27 1 26 26 5.52 30.4704 30.4704 676

    27 < 29 1 28 28 7.52 56.5504 56.5504 784

    Total 50 na 1024 na na 252.48 21224

    22

    2

    i

    ii

    i

    ii

    f

    xf

    f

    xf

    Variance = 21224/50 - (1024/50)^2

    = 424.48 - 419.4304

    = 5.0496

    Same answer as other formula!

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    44/51

    Finally, lets calculate the inter-quartile range and the quartile

    deviation. Well need the cumulative frequency to do this.

    Q1: 0.25(50 + 1) = 12.75th

    position

    From the 0 items in the preceding interval,

    12.75 more are needed to reach the 12.75th

    position.

    There are 14 items in the interval that

    contains Q1.

    From this we get: 12.75 / 14 = 0.91Take this times the size of the interval to get: 0.91 x 2 = 1.82

    Add this to the beginning of the interval to get: 1.82 + 17 = 18.82

    = Q1

    Ages fi cumulative

    17 < 19 14 14

    19 < 21 19 33

    21 < 23 11 44

    23 < 25 4 48

    25 < 27 1 4927 < 29 1 50

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    45/51

    Q2: 0.5(50 + 1) = 25.5th position

    From the 14 items in the precedinginterval, 11.5 more are needed to reach the

    25.5th position.

    There are 19 items in the interval that

    contains Q2.

    From this we get: 11.5 / 19 = 0.605

    Take this times the size of the interval to get: 0.605 x 2 = 1.21

    Add this to the beginning of the interval to get: 1.21 + 19 = 20.21

    = Q2

    Ages fi cumulative

    17 < 19 14 14

    19 < 21 19 33

    21 < 23 11 4423 < 25 4 48

    25 < 27 1 49

    27 < 29 1 50

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    46/51

    Q3: 0.75(50 + 1) = 38.25th position

    From the 33 items in the precedinginterval, 5.25 more are needed to reach the

    38.25th position.

    There are 11 items in the interval that

    contains Q3.

    From this we get: 5.25 / 11 = 0.4772

    Take this times the size of the interval to get: 0.4772 x 2 = 0.954

    Add this to the beginning of the interval to get: 0.954 + 21 = 21.95

    = Q3

    Ages fi cumulative

    17 < 19 14 14

    19 < 21 19 33

    21 < 23 11 4423 < 25 4 48

    25 < 27 1 49

    27 < 29 1 50

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    47/51

    The IQR = Q3Q1 = 21.9518.82 = 3.13

    The QD = 3.13 /2 = 1.565

    Summary of our Grouped data:

    mean 20.48

    variance 5.0496

    st. dev. 2.247

    median 20.21

    Q1 18.82Q2 20.21

    Q3 21.95

    IQR 3.13

    QD 1.57

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    48/51

    V. Other Descriptive Statistics

    The coefficient of variation (CV) is useful for comparing two

    sets of data when

    - the means are close but the variances are different

    - the means are different but the variances are close

    CV is independent of the units of measurement.

    100

    CV

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    49/51

    Pearsons Coefficient of skewness (sk) gives a measure of the

    degree of skewness in a dataset.- independent of units of measure

    sk = 3(meanmedian)

    standard deviation

    A negative value means the data is skewed to the left.

    A positive value means the data is skewed to the right.

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    50/51

    A box plot is a graphical display of the symmetry or skewness

    of a dataset.

    The middle bar in the box represents the median.

    Each end of the box is Q1 and Q3.

    The whiskers extend to the minimum and maximum data values.

    - as long as the value is within (1.5)(IQR)

    - otherwise value is marked with an *

    Chapter Skills:

  • 8/3/2019 Stats Lecture 02 Descriptive Stats

    51/51

    Chapter Skills:

    Given raw data you should be able to calculate:

    mean median

    mode quartiles

    variance standard deviation

    coefficient of variation Pearsons coefficient

    box plot

    Given raw data you should be able to construct a frequencydistribution table and cumulative frequency.

    From grouped data you should be able to calculate:

    mean medianmode quartiles

    variance standard deviation

    coefficient of variation Pearsons coefficient

    box plot