stats powerpoint from the worst prof in the world

Upload: erikatoda

Post on 02-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 stats powerpoint from the worst prof in the world

    1/61

  • 8/10/2019 stats powerpoint from the worst prof in the world

    2/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Preliminaries

    Topics:

    What is Statistics?

    Typical Descriptive Statistics Problems

    Note for the StudentIt is recommended that students read this section in itsentiretybefore coming to class for the lecture to ensure thatthey have the required background information.1

    During the lecture I will mainly focus on sections which havea direct bearing on the lecture topic under discussion.

    Material in the last section serves to complement what wecover during the lecture.

    1This also applies to the Preliminaries section of subsequent lecture slides.

  • 8/10/2019 stats powerpoint from the worst prof in the world

    3/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Statistics Overview

    Topics:

    What is Statistics?

    Applications of Statistics

    Learning Objectives:

    Learn the nature of Statistics and study its relevance toBusiness Research Analysis and Decision Making.

    Learn about the different subdisciplines of Statistics concerned

    with extracting descriptive information from data, assessinguncertainty and making statistical inferences & predictions.

  • 8/10/2019 stats powerpoint from the worst prof in the world

    4/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    What is Statistics?

    Statistics is the discipline which makes use of mathematical andcomputational techniques to, among other things,

    collect data using surveys, observational studies or designedexperiments;

    describe, summarize and present the collected data;assess and quantify uncertainty;

    draw inferences about population characteristics based onsample information;

    assess the statistical significance of observed differences orpresence of associations;

    construct empirical models to obtain estimates, testhypotheses or for predictive purposes;

    make projections using cross-sectional or time series data.

    P li i i E i i l D Di ib i S M Mi ll

  • 8/10/2019 stats powerpoint from the worst prof in the world

    5/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Applications of Statistics

    Some Applications:

    Marketing Research

    Eg. Assessing Brand Preferences for a Given Product

    Finance

    Eg. Measuring the Credit Risk of a Counterparty

    Insurance

    Eg. Measuring Risk of an Insurance Portfolio

    Reliability Engineering

    Eg. Assessing the Reliability of an Aircraft Engine

    Medical Research

    Eg. Determining the Efficacy of a New Drug

    Q: Do you think Statistics is worthwhile learning? If so, why?

    P li i i E i i l D t Di t ib ti S M Mi ll

  • 8/10/2019 stats powerpoint from the worst prof in the world

    6/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Typical Descriptive Statistics Problems

    Organizing Data

    Forty students in an Introductory Statistics course were asked tostate their political affliations (i.e., whether they favoured theDemocratic (D), Republican (R) or Other (O) party). Thefollowing results were obtained.

    D R O R R R R R

    D O R D O O R D

    D R O D R R O R

    D O D D D R O D

    O R D R R R R D

    What type of data are we dealing with?

    What can we say about the distribution of political affliations?

    Source: Adapted from Weiss (2012, p. 40).

    Preliminaries Empirical Data Distrib tions S mmar Meas res Miscellan

  • 8/10/2019 stats powerpoint from the worst prof in the world

    7/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Summarizing Data

    Arterial blood pressures (in mm of mercury) for a sample of 16children of diabetic mothers are given below.

    81.6 84.1 87.6 82.8

    82.0 88.9 86.7 96.4

    84.6 104.9 90.8 94.069.4 78.9 75.2 91.0

    What does the data tell you about the average blood pressureof a child whose mother is diabetic?

    What can we conclude about the variability of the bloodpressure measurements?

    Source: Adapted from Weiss (2012, p. 95)

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    8/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Empirical Data Distributions

    Topics:Tabulating Data Distributions

    Graphing Data Distributions

    Learning Objectives:Learn tabular and graphical techniques for organizing andpresenting data.

    Learn how to choose among the available techniques for a

    given problem in descriptive statistical analysis.

    Note:

    Much of the material in this and the next section are of a review nature.Well quickly review such material but spend more time on materialstudents are less familiar with.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    9/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Tabulating Data Distributions

    Tabulating Categorical Data

    The first column of the table contains the possible categoriesand the second column the correponding absolute frequencies(optionally, relative frequencies may also be given in anothercolumn).

    Example

    Consider the political affliation data given in the first illustrativeproblem. Following is the frequency table for the data.

    Affliation Abs Freq Rel FreqDemocratic 13 0.325Republican 18 0.450Other 9 0.225

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    10/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Tabulating Numerical Data

    In an absolute frequency table, the number of observations ineach class (i.e., pre-defined sub-interval) is presented.

    Class Frequency

    (l1, u1] n1(l2, u2] n2(l3, u3] n3

    ... ...

    (lk, uk] nk

    Abs Frequency Table

    Class Frequency(10, 20] 3(20, 30] 7(30, 40] 4(40, 50] 4

    (50, 60] 2

    Note: (10, 20] refers to values between 10 (exclusive) and 20 (inclusive) etc.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    11/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Example [Frequency Tables]

    The absolute frequency table in the previous slide was obtained

    from the following raw data

    12 13 17 21 24 24 26 27 27 30

    32 35 37 38 41 43 44 46 53 58

    The corresponding relative and cumulative frequency tables are:

    Class Rel Freq(10, 20] 0.15(20, 30] 0.35

    (30, 40] 0.20(40, 50] 0.20(50, 60] 0.10

    Class Cum Freq(10, 20] 0.15(20, 30] 0.50

    (30, 40] 0.70(40, 50] 0.90(50, 60] 1.00

    Q: What can we deduce from each table?

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    12/61

    p y y

    Graphing Data Distributions

    Graphing Distributions for Categorical Data

    Pie Chart

    A circle is divided into pie slices. The area of each slice isproportional to the relative frequency of each category.

    ExampleFor the political affliation data, we have the following pie chart.

    Pie Slice AngleDemocratic 117 degRepublican 162 degOther 81 deg

    Q: How can we improve on this graphical display?

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    13/61

    y y

    Bar Chart

    Each category is represented by a vertical (or horizontal) bar.The height (or width) of each bar is equal or proportional tothe absolute or relative frequency of a category.

    Example

    For the political affliation data, we have the following bar chart.

    Q: Which is preferred? A pie chart or bar chart?

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    14/61

    Side-by-Side Bar Chart

    This chart may be used to present bivariate categorical data.

    Example [Side-by-Side Bar Chart]

    Consider the following distribution of student grades by gender.

    A B C D EFemale 3 9 7 1 1

    Male 4 6 5 3 1

    In relative terms, we have the following table.

    A B C D E

    Female 0.14 0.43 0.33 0.05 0.05

    Male 0.21 0.32 0.26 0.16 0.05

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    15/61

    Example [Side-by-Side Bar Chart] (contd)

    Information in the first (second) table may be displayed by the

    chart in the left (right) panel of the following figure.

    Q: What conclusion(s) can be drawn from the above figure?

    Q: Does it matter which chart you base you conclusions on?

    Source: Adapted from Chow et al (2007, p. 7).

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    16/61

    Graphing Distributions for Numerical Data

    Absolute Frequency Histogram

    Displays information contained in an absolute frequency tableusing vertical bars with no gaps between bars.

    The height of each bar gives the number of observations thatlie in the interval determined by the base of the bar.

    Example

    Class Frequency(10, 20] 3(20, 30] 7(30, 40] 4(40, 50] 4(50, 60] 2

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    17/61

    Relative Frequency Histogram

    Displays information in a relative frequency table by vertical

    bars with no gaps between bars.The area of each bar gives the fraction of observations that liein the interval determined by the base of the bar.

    Example

    Class Frequency(10, 20] 0.15(20, 30] 0.35

    (30, 40] 0.20(40, 50] 0.20(50, 60] 0.10

    Q: What can you conclude from the above figure?

  • 8/10/2019 stats powerpoint from the worst prof in the world

    18/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    19/61

    Cumulative Frequency Polygon

    Displays a plot of cumulative frequency against upper class limit in

    an expanded cumulative frequency table (as illustrated below).

    Example

    Class Cum Freq (%)(0, 10] 0

    (10, 20] 15(20, 30] 50

    (30, 40] 70(40, 50] 90(50, 60] 100

    Q: What useful statistic(s) can we deduce from such plots?

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    20/61

    Digression: Quartiles

    Let x1, x2, . . . , xn denote a set ofn observations for our study.

    Usually, the xis are unordered.

    For some applications, we need to work with ordered values in thedataset, i.e, with x(i)s such that

    x(1) x(2) x(n).

    Define

    Q2 = second quartile of the xis

    =

    12

    x(k)+ x(k+1)

    , ifn= 2k,

    x(k+1), ifn= 2k+ 1.

    Note that Q2 is also referred to as the median of the xis.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    21/61

    The first quartile, denoted Q1, may be definedas the median ofxivalues less than or equal to Q2.

    The third quartile, denoted Q3, may be definedas the median ofxivalues greater than or equal to Q2.

    Example

    For the following set of 5 observations

    101.96 109.76 99.63 99.76 100.22

    the corresponding ordered sample is

    99.63 99.76 100.22 101.96 109.76.

    Here,Q1 = 99.76, Q2 = 100.22 and Q3 = 101.96.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    22/61

    Stem and Leaf Diagram

    A stem and leaf diagram (like the one shown below) is a graphicaldisplay that shows the distribution of a set of numerical values.From it, one can

    sometimes recover the original data;

    easily infer empirical percentiles;

    obtain measures of central tendency and dispersion.

    Example

    1 | 677888992 | 0012257

    3 | 2 8

    4 | 2

    Ordered data: 16, 17, . . . , 38, 42.Distribution is right-skewed.

    Q1 = 18, Q2 = 20 and Q3 = 23.5

    Min = 16 and Max = 42.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    23/61

    Example [Stem and Leaf Display]

    For the Cord Strength dataset

    25 25 36 31 26 36 29 37 37 2034 27 21 35 30 41 33 21 26 2619 25 14 32 30 29 31 26 22 2434 33 28 26 43 30 40 32 32 3125 26 27 34 33 27 33 29 30 31

    we obtain

    1 | 4

    1 | 9

    2 | 011242 | 55556666667778999

    3 | 000011112223333444

    3 | 56677

    4 | 013

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    24/61

    Boxplots

    We introduce the boxplot via a couple of examples.

    Example [Boxplot]

    Weekly television viewing times (in hours) of a sample of 20 peopleare given below.

    25 41 27 32 4366 35 31 15 5

    34 26 32 38 16

    30 38 30 20 21

    To obtain a boxplot, begin by finding the quartiles.

    5 15 16 20 21

    25 26 27 30 30

    31 32 32 34 35

    38 38 41 43 66

    Q1 = 23

    Q2 = 30.5

    Q3 = 36.5

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    25/61

    Example [Boxplot] (contd)

    Then, determine the following limits

    Lower Limit = Q1 1.5 IQR= 2.75,Upper Limit = Q3 + 1.5 IQR= 56.75,

    where IQR= 36.5 23 = 13.5. Finally, obtain 5 and 43 as the

    adjacent values

    a

    and note that 66 is a potential outlier since it fallsoutside the interval (2.75, 56.75).

    aAdjacent values are the most extreme values that lie within the lower andupper limits; they are the most extreme observations that are not potentialoutliers (Weiss, 2012, p. 120).

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    26/61

    Example [Parallel Boxplots]

    Measurements on skinfold thickness (in mm) for samples of

    runners and nonrunners in the same age group are given below.

    Runners | Nonrunners

    -----------------+-----------------------

    7.3 6.7 8.7 | 24.0 19.9 7.5 18.4

    3.0 5.1 8.8 | 28.0 29.4 20.3 19.0

    7.8 3.8 6.2 | 9.3 18.1 22.8 24.25.4 6.4 6.3 | 9.6 19.4 16.3 16.3

    3.7 7.5 4.6 | 12.4 5.2 12.2 15.6

    Group

    Statistics Runners Nonrunners5 Num Summary 3.0, 4.85, 6.3, 7.4, 8.8 5.2, 12.3, 18.25, 21.55, 29.4Limits 1.025, 11.225 -1.575, 35.425Adjacent Values 3.0, 8.8 5.2, 29.4Potential Outliers None None

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    27/61

    Example [Parallel Boxplots] (contd)

    Q: What conclusions can you draw from the above figure?

    Source: Adapted from Weiss (2012, pp. 121-122)

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    28/61

    Summary Measures

    Topics:

    Location & Spread of a Distribution

    Measures of Central Tendency

    Measures of Dispersion

    Summary Measures for Grouped Data

    Learning Objectives:

    Learn how to measure the location and spread of thedistribution ofrawdata for a single numerical variable.

    Learn how to obtain summary measures from grouped data.Learn how to interpret and choose between the varioussummary measures.

    Learn the role played by robustness in the selection of a

    summary measure.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    29/61

    Location & Spread of a Distribution

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    30/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    31/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    32/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    33/61

    Measures of Central Tendency

    Let x1, x2, . . . , xn denote a set n observations with corresponding

    ordered values x(1), x(2), . . . , x(n).

    Some measures of central tendency are given below.

    Mean

    mean = 1n

    ni=1

    xi = x, say.

    Median

    median = 12 x(k)+ x(k+1) , ifn= 2k,

    x(k+1), ifn= 2k+ 1.

    Mode

    mode = data value with highest frequency.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    34/61

    Example

    Consider dataset

    101.96, 109.76, 99.63, 99.76, 100.22

    with corresponding ordered values

    99.63, 99.76, 100.22, 101.96, 109.76.

    Here, the mean is

    x=101.96 + 109.76 + 99.63 + 99.76 + 100.22

    5 102.27

    andmedian = x(3)= 100.22.

    Q: What about the mode?

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    35/61

    Advantages & Disadvantages

    Feature Mean Median ModeAlways Exists? Y Y NAlways Unique? Y N NNot Affected by Outliers? N Y YFurther Analysis Potential? Y N N

    Note

    Use a robust (i.e., resistant) measure of central tendencywhen outlying values (assuming these are valid) are present.

    The trimmed mean is an example of a robust measure oflocation - see Exercise 3.54 on p. 101 of Weiss (2012) for aspecific illustration.

    Q: What about the mean and median?

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    36/61

    Example [Robustness]

    The mean is not robust since it is affected by outlying (extreme)

    observations.

    > set.seed(2012)

    > x mean(x)

    [1] 10.03585

    > median(x)[1] 10.09504

    Note that Ive decided to stop using R for this course. You may ignore the Rcodes that you see in this and the next three examples.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    37/61

    Example [Robustness] (contd)

    > x x[50] mean(x)

    [1] 10.37307

    > median(x)

    [1] 10.09504The median is not affected by extreme observations and hence it isa robust measure of central tendency.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    R l ti M gnit d f L ti n M s s

  • 8/10/2019 stats powerpoint from the worst prof in the world

    38/61

    Relative Magnitude of Location Measures

    Example

    > table(x)

    x

    1 2 3 4 5 6 7

    4 7 23 32 23 7 4

    > mean(x)

    [1] 4

    > median(x)[1] 4

    The above example illustrates the case when

    mean = median = mode.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    39/61

    In the next example, we have

    mean table(x)

    x1 2 3 4 5 6 7

    2 4 7 12 15 33 27

    > mean(x)

    [1] 5.41

    > median(x)

    [1] 6

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    It is also possible that

  • 8/10/2019 stats powerpoint from the worst prof in the world

    40/61

    It is also possible that

    mean >median = mode.

    Example

    > table(x)

    x

    1 2 3 4 5 6 7

    27 33 15 12 7 4 2

    > mean(x)

    [1] 2.59

    > median(x)

    [1] 2

    Q: What is the practical significance of these examples?

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    41/61

    Example [Mean vs Median]

    The ordered sample and stem and leaf display for some data onarterial blood pressure are given below.

    69.4 75.2 78.9 81.6

    82.0 82.8 84.1 84.686.7 87.6 88.9 90.8

    91.0 94.0 96.4 104.9

    6 | 9

    7 | 5 9

    8 | 22345789

    9 | 1146

    1 0 | 5Here,

    x= 86.18 and median = 85.65.

    Q: Which measure do you recommend for the data at hand?

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Measures of Dispersion

  • 8/10/2019 stats powerpoint from the worst prof in the world

    42/61

    Measures of Dispersion

    Some measures of dispersion are given below.

    Rangerange = x(n) x(1)

    Interquartile Range

    IQR = Third Quartile

    First Quartile

    Variance

    variance = 1

    n 1n

    i=1

    (xi x)2

    Standard Deviation

    standard deviation =

    1

    n 1

    ni=1

    x2i nx2

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    43/61

    Example

    Consider the (ordered) dataset

    99.63, 99.76, 100.22, 101.96, 109.76.

    Here,range = 109.76

    99.63 = 10.13

    andIQR = 101.96 99.76 = 2.2.

    Furthermore,

    variance =

    99.632 + + 109.762 5 102.2725 1 18.42

    andstandard deviation

    18.42 = 4.29.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    A relative measure of dispersion is

  • 8/10/2019 stats powerpoint from the worst prof in the world

    44/61

    A relative measure of dispersion is

    coefficient of variation = standard deviation

    mean .

    Example

    For data in the previous example,

    coefficient of variation =

    4.29

    102.27 0.04.

    Advantages & Disadvantages

    Feature R V SD IQR CV

    Always Exists? Y Y Y Y YAlways Unique? Y N N N NNot Affected by Outliers? N N N Y NAbsolute Measure? Y Y Y Y NSame Units? Y N Y Y N

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    45/61

    Example [Comparing Stock Performance]

    Following are annual logarithmic returns of Microsof (MSFT) andHewlett-Packard (HWP) for the period spanning 1995-1999.

    | 1995 1996 1997 1998 1999

    -----+------------------------------------

    MSFT | 0.3644 0.6622 0.5026 0.7648 0.5290

    HWP | 0.5014 0.1836 0.2156 0.1864 0.4921

    Some summary statistics for the returns are as follows:

    | MSFT HWP

    -------------+----------------

    Mean | 0.5646 0.3158

    Std Dev | 0.1539 0.1657

    Median | 0.5290 0.2156IQR | 0.1596 0.3057

    Coef of Var | 0.2727 0.5246

    Q: Which of the two stocks performed better over 1995-1999?

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Mean & Variance for Grouped Data

  • 8/10/2019 stats powerpoint from the worst prof in the world

    46/61

    Mean & Variance for Grouped Data

    Grouped data refers to data in a frequency distribution.

    Example

    Class | Freq. Percent Cum.

    ------------+-----------------------------------

    (10,15] | 1 2.00 2.00

    (15,20] | 2 4.00 6.00

    (20,25] | 8 16.00 22.00

    (25,30] | 17 34.00 56.00

    (30,35] | 15 30.00 86.00

    (35,40] | 5 10.00 96.00

    (40,45] | 2 4.00 100.00------------+-----------------------------------

    Information in the first and any one of the remaining threecolumns of the above table constitute grouped data.

  • 8/10/2019 stats powerpoint from the worst prof in the world

    47/61

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Example

  • 8/10/2019 stats powerpoint from the worst prof in the world

    48/61

    Example

    For the grouped data given earlier, we have

    2Class | ni mi mi*ni mi * ni

    -----------+------------------------------------------

    (10,15] | 1 12.5 12.5 156.25

    (15,20] | 2 17.5 35.0 612.50

    (20,25] | 8 22.5 180.0 4050.00

    (25,30] | 17 27.5 467.5 12856.25

    (30,35] | 15 32.5 487.5 15843.75

    (35,40] | 5 37.5 187.5 7031.25

    (40,45] | 2 42.5 85.0 612.50

    -----------+------------------------------------------

    Total | 50 1455.0 44162.50

    Hence,

    xg=1455.0

    50 = 29.1 and s2g =

    44162.50

    50 29.12 = 36.44.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    49/61

    Topics:

    Summation Notation

    Classification of Statistical Studies

    Questions for Class Discussion

    Learning Objectives:

    Review the notation used for summation.

    Learn about different types of statistical studies.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Summation Notation

  • 8/10/2019 stats powerpoint from the worst prof in the world

    50/61

    Summation Notation

    Given numerical values x1, . . . , xn, we have:n

    i=1

    xi = x1+ x2+ +xnn

    i=1

    (axi+b) = (ax1+ b) + + (axn+ b) = an

    i=1

    xi+nb

    Example

    Ifxis are given by 1.75, 2.25, 2.25, 2.25, 1.75, 2.00, 1.50, we have

    7i=1

    xi = 13.75 and7

    i=1

    x2i = 1.752 + + 1.502 = 27.5625.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Classification of Statistical Studies

  • 8/10/2019 stats powerpoint from the worst prof in the world

    51/61

    Observational Study

    Observed relationships and other inferences apply only tothe study subjects (or objects) under investigation.

    No control of extraneous sources of variation.

    Example [Vasectomies & Prostrate Cancer]

    A study found an association between vasectomy and prostratecancer - elevated risk after vasectomy.

    No information that the study was based on a properly chosensample or a properly designed experiment.

    We cannot infer causation nor generalize the observed association.

    Source: Adapted from Weiss (2012, p. 7).

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    52/61

    Inferential Study

    The study is based on a properly chosen sample (e.g., randomsample).

    Inferences made from sample information may be generalizedto a larger population.

    Example [Testing Baseballs]

    An independent testing company investigated the liveliness of 85randomly selected Rawlings baseballs from the 1977 supplies ofmajor league teams.

    The Rawlings baseball was found to be more lively than the 1976Spalding baseball.

    Source: Adapted from Weiss (2012, p. 6).

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    53/61

    Designed Experiments

    A proper randomization technique is used to allocate subjects(or objects) to treatment and control groups.

    Relevant sources of extraneous variation are controlled.

    Example [Folic Acid & Birth Defects]

    4753 women prior to conception were divided randomly into twogroups. One group took daily doses of folic acid while the othertook only trace elements.

    Incidence of major birth defects was much reduced for the group

    taking folic acid.

    Here, we can infer presence of a causal relationship.

    Source: Adapted from Weiss (2012, p. 7).

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Questions for Class Discussion

  • 8/10/2019 stats powerpoint from the worst prof in the world

    54/61

    Question 1

    A stem-and-leaf display of daily protein intake (in grams) for asample of 51 female vegetarians is shown below.

    The decimal point is 1 digit(s) to the right of the |

    0 | 1259

    1 | 34558

    2 | 01889

    3 | 013566688899

    4 | 0012355675 | 002234467899

    6 | 8 8

    7 |

    8 | 0 5

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    55/61

    Question 1 (contd)

    A similar display for a sample of 53 female nonvegetarians is given

    below.

    The decimal point is 1 digit(s) to the right of the |

    0 | 5

    1 | 1 42 | 34557

    3 | 4567779

    4 | 0112444569

    5 | 0003345577

    6 | 0113334799

    7 | 1157

    8 | 1444

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Question 1 (contd)

  • 8/10/2019 stats powerpoint from the worst prof in the world

    56/61

    Question 1 (cont d)

    (a) The quartiles for both groups of females are partially given in

    the following table. Fill in the missing entries in table.

    Group 1st Quartile 2nd Quartile 3rd QuartileVegetarian 39

    Nonvegetarian 38 63

    Table : Quartiles of Vegetarian and Nonvegetarian Females

    (b) Based on information in (the completed) table, compare the

    location and spread of the two sets of data.(c) Identify potential outliers, if any, for each dataset. Do you

    obtain results that are consistent with what you observe in thestem-and-leaf displays?

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    57/61

    Question 2

    (a) Which of the following is not a property of the coefficient ofvariation?

    (i) It is not always unique.(ii) It is resistant to outliers.

    (iii) It is a relative measure.(iv) It is not in the same units as the original data.

    (b) The (arithmetic) mean computed from raw data is alwaysunique. The same is true of the mean computed fromgrouped data. True or False?

    (c) The sample mid-range is a robust measure of location. Trueor False?

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    58/61

    Question 3

    Suppose you obtain the following five number summaries fromdata on annual (percentage) returns for common stock andgovernment bonds over a fifteen year period.

    Investment: Bonds

    [1] -10.460 1.035 4.600 14.080 42.980

    Investment: Stocks

    [1] -25.930 -0.495 10.710 23.760 44.770

    (a) What types of statistics do the numbers in each summaryrepresent?

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    59/61

    Question 3 (contd)

    (b) One of the values given in the five number summary for thebond returns looks unusual. Is it a potential outlier?

    (c) Of the two financial instruments, which is preferred if yourprimary investment objective is to choose the one that gives

    you the greater level of return on average?(d) Which is preferred if risk aversion is the key factor influencing

    your choice of investment to make?

    (e) Is there anything wrong with the following statement?

    Under appropriate conditions, the coefficient of variation is auseful measure to consider when making risk-reward trade-offs

    amongst several investment alternatives.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

  • 8/10/2019 stats powerpoint from the worst prof in the world

    60/61

    Question 4

    Consider the following absolute frequency distribution obtainedfrom data on distance (in miles) travelled to work for a randomsample of 50 workers.

    Classes | (10,20] (20,30] (30,40] (40,50]

    ----------+------------------------------------Frequency | 3 19 23 5

    (a) Determine the grouped data variance using information

    provided by the above empirical distribution.(b) Determine one other grouped data measure of dispersion.

    Preliminaries Empirical Data Distributions Summary Measures Miscellany

    Acknowledgements

  • 8/10/2019 stats powerpoint from the worst prof in the world

    61/61

    The current slides are based in parton material from:

    Introductory Statistics (9th Edition) by Neil A. Weiss.

    Introductory Statistics (2nd Edition) by H. K. Chow, A.Ghosh, D. H. Y. Leung and Y. K. Tse.

    The slides were produced usingThe Beamer Class package andMikTeX (a public domain document preparation system).

    Customized computations and graphics were produced usingR (apublic domain statistical software package).

    I am grateful to the developers of the above resources for makingthem available.