st102 mt section 2

Upload: tylertangtengyang

Post on 02-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 ST102 MT Section 2

    1/19

    ST102Elementary Statistical Theory

    Descriptive statistics

    Dr James Abdey

    Department of StatisticsLondon School of Economics and Political Science

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 31

    Part I: 2. Descriptive statistics

    Part I:

    1. Introduction.2. Descriptive statistics.

    3. Introduction to probability theory.

    4. Random variables.

    5. Some common distributions of random variables.

    6. Multivariate random variables.

    7. Sampling distributions of statistics.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 32

    2. Descriptive statistics

    2.1: Introduction.

    2.2: The sample distribution.

    2.3: Measures of central tendency.

    2.4: Measures of dispersion.

    2.5: Associations between two variables.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 33

    2.1: Introduction

    Starting point: A collection of numerical data (a sample) has beencollected in order to answer some questions.

    Statistical analysis may have two broad aims:

    1. Descriptive statistics: Summarise the data that were collected, inorder to make the data more understandable.

    2. Statistical inference: Use the observed data to draw conclusionsabout some broader population.

    Sometimes 1. is the only aim.

    Even when 2. is the main aim, 1. is still an essential first step.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 34

  • 8/10/2019 ST102 MT Section 2

    2/19

    Need for descriptive statistics

    Data do notjust speak for themselves: There are usually simply too many

    numbers to make sense of just by staring at them.

    Descriptive statistics attempt to summarise some key features of thedata to make them understandable and easy to communicate.

    These summaries may benumerical (tables or individual summarystatistics) orgraphical.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 35

    Example

    Consider data for 155 countries on three things, from around 2002:

    Regionof the country.

    Coded as 1 = Africa, 2 = Asia, 3 = Europe, 4 = Latin America, 5 =North America and 6 = Oceania.

    Level of democracy in the country.

    An 11-point scale from 0 (lowest level of democracy) to 10 (highest).

    Gross domestic product (GDP) per capita (in $000s).

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 36

    A datasetThe statistical data in a sample are typically stored in a data matrix:

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 37

    Units and variables

    Rowsof the data matrix correspond to different units(subjects).

    Here each unit is a country.

    The number of units in a dataset is the sample size, typically denoted bythe letter n.

    Here n= 155 countries.

    Columnsof the data matrix correspond to variables, i.e. differentcharacteristics of the units.

    Here region, level of democracy and GDP per capita are the variables.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 38

  • 8/10/2019 ST102 MT Section 2

    3/19

    Continuous and discrete variables

    Different variables may have different properties. These determine whatkinds of statistical methods are suitable for the variables.

    Acontinuousvariable can, in principle, take any real values within some(continuous) interval.

    For example GDP per capita, which can have any values 0.

    A variable is discreteif it is not continuous, i.e. if it can only take certain(usually integer) values, but not any others.

    For example region, with possible values 1, 2, 3, 4, 5 and 6, and thelevel of democracy, with possible values 0, 1, 2, . . . ,10.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 39

    Discrete variables: number of possible values

    Many discrete variables have only a finitenumber of possible values. Inour example, region has 6, and level of democracy has 11 possible values.

    The simplest possibility is a binary (dichotomous) variable, with just twovalues. For example, a persons sex recorded as 1 = female and 2 = male.

    A discrete variable can also have an unlimited number of possible values.

    For example, the number of visitors to a website in a day could be

    0, 1, 2, 3, 4, . . . .

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 40

    Discrete variables: ordering of the valuesIn the example, the levels of democracy have a meaningful ordering, fromless to more democratic countries.

    The numbers assigned to the different levels must also be in this order: alarger number = more democratic.

    In contrast, different regions (Africa, Asia, Europe, Latin America, NorthAmerica and Oceania) do not have such a natural ordering.

    The numbers used for the variable Region are just labels for differentregions. A different numbering (for example 6 = Africa, 5 = Asia, 1 =Europe, 3 = Latin America, 2 = North America and 4 = Oceania) wouldbe just as acceptable as the one we used.

    Some statistical methods are appropriate for variables with both orderedand unordered values, some only in the ordered case.

    Unordered categories arenominal data; ordered categories are ordinaldata.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 41

    Using computers for statistical analysis

    For understanding and practice, we make you calculate some descriptivestatistics by hand.

    However, most real statistical analysis is done with computers, usingstatistical software packages.

    To give you an idea of how they work, in Exercise 1 we ask you to do somedescriptive statistics with a package called Minitab.

    See a note on the ST102 Moodle site for instructions on how to useMinitab for the exercise.

    There are many other statistical packages which do more or less the samething and which you may encounter in later courses: Stata, SPSS, R, SASand others.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 42

  • 8/10/2019 ST102 MT Section 2

    4/19

    2.2 The sample distribution

    The sample distribution of a variable consists of:

    a list of the values of the variable that are observed in the sample

    the number of times each value occurs (the countsorfrequenciesofthe observed values).

    When the number of different observed values is small, we can show the

    whole sample distribution as a frequency tableof all the values and theirfrequencies.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 43

    Example: observations of region in the sample

    3 1 1 4 2 6 3 2 2 2 3 3 1 2 4

    1 4 3 1 2 1 1 2 1 5 1 4 2 4 1

    1 4 1 3 4 2 3 3 1 4 2 4 1 4 1

    1 3 1 6 3 3 1 1 2 3 1 3 4 1 1

    4 4 4 3 2 2 2 2 3 2 3 4 2 2 2

    1 2 2 2 3 1 1 1 3 3 1 1 2 1 1

    1 4 3 2 1 1 2 1 2 3 4 1 1 3 6

    2 2 4 4 4 2 6 3 3 2 3 3 1 1 2

    2 1 3 1 2 3 3 3 2 1 1 3 3 2 2

    2 1 2 1 4 1 2 2 2 1 3 3 4 5 24 2 2 1 1

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 44

    Frequency table of region

    RelativeFrequency frequency

    Region (count) (%)

    100

    (48/155)

    (1) Africa 48 31.0(2) Asia 45 29.0

    (3) Europe 34 21.9

    (4) Latin America 23 14.8

    (5) North America 2 1.3

    (6) Oceania 3 1.9

    Total 155 100

    Here % is the percentage of countries in a region, out of the sample of155 countries. This is a measure ofproportion (relative frequency).

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 45

    Frequency table of the level of democracy

    Democracy Cumulativeindex Frequency % %

    0 35 22.6 22.61 12 7.7 30.32 4 2.6 32.9

    3 6 3.9 36.84 5 3.2 40.05 5 3.2 43.26 12 7.7 50.97 13 8.4 59.38 16 10.3 69.69 15 9.7 79.3

    10 32 20.6 100Total 155 100

    Cumulative % for a value of the variable is the sum of the percentagesfor that value and all lower-numbered values.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 46

  • 8/10/2019 ST102 MT Section 2

    5/19

    Bar charts

    Abar chart is the graphical equivalent of the table of frequencies.

    Africa Asia Europe Latin

    America

    Northern

    America

    Oceania

    Region

    0

    10

    20

    30

    40

    50

    Count

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 47

    Bar chart of the level of democracy

    0 1 2 3 4 5 6 7 8 9 1 0

    Democracy index

    0.0%

    5.0%

    10.0%

    15.0%

    20.0%

    25.0%

    Percentage

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 48

    Sample distributions of variables with manyvalues

    If a variable has many distinct values, listing frequencies of all of themdoes not work well.

    Solution: Group the values into non-overlapping intervals, and do a tableor graph of the frequencies within the intervals.

    The most common graph used for this is a histogram.

    Like a bar chart, but histograms are without gaps between bars.

    A histogram often uses more bars (intervals of values) than is sensiblein a table.

    Histograms are usually drawn using statistical software you can letthe software choose the intervals and the number of bars.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 49

    Table of frequencies for GDP per capita

    GDP per capita($000s) Frequency %

    Less than 2.0 49 31.62.0 to 4.9 32 20.65.0 to 9.9 29 18.710.0 to 19.9 21 13.520.0 to 29.9 19 12.330.0 or more 5 3.2

    Total 155 100

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 50

  • 8/10/2019 ST102 MT Section 2

    6/19

  • 8/10/2019 ST102 MT Section 2

    7/19

    2.3 Measures of central tendency

    Frequency tables, bar charts and histograms aim to summarise the wholesample distribution of a variable.

    Next we consider descriptive statistics which summarise onefeature of thesample distribution in a single number: summary statistics.

    We begin with measures of central tendency. These answer thequestion: Where is the centre or average of the distribution?. Weconsider:

    the mean (arithmetic mean or average)

    the median

    the mode.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 55

    Preliminaries: notation for variables

    In formulae, a generic variable is denoted by a single letter.

    In these course notes, usually X.

    Any other letter (Y, W, etc.) can also be used, as long as it is usedconsistently.

    A letter with a subscript denotes a single observation of a variable.

    For example, we use Xito denote the value ofXfor unit i, where ican take values 1, 2, 3, . . . , n, and n is the sample size.

    Therefore, the n observations ofX in the dataset (the sample) areX1, X2, X3, . . . ,Xn. These can also be written as Xi, i= 1, . . . , n.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 56

    Preliminaries: summation notationLetX1, X2, . . . ,Xn (i.e. Xi, i= 1, . . . ,n) be a set ofn numbers. The sumof the numbers is written as:

    ni=1

    Xi=X1+X2+ +Xn.

    This may also be written as iXi or just Xi.Other versions of the same idea:

    Infinite sums:i=1

    Xi =X1+X2+ .

    Sums of sets of observations other than 1 to n, for example:

    n/2i=2

    Xi =X2+ X3+ +Xn/2.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 57

    Properties of the summation operator

    Here Xi and Yi (i= 1, . . . , n) are sets ofn numbers.

    Here a denotes a constant, i.e. a number with the same value for all i.

    All of the following results follow simply from the properties of addition (ifyou are still not convinced, try them with n= 3).

    (1)n

    i=1a= n a.

    Proof:n

    i=1a =

    ntimes (a+ +a) =n a.

    (2) iaXi=a iXi.Proof:

    iaXi = (aX1+ + aXn) =a(X1+ + Xn) =a

    iXi.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 58

  • 8/10/2019 ST102 MT Section 2

    8/19

    Properties of the summation operator

    (3)

    i(Xi+Yi) =

    iXi+

    iYi.

    Proof: Rearranging the elements of the summation, we get:i

    (Xi+Yi) = [(X1+Y1) + (X2+Y2) + + (Xn+Yn)]

    = [(X1+X2 +Xn) + (Y1+Y2+ +Yn)]

    = (X1+X2+ +Xn) + (Y1+Y2+ +Yn)

    = i

    Xi+i

    Yi.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 59

    Extension: double (triple etc.) summation

    Sometimes sets of numbers may be indexed with two (or even more)subscripts, for example as Xij, i= 1, . . . , n, j= 1, . . . ,m.

    Summation over both indices is written as:

    ni=1

    mj=1

    Xij =n

    i=1

    (Xi1+ +Xim)

    = (X11+ +X1m) + (X21+ +X2m)+ + (Xn1+ +Xnm).

    The order of summation can be changed, that is:

    ni=1

    mj=1

    Xij=m

    j=1

    ni=1

    Xij.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 60

    Product notation

    The analogous notation for the productof a set of numbers is:

    ni=1

    Xi=X1 X2 Xn.

    It follows from the properties of multiplication that, for example:

    1.n

    i=1aXi=a

    n

    ni=1

    Xi

    .

    2.n

    i=1a= an.

    3.n

    i=1XiYi= n

    i=1Xi n

    i=1Yi.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 61

    The sample mean

    Thesample mean (arithmetic mean, mean or average) is the mostcommon measure of central tendency.

    The sample mean of a variable X is denoted as X.

    It is the sum of the observations divided by the number of observations(sample size):

    X=

    ni=1

    Xi

    n .

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 62

  • 8/10/2019 ST102 MT Section 2

    9/19

    The sample mean

    For example, the mean X=

    iXi/n of the numbers 1, 4 and 7 is:

    X =1 + 4 + 7

    3 =

    12

    3 = 4.

    For the variables in the country example:

    The level of democracy: X = 5.3.

    GDP per capita: X = 8.6 (in $000s).

    Region: the mean is not meaningful, because the values of thevariable do not have a meaningful ordering.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 63

    Frequency table of the level of democracy

    Value of the levelof democracy Frequency Cumulative(Xj) (fj) % %

    0 35 22.6 22.6

    1 12 7.7 30.32 4 2.6 32.93 6 3.9 36.84 5 3.2 40.05 5 3.2 43.26 12 7.7 50.97 13 8.4 59.3

    8 16 10.3 69.69 15 9.7 79.310 32 20.6 100

    Total 155 100

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 64

    Mean from a frequency tableIf a variable has a small number of distinct values, X is easy to calculatefrom the frequency table.

    For example, the level of democracy has just 11 different values, whichoccur in the sample 35, 12, . . . , 32 times each, respectively.

    Suppose X has K different values X1, X2, . . . ,XK, with correspondingfrequencies f1, f2, . . . , fK. Then

    Kj=1

    fj=n and:

    X =

    Kj=1

    fjXj

    K

    j=1fj

    = f1X1+ +fKXK

    f1+ +fK=

    f1X1+ +fKXKn

    .

    In our example, the mean level of democracy (where K= 11) is:

    X =35 0 + 12 1 + + 32 10

    35 + 12 + 4 + + 32 =0 + 12 + 8 + + 320

    155 5.3.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 65

    Why is the mean a good summary of centraltendency?

    Consider the following small dataset:

    Deviations:

    from X (= 4) from Median (= 3)

    i Xi Xi X (Xi X)2 Xi 3 (Xi 3)21 1 3 9 2 42 2 2 4 1 13 3 1 1 0 04 5 +1 1 +2 45 9 +5 25 +6 36

    Sum 20 0 40 +5 45X = 4

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 66

  • 8/10/2019 ST102 MT Section 2

    10/19

    The sum of deviations from the mean is 0The mean is in the middle of the observations X1, . . . ,Xn, in the sensethat positive and negative values of the deviations Xi X cancel out,when summed over all the observations, that is:

    n

    i=1 (X

    i X) = 0.

    Proof: [The proof uses the definition ofX and properties of summationintroduced earlier. Note that Xis a constant in the summation, because ithas the same value for all i.]

    n

    i=1 (XiX) =

    n

    i=1 Xin

    i=1 X =n

    i=1 Xi n X

    =n

    i=1

    Xi n

    ni=1

    Xi

    n =

    ni=1

    Xin

    i=1

    Xi= 0.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 67

    Mean minimises the sum of squared deviationsThe smallest possible value of the sum of squared deviations

    ni=1

    (Xi C)2

    for any constant Cis obtained when C= X.

    Proof:(Xi C)2 = (Xi=0

    X+ XC)2 = [(Xi X) + (X C)]2=

    [(Xi X)2 + 2(Xi X)(X C) + (X C)2]

    =

    (Xi X)2 +

    2(Xi X)(X C) +

    (X C)2

    =

    (Xi X)2 + 2(X C)

    =0

    (Xi X) +n(X C)2

    = (Xi X)2 +n(X C)2

    (Xi X)2

    since n(X C)2 0 for any choice ofC. Equality is obtained only whenC= X, so that n(X C)2 = 0. ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 68

    The (sample) median

    LetX(1),X(2), . . . ,X(n) denote the sample values ofXordered from thesmallest to the largest, such that:

    X(1) is the smallest observed value (the minimum) ofX

    X(n) is the largest observed value (the maximum) ofX.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 69

    The (sample) median

    The (sample) median, q50, of a variable X is the value that is in themiddle of the ordered sample.

    Ifn is odd, q50= X((n+1)/2).

    For example, ifn= 3, q50= X(2): (1) (2) (3)

    Ifn is even, q50= [X(n/2)+X(n/2+1)]/2.

    For example, ifn= 4, q50= [X(2)+X(3)]/2: (1) (2) (3) (4)

    In the country example, n= 155, so q50= X(78). For the level ofdemocracy, the median is 6.

    From a table of frequencies, the median is the value for which thecumulative percentage first reaches 50% (or, if a cumulative % is exactly50%, the average of the corresponding value ofXand the next-highervalue).

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 70

  • 8/10/2019 ST102 MT Section 2

    11/19

    Example: ordered values of level of democracy

    (.0) (.1) (.2) (.3) (.4) (.5) (.6) (.7) (.8) (.9)

    (0.) 0 0 0 0 0 0 0 0 0

    (1.) 0 0 0 0 0 0 0 0 0 0

    (2.) 0 0 0 0 0 0 0 0 0 0

    (3.) 0 0 0 0 0 0 1 1 1 1

    (4.) 1 1 1 1 1 1 1 1 2 2

    (5.) 2 2 3 3 3 3 3 3 4 4

    (6.) 4 4 4 5 5 5 5 5 6 6

    (7.) 6 6 6 6 6 6 6 6 6 6

    (8.) 7 7 7 7 7 7 7 7 7 7

    (9.) 7 7 7 8 8 8 8 8 8 8

    (10.) 8 8 8 8 8 8 8 8 8 9

    (11.) 9 9 9 9 9 9 9 9 9 9

    (12.) 9 9 9 9 10 10 10 10 10 10

    (13.) 10 10 10 10 10 10 10 10 10 10

    (14.) 10 10 10 10 10 10 10 10 10 10

    (15.) 10 10 10 10 10 10

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 71

    Median from the frequency table of level of

    democracy

    Value of level ofdemocracy Frequency Cumulative(Xj) (fj) % %

    0 35 22.6 22.61 12 7.7 30.32 4 2.6 32.93 6 3.9 36.84 5 3.2 40.05 5 3.2 43.2

    6 12 7.7 50.9

    7 13 8.4 59.38 16 10.3 69.69 15 9.7 79.310 32 20.6 100

    Total 155 100

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 72

    The mean is sensitive to outliers

    For the following sample, the mean and median are both 4:

    1 2 4 5 8.

    If we add one observation to get the sample:

    1 2 4 5 8 1 0 0

    then the median is now 4.5and the mean is now 20.

    In general the mean is affected much more than the median by outliers,i.e. unusually large or small observations.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 73

    Skewness, means and medians

    The mean, more than the median, is pulled toward the longer tail of thesample distribution.

    For a positively skewed distribution, the mean is larger than themedian.

    For a negatively skewed distribution, the mean is smaller than themedian.

    For an exactly symmetric distribution, the mean and median are equal.

    When summarising variables with skewed distributions, it is useful to

    report both the mean and the median.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 74

  • 8/10/2019 ST102 MT Section 2

    12/19

    Mean and median: examples

    Median MeanLevel of democracy (p. 46) 6 5.3

    GDP per capita (p. 50) 4.7 8.6

    Blood pressures (p. 53) 73.5 74.2

    Examination marks (p. 54) 60.5 59.7

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 75

    Other measures of central tendency: the mode

    The (sample) modeof a variable is the value which has the highestfrequency (i.e. appears most often) in the data.

    For example, in the country example the mode of region is 1 (Africa) and

    the mode of the level of democracy is 0.

    The mode is not very useful for continuous variables which have manydifferent values, such as GDP per capita in the country example.

    A variable can have several modes (i.e. be multimodal). For example,GDP per capita in the example has modes 0.8 and 1.9, both with 5countries out of the total sample of 155 countries.

    The mode is the only measure of central tendency which can be used evenwhen the values of a variable have no ordering, such as for the regionvariable in the example.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 76

    Geometric and harmonic meansThegeometric mean G is defined as:

    G=

    ni=1

    Xi

    1/n

    and the harmonic mean H as:

    H=

    ni=1

    X1i /n

    1=

    nn

    i=1(1/Xi)

    .

    Neither is used very often. Both are examples of the general formula:

    g

    1 i

    g(Xi)/nwhere gis an invertible function and g1 its inverse function. We obtainX with g(x) =x, G with g(x) = log(x) and Hwith g(x) = 1/x.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 77

    2.4 Measures of dispersion (variation)

    Central tendency is not the whole story. The following two sampledistributions have the same mean:

    ...but they are clearly not the same. In one (red) the values have moredispersion (variation) than in the other.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 78

  • 8/10/2019 ST102 MT Section 2

    13/19

    A small example again

    Deviations from X

    i Xi X2i Xi X (Xi X)2

    1 1 1 3 92 2 4 2 43 3 9 1 14 5 25 +1 15 9 81 +5 25

    Sum 20 120 0 40X = 4 = X

    2i = (Xi

    X)2

    The first measures of dispersion, the sample variance and its square root,the sample standard deviation, are based on (Xi X)2, the squareddeviations from the mean.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 79

    Sample variance

    Thesample varianceof a variable X, denoted S2

    (orS2X), is defined as:

    S2 =

    ni=1

    (Xi X)2

    n 1 .

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 80

    Sample standard deviationThe sample standard deviation (s.d. for short) ofX, denoted S (orSX),is the square root of the sample variance, i.e. we have:

    S=

    ni=1

    (Xi X)2

    n

    1

    .

    This is the most commonly used measure of dispersion. The standarddeviation is more understandable than the variance because it is expressedin the same units as X (rather than X2).

    A rule-of-thumb for interpretation is that for a symmetric distributionoften:

    about 2/3 of the observations are between X

    S and X+S

    about 95% of the observations are between X 2S and X+ 2S.

    Remember that standard deviations (and variances) are nevernegative andthey are zero onlyif all the observations Xiare the same.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 81

    An alternative formula for the varianceThe sum of squares in S2 can also be expressed as:

    ni=1

    (Xi X)2 =n

    i=1

    X2i n X2

    Proof:

    ni=1

    (Xi X)2 =n

    i=1

    (X2i 2XiX+X2)

    =n

    i=1X2i 2X

    =nX n

    i=1Xi+

    =nX2 n

    i=1X2

    =n

    i=1

    X2i nX2.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 82

    A l i f l f h i

    S l i l f l l i

  • 8/10/2019 ST102 MT Section 2

    14/19

    An alternative formula for the variance

    The sample variance can therefore also be calculated as:

    S2 =

    ni=1

    X2i n X2

    n

    1

    (and the standard deviation S=

    S2 again).

    This formula is most convenient for calculations by hand.

    If using a frequency table, we can also calculate:

    S2 =

    Kj=1

    fjX2j n X2

    n 1(see p. 66 for the analogous formula for the mean).

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 83

    Sample variance: example of calculations

    Deviations from X

    i Xi X2i Xi X (Xi X)2

    1 1 1 3 92 2 4 2 43 3 9 1 14 5 25 +1 15 9 81 +5 25

    Sum 20 120 0 40X = 4 =

    X2i =

    (Xi X)2

    We have:

    S2 =

    (Xi X)2

    n 1 =40

    4= 10 =

    X2i nX2

    n 1 =120 5 42

    4

    and S=

    S2 =

    10 = 3.16.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 84

    Sample quantiles

    The median, q50, is basically the value which divides the sample into thesmallest 50% of observations and the largest 50% of observations.

    If we consider other percentage splits, we get other (sample) quantiles(percentiles) qc, for example:

    thefirst quartile, q25, is the value which divides the sample into thesmallest 25% of observations and the largest 75% of observations.

    the third quartile, q75, for the 7525 split

    the extremes in this spirit are the minimum X(1) (the 0% quantile,so to speak) and maximum X(n) (the 100% quantile).

    These are no longer in the middle of the sample, but they are moregeneral measures of location of the sample distribution.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 85

    Calculation of sample quantilesThis is how computer software calculates general sample quantiles (or howyou can do so by hand, if you ever needed to).

    Suppose we need to calculate the cth sample quantile,qc, where0< c

  • 8/10/2019 ST102 MT Section 2

    15/19

    Quantile-based measures of dispersion

    Two measures based on quantile-type statistics are:

    Range: X(n) X(1) = maximum minimum.Interquartile range (IQR):q75 q25= third quartile first quartile.

    The range is clearly extremely sensitive to outliers, since it depends onnothing but the extremes of the distribution.

    The IQR focuses on the middle 50% of the distribution, so it is completely

    insensitive to outliers.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 87

    Boxplots

    A boxplot (box-and-whiskers plot) summarises some key features of asample distribution using quantiles.

    The plot shows:

    the line inside the box (the median)the box: first to third quartiles (q25 to q75), i.e. the middle 50% ofthe observations

    the whiskers: either to the minimum and maximum, or up to a lengthof 1.5 times the width of the box, whichever is nearer (the rest of thedata, except for outliers)

    shown as individual points: observations beyond the ends of thewhiskers (regarded as outliers).

    A much longer whisker (and/or outliers) in one direction relative to theother indicates a skewed distribution.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 88

    Boxplot of GDP per capita for 155 countries

    0

    10

    20

    30

    40

    GDP

    percapita

    Median = 4.7

    Minimum = 0.5

    Maximum = 37.8

    3rd Quartile = 11.4

    1st Quartile = 1.7

    (IQR = 11.4-1.7 = 9.7)

    23.7 = Largest observation at most

    1.5 x IQR = 14.6 above 3rd Quartile

    Outliers

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 89

    Summary statistics: examples

    Median Mean s.d. IQR Range

    Level of democracy (p. 46) 6 5.3 3.9 8 10

    GDP per capita (p. 50) 4.7 8.6 9.5 9.7 37.3

    Blood pressures (p. 53) 73.5 74.2 11.3 14.5 88

    Examination marks (p. 54) 60.5 59.7 17.5 21.3 94

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 90

    Sample moments Sample skewness

  • 8/10/2019 ST102 MT Section 2

    16/19

    Sample moments

    Note: This page is skipped now, but is not marked with. This is becausesample moments will be used again, early in Part II of the course.

    Let us define, for a variable Xand for each r= 1, 2, . . . :

    the rth sample moment about zero: mr=

    ni=1

    Xri

    n

    the rth central sample moment: mr=

    ni=1

    (Xi X)r

    n .

    In other words, these are sample averages of the powers Xri and (Xi

    X)r.

    Clearly, X=m1 and S2 = [n/(n 1)] m2 = [n/(n 1)][m2 (m1)2].Moments of powers 3 and 4 are used in two more summary statistics thatare described below (asmaterial). These are used much less often thanmeasures of central tendency and dispersion.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 91

    Sample skewness

    A measure of the skewnessof the distibution of a variable X is:

    g1=

    m3

    m3/22 =i(Xi

    X)3

    [i(Xi X)2]3/2 .For this measure, g1= 0 for a symmetric distribution, and g1< 0 for anegatively skewed distribution and g1> 0 for a positively skeweddistribution.

    For example, g1= 0.006 for the (fairly symmetric) blood pressure

    distribution shown on p. 53, and g1 = 1.24 for the (positively skewed)GDP per capita distribution shown on p. 51.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 92

    Sample kurtosisKurtosisrefers to yet another characteristic of a sample distribution. Thishas to do with the relative sizes of the peak and tails of the distribution(think about shapes of histograms).

    A distribution with high kurtosis (leptokurtic) has a sharp peak and ahigh proportion of observations in the tails far from the peak.

    A distribution with low kurtosis (platykurtic) is flat, with nopronounced peak with most of the observations spread evenly aroundthe middle and weak tails.

    A sample measure of kurtosis is:

    g2= m4

    m22 3 =

    i(Xi X)4

    [i(XiX)2]2

    3.

    This is g2> 0 for leptokurtic and g2< 0 for platykurtic distributions, andg2= 0 for the normal distribution (introduced later). Some softwarepackages define a measure of kurtosis without the 3, i.e. excess kurtosis.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 93

    2.5 Associations between two variables

    So far we have tried to summarise (some aspect of) the sampledistribution ofonevariable at a time.

    But we can also look at two (or more) variables together. The key

    question is then whether some values of one variable tend to occurfrequently together with particular values of another, for example highvalues with high values. This would be an example of an associationbetween the variables. Such associations are central to most interestingresearch questions, so you will hear much more about them in the future.

    Some common methods of descriptive statistics for two-variableassociations are introduced here, but only very briefly and mainly through

    examples.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 94

    Different types of two variable plots and tables Scatterplots

  • 8/10/2019 ST102 MT Section 2

    17/19

    Different types of two-variable plots and tables

    The best way to summarise two variables together depends on whether thevariables have few or many possible values.

    We illustrate one method for each combination:

    Many vs. many: scatterplots (including line plots).

    Few vs. many: side-by-side boxplots.

    Few vs. few: two-way cross-tabulations.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 95

    Scatterplots

    A scatterplot shows the values of two continuousvariables against eachother, plotted as points in a two-dimensional coordinate system.

    Example: A plot of data for 164 countries, with:on the horizontal axis (x-axis): a World Bank measure of control ofcorruption, where high values indicate low levels of corruption

    on the vertical axis (y-axis): GDP per capita.

    Interpretation: It appears that virtually all countries with high levels ofcorruption have relatively low GDP per capita. At lower levels of

    corruption there is a positive association, where countries with very lowlevels of corruption also tend to have high GDP per capita.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 96

    An example of a scatterplot

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 97

    Line plots (time series plots)

    A common special case of a scatterplot is a line plot (time series plot),where the variable on the x-axis is time. The points are connected in time

    order by lines, to show how the variable on the y-axis changes over time.

    Example: Time series of an index of prices of consumer goods and servicesin the UK, 18002009 (Office for National Statistics; scaled so that theprice level in 1974 = 100). This shows the price inflation over that period.

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 98

    Example of a time series plot: inflation Side-by-side boxplots for comparisons

  • 8/10/2019 ST102 MT Section 2

    18/19

    Example of a time series plot: inflation

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 99

    Side-by-side boxplots for comparisons

    Boxplots are useful for comparisonsof how the distribution of a continuousvariable varies across different groups, i.e. across different levels of adiscrete variable.

    Example: Boxplots of GDP per capita in different regions.GDP per capita in African countries tends to be very low. There is ahandful of countries with somewhat higher GDPs per capita(designated as outliers in the plot).

    The median for Asia is not much higher than for Africa. However, thedistribution in Asia is heavily skewed to the right, with a tail ofcountries with very high GDPs per capita.

    The median in Europe is high, and the distribution is fairly symmetric.

    The boxplots for North America and Oceania are not very useful,because they are based on very few countries (2 and 3, respectively).

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 100

    Example of side-by-side boxplots

    OceaniaNorth Am .Latin Am .EuropeAsiaAfrica

    40

    30

    20

    10

    0

    Region

    GDP

    percapita

    Boxplot of GDP per capita by region

    ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 101

    Two-way contingency tablesA (two-way) contingency table(orcross-tabulation) shows thefrequencies in the sample of each possible combinationof the values oftwo discrete variables.

    Often it also shows percentages within each rowor column of the table.

    Example: From a survey of 972 private investors1:

    row variable: age as a discrete, grouped variable (four categories)

    column variable: how much importance the person places onshort-term gains from his/her investments (four levels).

    Interpretation: Look at the row percentages. For example, 17.8% ofthose aged under 45, but only 5.2% of those 65 and over, think thatshort-term gains are very important. Among these respondents, the older

    group seems to be less concerned with quick profits than the youngergroup.

    1Lewellen et al. (1977) Patterns of investment strategy and behavior amongindividual investors. The Journal of Business.ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 102

    Example of a two-way contingency table

  • 8/10/2019 ST102 MT Section 2

    19/19

    Example of a two way contingency table

    Importance of short-term gainsSlightly Very

    Age group Irrelevant important Important important Total

    Under 45 37 45 38 26 146

    (25.3) (30.8) (26.0) (17.8) (100)4554 111 77 57 37 282

    (39.4) (27.3) (20.2) (13.1) (100)

    5564 153 49 31 20 253(60.5) (19.4) (12.3) (7.9) (100)

    65 and over 193 64 19 15 291(66.3) (22.0) (6.5) (5.2) (100)

    Total 494 235 145 98 972(50.8) (24.2) (14.9) (10.1) (100)

    (Numbers in parentheses are percentages within the rows. For example,25.3 = (37/146) 100.)ST102 Elementary Statistical Theory Dr James Abdey MT 2014 Part I: 2. Descriptive statistics 103