measures of location and scale in radibucch/2ws05/notes2ws05.pdfwe shall examine the three measures...

34
MEASURES OF LOCATION AND SCALE IN R P. L. Davies Eindhoven, August 2007 Reading List Tukey, J. W. (1977) Exploratory Data Analysis, Addison-Wesley. Mosteller, F. and Tukey, J. W. (1977) Data Analysis and Regression, Wiley. Huber, P. J. (1981) Robust Statistics, Wiley. Hoaglin, D. C., Mosteller, F. and Tukey, J. W. (1983) Understanding Robust and Exploratory Data Analysis, Wiley.

Upload: others

Post on 26-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • MEASURES OF LOCATION AND

    SCALE IN R

    P. L. Davies

    Eindhoven, August 2007

    Reading List

    • Tukey, J. W. (1977) Exploratory Data Analysis, Addison-Wesley.• Mosteller, F. and Tukey, J. W. (1977) Data Analysis and Regression,

    Wiley.

    • Huber, P. J. (1981) Robust Statistics, Wiley.• Hoaglin, D. C., Mosteller, F. and Tukey, J. W. (1983) Understanding

    Robust and Exploratory Data Analysis, Wiley.

  • Contents

    1 Location and Scale in R 21.1 Measures of location . . . . . . . . . . . . . . . . . . . . . . . 2

    1.1.1 The mean . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 The median . . . . . . . . . . . . . . . . . . . . . . . . 31.1.3 The shorth . . . . . . . . . . . . . . . . . . . . . . . . 41.1.4 What to use? . . . . . . . . . . . . . . . . . . . . . . . 5

    1.2 Measures of scale or dispersion . . . . . . . . . . . . . . . . . . 61.2.1 The standard deviation . . . . . . . . . . . . . . . . . . 61.2.2 The MAD . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.3 The length of the shorth . . . . . . . . . . . . . . . . . 71.2.4 The interquartile range IQR . . . . . . . . . . . . . . . 71.2.5 What to use? . . . . . . . . . . . . . . . . . . . . . . . 8

    1.3 Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4 Empirical distributions . . . . . . . . . . . . . . . . . . . . . . 11

    1.4.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 111.4.2 Empirical distribution function . . . . . . . . . . . . . 11

    1.5 Equivariance considerations . . . . . . . . . . . . . . . . . . . 131.5.1 Transforming data . . . . . . . . . . . . . . . . . . . . 131.5.2 The group of affine transformations . . . . . . . . . . . 131.5.3 The group of affine transformations . . . . . . . . . . . 141.5.4 Measures of location . . . . . . . . . . . . . . . . . . . 141.5.5 Measures of scale . . . . . . . . . . . . . . . . . . . . . 15

    1.6 Outliers and breakdown points . . . . . . . . . . . . . . . . . . 151.6.1 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . 151.6.2 Breakdown points and equivariance . . . . . . . . . . . 161.6.3 Identifying outliers . . . . . . . . . . . . . . . . . . . . 20

    1.7 M-measures of location and scale . . . . . . . . . . . . . . . . 251.7.1 M-measures of location . . . . . . . . . . . . . . . . . . 251.7.2 M-measures of scale . . . . . . . . . . . . . . . . . . . . 281.7.3 Simultaneous M-measures of location and scale . . . . 30

    1

  • 1 Location and Scale in R

    1.1 Measures of location

    1.1.1 The mean

    We start with measures of location for simple data sets xn = (x1, . . . , xn)consisting of n real numbers. By far the most common measure of locationis the arithmetic mean

    mean(xn) = x̄n =1

    n

    n∑i=1

    xi. (1.1)

    Example 1.1. 27 measurements of the amount of copper (milligrams perlitre) in a sample of drinking water.

    2.16 2.21 2.15 2.05 2.06 2.04 1.90 2.03 2.062.02 2.06 1.92 2.08 2.05 1.88 1.99 2.01 1.861.70 1.88 1.99 1.93 2.20 2.02 1.92 2.13 2.13

    The mean is given by

    x̄27 =1

    27(2.16 + . . . + 2.13) =

    1

    2754.43 = 2.015926

    In R the command is> mean(copper)[1] 2.015926

    Example 1.2. Length of a therapy in days of 86 control patients after asuicide attempt.

    1 1 1 5 7 8 8 13 14 14 17 18 2121 22 25 27 27 30 30 31 31 32 34 35 3637 38 39 39 40 49 49 54 56 56 62 63 6565 67 75 76 79 82 83 84 84 84 90 91 9293 93 103 103 111 112 119 122 123 126 129 134 144

    147 153 163 167 175 228 231 235 242 256 256 257 311314 322 369 415 573 609 640 737

    The mean is given by

    x̄86 =1

    86(1 + . . . + 737) =

    1

    8610520 = 122.3256.

    2

  • Example 1.3. Measurements taken by Charles Darwin on the differencesin growth of cross- and self-fertilized plants. Heights of individual plants (ininches)

    Pot I Cross 23.5 12.0 21.0Self 17.4 20.4 20.0

    Pot II Cross 22.0 19.2 21.5Self 20.0 18.4 18.6

    Pot III Cross 22.2 20.4 18.3 21.6 23.2Self 18.6 15.2 16.5 18.0 16.2

    Pot IV Cross 21.0 22.1 23.0 12.0Self 18.0 12.8 15.5 18.0

    The differences in heights are 23.5− 17.4, . . . , 12.0− 18.9.

    6.1, −8.4, 1.0, 2.0, 0.8, 2.9, 3.6, 1.8, 5.2, 3.6, 7.0, 3.0, 9.3, 7.5, −6.0

    The mean of the differences is 2.62667.

    1.1.2 The median

    The second most common measure of location is the median. To define themedian we order the observations (x1, . . . , xn) according to their size

    x(1) ≤ x(2) . . . ≤ x(n) (1.2)

    and define the median to be the central observation of the ordered sample.If the size of the sample n is an odd number n = 2m + 1, then this is unam-biguous as the central observation is x(m+1). If the number of observationsis even n = 2m then we take the mean of the two innermost observations ofthe ordered sample. This gives

    med(xn) =

    {x(m+1) if n = 2m + 1(x(m) + x(m+1))/2 if n = 2m.

    (1.3)

    Example 1.4. We take the first 5 measurements of Example 1.1. These are

    2.16, 2.21, 2.15, 2.05, 2.06

    so the ordered sample is

    2.05, 2.06, 2.15, 2.16, 2.21.

    3

  • The number of observation is odd, 5 = n = 2m + 1 with m = 2. The medianis therefore

    med(2.16, 2.21, 2.15, 2.05, 2.06) = x(m+1) = x(3) = 2.15.

    We take the first 6 measurements of Example 1.1 These are

    2.16, 2.21, 2.15, 2.05, 2.06, 2.04

    so the ordered sample is

    2.04, 2.05, 2.06, 2.15, 2.16, 2.21.

    The number of observation is odd, 6 = n = 2m with m = 3. The median istherefore

    med(2.16, 2.21, 2.15, 2.05, 2.06, 2.04) = (x(m) + x(m+1))/2

    = (x(3) + x(4))/2 = (2.06 + 2.15)/2

    = 4.21/2 = 2.105

    In R the command is

    > median(copper)[1] 2.03

    1.1.3 The shorth

    A less common measure of location is the so called shorth (shortest half)introduced by Tukey. We require some notation

    (a) bxc spoken “floor x” is the largest integer n such that n ≤ x. Forexample b2.001c = 2, b5c = 5, b−2.3c = −3.

    (b) dxe spoken “ceiling x” is the smallest integer n such that x ≤ n.(c) For any number x, it holds that bx/2c + 1 is the smallest integer n

    which is strictly greater than x/2.

    (d) The R command is floor:> floor(2.39)[1] 2

    4

  • For a sample (x1, . . . , xn) of size n we consider all intervals I = [a, b] thatcontain at least bn/2c+1 of the observations, that is at least half the numberof observations. Let I0 denote that interval which has the smallest lengthamongst all intervals containing at least bn/2c+ 1 observations. The shorthis defined to be the mean of those observations in the shortest interval. Thereis a degree of indeterminacy in this definition. Firstly there may be morethan one such interval. Secondly there may be only one such interval butthe number of observations in this interval is not clear. This is the case ifat least one of the endpoints is a repeated observation. We cannot removethe first indeterminacy but the fact should be reported. The second can beremoved by including all observations which lie in the interval according tomultiplicity.

    Example 1.5. We consider the data of Example 1.1. The first step is toorder the data.

    1.70 1.86 1.88 1.88 1.90 1.92 1.92 1.93 1.991.99 2.01 2.02 2.02 2.03 2.04 2.05 2.05 2.062.06 2.06 2.08 2.13 2.13 2.15 2.16 2.20 2.21

    The sample size is n = 27 so bn/2c + 1 = b14c = 14. We look for theshortest interval which contains at least 14 observations. The intervals[1.70, 2.03], [1.86, 2.04], [1.88, 2.05], . . . [2.03, 2.21] all contain at least 14 ob-servations. Their lengths are 0.33, 0.18, 0.17, 0.17, . . . , 0.18. The smallestintervals is of length 0.14, namely [1.92, 2.06] and contains the 15 observa-tions

    1.92 1.92 1.93 1.99 1.99 2.01 2.02 2.022.03 2.04 2.05 2.05 2.06 2.06 2.06

    The mean of these observations is 2.048.

    Exercise 1.6. Write an R–programme to calculate the shortest half.

    1.1.4 What to use?

    We now have three measures of location and there are more to come. Anatural question is which one to use. The natural answer is it depends onthe data and subject matter. What we can do is to analyse the behaviourof the different measures both analytically and practically when applied todifferent data sets. As an example we point out that the mean and medianare always uniquely defined but not the shorth. This is a disadvantage of theshorth but it may have advantages which, for particular data sets, outweighthis. We shall examine the three measures in greater depth below but in themeantime the student can try them out on different data sets and form anown judgment.

    5

  • 1.2 Measures of scale or dispersion

    1.2.1 The standard deviation

    The most common measure of dispersion is the standard deviation. Unfor-tunately there are two competing definitions with no universal agreement.They are

    sd(xn) =

    √√√√ 1n

    n∑i=1

    (xi − x̄n)2 (1.4)

    and

    sd(xn) =

    √√√√ 1n− 1

    n∑i=1

    (xi − x̄n)2. (1.5)

    WARNINGIn these notes I shall use definition (1.4). In R the definition (1.5) is used.

    Example 1.7. We consider the data of Example 1.1. The standard devia-tion according to (1.4) is 0.1140260. If you use R you get

    > sd(copper)[1] 0.1161981

    1.2.2 The MAD

    The second most common measure of dispersion but one which is neverthelessnot widely known is the MAD (Median Absolute Deviation) and is definedas follows. We form the absolute deviations from the median

    |x1 −med(xn)|, . . . , |xn −med(xn)|.

    The MAD is now simply the median of these absolute deviations.

    MAD(xn) = median(|x1 −med(xn)|, . . . , |xn −med(xn)|). (1.6)

    Example 1.8. We consider the data of Example 1.1. The median of thedata is 2.03 which gives the following absolute deviations

    6

  • 0.13 0.18 0.12 0.02 0.03 0.01 0.13 0.00 0.030.01 0.03 0.11 0.05 0.02 0.15 0.04 0.02 0.170.33 0.15 0.04 0.10 0.17 0.01 0.11 0.10 0.10

    The median of these values is 0.1 so MAD(xn) = 0.1. If use R you get

    > mad(copper)[1] 0.14826

    which is not the same. If instead you use

    > mad(copper,const=1)[1] 0.1

    which is the same.

    I shall use mad(· ,const=1) in the lecture notes apart from the R command.The factor 1.4826 in the R version of the MAD is chosen so that the MAD of aN (µ, σ2)–random variable is precisely σ. To see this we transfer the definitionof the MAD to a random variable X as the median of |X −median(X)|.Exercise 1.9. Let X be aN (µ, σ2)–random variable. Show that MAD(X) =0.6744898σ so that 1.482602 MAD(X) = σ.

    1.2.3 The length of the shorth

    The third measure of dispersion is the length of the shortest interval in thedefinition of the shorth. We denote this by lshorth.

    Example 1.10. We consider the data of Example 1.1 and use the results ofExample 1.5. The length of the shortest interval is 0.14 and hence lshorth(xn) =0.14.

    1.2.4 The interquartile range IQR

    The fourth measure of dispersion is the so called interquartile range. Forthis we require the definitions of the first Q1 and third Q3 quartiles of a dataset. Again unfortunately there is no agreed definition and the one I give herediffers from them all. An argument in its favour will be given below. Weorder the data and define Q1 and Q3 as follows

    Q1 = (x(dn/4e) + x(bn/4c+1))/2, Q3 = (x(d3n/4e) + x(b3n/4c+1))/2, (1.7)

    The interquartile range is then simply IQR=Q3-Q1.

    7

  • Example 1.11. We consider the data of Example 1.1. We have 27 observa-tions and so

    dn/4e = d27/4e = d6.75e = 7, bn/4c+ 1 = b6.75c+ 1 = 6 + 1 = 7.This gives Q1 = x(7) = 1.82. Similarly Q3 = x(21) = 2.08 so that the in-terquartile range IQR = 2.08− 1.82 = 0.16. If you use R you get

    > IQR(copper)[1] 0.145

    1.2.5 What to use?

    We now have four measures of scale or dispersion and there are more tocome. The comments made in Section 1.1.4 apply here as well. One particularproblem with the MAD is that it is zero if more than 50% of the observationsare the same. This can happen for small data sets with rounding but canalso occur in counting data where there is for example a large atom at zero.

    Exercise 1.12. Within a production line the deviation of target of the lengthof newly produced nails are measured. A small sample consist of the followingobserved values:

    1.3 2.4 − 0.5 0.2 − 1.3 − 1.8 0.6 1.3 2.1 − 1.6(i) Estimate the standard deviation by the following statistics:

    σ̂ =

    √√√√ 110

    10∑i=1

    (xi − x)2

    σ̃ = 1.483 MAD

    σ∗ =x(10)− x(1)

    d10

    where x(1), . . . , x(10) are the ordered (from small to large) values of thesample and 1/d10 = 0.325.

    (ii) Assume that instead the following values had been observed.

    1.3 2.4 − 0.5 0.2 − 1.3 − 1.8 6.0 1.3 2.1 − 1.6Repeat the calculations from 1) and discuss the results.

    8

  • 1.7

    1.8

    1.9

    2.0

    2.1

    2.2

    020

    040

    060

    0

    −50

    510

    Figure 1: The panels from left to right are the boxplots of the copper data(Example 1.1), the suicide data (Example 1.2) and the Darwin data (Example1.3), respectively.

    1.3 Boxplots

    Even for small samples it is difficult to obtain a general impression of thedata sets, for example location, scale, outliers and skewness, from the rawnumbers. A boxplot is a graphical representation of the data which in mostcases can provide such information at a glance. Figure 1 shows from leftto right the boxplots of the copper data (Example 1.1), the suicide data(Example 1.2) and the Darwin data (Example 1.3). We now describe theconstruction of a boxplot with the proviso that there are different definitionsin the literature.

    The line through the box is the median. The lower edge of the box is thefirst quartile Q1 and the upper edge is the third quantile Q3. The “whiskers”extending from the bottom and top edges of the box start at the edge of thebox and are extended to that observation which is furthest from the edgebut at a distance of at most 1.5 IQR from the edge. Observations outsidethe whiskers are displayed individually.

    Although the boxplot can be used for individual data sets it is most usefulfor comparing data sets.

    Example 1.13. The following data are taken from an interlaboratory testand show the measurements returned by 14 laboratories of the amount ofmercury (micrograms per litre) in a water sample. Each laboratory was re-

    9

  • 1 2 3 4 5 6 7 8 9 10 11 12 13 14

    2.0

    2.5

    3.0

    3.5

    Figure 2: The boxplots of the interlaboratory test data of Example 1.13.

    quired to give four measurements. The boxplot gives an immediate impres-sion of the outlying laboratories and also the precision of the measurementsthey returned. The fact that the measurements have different numbers ofsignificant figures is due to the returns of the individual laboratories.

    1 2 3 4Lab. 1 2.2 2.215 2.2 2.1Lab. 2 2.53 2.505 2.505 2.47Lab. 3 2.3 3.2 2.8 2.0Lab. 4 2.33 2.35 2.3 2.325Lab. 5 2.1 2.68 2.46 1.92Lab. 6 2.692 2.54 2.692 2.462Lab. 7 2.292 2.049 2.1 2.049Lab. 8 2.120 2.2 1.79 2.12Lab. 9 1.967 1.899 2.001 1.956Lab. 10 3.8 3.751 3.751 3.797Lab. 11 2.696 2.22 2.01 2.57Lab. 12 2.191 2.213 2.21 2.228Lab. 13 2.329 2.252 2.252 2.26Lab. 14 2.37 2.42 2.33 2.36

    10

  • Exercise 1.14. Interpret the boxplot of Figure 2.

    1.4 Empirical distributions

    1.4.1 Definition

    Given a real number x we denote by δx the measure with unit mass at thepoint x, the so called Dirac measure. For any subset B of R we have

    δx(B) = 1{x∈B}.

    For any function f : R→ R we have∫

    Rf(u) dδx(u) = f(x).

    Given a sample (x1, . . . , xn) we define the corresponding empirical measurePn by

    Pn =1

    n

    n∑i=1

    δxi . (1.8)

    The empirical distribution Pn is a probability measure and for any subset Bof R we have

    Pn(B) =1

    n× number of points xi in B.

    Furthermore for any function f : R→ R we have∫

    Rf(u) dPn(u) =

    1

    n

    n∑i=1

    f(xi). (1.9)

    1.4.2 Empirical distribution function

    Given data xn = (x1, . . . , xn) the empirical distribution function Fn is definedby

    Fn(x) = Pn((−∞, x])

    =1

    n

    n∑i=1

    1{xi≤x} (1.10)

    Figure 1.8 shows the empirical distribution functions of the data sets of theExamples 1.1 and 1.2.

    11

  • 1.7 1.8 1.9 2.0 2.1 2.2

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    0 200 400 600 800

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Figure 3: The upper panel shows the empirical distribution function of thedata of Example 1.1. The lower panel shows the empirical distribution func-tion of the data of Example 1.2.

    Exercise 1.15. Let g : R→ R be defined by g(x) := x2 1[0,10](x), where

    1[0,10](x) =

    {1 for x ∈ [0, 10]0 elsewhere

    Calculate∫

    g dP if P is

    (i) the uniform distribution on [8, 12]

    (ii) Dirac measure at x = 5

    (iii) binomial distribution with n = 2 and p = 0.25

    (iv) a probability measure on M = {1, 2, . . . , 15} defined by P (B) = |B∩M ||M | ,where |.| denotes the cardinality.

    12

  • 1.5 Equivariance considerations

    1.5.1 Transforming data

    First of all we recall some standard notation from measure and integrationtheory. It has the advantage of unifying definitions and concepts just asmeasure and integration theory allows in many situations a unified approachto discrete (say Poisson) and continuous (say Gaussian) probability models.

    Definition 1.16. Given a probability distribution P on (the Borel sets of)R and a (measurable) transformation f : R → R we define the transformeddistribution Pf by

    Pf (B) := P({x : f(x) ∈ B}) = P(f−1(B)). (1.11)Exercise 1.17. Show that Pf is a probability distribution.

    In many situations we have natural family F of transformations of thesample space (here R) which forms a group. Examples are the affine group onR which corresponds to changes in units and origin and will be investigatedin the following section. In the case of directional data in two and three di-mensions one is lead to the group of rotations or orthogonal transformations.We refer to Davies and Gather (2005) for further examples.

    1.5.2 The group of affine transformations

    If we have a data set xn = (x1, . . . , xn) and transform it using a func-tion f : R → R, then we obtain the data set x′n = (x′1, . . . , x′n) withx′i = f(xi), i = 1, . . . , n. If the empirical distribution of xn is Pn, then theempirical distribution of x′n is given by

    P′n(B) =1

    n

    n∑i=1

    1{x′i∈B}

    =1

    n

    n∑i=1

    1{f(xi)∈B}

    =1

    n

    n∑i=1

    1{xi∈f−1(B)}

    = Pn(f−1(B)) = Pfn(B)

    so that P′n = Pfn.

    13

  • 1.5.3 The group of affine transformations

    A real data set xn = (x1, . . . , xn) often consists of readings made in cer-tain units such as heights measured in centimetres. If the same heights aremeasured in inches x′n = (x

    ′1, . . . , x

    ′n), then to a first approximation we have

    xi = 2.54 x′i, i = 1, . . . , n.

    The basis from which the heights are measured may also change. Insteadof being measured from, for example, sea level they may be measured withrespect to the floor of a house. If the units of measurement are the same thetwo samples will be related by

    xi = x′i + b, i = 1, . . . , n

    for some real number b. In general we can consider a change of units and achange of basis which leads to transformations A : R→ R of the form

    A(x) = ax + b with a, b ∈ R, a 6= 0.

    Such a transformation is called an affine transformation and the set A ofaffine transformations A : R→ R forms a group, the affine group. An affinetransformation may be regarded as a change of units and origin.

    Exercise 1.18. Show that A is indeed a group.

    1.5.4 Measures of location

    If we apply the affine transformation A(x) = ax + b to the data set xn =(x1, . . . , xn) to give x

    ′n = (x

    ′1, . . . , x

    ′n) with x

    ′i = A(xi) then this relationship

    transfers to the means

    mean(x′n) = x̄′n = A(x̄n) = ax̄n + b = a mean(xn) + b. (1.12)

    This relationship holds also for the median and the shorth. Indeed we maytake it to be the defining property of a location measure. In other words alocation measure TL is one for which

    TL(A(xn)) = A(TL(xn)). (1.13)

    Such measures are called location equivariant.

    14

  • 1.5.5 Measures of scale

    Similar considerations apply to scale measures although the definition ofequivariance is different. Indeed if we transform the data set xn = (x1, . . . , xn)using the affine transformation A(x) = ax + b to give the data set x′n =(x′1, . . . , x

    ′n) then the relationship between the standard deviations is

    sd(x′n) = |a|sd(xn).

    This applies to the MAD, the length of the shorth and the interquartile rangeIQR. We call a measure TS scale equivariant if

    TS(A(xn)) = |a|(TL(xn)), A(x) = ax + b. (1.14)

    1.6 Outliers and breakdown points

    1.6.1 Outliers

    Outliers are important for two reasons. Firstly an outlier is an observationwhich differs markedly from the majority of the other observations. Thismay be simply due to a mistake, an incorrectly placed decimal point, or itmay be because of particular circumstances which are themselves of interest.Secondly, an undetected outlier can completely invalidate a statistical anal-ysis. In one-dimensional data outliers can be detected by visual inspection.Nevertheless there is an interest in automatic detection. There are nationaland international standards for analysing the results of interlaboratory tests(Example 1.13). The main problem when devising the statistical part of sucha standard is how to deal with outliers which are very common in such data.Another reason for the interest in automatic methods is the large amountof data available which makes it impossible to have every data set visuallyinspected. The effect of outliers can be illustrated as follows.

    Example 1.19. We consider the copper data of Example 1.1. The mean,standard deviation, median and MAD of the data are respectively

    2.015926, 0.1140260, 2.03, 0.1.

    Suppose we replace the last observation 2.13 by 21.3. The correspondingresults are

    2.725926, 3.644391, 2.03, 0.1.

    The standard deviation has been most effected followed by the mean. Thevalues of the median and MAD have not changed. Even if we change the

    15

  • value of the second last observation also from 2.13 to 21.3 the median andthe MAD do not change. The values for this altered sample are

    3.435926, 5.053907, 2.03, 0.1.

    We look as this phenomenon more closely in the next section.

    1.6.2 Breakdown points and equivariance

    We wish to measure the resilience of a statistical measure T with respectto outliers. A simple and useful concept is that of the breakdown pointintroduced in Hampel (1968). The version we use is the so called finitesample breakdown point due to Donoho and Huber (1983). To define thiswe consider a sample xn = (x1, . . . , xn) and to allow for outliers we considersamples which can be obtained from xn by replacing at most k of the xi byother values. We denote a generic replacement sample by xkn = (x

    k1, . . . , x

    kn).

    and its empirical distribution by Pkn. We haven∑

    i=1

    1{xki 6=xi} ≤ k.

    We define the finite sample breakdown point of T at the sample xn by

    ε∗(T,xn) = mink{k/n : sup

    xkn

    |T (xn)− T (xkn)| = ∞} (1.15)

    Example 1.20. The mean has a breakdown point ε∗(mean, xn) = 1/n forany data xn. We need only move one observation say x1 to ∞ and it is clearthat the mean of this sample will also tend to infinity.

    Example 1.21. The median has a breakdown point ε∗(med, xn) = b(n +1)/2c/n for any data xn. We consider the case that n = 2m is even. If wemove k = m−1 observations all to∞ or all to −∞ (worst case), then the twocentral values of the ordered replacement sample will come from the originalsample xn. This implies

    supxkn

    |med(xkn)−med(xn)| < ∞, k ≤ m− 1.

    If we now move m points all to ∞, then only one of the two central pointswill come from the original sample. This implies

    supxkn

    |med(xkn)−med(xn)| = ∞

    and henceε∗(med,xn)) = m/2m = 1/2 = b(n + 1)/2c/n. (1.16)

    16

  • Exercise 1.22. Show that (1.16) also holds if the sample size n = 2m + 1 isan odd number.

    This is the highest possible finite sample breakdown point for measuresof location as we shall now show.

    Theorem 1.23. Let TL be a measure of location , that is affinely equivariantas defined by (1.13). Then for any sample xn we have

    ε∗(TL,xn) ≤ b(n + 1)/2c/n.

    Proof. We restrict ourselves to the case where n = 2m is even. We considertwo replacement samples x′n and x

    ′′n with k = m defined by

    x′i = xi, i = 1, . . . , m, x′m+i = xi + λ, i = 1, . . . , m,

    x′′i = xi − λ, i = 1, . . . , m, x′′m+i = xi, i = 1, . . . , m.

    The samples x′n and x′′n are related by

    x′′i = x′i − λ, i = 1, . . . , n

    and hence, as TL is affinely equivariant |TL(x′′n) − TL(x′n)| = λ. Both arereplacement samples with k = m and on letting λ →∞ we see that not both|TL(x′n)− TL(xn)| and |TL(x′′n)− TL(xn)| can remain bounded. Therefore

    supxmn

    |TL(xmn )− TL(xn)| = ∞

    and henceε∗(TL,xn) ≤ m/2m = 1/2 = b(n + 1)/2c/n.

    ¤.

    Remark 1.24. In Theorem 1.23 we did not make full use of the affine equiv-ariance. All we required was the translation equivariance.

    Remark 1.25. Theorem 1.23 connects the idea of breakdown point and equiv-ariance which at first sight are unrelated. However if no conditions are placedupon a functional T then the breakdown point can be 1. A trivial way of at-taining a breakdown point of 1 is simply to put T (xn) = 0 for all xn. Anapparently less trivial example is to restrict T to an interval [a, b] for whichit is a priori known that the correct value must lie in this interval. It wasargued in Davies and Gather (2005) that equivariance considerations are re-quired in order that a non-trivial bound for the breakdown point exists.

    17

  • We now apply the above ideas to measures of scale. One problem isthe definition of breakdown. Clearly breakdown will be said to occur ifarbitrarily large values are possible. As measures of scale are non-negativeit is not possible to have the values tending to −∞. The question is whethervalues tending to zero should be regarded as breakdown. This is often thecase and zero values of scale due to rounding the measurements are one of theproblems when evaluating interlaboratory tests. To take this into accountwe use the metric

    d(s1, s2) = | log(s1)− log(s2)| (1.17)on R+.

    The finite sample breakdown point for measures of scale is defined by

    ε∗(TS, xn) = mink{k/n : sup

    xkn| log(TS(xkn))− log(TS(xn))| = ∞}. (1.18)

    with the convention that ε∗(TS,xn) = 0 if TS(xn) = 0.

    Exercise 1.26. The following two samples are given:

    (i) 1.0 1.8 1.3 1.4 1.9 1.1 2.0 1.6 1.7 1.2 0.9

    (ii) 1.0 1.8 1.3 1.3 1.9 1.1 1.3 1.6 1.7 1.3 1.3

    Calculate the MAD for both samples. Then replace individual observationsby different values and repeat the calculation. How many observations mustbe replaced to get arbitrarily close to the boundary of the parameter space[0,∞)?

    In order to state the theorem for measures of scale corresponding to The-orem 1.23 we require

    ∆(xn) := maxxPn({x}) = max

    x|{j : xj = x}|/n. (1.19)

    Theorem 1.27. Let TS be a measure of scale, that is TS(A(xn)) = |a|TS(xn)for A(x) = ax + b. Then for any sample xn we have

    ε∗(TS,xn) ≤ b(n(1−∆(xn)) + 1)/2c/n.Proof. Let ∆(xn) = l/n. We restrict ourselves to the case where n−l = 2m iseven. At least one value in the sample xn is present l times and we take thisto be x1, . . . , xl. Let A be an affine transformation with A(x1) = ax1 +b = x1and a 6= 1. We consider two replacement samples x′n and x′′n with k = mdefined by

    x′i = xi, 1 ≤ i ≤ l, x′l+i = xl+i, 1 ≤ i ≤ m,x′l+m+i = A

    −1(xl+i), 1 ≤ i ≤ m,

    18

  • and

    x′′i = A(xi) = xi, 1 ≤ i ≤ l, x′′l+i = A(xl+i), 1 ≤ i ≤ m,x′′l+m+i = xl+i, 1 ≤ i ≤ m.

    The samples x′n and x′′n are related by

    x′′i = A(x′i), i = 1, . . . , n

    and hence, as TS is affinely equivariant TS(x′′n) = |a|TS(x′n). Both are re-

    placement samples with k = m and on letting a →∞ and a → 0 we see thatnot both | log(TS(x′n)) − log(TS(xn))| and | log(TS(x′′n)) − log(TS(xn))| canremain bounded. Therefore

    supxmn

    | log(TS(xmn ))− log(TS(xn))| = ∞

    and hence

    ε∗(TS,xn) ≤ m/2m = m/2 = b(n(1−∆(Pn)) + 1)/2c/n.

    ¤

    Exercise 1.28. Prove Theorem 1.27 for the case n− l = 2m + 1.Example 1.29. The standard deviation functional sd has a breakdown pointε∗(sd,xn) = 1/n for any data xn. We need only move one observation say x1to ∞ and it is clear that the standard deviation of this sample will also tendto infinity.

    Example 1.30. We calculate the finite sample breakdown point of theMAD. We note firstly that if ∆(xn) > 1/2 then MAD(xn) = 0 and henceε∗(MAD,xn) = 0. This shows that the MAD does not attain the upper boundof Theorem 1.27. Suppose now that ∆(xn) = l/n ≤ 1/2 and that n = 2mis even. If the value x1 is taken on l times and we now alter m + 1 − l ofthe remaining sample values to x1 we see that the MAD of this replacementsample is zero. This gives

    ε∗(MAD,xn) ≤ (m + 1− l)/n = bn/2 + 1c/n−∆(xn).

    This upper bound also holds for n odd. So far we have altered the sampleso that the MAD of the replacement sample is zero. We can also try andmake the MAD breakdown by becoming arbitrarily large. A short calculation

    19

  • shows that in order to do this we must send off b(n + 1)/2c observations to∞ and hence

    ε∗(MAD,xn) ≤ min{bn/2 + 1c/n−∆(xn), b(n + 1)/2c/n}= bn/2 + 1c/n−∆(xn)

    as ∆(xn) ≥ 1/n. An examination of the arguments shows that the MAD canonly breakdown under the conditions we used for the upper bound and hence

    ε∗(MAD,xn) = bn/2 + 1c/n−∆(xn).

    As already mentioned this is less than the upper bound of Theorem 1.27. Itis not a simple matter to define a scale functional which attains the upperbound but it can be done.

    1.6.3 Identifying outliers

    This section is based on Davies and Gather (1993). We take the copperdata of Example 1.1 and we wish to identify any possible outliers. There aremany ways of doing this and much misguided effort was put into devisingstatistical tests for the presence of outliers in one-dimensional data sets (seeBarnett and Lewis (1994)). The problem is that identifying outliers does notfit well into standard statistical methodology. An optimal test for say oneoutlier may be devised but, as we shall see, it performs badly if more thanone outlier is present. The Bayesian approach is no better requiring as itdoes the parameterization of the number of outliers and their distributiontogether with an appropriate prior. We describe the so called “two-sideddiscordancy test” (Barnett and Lewis (1994), pages 223-224). For a data setxn the test statistic is

    TS(xn) = maxi|xi −mean(xn)|/sd(xn) (1.20)

    with critical value c(TS, n, α) for a test of size α. The distribution of thetest statistic TS is calculated under the normal distribution. The affineinvariance of TS means that it is sufficient to consider the standard normaldistribution. Simulations show that for n = 27 and α = 0.05 the critical valueis c(TS, 27, 0.05) = 2.84 where we have used the R version of the standarddeviation. The attained value for the copper data is 2.719 so that the testaccepts the null hypothesis of no outliers. Suppose we now replace the largestobservation 2.21 by 22.1. The value of the test statistic is now 5.002 and theobservation 22.1 and only this observation is identified as an outlier. Wenow replace the second largest observation, 2.20 by 22.0. The value of the

    20

  • test statistic is now 3.478 and the observations 22.0 and 22.2 and only theseobservations are identified as outliers. We continue and replace the thirdlargest observation 2.16 by 21.6. The value of the test statistic is now 2.806and we come to the rather remarkable conclusion that there are no outlierspresent. The source of the failure is easily identified. The values of the meanas we move from no outlier to three are 2.016, 2.752, 3.486 and 4.206. Thecorresponding values of the standard deviation are 0.116, 3.868, 5.352 and6.376. We conclude that the mean and standard deviation are so influencedby the very outliers they are meant to detect that the test statistic dropsbelow its critical value. The following exercises explain this behaviour.

    Exercise 1.31. Let Xi, i = 1, . . . , n be i.i.d. N (0, 1) random variables.Show that

    limn→∞

    P

    (max1≤i≤n

    |Xi| ≤√

    τ log(n)

    )=

    {0, τ < 2,1, τ ≥ 2.

    Exercise 1.32. Show that the critical values c(TS, n, α) of the two-sideddiscordancy test satisfy

    limn→∞

    c(TS, n, α)√2 log(n)

    = 1

    for any α, 0 < α < 1.

    Exercise 1.33. Show that for large n the two-sided discordancy test can failto detect n/(τ log(n)) arbitrary large outliers for any τ < 2.

    The failure of the two-sided discordancy test to identify large outliers isrelated to the low finite sample breakdown points of 1/n of the mean andstandard deviation. The remedy is simply to replace the mean and standarddeviations by other measures with a high finite sample breakdown point.The obvious candidates are the median and the MAD. We replace the TS of(1.20) by

    TH(xn) = maxi|xi −med(xn)|/MAD(xn). (1.21)

    The ‘H’ in (1.21) stands for Hampel who proposed cleaning data by remov-ing all observations xi for which |xi − med(xn)| ≥ 5.2MAD(xn) and thencalculating the mean of the remainder (see Hampel (1985)). Again we calcu-late the cut-off value for TH using the normal distribution. Repeating thesimulations as for TS we obtain a value of c(TH, 27, 0.05) = 5.78 where wehave used the const = 1 version of the mad in R. The values for none, one,two and three outliers as before are 3.3, 200.7, 200.7 and 200.7 so that all

    21

  • outliers are easily identified. How far can we go? If we replace the largest13 observations by ten times their actual value the value of TH is 60.82 andall outliers are correctly identified. However if we replace the largest 14 ob-servations by ten times their actual value the value of TH is 10.94 and thesmallest 13 observations are now identified as outliers. The sample is now

    1.70 1.86 1.88 1.88 1.90 1.92 1.92 1.93 1.991.99 2.01 2.02 2.02 2.03 20.4 20.5 20.5 20.620.6 20.6 20.8 21.3 21.3 21.5 21.6 22.0 22.1.

    In the absence of further information this is perfectly reasonable as the largeobservations are now in the majority. The fact that the switch over takesplace when about half the observations are outliers is related to the finitesample breakdown point of the median and the MAD, namely 1/2.

    The cut-off value c(TH, 27, 0.05) = 5.78 was obtained by simulations.Simulations are a very powerful technique for evaluating integrals in highdimensions which is what we are essentially doing but they are computa-tionally expensive. Furthermore it is not possible to simulate the value ofc(TH, n, α) for every n and α and thought must be given to how best tocalculate the c(TH, n, α) at least approximately. The following method isoften appropriate. We restrict ourselves to say 3 values of α, say 0.1,0.05and 0.01. For small values of n, say 3:19, we calculate the simulated valuesand store them. The values are given in Table 1 and Figure 4 shows a plotof the values of c(TH, n, 0.1 for n = 3(1)19.

    Exercise 1.34. Explain the odd-even sample size behaviour of c(TH, n, 0.1)evident in Figure 4.

    For large values of n we try and determine the asymptotic behaviourof c(TH, n, α) and then consider the differences to the actual behaviour forlarge n. With luck these will tend to zero and exhibit a regular behaviourwhich allows them to be approximated by a simple function of n. This canbe calculated using the linear regression module of R and we finally obtain asimple formula for c(TH, n, α) for large n. The first step is to calculate theasymptotic behaviour of c(TH, n, α). If the sample Xn is i.i.d. N (0, 1) thenmed(Xn) ≈ 0 and MAD(Xn) ≈ 0.67449. This gives

    TH(Xn) ≈ maxi|Xi |/0.67449.

    Now if x is chosen so that

    1− α = P (maxi|Xi| ≤ x) = P (|X1| ≤ x)n

    22

  • n α = 0.1 α = 0.05 α = 0.013 15.40 32.54 158.974 7.11 10.19 21.825 7.84 11.58 26.426 6.18 8.07 13.537 6.78 9.17 16.288 5.84 7.23 11.989 5.99 7.45 12.1610 5.58 6.77 10.3511 5.68 6.92 11.0612 5.37 6.47 9.3413 5.52 6.58 9.3814 5.28 6.16 8.4615 5.44 6.40 8.6616 5.25 6.12 8.1717 5.36 6.20 8.4218 5.21 6.01 7.9119 5.26 6.02 8.00

    Table 1: The cut-off values c(TH, n, α) of TH of (1.21) for n = 3(1)19 andfor α = 0.1, 0.05 and 0.01.

    we see that

    2P (X1 ≤ x)− 1 = P (|X1| ≤ x) = (1− α)1/n

    so thatc(TH, n, α) ≈ Φ−1 ((1 + (1− α)1/n)/2) /0.67449. (1.22)

    On putting

    γ(TH, n, α) = 0.67449 c(TH, n, α)/Φ−1((1 + (1− α)1/n)/2) (1.23)

    we expect that limn→∞ γ(TH, n, α) = 1. In fact due to the variability of themedian and MAD we expect that for finite n the value of γ(TH, n, α) shouldexceed one. It seems reasonable therefore that limn→∞ log(γ(TH, n, α)−1) =0. The R command for Φ−1(p) is

    > qnorm(p)

    so that Φ−1((1 + (1− α)1/n)/2) can be easily calculated. The next ques-

    tion is which values of n to use. As we expect the dependence to be in log(n)

    23

  • 5 10 15

    68

    1012

    1416

    n

    c(TH

    ,n,0

    .1)

    Figure 4: The values of c(TH, n, 0.1) of TH of (1.21) for n = 3(1)19. This isthe second column of Table 1.

    (experience!) we choose the values of n such that log(n) is approximatelylinear. In the following we put n = 20 · 1.250(1)19. To test whether this issatisfactory a small simulation with 1000 simulations for each value of n canbe carried out. The following results are based on 50000 simulations for eachvalue of n. The upper panel of Figure 5 shows the values of γ(TH, n, 0.1)plotted against those of log(n). As expected they tend to one as n increases.The lower panel of Figure 5 shows the values of log(γ(TH, n, 0.1)−1) plottedagainst those of log(n). As expected they tend to −∞ as n → ∞. Luckilythe dependence on log(n) is clearly almost linear. The line in the lower panelshow the regression line

    log(γ(TH, n, 0.1)− 1) = 1.261− 0.867 log(n).

    Putting all this together we get the following approximation for c(TH, n, 0.1)where we simplify the numerical values as the exact ones are not very im-portant. We are only trying to identify outliers and this does not depend onthe last decimal point. Bearing this in mind we have the approximation

    c(TH, n, 0.1) ≈ 1.48(1 + 3.53 n−0.87)Φ−1 ((1 + 0.91/n)/2) . (1.24)

    24

  • 3 4 5 6 7

    1.00

    1.05

    1.10

    1.15

    1.20

    1.25

    log(n)

    gamm

    a(n,0.

    1)

    3 4 5 6 7

    −5−4

    −3−2

    log(n)

    log(ga

    mma(n

    ,0.1)−

    1)

    Figure 5: The upper panel shows the values of γ(TH, n, 0.1) of (1.23)plotted against the values of log(n). the lower panel shows the values oflog(γ(TH, n, 0.1)− 1) plotted against the values of log(n) together with theapproximating linear regression.

    The corresponding results for c(TH, n, 0.05) and c(TH, n, 0.01) are

    c(TH, n, 0.05) ≈ 1.48(1 + 4.65 n−0.88)Φ−1 ((1 + 0.951/n)/2) , (1.25)c(TH, n, 0.01) ≈ 1.48(1 + 8.10 n−0.93)Φ−1 ((1 + 0.991/n)/2) . (1.26)

    Exercise 1.35. Repeat the simulations and approximations for the outlieridentifier based on the shorth and the length of the shortest half-interval:

    TSH(xn) = maxi|xi − shrth(xn)|/lshrth(xn).

    1.7 M-measures of location and scale

    1.7.1 M-measures of location

    So far we only have three measures of location, the mean the median andthe shorth. We now introduce a whole class of location measures which we

    25

  • motivate by showing that the mean and the median are special cases. LetPn denote the empirical measure of the sample xn. Then if we use (1.9) withf(x) = ι(x) = x we see that

    mean(xn) =

    Rι(x) dPn(x) =

    Rx dPn(x).

    From this it follows that∫

    Rι(x−mean(xn)) dPn(x) = 0 (1.27)

    and in fact this defines the mean. By choosing a different function namelythe sign function sgn defined by

    sgn(x) =

    −1, x < 0,

    0, x = 0,1, x > 0

    (1.28)

    we see that the median satisfies∫

    Rsgn(x−med(xn)) dPn(x) = 0. (1.29)

    This leads to a general class of location measures TL defined by

    Rψ(x− TL(xn)) dPn(x) = 0. (1.30)

    Definition 1.36. Any location measure TL defined as a solution of (1.30) iscalled an M-measure or M-estimator of location.

    M-measures were first introduced by Huber (1964) as a generalizationof maximum likelihood estimates by breaking the link between the functionψ and the density f of the proposed model(see also Huber (1981)). Underappropriate conditions (1.30) always has a unique solution.

    Theorem 1.37. Let ψ : R→ R be an odd, continuous, bounded and strictlyincreasing function. Then the equation

    Rψ(x−m) dPn(x) = 0

    has a unique solution m =: TL(xn) for every sample xn.

    26

  • Proof. The conditions placed on ψ guarantee that limx→±∞ ψ(x) = ψ(±∞)exist with −∞ < ψ(−∞) < 0 < ψ(∞) < ∞. The function

    Ψ(m) =

    Rψ(x−m) dPn(x)

    is continuous and strictly monotone decreasing with Ψ(−∞) = ψ(∞) > 0and Ψ(∞) = ψ(−∞) < 0. It follows that there exist a uniquely defined mwith Ψ(m) = 0 and this proves the theorem. ¤

    Remark 1.38. Although it is generally assumed that ψ is an odd functionthe theorem still holds is this is replaced by ψ(−∞) < 0 < ψ(∞).Example 1.39. The function ψc given by

    ψc(x) =exp(cx)− 1exp(cx) + 1

    , x ∈ R, c > 0,

    satisfies the conditions of Theorem 1.37. We note

    limc→0

    ψc(x)/c = x

    andlimc→∞

    ψc(x) = sgn(x).

    This will be our standard ψ–function where we have the freedom to choosethe tuning constant c.

    To be of use one must to calculate the M-measure for data sets. Thefollowing procedure can be shown to work.

    (a) Put m0 = med(xn) and s0 = MAD(xn).

    (b) If Ψ(m0) = 0 put TL(xn) = m0 and terminate.

    (c) If Ψ(m0) < 0 put m2 = m0, choose k so that Ψ(m0− ks0) > 0 and putm1 = m0 − ks0.

    (d) If Ψ(m0) > 0 put m1 = m0, choose k so that Ψ(m0 + ks0) < 0 and putm2 = m0 + ks0.

    (e) Put m3 = (m1 + m2)/2. If Ψ(m3) < 0 put m2 = m3. If Ψ(m3) > 0 putm1 = m3.

    (f) Iterate (e) until m2 −m1 < 10−4s0/√

    n.

    27

  • (g) Put TL(xn) = (m1 + m2)/2 and terminate.

    Exercise 1.40. Write an R-programme to calculate TL for the ψ–function of(1.39).

    Exercise 1.41. Show that if ψ satisfies the conditions of Theorem 1.37 thenTL is translation equivariant but not affine equivariant.

    Exercise 1.42. Suppose that ψ satisfies the conditions of Theorem 1.37.Calculate the finite sample breakdown point ε∗(TL,xn) of TL.

    To make TL affine equivariant we must augment (1.30) by an auxiliarymeasure of scale TS. This leads to solving

    (x− TL(xn)TS(xn)

    )dPn(x) = 0 (1.31)

    for TL(xn). A good auxiliary measure of scale is the MAD which leads to

    (x− TL(xn)MAD(xn)

    )dPn(x) = 0. (1.32)

    Example 1.43. If we put c = 3 and use TS = MAD, then the value of theM-measure of location for the copper data of Example 1.1 is 2.0224 whichlies between the mean and the median.

    1.7.2 M-measures of scale

    M-measures of scale TS are defined in a similar manner as follows.

    Definition 1.44. For a function χ and a sample xn with empirical distri-bution Pn any solution of

    (x

    TS(xn)

    )dPn(x) = 0. (1.33)

    is call an M-measure or M-estimator of scale.

    Exercise 1.45. Suppose that the sample xn has mean zero and put χ(x) =x2 − 1. Show that TS is the standard deviation of the sample.Exercise 1.46. Suppose that the sample xn has median zero and put χ(x) =sgn(|x| − 1). Show that TS is the MAD of the sample.

    Corresponding to Theorem 1.37 we have

    28

  • Theorem 1.47. Let χ : R→ R be an even, continuous, bounded and strictlyincreasing function on R+ with χ(0) = −1 and χ(∞) = 1. Then the equation

    Rχ(x/s) dPn(x) = 0

    has a unique solution s =: TS(P) with 0 < TS(xn)) < ∞ for every sample xnwith Pn({0}) < 1/2.Proof. As Pn({0}) < 1/2 < 1 the function Ξ(s) =

    ∫R χ(x/s) dPn(x) is strictly

    monotone decreasing and continuous. Furthermore lims→∞ Ξ(s) = −1 andlims→0 Ξ(s) = −Pn({0}) + (1− Pn({0})) > 0 as Pn({0}) < 1/2. Hence, thereexists a unique s =: TS(xn) with Ξ(s) = 0. ¤

    Example 1.48. The function

    χ(x) =x4 − 1x4 + 1

    satisfies the assumptions of the theorem and it will be our standard χ–function.

    The calculation of an M-measure of scale follows the lines of that forM-measures of location. We write

    Ξ(s) =1

    n

    n∑i=1

    χ(xi/s).

    (a) Put s0 = MAD(xn).

    (b) If Ξ(s0) = 0 put TS(Pn) = s0 and terminate.

    (c) If Ξ(s0) < 0 put s2 = s0, choose k so that Ξ(s02−k) > 0 and put

    s1 = s02−k.

    (d) If Ξ(s0) > 0 put s1 = s0, choose k so that Ψ(s02k) < 0 and put

    m2 = s02k.

    (e) Put s3 =√

    s1s2. If Ξ(s3) < 0 put s2 = s3. If Ξ(s3) > 0 put s1 = s3.

    (f) Iterate (e) until s2/s1− 1 < 10−4/√

    n

    (g) Put TS(Pn) =√

    s1s2 and terminate.

    29

  • Exercise 1.49. Write an R-programme to calculate TS for the χ–function ofExample 1.48.

    In order to make TS affine equivariant we require an measure of locationTL and then solve

    (x− TL(xn)

    TS(xn)

    )dPn(x) = 0.

    Rather than do this in an ad hoc method we now consider joint M-measuresof location and scatter.

    1.7.3 Simultaneous M-measures of location and scale

    In general location and scale functionals have to be estimated simultaneously.This leads to the following system of equations

    (x−m

    s

    )dPn(x) = 0 (1.34)

    (x−m

    s

    )dPn(x) = 0. (1.35)

    In order to guarantee that a solution exists we must place conditions on thefunctions ψ and χ as well as on the sample xn.

    (ψ1) ψ(−x) = −ψ(x) for all x ∈ R.(ψ2) ψ is strictly increasing.(ψ3) limx→∞ ψ(x) = 1.(ψ4) ψ is continuously differentiable with derivative ψ(1).

    (χ1) χ(−x) = χ(x) for all x ∈ R.(χ2) χ : R+ → [−1, 1] is strictly increasing.(χ3) χ(0) = −1.(χ4) limx→∞ χ(x) = 1.(χ5) χ is continuously differentiable with derivative χ(1).

    (ψχ1) χ(1)/ψ(1) : R+ → R+ is strictly increasing.

    (xn) ∆(xn) < 1/2 where ∆(xn) is defined by (1.19)

    The existence of joint M-measures of location and scale is not trivial but thefollowing theorem holds.

    30

  • Theorem 1.50. Under the above assumptions (ψi), i = 1, . . . , 4, (χi), i =1, . . . , 5 , (ψχ1) and (xn) the equations (1.34) and (1.35) have a uniquesolution (m, s) =: (TL(xn), TS(xn)).

    Proof. A proof may be found in Huber (1981). ¤

    Example 1.51. We consider the ψ– and χ–functions

    ψ(x) = ψ(x, c) =(exp(cx)− 1)(exp(cx) + 1)

    χ(x) =x4 − 1x4 + 1

    .

    They clearly fulfill all the conditions of Theorem 1.50 for all values of c > 0apart possibly for the condition (ψχ1). Direct calculations show that thisholds for c > 2.56149. This does not exclude the existence and uniqueness ifc < 2.56149 as the condition is a sufficient one.

    Exercise 1.52. Verify the calculation of Example 1.51

    An algorithm for calculating the solutions for a data set may be derivedfrom the proof of the theorem given in Huber (1981). The following simpleiteration scheme seems to always converge although I have no proof:

    • Put s = s1 = MAD(Pn) in (1.34) and solve for m = m1.• Put m = m1 in (1.35) and solve for s = s2.• Put s = s2 in (1.34) and solve for m = m2.• Put m = m2 in (1.35) and solve for s = s3.• Iterate as indicated and terminate when

    |mk+1 −mk|+ |sk+1 − sk| < 10−4sk/√

    n.

    Example 1.53. We calculate the simultaneous location and scale functionalsTL and TS for the copper data using the functions

    ψ(x) = ψ(x, c) =(exp(cx)− 1)(exp(cx) + 1)

    χ(x) =x4 − 1x4 + 1

    .

    with tuning parameter c = 3 We obtain TL = 2.026956 (mean 2.015926,median 2.03) and TS = 0.0674 (standard deviation 0.1140260, MAD 0.l).

    31

  • Exercise 1.54. Verify the calculation of Example 1.53.

    We look at the breakdown behaviour of the joint M-functional (1.34) and(1.35). For a sample xn the we set (m, s) = (TL(Pn), TS(Pn)) so the equationsread

    n∑i=1

    ψ

    (xi −m

    s

    )= 0 (1.36)

    n∑i=1

    χ

    (xi −m

    s

    )= 0. (1.37)

    The situation is more complicated. The condition on ∆(P) < 1/2 for the exis-tence and uniqueness of a solution is only sufficient. This makes it difficult todecide whether a solution exists even if ∆(P) ≥ 1/2. Because of this we con-sider only the breakdown of the location part TL(Pn). Suppose we can alter kobservations in such a manner than we obtain a sequence of values ((ml, sl))

    ∞1

    of solutions of (1.36) and (1.37) with liml→∞ ml = ∞. We consider two cases.

    Case 1. sl is bounded from ∞.For the n−k observations not altered (xi−ml)/sl tends to −∞. This implies−(n− k) + k ≥ 0 and hence k ≥ n/2.

    Case 2. lim supl→∞ sl = ∞.We can find a subsequence with liml′→∞ ml′/sl′ = γ with 0 ≤ γ ≤ ∞. Forthe n − k observations not altered (xi −ml′)/sl′ tends to −γ. This implies(n − k)ψ(−γ) + k ≥ 0 and (n − k)χ(−γ) + k ≥ 0. As ψ is an odd functionand χ an even function we deduce ψ(γ) ≤ k/(n−k) and χ(γ) ≥ −k/(n−k).As γ ≥ 0 this implies

    χ−1(−k/(n− k)) ≤ γ ≤ ψ−1(k/(n− k))Let ε0 be the solution of

    χ−1(−ε0/(1− ε0)) = ψ−1(ε0/(1− ε0)). (1.38)Then k/n ≥ ε0. As ε0 < 1/2 we see that the finite sample breakdown pointis at least ε∗ ≥ dnε0e/n.

    We conclude dnε0e/n ≤ ε∗(TL,Pn) ≤ b(n + 1)/2c/n. In fact one can gofurther and show that

    dnε0e/n ≤ ε∗(TL,Pn) ≤ (dnε0 + 1e/n (1.39)with ε0 given by (1.38).

    32

  • References

    V. Barnett and T. Lewis. Outliers in Statistical Data. Wiley, Chichester,third edition, 1994.

    P. L. Davies and U. Gather. The identification of multiple outliers (withdiscussion). Journal of the American Statistical Association, 88:782–801,1993.

    P. L. Davies and U. Gather. Breakdown and groups (with discussion). Annalsof Statistics, 33:977–1035, 2005.

    D. L. Donoho and P. J. Huber. The notion of breakdown point. In P. J.Bickel, K. A. Doksum, and Jr. J. L. Hodges, editors, A Festschrift forErich L. Lehmann, pages 157–184, Belmont, California, 1983. Wadsworth.

    F. R. Hampel. Contributions to the theory of robust estimation. PhD thesis,University of California, Berkeley, 1968.

    F. R. Hampel. The breakdown points of the mean combined with somerejection rules. Technometrics, 27:95–107, 1985.

    P. J. Huber. Robust estimation of a location parameter. Annals of Mathe-matical Statistics, 35:73–101, 1964.

    P. J. Huber. Robust Statistics. Wiley, New York, 1981.

    33