s2e - stat2var - lessons - rev 2020€¦ · 2nd semester _____ bivariate statistics ... the average...

19
____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 1 / 14 SALES AND MARKETING Department MATHEMATICS 2nd Semester ________ Bivariate statistics ________ LESSONS Online document: http://jff-dut-tc.weebly.com section DUT Maths S2.

Upload: others

Post on 21-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 1 / 14

SALES AND MARKETING Department

MATHEMATICS

2nd Semester

________ Bivariate statistics ________

LESSONS

Online document: http://jff-dut-tc.weebly.com section DUT Maths S2.

Page 2: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 2 / 14

TABLE OF CONTENTS

LESSONS 3

1 Introduction, vocabulary 3

1-1 Aims 3

1-2 Formatting 3

1-3 Scatter plot 4

2 Chi-square independence testing 5

II-1 The special case of a Chi-square independence testing 5

II-2 Methodology 5

II-3 Independence in a 2x2 table 6

II-4 Some clarification on Chi-square law 9

3 Fitting: Mayer’s method and moving means 10

3-1 Moving means 10

3-2 Purpose of linear fitting 11

3-3 Mayer’s method 11

4 Linear fitting: least square method 12

4-1 Parameters of a bivariate series 12

4-2 Least square method 13

4-3 Linear correlation coefficient 14

5 Non-linear fitting: variable change 17

6 Statistical prediction 18

6-1 Point estimate 18

6-2 Confidence interval 18

Page 3: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 3 / 14

LESSONS

1 Introduction, vocabulary

1.1 Aims Two characters will be studied simultaneously on each individual in a population of size n, creating two

variables (lists of values) X and Y.

Aims : * highlight a relationship between both characters: their correlation;

* model this correlation by a mathematical function: regression;

* use this model for forecasting purposes: prediction, with an associated confidence level;

* test the hypothesis that X and Y are not related.

If a cause-and-effect relationship is to be studied, X will represent the cause and will be called the

explanatory variable, and Y will represent the effect and will be called the explained variable.

1.2 Formatting From one individual (no. i), an observation will be written down as an ordered pair of values (xi ; yi).

There are two possible ways to display the data series, depending on the situation:

* bivariate data series given in lists

e.g.: relationship between the quantity of spread fertilizer and the harvested production

fertilizer harvest

plot no. X (kg.ha-1) Y (q.ha-1)

1 150 46

2 80 37

3 120 46

4 220 51

5 100 43

e.g. of a time series: annual advertising expense of a company

X : year 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

Y : expense 41 60 55 66 87 61 90 95 82 120 125 118

* bivariate data series + frequencies: contingency table

e.g.: relationship between age and visual acuity, data collected from 200 people

X : age

20 40 50 60

Y :

acuity

3/10 1 5 10 20

6/10 8 12 25 18

9/10 55 26 14 6

Page 4: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 4 / 14

1.3 Scatter plot Every statistical series with two variables can be graphically represented by a point cloud, each variable

taking place on its own axis.

* series in lists: a pair (xi ; yi) corresponds to one individual and to one point.

second example in the previous page:

* series with contingency: a pair (xi ; yi) mostly corresponds to more than one individual (freq ≥ 1) and to

an object whose size is an increasing function of the associated frequency.

third example in the previous page:

year 1: 2006)

acuity

age

expense (k€)

Page 5: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 5 / 14

2 Chi-square independence testing

A statistical test consists in deciding whether a hypothesis, made on the population from the results obtained

on a sample, can or cannot be rejected. This hypothesis is named "null hypothesis", H0.

If the decision leads to a rejection of H0, this is done with a certain risk of error, the probability of which is

called "significance level", or sometimes "risk threshold", of the test, and noted α. (It is also called p-value of the

test).

2.1 The special case of a Chi-square independence testing:

A study crosses two quantitative or qualitative variables (in the example of the next tutorial: sex and

relationship to tobacco), variables whose interdependence within a population is to be estimated, based

solely on the frequencies distribution obtained from a sample of respondents.

In the case of independence (H0), the theoretical answers are supposed to be distributed by keeping the

subtotals found from the sample (e.g.: a certain number of men and a certain number of women were

interviewed, possibly different numbers) and in proportion to these subtotals.

It involves calculating the deviation shown by the observed distribution compared to this theoretical one,

deviation noted as "χ²calc" (pronounce “calculated Chi-square”), and then deciding whether this deviation

is abnormally large or not – it’s proven that a population in which two variables are independent usually

gives samples with a slight deviation (due to the random nature of the sample selection), but rarely a

large deviation.

2.2 Methodology: n observations are conducted: n individuals are evaluated on two variables X and Y.

The variable X shows as results r different values, and Y shows k different values.

The null hypothesis H0 is by convention: the variables are independent.

The test compares reality to what would perfect independence have shown.

We can reject H0 in case the set of observations is « too far » from the theoretical distribution.

1. Calculation of the observed χ²

* table of observations on n individuals Y1 Y2 … Yk total X

X1 obs11 obs12 … obs1k total X1

X2 obs21 obs22 … obs2k total X2

… … … … … …

Xr obsr1 obs r2 … obsrk total Xr

total Y total Y1 total Y2 … total Yk n

* table of the theoretical distribution (independence)

This second table is built from the first, taking back every subtotal, then calculating each frequency

in proportion to these subtotals and to the general total n.

* calculation of χ²calc (global difference between obs and th): χ²calc = ( )2−

∑table

obs th

th

2. Rejection area

The χ² variable expresses the infinity of the possible χ² values that could be obtained from any

possible sample, under the null hypothesis. This variable is distributed in probability, by a law of the

same name, settled by its number of degrees of freedom (dof): dof = (r - 1)(k - 1)

To each possible χ² value (in [0 ; +∞[) corresponds a probability "α" that a sample would exceed it.

In an exercise, in case α is given, we can read the value of the corresponding χ²lim in the table.

3. Comparison and decision

If χ²calc > χ²lim , then we are allowed to reject H0 (the independence), with a risk α to be wrong.

Page 6: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 6 / 14

2.3 Independence in a 2x2 table

(from: ENFA - Bulletin du GRES n°9 – février 2000)

Let's have a look at the tools that are available to conduct a two-character independence test for a 2 x 2 table

(two qualitative variables each with two modalities - for example: male/female for one and smoking/non-smoking

for the other).

Let us take the example of YATES (1934) quoted in [M.G. KENDALL and A. STUART The advanced theory of statistics

Griffin 1960]. We consider a sample of 42 children, of whom 20 were breastfed and 22 bottle-fed. The arrangement

of the teeth of these children was observed.

Normal dentition Poorly implanted dentition margin frequencies

Breastfed (S) 4 16 20

Bottle-fed (B) 1 21 22

margin frequencies 5 37 42

The question is whether this sample alone can establish a link in the population between the way a baby is fed

and the quality of his or her dentition. This issue is addressed by an independence test.

2.3.1 Independence Chi-square test

The null hypothesis is "there is independence between the two characters" (mode of feeding and tooth

implantation).

The methodology of this test consists first of all in calculating the distance between the observed sample and

the average sample that would be taken from a population checking the null hypothesis. In order for the two

tables to be comparable, the marginal numbers (also called margins, i.e. subtotals) must be identical (i.e. the

numbers in bold and italics in the table are fixed).

Le tableau d’effectifs « théoriques » (en fait : ceux de l’échantillon moyen mentionné ci-dessus) est :

Normal dentition Poorly implanted dentition margin frequencies

Breastfed (S) 2.38095238 17.6190476 20

Bottle-fed (B) 2.61904762 19.3809524 22

margin frequencies 5 37 42

After comparison with the observed sample, this gives the following partial and total chi-2:

1.10095238 0.14877735

1.0008658 0.13525214

2.38584767

This chi-square value (2.386) calculated, for 1 dof, corresponds to a significance level higher than 10%.

The chi-square law tells us more precisely that a chi-square of 2.386 corresponds to a p-value of 12.24% (in

other words: in a population where our two variables are independent, there is a 12.24% chance that a sample

with the same subtotals is as different or more different from the average sample).

But here this would pose a problem, because the theoretical numbers are "too small", in the sense that,

according to textbooks, the Chi-2 test is only applicable if the theoretical numbers are all greater than or equal

to 5 (by the way, one may ask the question: why 5?).

This result of 12.24% is derived from the continuous Chi-square law, which is only an approximation of the

reality that is discrete here (for example, the "breastfed/normal dentition" frequency can only be 0, 1, 2, 3, 4

or 5, which is a "too discrete" situation to be effectively followed "closely" by a continuous law).

Section 2.3.2 below solves the problem.

Page 7: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 7 / 14

2.3.2 The Exact Approach: The Fisher Exact Test

[R. A. FISHER Les méthodes expérimentales PUF 1947]

If the margin frequencies are stated, then there are six different possible tables:

0 20 1 19 2 18 3 17 4 16 5 15

5 17 4 18 3 19 2 20 1 21 0 22

The question that arises then is to calculate, under the hypothesis of independence of the two characters, the

probability of appearance of each of the tables. It should be noted that, since the marginal numbers are fixed,

in order to fill in a table it is sufficient to know the number in the first row and first column.

The independence hypothesis can be interpreted as follows: of the 42 children, 20 are breastfed and 22 are

bottle-fed. If the mode of feeding has no influence on the dentition, then the 5 children with normal dentition

are distributed according to the proportions of the two modes of feeding.

Let's randomly choose 20 babies from 42 and call the "normal dentition" event a success. The number of

successes is described by the hypergeometric distribution H(42, 5, 20).

The probability of k successes (k compris entre 0 et 5) est 20

5 37

20

42

C C

C

k k−×.

The calculation, for each of these 6 values, leads to the following results:

To sum up:

Value first row first column 0 1 2 3 4 5

Probability 0.0310 0.1719 0.3440 0.3096 0.1253 0.0182

Let's go back to the first data table of the sample. If the null hypothesis is true, then the probability of obtaining

such a table (k = 4) or a table more distant from a proportionality table (k = 5) is 0.1435. The null hypothesis

can therefore only be rejected at a risk threshold greater than or equal to 14.35% (compared to the 12.24%

given by the Chi-square law), which is too high compared to the risk thresholds conventionally used (generally:

5% maximum).

To be more complete, we can say that for a 5% risk, we have the following decision rule:

Value first row first column 0 1, 2, 3, 4 5

Decision rejection of the

hypothesis

non-rejection of the

hypothesis

rejection of the

hypothesis

P. DAGNELIE in [Theoretical and Applied Statistics Volume II De Boeck 1998] states that: "Despite these

objections, like many authors, we still recommend the use of this test for small samples". The objections relate

to the very strong assumption that the margins are fixed.

"The processing of frequencies by a χ² is a useful approximation in practice because of the relative simplicity of

the calculations. The exact treatment, more time-consuming, but necessary in case of doubt, shows the true

nature of the inferences suggested by the method of χ². “

0 20 The probability to obtain 1 19 The probability to obtain

5 17 such a table is 0.0310 4 18 such a table is 0.1719

2 18 The probability to obtain 3 17 The probability to obtain

3 19 such a table is 0.3440 2 20 such a table is 0.3096

4 16 The probability to obtain 5 15 The probability to obtain

1 21 such a table is 0.1253 0 22 such a table is 0.0182

Page 8: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 8 / 14

2.3.3 Additional comments

1°) A quote from M.J. Moroney in [Understanding the Statistics Marabout 1970]:

"A simple mathematical distribution can be perfectly well chosen because of its simplicity, whereas it fits the

facts less well than a more complex distribution, provided it fits our purpose well enough. A man going on a

trip may prefer to take a sketch with him rather than a headquarters map, because a sketch that is accurate

enough and simpler to follow better suits his needs. ”

The statistic of χ² is not the best fit for the previous independence test. Let us recall that the distribution of χ²

is continuous whereas the calculated chi-square can only take a finite number of values, but it is very simple

to use and sufficient in the sense of the author of the quotation.

2°) Instead of the term independence, some authors prefer the term association. The term association should

be understood in the sense: "is having bad teeth more associated with bottle-fed children than with breastfed

children? ”. In order to measure the degree of association of two characteristics each having two modalities,

various coefficients have been proposed, such as the YULE association coefficient and the FORBES-MARGALEF

association coefficient.

Let's take a look at the formal table: Presence of the character A Absence of the character A

Presence of the character B a c

Absence of the character B b d

• The coefficient of association in the sense of YULE (1900) is noted Q and by definition: ad bc

Qad bc

−=+

.

Note that this formula makes the numerator show the quantity ad bc− , the difference between the

crossproducts of the formal table, which cancels out if it is a proportion table, i.e. when there is independence

of the two characters.

Moreover, Q is between –1 and 1.

If Q = 1, then bc = 0. If, for example, b = 0, it means that if the character A is present, then B too (associated

characters).

If Q = –1, then ad = 0. If, for example, a = 0, it means that the presence of the character A leads to the absence

of B (dissociated characters).

• The FORBES coefficient is defined by ( )

( )( )a a b c d

a b c d

+ + ++ +

.

Its definition is based on a frequencistic approach and on the idea that if two non-zero reals are equal, then

their quotient is equal to 1. The probability (inferred from the observations) that an individual has both

character A and character B is equal to a

a b c d+ + +. If the two characters are independent (in the sense of

probabilities), then the probability that an individual has both character A and character B is equal to the

product of their probabilities, this probability (inferred from the observations) is equal to ( )( )

( )2

a b a c

a b c d

+ +

+ + +.

Therefore, if the two characters are independent, the quotient of these two observed probabilities must be

close to 1, this quotient is equal to ( )

( )( )a a b c d

a b c d

+ + ++ +

.

By looking at the two previous probabilities, you will be sure to reconcile the observed numbers with the

theoretical numbers. (they are equal to each other to the nearest ( )a b c d+ + + !).

3°) In R.A. FISHER's book and in books intended for commercial studies (e.g. [Y. FOURNIS Les études de marché

Dunod 1995]), there is another way of calculating the χ² observed.

Let us take back the latest table and name 1 2 1 2n n m m the margin frequencies.

Thus, the value of the observed χ2 is equal to ( )2

1 2 1 2

ad bc n

n n m m

−, formula that is easy to implement and automate.

Note again the presence of the term ad bc− in the numerator as for the YULE association coefficient.

Page 9: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 9 / 14

2.4 Some clarification on Chi-2 law

2.4.1 Definition

A Chi-2 law with d degrees of freedom is the continuous distribution of a variable, often noted K, defined as

the sum of the squares of d independent random variables Ui of the standard normal law:

( ) ( )χ=

=∑ 2 2

1

If 0 , 1 , then ∼ ∼

d

i i

i

U N K U d

(Like the exponential law and others, this law belongs to the group of "gamma" laws – Γ – which we will not

talk about here; let us simply mention that the law ( )2dχ is this way the law

1 ,

2 2

d Γ

).

2.4.2 Parameters of the law ( )2dχ

Mean : d Standard deviation : 2d Mode : − ≥2, if 2d d

The median depends on d in a more complex way:

d 1 2 3 4 5 6d ≥

median (approx.) 0.45 1.39 2.37 3.36 4.35 − 0.66d

2.4.3 Patterns of probability densities

* If d = 1 (blue), the density is decreasing in

]0 ; +∞[ and tends to infinity in zero.

* If d = 2 (green), it is also strictly decreasing

but is worth 0.5 in zero.

(The law ( )22χ is actually the exponential law

with a parameter (intensity) of 0.5)

* If 3d ≥ (yellow: 3, red: 5, brown: 8), the

density is first increasing then decreasing and

reaches its peak on the abscissa 2d − (mode)

On the occasion of an independence c² test, let

us not forget that we rely on the law of the same

name, which is continuous, to evaluate a

discrete situation (we generally test numbers of

quotations or numbers of successes, therefore

integers). This law can, in these cases, only give an approximation of the probabilities we are interested in.

2.4.4 Links with other laws (further study)

* The central limit theorem allows to give a good approximation of the law ( )2dχ by a normal law

( ) , 2d dN when d is “big enough” (criterion, here: d > 100).

* definition of a Student law : ( ) ( ) ( )χ =2If 0 , 1 and , then ∼ ∼ ∼

UU N K d T d

K dST

* definition of a Fisher law : ( ) ( ) ( )χ χ =2 2 1 11 1 2 2 1 2

2 2

If and , then , ∼ ∼ ∼

K dK d K d F d d

K dF

Page 10: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 10 / 14

3 Fitting: Mayer’s method and moving means

3.1 Moving means

The moving means are most frequently used in the case of time series, the variable X represents time and

the variable Y a value that changes over time.

When the Y values show large oscillations through time, an overall upward or downward trend is hard to

detect. The moving means are there to provide an answer, by smoothing these oscillations.

Methodology:

* group successive values of Y in packets, always of the same number n (for example: take values three

by three, or four by four, etc.); this number is chosen according to the periodicity of seasonal

phenomena. When this periodicity is even, the moving average is calculated with one more value,

the two extreme observations being weighted by half;

* The next set consists of the previous one, in which the first value of Y is removed and the next one is

joined (sliding sets);

* The average value of Y is calculated in each set (providing a list of moving means), same for the

average value of X (providing an average location in time for each set);

* The corresponding points are plotted (graphically represented).

e.g.:

X (trimesters) 1 2 3 4 5 6 7 8

Y (thousands of tourists) 58 22 13 36 60 19 14 33

Let’s create the list of the 4×4 moving means:

X 3 4 5 6

Y 32.5 32.375 32.125 31.875

This new list of values (doubled by its graph) suggests a downward trend.

note:

* the first moving mean is the mean of the values n° 1 (coef 1/2), 2, 3, 4 and 5 (coef 1/2).

Here: (1/2+2+3+4+5/2)/4 = 3 for x and (58/2+22+13+36+60/2)/4 = 32.5 for y

* the second moving mean is the mean of the values n° 2 (coef 1/2), 3, 4, 5 and 6 (coef 1/2).

Here: (2/2+3+4+5+6/2)/4 = 4 for x and (22/2+13+36+60+19/2)/4 = 32.375 for y

* and so on…

+

+

+

+

+

+ +

+ + + + +

Page 11: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 11 / 14

3.2 Purpose of linear fitting A point cloud may show a link between both variables if its points are apparently not gathered at random.

In some cases, this cloud's shape may be elongated, relatively thin, with a "directional axis" quite straight

showing a tendency ... Can we find an axis, a straight line, that "follows" the whole cloud "to the best"?

Let’s say this line has already been drawn

(D) : y′ = ax + b.

To a given value xi are associated the value

yi (ordinate of the point Mi in the cloud)

and the value y′ = axi + b (on the line).

definition: we name residue the number

ei = yi – iy′

The residue of a point Mi is then positive if this

point is above the line and negative in the

opposite situation.

Hence, we aim to find the line that « minimises to the best » the residues, the line that passes through the

cloud as close as possible to the points. This way, we perform a linear fitting, or linear regression. Once

done, this object is called fitting line, trend line or regression line of the series.

3.3 Mayer’s method

Some residues are positive, the other are negative. Mayer's assumption is that the "best" line is the one

that leads to a zero sum of residues (the negative residues offset the positive ones).

definition: we name Mayer’s principle the goal n

i

i

e=

=∑1

0

mathematical analysis:

( )i i i i ie y ax b y a x nb= − − = − −∑ ∑ ∑ ∑

This sum is zero 1 1 1

iff 0 iff 0i iy a x n b y ax bn n n

− − = − − =∑ ∑

That is to say: to obtain a cancellation of the global residue, it is necessary and sufficient that the

straight line contains the midpoint of the cloud, ( ),G x y . This property isn't sufficient in itself to make

a Mayer's line unique, since the only obligation is to own one given point. There are an infinite number

of straight lines making a zero sum of residues!

Mayer’s method:

* Divide the cloud into two subclouds:

Both subclouds must contain the same number of points: n/2 if n is even, or (n+1)/2 on one side and

(n-1)/2 on the other side if n is odd. The abscissas x in the first subcloud must all be less than the

abscissas x in the second one;

* Calculate the coordinates of G1 and G2, mean points (midpoints) of both subclouds;

* Determine (if asked) the expression of the line (G1G2), Mayer’s line that will be chosen; draw it

note: It’s been proved that the mean point of the whole cloud, G, belongs to the line (G1G2) in any

case, and then that the latter meets Mayer’s principle.

iy

ix

iy′

Page 12: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________

IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 12 / 14

4 Linear fitting: least square method

4.1 Parameters of a bivariate series

4.1.1

The mean of X or of Y are:

1

n

i

i

x

xn

==∑

and

n

i

i

y

yn

==∑

1 without contingency (data series in lists – see p.3 examples 1 and 2);

r

i i

i

n x

xn

==∑

1 and

k

j j

j

n y

yn

==∑

1 with contingency (frequencies gathered into a crossed table – p.3 ex 3).

The special point ( ),G x y is named mean point or midpoint of the cloud.

4.1.2

The variance of X and the one of Y are easily accessible (manual calculations) by Koenig’s theorem:

( )

r

i

i

x

X xn

== −∑ 2

21V and ( )2

21V

r

i

i

y

Y yn

== −∑

without contingency;

( )2

21V

r

i i

i

n x

X xn

== −∑

and ( )2

21V

r

i i

i

n y

Y yn

== −∑

with contingency.

The standard deviations are still the square roots of the variances.

4.1.3

We name covariance of the pair (X,Y) the number : ( )( )( )

, 1Cov

n

i i

i

x x y y

X Yn

=

− −=∑

.

This is a « common variance » between both variables, which is necessary to analyze their correlation.

Koenig’s theorem gives an easier way to calculate the covariance:

( ), 1Cov

n

i i

i

x y

X Y x yn

== − ×∑

(without contingency) and ( ),

r k

ij i j

i j

n x y

X Y x yn

= == − ×∑∑

1 1Cov (with)

4.1.4

Using the calculator:

The means and standard deviations are given directly, in Stat mode.

Unfortunately, the calculator gives neither the variances nor the covariance.

Page 13: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 13 / 14

4.2 Least square method

The idea of this method is to square each residue, then to add these squares, and finally to say that the

"best" line is the one that minimizes this sum (obtain the smallest possible sum, considering the infinite

number of possible lines).

definition: We name least square principle the one that consists in finding a line leading to

2

1

is minimum within the cloudn

i

i

e=∑ (Gauss)

mathematical analysis: we set ( ) ( ),2

i iP a b y ax b= − −∑ : bivariate polynomial.

There are two different ways to expand it:

( ) ( ) ( ) ( ), ( )i i i i i iP a b y ax b nb b y ax y ax= − − = − − + −∑ ∑ ∑2 22 2 (1)

2nd degree trinomial, with respect to b;

( ) ( ) ( ) ( ), ( )i i i i i i iP a b y b ax a x a x y b x y b= − − = − − + −∑ ∑ ∑ ∑ ∑2 22 2 2 (2)

2nd degree trinomial with respect to a.

In this context, we can continue like this:

* consider a as a constant and b as a variable. P(a,b) (1) is minimum when its derivative (/b) is zero (its

1st coefficient, n, is non-negative), which leads to b y ax= −

* consider this latest value of b, and a as a variable. P(a,b) (2) is minimum when its derivative (/a) is

zero, which leads to ( )( )

.,i i

i

x y x y X YnaX

x xn

−= =

∑ 2 2

1Cov

1 V

Calculus amateurs can try to find these results!

notes:

* such a value of b implies that the regression line owns the mean point of the cloud, G; that is to say:

it meets Mayer’s principle.

* This method conducts to a unique line and is mostly employed.

least square method:

* Calculate the coefficients ( )( )

,Cov

V

X Ya

X= and b y ax= − (you can get them on your calculator!)

* Write the expression of the Y on X regression line DY/X : y′ = ax + b

Page 14: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________ IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 14 / 14

4.3 Linear correlation coefficient A scatterplot shows a more or less strong link between two variables X and Y, sometimes displaying an

elongated and almost right cloud: in this case, a linear model is relevant. The purpose of the linear

correlation coefficient is to evaluate the strength of a linear link, by a number.

linear correlation coefficient between X and Y : ( )

( ) ( ),Cov X Y

rX Yσ σ

=

It’s been stated that, whatever the data series, -1 ≤ r ≤ 1

(the capital R or the Greek letter ρ - « rhô », are sometimes used for this coefficient)

On the calculator:

A calculator generally writes it r… if it mentions it! (it depends on the type of calculator).

Therefore, we will calculate it by ourselves (which implies calculating the covariance first...).

Interpretation of its value:

The strongest the linear correlation is (cloud looking like a straight line), the closest to 1 is |r|.

CAUTION: THE RECIPROCAL IS NOT NECESSARILY TRUE!

A coefficient close to 1 can be obtained with a point cloud along a slightly curved axis, in a situation for

which the linear fit would not be relevant!

"positive correlation" : r is positive when Y overall increases with X

"negative correlation" : r is negative when Y overall decreases as X increases

0 ≤ |r| ≤ 0.5 : weak linear correlation, inappropriate linear model.

0.5 ≤ |r| ≤ 0.75 : mean linear correlation, non-appropriate linear model.

0.75 ≤ |r| ≤ 0.95 : tolerable linear correlation, the linear model may not be the best one.

0.95 ≤ |r| ≤ 1 : strong linear correlation, the linear model is maybe the best one.

Comments:

* are X and Y really linked ?

If r is close to 1 (or -1), the points are close to be collinear (it might follow a curve!). Nevertheless,

that doesn't always mean that X and Y are concretely related. E.g.: in France, from 1974 to 1981,

the wedding rate decreased and in the meantime the GDP (French : PIB) increased, so that the

scatter plot using both data sets is quasi-linear (fourth graph below). The linear correlation is

mathematically very strong, but facts and studies show there is no cause to effect relationship

between both variables! (after the year 1981, the following points are not at all collinear with the

previous ones any more).

* linear correlation

r only shows a linear link. A correlation between X and Y may be very strong, but not in a linear way

(curved). In that case, r is far from 1 and -1, and the study has to be expanded (see II-4). But if |r| is

far from 1, there is a chance that the linear fitting would be better than any othe to model the

points cloud – see the two first example of the following page.

Page 15: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________

IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 15 / 14

E.g.s:

income (€) vs. duration in a company

r = 0.8449 duration

success rate vs. % of disadvantaged SPC

r = -0.7457

unit margin (€/u) vs. quantity

r = 0.6438 quantity (thousands)

wedding rate through time

r = -0.9875

Once again, beware of the relevance of a linear fit: the fact of knowing r, a and b is not enough to give us the

right to represent a bivariate series with a straight line!

R. Tomassone, E. Lesquoy and C. Miller, in their remarkable book "La régression, nouveaux regards sur une

ancienne méthode statistique" (Masson, 1983), present (p.21) the five series on the following page.

It turns out that all five have, up to the third decimal place, the same linear correlation coefficient and the same

least squares regression line coefficients (slightly more deviations for b); yet the five point clouds are very

different!

(for info, next page: 0.785 < r < 0.786; 0.808 < a < 0.809; 0.519 < b < 0.524)

Page 16: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________

IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 16 / 14

X1 Y1 X2 Y2 X3 Y3 X4 Y4 X5 Y5

7 5,535 7 0,113 7 7,399 7 3,864 13,715 5,654

8 9,942 8 3,77 8 8,546 8 4,942 13,715 7,072

9 4,249 9 7,426 9 8,468 9 7,504 13,715 8,491

10 8,656 10 8,792 10 9,616 10 8,581 13,715 9,909

12 10,737 12 12,688 12 10,685 12 12,221 13,715 9,909

13 15,144 13 12,889 13 10,607 13 8,842 13,715 9,909

14 13,939 14 14,253 14 10,529 14 9,919 13,715 11,327

14 9,45 14 16,545 14 11,754 14 15,86 13,715 11,327

15 7,124 15 15,62 15 11,676 15 13,967 13,715 12,746

17 13,693 17 17,206 17 12,745 17 19,092 13,715 12,746

18 18,1 18 16,281 18 13,893 18 17,198 13,715 12,746

19 11,285 19 17,647 19 12,59 19 12,334 13,715 14,164

19 21,365 19 14,21 19 15,04 19 19,761 13,715 15,582

20 15,692 20 15,577 20 13,737 20 16,382 13,715 15,582

21 18,977 21 14,652 21 14,884 21 18,945 13,715 17,001

23 17,69 23 13,947 23 29,431 23 12,187 33,281 27,435

series 1 series 2

series 3 series 4

series 5

Page 17: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________

IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 17 / 14

5 Non-linear fitting: the variable change

A variable change may be performed if the points seem to follow a curve in particular.

The function to consider will always be defined by the directions of an exercise. It may be:

* a logarithm or exponential function

* a polynomial function

* a trigonometric function

* One of the variables X or Y (or both!) has to be replaced each by a new one, noted T for instance,

following a given formula that allows its calculation starting from the former.

e.g.:

X 2 3 5 8

Y 9 13 28 70

As Y seems to vary as X squared, plus 5, we can define the variable change T = X ².

We have to build the following table, into which T shall replace X :

T 4 9 25 64

Y 9 13 28 70

* We perform a linear regression of the pair (T, Y), observing their order.

e.g.:

Here, the question is to determine the expression of their fitting line, y′ = at + b. If we are told to use

the least square method, the coefficients a and b will be given by the calculator: y′ = 1.02526 t + 3.856

* Finally, we can deduce the expression of a curve, fitting the non-linear relationship between X and Y,

just by writing the variable change again; we may draw this curve, if we are told to.

e.g.:

Since y′ = 1.02526 t + 3.856, we get: y′ = 1.02526 x² + 3.856 (expression of a parabola)

Page 18: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________

IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 18 / 14

6 Statistical prediction

6.1 Point estimate The fitting straight line (obtained with or without a variable change) makes it possible, through its

expression, to estimate a value of the variable Y on choosing an unexplored value of the variable X

(generally greater than those collected in the genuine series). In this case, if X represents time, it is

possible to make a forecast to the future.

e.g.: let’s set the expression of a fitting line: y = 0.85x + 22.

a. Point estimate of y with x0 = 10. y’0 = 0.85×10 + 22 = 30.5.

b. Point estimate of x with y0 = 39. x’0 = (39 – 22)/0.85 = 20.

6.2 Confidence interval We ought to step back, considering the point estimate: according to the noise (dispersion) of the point

cloud, it is more or less trustable – it gives us a more or less precise prediction.

Here, the new idea is to give an estimate by a range (interval), around the point estimate, rather than a

single value, and to be able to associate a probability (confidence level) for the unknown reality to be

inside such a range.

Rates method (uses a linear model, estimates y from x):

1. For each value xi of the initial data set:

* calculate the values y'i following the expression of the regression line

* calculate the rates zi = yi / y'i

* calculate the mean and standard deviation of the variable Z

2. Z is considered as distributed by a normal law. Consequently:

95 % of Z values take place inside the interval [ ];1.96 1.96Z Z

z zσ σ− +

99 % of Z values take place inside the interval [ ];2.58 2.58Z Z

z zσ σ− +

3. Calculate the point estimate y'0 , associated to the new given value x0, thanks to the fitting line.

Now, we can predict the unexplored possible values y0 by an interval, as follows:

There are 95% chances that y0 would be in ( ) ( );0 01.96 1.96Z Zy z y zσ σ′ ′ − +

There are 99% chances that y0 would be in ( ) ( );0 02.58 2.58Z Zy z y zσ σ′ ′ − +

comments:

* this method is efficient only for r > 0 (non-negative correlation)

* the probability (95%, 99%, etc.) is named confidence level of the prediction.

Its complement (5%, 1%, etc.) is named significance level.

* The size of such an interval is related to the uncertainty of the answer. It increases when:

. the confidence level increases,

. |r| decreases,

. the distance between x0 and the abscissas xi of the point cloud increases.

Page 19: S2e - Stat2Var - LESSONS - Rev 2020€¦ · 2nd Semester _____ Bivariate statistics ... the average sample that would be taken from a population checking the null hypothesis. In order

____________________________________________________________________________

IUT de Saint-Etienne – Département TC –J.F.Ferraris – Math – S2 – Stat2Var – Lessons – Rev2020 – page 19 / 14

IUT TC MATHEMATICS FORM FOR BIVARIATE STATISTICS

χ² law table

The table gives values χ²lim

such that p(χ² > χ²lim) = α

α α α α

χ²lim

χ²

α 1 − α