Download - Class 3 Relationship Between Variables

Class 3Relationship Between

Variables

SKEMA Ph.D programme

2010-2011

Lionel Nesta

Observatoire Français des Conjonctures Economiques

[email protected]

Qualitative ×

Qualitative

Qualitative ×

Quantitative

Quantitative×

Quantitative

Which variables are we looking at ?

Relationship Between Variables

ANOVA

ANOVA: ANalysis Of VAriance

ANOVA is a generalization of Student t test

Student test applies to two categories only:

H0: μ1 = μ2

H1: μ1 ≠ μ2

ANOVA is a method to test whether group means are equal or not.

H0: μ1 = μ2 = μ3 = ... = μn

H1: At least one mean differs significantly

ANOVA

This method is called after the fact that it is based on measures of

variance. The F-statistics is a ratio comparing the variance due to

group differences (explained variance) with the variance due to other

phenomena (unexplained variance).

explained variance

unexplained varianceF Higher F means more explanatory power,

thus more significance of groups.

Revenues (in million of US $ )

Sector 1 Sector 2 Sector 3

Firm 1 18.0 21.5 34.8

Firm 2 18.0 21.5 34.8

Firm 3 18.0 21.5 34.8

Firm 4 18.0 21.5 34.8

Firm 5 18.0 21.5 34.8



Firm 1 18.0 18.0 18.0

Firm 2 21.5 21.5 21.5

Firm 3 25.0 25.0 25.0

Firm 4 28.7 28.7 28.7

Firm 5 34.8 34.8 34.8



Firm 1 19.6 23.7 30.8

Firm 2 19.4 28.4 32.9

Firm 3 21.9 28.5 35.3

Firm 4 21.2 31.7 31.8

Firm 5 24.6 37.0 35.7

Do sectors differ significantly in their revenues?

H0 : μ1 = μ2 = μ3 = ... = μn

H1: At least one mean differs significantly.

ANOVA

2 22

Total Variance Within-group variance Between-group variance(Total Sum of Square) (Within sum of Square) (between sum of Square)

SS SS SStotal within between

k kn nk k k

ij ij j k jj i j i j

x x x x n x x

df = (k – 1)df = n – kdf = n – 1

residual

This decomposition produces Fisher’s Statistics as follows:

__

1 explained variance1,

unexplained variance

betweendf num

df denomwithin

SS kF k N k F

SS N k

Origin of variation SS d.f. MSS F-Stat Prob>F

SS-between 379.1 2 189.6

SS-within (residual) 132.5 12 11.0

SS-total 511.6 14 36.54 17.7 0.0003

The result tells me that I can reject the null Hypothesis H0 with 0.03% chances

of rejecting the null Hypothesis H0 while H0 holds true (being wrong).

I WILL TAKE THE CHANCE!!!

The ANOVA decomposition on Revenues

Comparison of Means Using Student t with STATA

We still use the same command ttest

ttest var1, by(varcat)

For example:ttest lnassets, by(type)ttest lnrd, by(year)ttest lnrdi, by(type)

Beware! Unlike ANOVA, Student t test can only be perfomed to compare two categories.

ANOVA under STATAWe still use the same command anova

anova var1 varcat

For example:anova lnassets isicanova lnrd isicanova lnrdi isic

anova cours titype

Total 1.5176e+13 1633 9.2931e+09 Residual 1.0564e+12 1479 714266318 fid 1.4119e+13 154 9.1684e+10 128.36 0.0000 Model 1.4119e+13 154 9.1684e+10 128.36 0.0000 Source Partial SS df MS F Prob > F

Root MSE = 26725.8 Adj R-squared = 0.9231 Number of obs = 1634 R-squared = 0.9304

. anova labour fid

Stata Instruction

Sum of Squares

F-Stat

P value

STATA Application: ANOVA

Anova Example in Published Paper

Verify that US companies are larger than those from the rest of the world

with an ANOVA

Are there systematic

Sectoral differences in terms of labour; R&D, sales

Write out H0 and H1for each variables

Analyse Comparer les moyennes ANOVA à un fateur

What do you conclude at 5% level?

What do you conclude at 1% level?

SPSS Application: ANOVA

SPSS Application: t test comparing meansDescriptives

35 447.4501 182.4318 30.83661 384.78256 510.117613 182.0091 817.9253

32 462.3145 310.5638 54.90044 350.34433 574.284688 19.5265 946.5801

281 32416.80 157435.7 9391.827 13929.247 50904.3542 16.0008 1193810

96 409.9650 453.3413 46.26895 318.10950 501.820453 11.1539 1665.716

100 193.4619 97.58658 9.7586578 174.09856 212.825145 49.3978 558.6539

153 8004.322 30796.25 2489.729 3085.3790 12923.2649 14.1116 184461.8

173 1387.709 1264.239 96.11829 1197.9855 1577.432087 141.0070 5852.729

208 17733.77 124017.6 8599.072 780.78382 34686.7595 123.0168 1664540

74 77161.50 222879.1 25909.17 25524.608 128798.396 281.2427 851216.2

45 1089.904 1240.178 184.8749 717.31279 1462.494371 1.0716 3790.107

155 251.1483 167.9513 13.49017 224.49859 277.797952 27.8838 1432.072

1352 14903.52 103262.3 2808.364 9394.2945 20412.7510 1.0716 1664540

55 50230.05 26169.055 3528.635 43155.57 57304.54 13588 104000

64 133708.02 96812.548 12101.569 109524.96 157891.07 20000 308000

306 55764.62 43392.780 2480.600 50883.36 60645.87 3619 181176

99 63445.73 45073.200 4530.027 54456.04 72435.42 2662 145787

120 36001.37 36324.601 3315.967 29435.42 42567.31 2998 149644

161 101231.85 95716.749 7543.537 86334.11 116129.59 1508 403508

177 128311.31 102126.3 7676.286 113161.90 143460.72 18200 417800

280 140859.11 153239.3 9157.799 122831.96 158886.27 647 876000

76 75601.54 42905.729 4921.625 65797.16 85405.92 11305 165000

65 185022.20 81524.803 10111.907 164821.34 205223.06 30964 317100

231 60497.76 42138.389 2772.502 55035.01 65960.51 1153 173000

1634 91298.87 96400.957 2384.818 86621.25 95976.50 647 876000

55 41423.22 35721.57 4816.696 31766.325 51080.11179 5627.646 121962.6

65 21827.52 15167.33 1881.276 18069.238 25585.80114 2590.539 52380.74

309 565218.4 2146365 122102.5 324957.84 805478.883 2158.768 12400000

99 29890.76 15579.40 1565.789 26783.498 32998.01180 9015.374 69895.68

120 12803.84 6396.795 583.9448 11647.575 13960.11274 2814.375 31224.46

161 821966.6 3180044 250622.6 327011.59 1316921.53 467.169 16600000

178 22379.21 18921.53 1418.229 19580.397 25178.02485 1679.668 79085.95

288 291520.4 1310460 77219.60 139531.82 443508.950 52.365 8071404

77 1522011 3744994 426781.6 672001.30 2372019.91 4679.127 12400000

67 23450.50 18731.51 2288.419 18881.521 28019.47136 38.080 81152.94

231 14908.32 11406.94 750.5212 13429.539 16387.09100 262.905 56015.21

1650 318383.6 1713117 42174.03 235663.33 401103.930 38.080 16600000

13

20

28

29

33

35

36

37

38

48

99

Total

13

20

28

29

33

35

36

37

38

48

99

Total

13

20

28

29

33

35

36

37

38

48

99

Total

rd

labour

sales

N Moyenne Ecart-typeErreur

standardBorne

inférieureBorne

supérieure

Intervalle de confiance à95% pour la moyenne

Minimum Maximum

SPSS Application: t test comparing means

ANOVA

5.11E+011 10 5.11E+010 4.934 .000

1.39E+013 1341 1.04E+010

1.44E+013 1351

2.79E+012 10 2.79E+011 36.607 .000

1.24E+013 1623 7.63E+009

1.52E+013 1633

2.43E+014 10 2.43E+013 8.683 .000

4.60E+015 1639 2.80E+012

4.84E+015 1649

Inter-groupes

Intra-groupes

Total

Inter-groupes

Intra-groupes

Total

Inter-groupes

Intra-groupes

Total

rd

labour

sales

Sommedes carrés ddl

Moyennedes carrés F Signification

Qualitative ×

Qualitative

Qualitative ×

Quantitative

Quantitative×

Quantitative



Chi-Square Independence Test

Chi-Square Independence Test

Introduction to Chi-Square

This part devoted to the study of whether two

qualitative (categorical) variables are independent:

H0: Independent: the two qualitative variables do not

exhibit any systematic association.

H1: Dependent: the category of one qualitative

variable is associated with the category of another

qualitative variable in some systematic way which

departs significantly from randomness.

The Four Steps Towards The Test1. Build the cross tabulation to compute observed joint

frequencies

2. Compute expected joint frequencies under the

assumption of independence

3. Compute the Chi-square (χ²) distance between

observed and expected joint frequencies

4. Compute the significance of the χ² distance and

conclude on H0 and H1

1. Cross Tabulation A cross tabulation displays the joint distribution of two

or more variables. They are usually referred to as a

contingency tables.

A contingency table describes the distribution of two (or

more) variables simultaneously. Each cell shows the

number of respondents that gave a specific

combination of responses, that is, each cell contains a

single cross tabulation.

1. Cross Tabulation We have data on two qualitative and

categorical dimensions and we wish to know

whether they are related

Region (AM, ASIA, EUR)

Type of company (DBF, LDF)






Total 431 100.00 JP 117 27.15 100.00 EUR 51 11.83 72.85 AMER 263 61.02 61.02 continent Freq. Percent Cum.

. tabulate continent






Total 431 100.00 pharmaceutique 264 61.25 100.00biotechnologie 167 38.75 38.75 type Freq. Percent Cum.

. tabulate type

1. Cross Tabulation

Crossing Region (AM, ASIA, EUR) × Type of

company (DBF, LDF) tabulate continent type

Total 167 264 431 JP 0 117 117 EUR 11 40 51 AMER 156 107 263 continent biotechno pharmaceu Total type

. tab continent type

2. Expected Joint Frequencies In order to say something on the relationship between

two categorical variables, it would be nice to produce

expected, also called theoretical, frequencies under the

assumption of independence between the two

variables.

Total line Total Column

Overall Sample SizeijE

tabulate continent type , expected

2. Expected Joint Frequencies

167.0 264.0 431.0 Total 167 264 431 45.3 71.7 117.0 JP 0 117 117 19.8 31.2 51.0 EUR 11 40 51 101.9 161.1 263.0 AMER 156 107 263 continent biotechno pharmaceu Total type

expected frequency frequency Key

. tabulate continent type, expected

3. Computing the χ² statistics We can now compare what we observe with what we

should observe, would the two variables be

independent. The larger the difference, the less

independent the two variables. This difference is

termed a Chi-Square distance.

2

2 ij ij

i j ij

O E

E

With a contingency table of n lines and m columns, the statistics follows a χ² distribution with (n-1)×(m-1) degree of

freedom, with the lowest expected frequency being at least 5.

3. Computing the χ² statistics

Pearson chi2(2) = 127.2334 Pr = 0.000

167.0 264.0 431.0 Total 167 264 431 45.3 71.7 117.0 JP 0 117 117 19.8 31.2 51.0 EUR 11 40 51 101.9 161.1 263.0 AMER 156 107 263 continent biotechno pharmaceu Total type

expected frequency frequency Key

. tabulate continent type, expected chi2

tabulate continent type , expected chi2

4. Conclusion on H0 versus H1 We reject H0 with 0.00% chances of being wrong I will take the chance, and I tentatively conclude

that the type of companies and the regional origins are not independent.

Using our appreciative knowledge on biotechnology, it makes sense: biotechnology was first born in the USA, with European companies following and Asian (i.e. Japanese) companies being mainly large pharmaceutical companies.

Most DBFs are found in the US, then in Europe. This is less true now.

AnalyseStatistiques descriptivesTableaux

CroisésCelluleObservé & Théorique

2. SPSS : Expected Joint Frequencies

Tableau croisé continent * type

156 107 263

101.9 161.1 263.0

59.3% 40.7% 100.0%

93.4% 40.5% 61.0%

36.2% 24.8% 61.0%

11 40 51

19.8 31.2 51.0

21.6% 78.4% 100.0%

6.6% 15.2% 11.8%

2.6% 9.3% 11.8%

0 117 117

45.3 71.7 117.0

.0% 100.0% 100.0%

.0% 44.3% 27.1%

.0% 27.1% 27.1%

167 264 431

167.0 264.0 431.0

38.7% 61.3% 100.0%

100.0% 100.0% 100.0%

38.7% 61.3% 100.0%

Effectif

Effectif théorique

% dans continent

% dans type

% du total

Effectif

Effectif théorique

% dans continent

% dans type

% du total

Effectif

Effectif théorique

% dans continent

% dans type

% du total

Effectif

Effectif théorique

% dans continent

% dans type

% du total

AMER

EUR

JP

continent

Total

DBF LDF

type

Total

AnalyseStatistiques descriptivesTableaux

CroisésStatistiqueChi-deux

Tests du Khi-deux

127.233a 2 .000

166.879 2 .000

431

Khi-deux de Pearson

Rapport devraisemblance

Nombre d'observationsvalides

Valeur ddl

Significationasymptotique

(bilatérale)

0 cellules (.0%) ont un effectif théorique inférieur à 5.L'effectif théorique minimum est de 19.76.

a.

3. SPSS : Computing the χ² statistics

Qualitative ×

Qualitative

Qualitative ×

Quantitative

Quantitative×

Quantitative



Correlations

Correlations

Introduction to Correlations

This part is devoted to the study of whether – and the

extent to which – two or more quantitative variables are

related:

Positively correlated: the values of one variable “varying somewhat

in step” with the values of another variable

Negatively correlated: the values of one continuous variable

“varying somewhat in opposite step” with the values of another

variable

Not correlated: the values of one continuous variable “varying

randomly” when the values of another variable vary.

Scatter Plot of R&D and Patents (log)

Scatter Plot of R&D and Patents (log)

-20

-15

-10

-5lp

at_

asse

ts

-6 -4 -2 0lrdi

The Pearson product-moment correlation coefficient is a measure of the co-relation between two variables x and y.

Pearson's r reflects the intensity of linear relationship between two variables. It ranges from +1 to -1.

r near 1 : Positive Correlation r near -1 : Negative Correlation r near 0 : No or poor correlation

,1 1 x yr

Pearson’s Linear Correlation Coefficient r

1

,2 2

1 1

,

n

i ii

x y n nx y

i ii i

x x y yCov x y

r

x x y y

Cov(x,y) : Covariance between x and y

x et y : Standard deviation of x and Standard deviation of y

n : Number of observations


Pearson’s Linear Correlation Coefficient r corr lpat lassets lrd lrdi lpat_assets

lpat_assets 0.3821 -0.8249 -0.6919 0.6416 1.0000 lrdi 0.1450 -0.5905 -0.2428 1.0000 lrd 0.3167 0.9263 1.0000 lassets 0.2071 1.0000 lpat 1.0000 lpat lassets lrd lrdi lpat_a~s

. pwcorr lpat lassets lrd lrdi lpat_assets

Is significantly different from 0 ?

H0 : rx,y= 0

H1 : rx,y 0

,*

2,1

2

x y

x y

rt

r

n

t* : if t* > t with (n – 2) degree of freedom and critical

probability α (5%), we reject H0 and conclude that r

significantly different from 0.


Pearson’s Linear Correlation Coefficient r pwcorr lpat lassets lrd lrdi lpat_assets, sig

0.0000 0.0000 0.0000 0.0000 lpat_assets 0.3821 -0.8249 -0.6919 0.6416 1.0000 0.0025 0.0000 0.0000 lrdi 0.1450 -0.5905 -0.2428 1.0000 0.0000 0.0000 lrd 0.3167 0.9263 1.0000 0.0000 lassets 0.2071 1.0000 lpat 1.0000 lpat lassets lrd lrdi lpat_a~s

. pwcorr lpat lassets lrd lrdi lpat_assets, sig

Assumptions of Pearson’s r

There is a linear relationships between x and y

Both x and y are continuous random variables

Both variables are normally distributed

Equal differences between measurements represent

equivalent intervals.

We may want to relax (one of) these assumptions


Spearman’s Rank Correlation Coefficient ρ Spearman's rank correlation is a non parametric

measure of the intensity of a correlation between two variables, without making any assumptions about the distribution of the variables, i.e. about the linearity, normality or scale of the relationship.

near 1 : Positive Correlation near -1 : Negative Correlation near 0 : No or poor correlation

x,y1 1

n2

i 1x,y x,y 2

6 dRho 1

n n 1

d² : the difference between ranks of paired values of x and y

n : Number of observations

ρ is simply a special case of the Pearson product-moment

coefficient in which the data are converted to ranks before

calculating the coefficient.

Spearman’s Rank Correlation Coefficient ρ


lpat_assets 0.3709 -0.8006 -0.6901 0.6093 1.0000 lrdi 0.1172 -0.5564 -0.2919 1.0000 lrd 0.3202 0.9353 1.0000 lassets 0.2257 1.0000 lpat 1.0000 lpat lassets lrd lrdi lpat_a~s

(obs=431). spearman lpat lassets lrd lrdi lpat_assets

spearman lpat lassets lrd lrdi lpat_assets

Spearman’s Rank Correlation Coefficient ρ spearman lpat lassets lrd lrdi lpat_assets, stats(rho

p)

0.0000 0.0000 0.0000 0.0000 lpat_assets 0.3709 -0.8006 -0.6901 0.6093 1.0000 0.0150 0.0000 0.0000 lrdi 0.1172 -0.5564 -0.2919 1.0000 0.0000 0.0000 lrd 0.3202 0.9353 1.0000 0.0000 lassets 0.2257 1.0000 lpat 1.0000 lpat lassets lrd lrdi lpat_a~s

Sig. level rho Key

(obs=431). spearman lpat lassets lrd lrdi lpat_assets, stats(rho p)

Pearson’s r or Spearman’s ρ?

Relationship between tastes and levels of

consumption on a large sample? (ρ)

Relationship between income and

Consumption on a large sample? (r)

Relationship between income and

Consumption on a small sample? Both (ρ)

and (r)

Analyse Corrélation Bivariée

Click on Pearson

Corrélations

1 .217** .146** .389** .326**

.000 .002 .000 .000

457 457 457 457 457

.217** 1 -.588** -.815** .929**

.000 .000 .000 .000

457 457 457 457 457

.146** -.588** 1 .642** -.248**

.002 .000 .000 .000

457 457 457 457 457

.389** -.815** .642** 1 -.684**

.000 .000 .000 .000

457 457 457 457 457

.326** .929** -.248** -.684** 1

.000 .000 .000 .000

457 457 457 457 457

Corrélation de Pearson

Sig. (bilatérale)

N


Sig. (bilatérale)

N


Sig. (bilatérale)

N


Sig. (bilatérale)

N


Sig. (bilatérale)

N

lnpatent

lnassets

lnrd_assets

lnpat_assets

lnrd

lnpatent lnassets lnrd_assets lnpat_assets lnrd

La corrélation est significative au niveau 0.01 (bilatéral).**.


Analyse Corrélation Bivariée

Click on “Spearman”


Corrélations

1.000 .243** .130** .385** .335**

. .000 .005 .000 .000

457 457 457 457 457

.243** 1.000 -.536** -.774** .941**

.000 . .000 .000 .000

457 457 457 457 457

.130** -.536** 1.000 .604** -.282**

.005 .000 . .000 .000

457 457 457 457 457

.385** -.774** .604** 1.000 -.669**

.000 .000 .000 . .000

457 457 457 457 457

.335** .941** -.282** -.669** 1.000

.000 .000 .000 .000 .

457 457 457 457 457

Coefficient de corrélation

Sig. (bilatérale)

N


Sig. (bilatérale)

N


Sig. (bilatérale)

N


Sig. (bilatérale)

N


Sig. (bilatérale)

N

lnpatent

lnassets

lnrd_assets

lnpat_assets

lnrd

lnpatent lnassets lnrd_assets lnpat_assets lnrd

La corrélation est significative au niveau 0,01 (bilatéral).**.

Download - Class 3 Relationship Between Variables

Top Related