introduction to statistical analysisbioinformatics-core-shared-training.github.io/...data types x 1...

Introduction to Statistical Analysis

Cancer Research UK – 5th of November 2018

D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]

Timeline

9:30 – Morning

I ⇠ 45mn Lecture: data type, summary statistics and graphical displaysI ⇠ 15mn Quiz

10:30 – 15mn Co↵ee & Tea break

I ⇠ 60mn Lecture: some statistical distributions + CLTI ⇠ 15mn Exercises & discussion

12:00 – Lunch break

13:00 – Afternoon

I ⇠ 45mn Lecture: One-sample location testI ⇠ 30mn Exercises with shiny app & discussion

I ⇠ 45mn Lecture: Two-sample location test

15:15 – 15mn Co↵ee & Tea break

I ⇠ 30mn Exercises with shiny app & discussion

16:15 – Group based exercisesI ⇠ 60mn

2

Grand Picture of Statistics

Population Sample

Data

(x1, x2, ..., xn)

Statistics

bµb�2

b⇡

Parameters

µ

�2

⇡

3

UK SMOKERS

Bias

MEASUREMENT ERROR

MISSING VALUES

c¢ ¢POEEFNFNNQEI.ge#sfuAt*INFERENCE

y

PROBABILITY OF@ No suitable == 1 = 0.5

ifmnotnaomsnoenhers } Test 5

Data Types

x1 x2 x3 · · · xn

Cancer status C �C �C · · · C

Nucleic acid sequence C T T · · · A

5-level pain score 3 1 5 · · · 4

# of daily admissions at A&E 16 23 12 · · · 17

Gene expression intensity 882.1 379.5 528.3 · · · 120.9

4

SUMMARY STATISTICS . MUTUAL EXCWSN

LOCATION VARIABILITY( typical value ) - RANKING

-

=FatIII⇐

.se#r.//EauioisI

- MODE- MEAN

- IQR- MEDIAN

- MAD ✓ /QUALITATIVE ( NOMINAL )

¥✓ x x

t¥n¥EI÷e if:/DISCRETE ✓ ✓ ✓

f continuous✓ ✓ ✓

QUANTITATIVE ( NUMERICAL ) HIGH

GRAPHICAL DISPLAYS

÷ARPLOT

- HISTOGRAM

- BOXPLOT

Summary statistics and plots for qualitative data

5-level answers of 21 patients to the question”How much did pain due to your ureteric stones interfere with yourday to day activities ?”:

3, 1, 5, 3, 1, 1, 1, 5, 1, 3, 4, 1, 1, 4, 5, 5, 5, 5, 5, 4, 4,

where

I 1 = ”Not at all”,

I 2 = ”A little bit”,

I 3 = ”Somewhat”,

I 4 = ”Quite a bit”,

I 5 = ”Very much”.

5

^

Tc = .§,I{xi=I where I(xi= c) = 1 if xi = c and

= O otherwise

7 -

7121 -

II.it:#t.HN2=1 °AT1 2 3 4 r

TWO - STAGE PROCESS :

A . .TN#fEnENcEYEFmoDE = YES

B : IF INTERFERENCE : INTERF . LEVEL MODE

= 5

Summary statistics and plots for quantative data

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●

Gene expression values of gene“CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:

x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77

6

LOCATION :

MEAN : µ^ = I = f §,

xi = 1.83

4¥ ) if " is °&= 1.93

Meenan : need = }or= { (x÷÷÷+D y u is era

g, if replaces sy

by 109,

→ impact on

mean

- > no impact on

melian.

0


0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●



6

VARIABILITYRANGE: Xlu ,

- 29 , ) =2. 77 .

0.46 =2.31

Variance :

^F={ E.

(79-

E)"

=

(0.46-1.83)-+4.11-1.835=-0.5u

st.

osev : E=

Fi = @= >-0¥MEDIAN ABS .

Dev :MID = med ( l x ; -

vied ( x ) ) )

( INTERQJARTIE RANGE : IOTR = 9^0#- 9^0.25

Robust To OUTLIERS

.

.


0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.00.10.20.30.40.50.60.70.80.91.0

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●



6

HISTOGRAMJYarea proportionate, .tg⇒"Frieden.to

"

=¥SPLITDATA RANGE#±gTEsnafter } 1 ° 4 ' 2 8 2 f. Into

^ 2 3 4 6 T 6 INTERVALS


0.0 0.5 1.0 1.5 2.0 2.5 3.0

● ● ●

● ● ● ●● ● ●●●●●● ●●●●● ●● ●● ●●● ● ● ●



6

#

* ,

Fk¥IE¥raeac.BOXPL= ' . STNMETRKAL

/€¥A 9%15.19:*uotsuitatle

Fyfe,aBjY€?BA"

ottersTotter

Two-sample case: independent versus paired samples

Permeability constants of a placental membrane at term (X) and between 12 to

26 weeks gestational age (Y).

1 2 3 4 5 6 7 8 9 10X 0.80 0.83 1.89 1.04 1.45 1.38 1.91 1.64 0.73 1.46Y 1.15 0.88 0.90 0.74 1.21

Hamilton depression scale factor measurements in 9 patients with mixed anxiety

and depression, taken at the first (X) and second (Y) visit after initiation of a

therapy (administration of a tranquilizer).

1 2 3 4 5 6 7 8 9X 1.83 0.50 1.62 2.48 1.68 1.88 1.55 3.06 1.30Y 0.88 0.65 0.60 2.05 1.06 1.29 1.06 3.14 1.29

Y�X �0.95 0.15 �1.02 �0.43 �0.62 �0.59 �0.49 0.08 �0.01

7

÷TRENT AND SAIE PARTICIPANTS

UNRELATED PARTICIPANTS MEASURED TW

:÷

TIi

- xi

mean of the { Elyseedifferences .

Quiz TimeSections 1 to 4

https://docs.google.com/forms/d/e/1FAIpQLScblQ_

-ISfSCGp_EIVPPI_mnrJHttaKxln8vVoyjJFvS8BL1w/viewform

8

Some parametric distributions: Binomial distribution

If

I n independent experiments,

I outcome of each experiment is dichotomous (success/failure),

I the probability of success ⇡ is the same for all experiments,

then,

I the number of successes out of n trials (experiments), Y =P

n

i=1 Xi,

follows a binomial distribution with parameters n and ⇡:

Y ⇠ Bin(n,⇡),

I the probability of observing exactly y successes out of n experiments, is

given by

P (Y = y|n,⇡) = n!(n� y)!y!

⇡y(1� ⇡)n�y

.

10

Some parametric distributions: Binomial distribution

If

I n independent experiments,

I outcome of each experiment is dichotomous (success/failure),

I the probability of success ⇡ is the same for all experiments,

then,

I the number of successes out of n trials (experiments), Y =P

n

i=1 Xi,

follows a binomial distribution with parameters n and ⇡:

Y ⇠ Bin(n,⇡),

I the probability of observing exactly y successes out of n experiments, is

given by

P (Y = y|n,⇡) = n!(n� y)!y!

⇡y(1� ⇡)n�y

.

Number of successes out of 50 experiments

prob

abilit

y

0.00

0.05

0.10

0.15

0.20

0.25

0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

10

←,

Plt = ol YTIII ) = ¥#.0¥( e . o .os to = oas'

~ t.rs

Some parametric distributions: Poisson distribution

If, during a time interval or in a given area,

I events occur independently,

I at the same rate,

I and the probability of an event to occur in a small interval (area) is

proportional to the length of the interval (size of the area),

then,

I the number of events occurring in a fixed time interval or in a given area,

X, may be modelled by means of a Poisson distribution with parameter �:

X ⇠ Poisson(�),

I the probability of observing x during a fixed time interval or in a given area

is given by

P (X = x|�) = �xe��

x!.

11

# EVENTS PER # Hourly ADMISSION AT ANE

UNCT OF TIME ANDFOR # of DAILY ORGAN DONORS in THE VK

PER AREA

Some parametric distributions: Poisson distribution

If, during a time interval or in a given area,

I events occur independently,

I at the same rate,

I and the probability of an event to occur in a small interval (area) is

proportional to the length of the interval (size of the area),

then,

I the number of events occurring in a fixed time interval or in a given area,

X, may be modelled by means of a Poisson distribution with parameter �:

X ⇠ Poisson(�),

I the probability of observing x during a fixed time interval or in a given area

is given by

P (X = x|�) = �xe��

x!.

Number of chronic conditions per patient (US National Medical Expenditure Survey)

Probab

ility

0 1 2 3 4 5 6 7 8 9 10 11

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

11

-

mean an 5

variance

} = 1. 55

! rr

-

Some parametric distributions: Continuous distrib.D

ensi

ty

−10 −5 0 5 10 15 20

0.0

0.1

0.2

0.3

ex−Gaussianskew tGammaInverse GammaGaussian

12

Some parametric distributions: Normal distribution

X ⇠ N(µ,�2), fX(x) =1p2⇡�2

e�(x�µ)2

2�2

E[X] = µ, Var[X] = �2,

Z =X � µ

�⇠ N(0, 1), fZ(z) =

1p2⇡

e�x2

2 .

Probability density function, fZ(z), of a standard normal:

Density

−4 −3 −2 −1 0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

13

÷ 6 g5%1.96


X ⇠ N(µ,�2), fX(x) =1p2⇡�2

e�(x�µ)2

2�2

E[X] = µ, Var[X] = �2,

Z =X � µ

�⇠ N(0, 1), fZ(z) =

1p2⇡

e�x2

2 .

Probability density function, fZ(z), of a standard normal:

99.73%

µ� 3� µ+ 3�

95.45%

µ� 2� µ+ 2�

68.27%µ� � µ+ �µ

0.0

0.1

0.2

0.3

0.4

13


X ⇠ N(µ,�2), fX(x) =1p2⇡�2

e�(x�µ)2

2�2

E[X] = µ, Var[X] = �2,

Z =X � µ

�⇠ N(0, 1), fZ(z) =

1p2⇡

e�x2

2 .

(i) Suitable modelling for a lot of variables

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.00.10.20.30.40.50.60.70.80.91.0

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●

13

^µ= 1.83

f2= 0.5

->

✓( 1.83,

0.5 )

[0.83

~ 95% 2.83


X ⇠ N(µ,�2), fX(x) =1p2⇡�2

e�(x�µ)2

2�2

E[X] = µ, Var[X] = �2,

Z =X � µ

�⇠ N(0, 1), fZ(z) =

1p2⇡

e�x2

2 .

(i) Suitable modelling for a lot of variables: IQ

99.73%

55 145

95.45%

70 130

68.27%85 115100

0.0000

0.0266

13


X ⇠ N(µ,�2), fX(x) =1p2⇡�2

e�(x�µ)2

2�2

E[X] = µ, Var[X] = �2,

Z =X � µ

�⇠ N(0, 1), fZ(z) =

1p2⇡

e�x2

2 .

(ii) Central limit theorem (Lindeberg-Levy CLT). Let (X1, ..., Xn) be n independent and identically distributed

(iid) random variables drawn from distributions of expectedvalues given by µ and finite variances given by �2,

. then

bµ = X =

Pni=1 Xi

nd! N

✓µ,

�2

n

◆.

If Xi ⇠ N(µ,�2), this result is true for all sample sizes.

13

Central limit theorem shiny app:Distribution of the meanhttp://bioinformatics.cruk.cam.ac.uk/apps/stats/central-limit-theorem/

14

#p⇐¥tuna%cIµegI ?

p

¥E

,

[0,;ozT

µ.gg z, [ a. ;q ]

✓

(;g3 " t.it?#_tµ¥*⇐

' s °o° Miao [9 ; az ] X

95% Confidence interval for µ, the population mean,when Xi ⇠ N(µ, �2)

I if X ⇠ N(µ,�2), then X ⇠ N

⇣µ,

�2

n

⌘,

I if X ⇠ N(µ,�2), then Z = X�µ

�⇠ N(0, 1),

P

< <

!= 0.95

15

ozu large and Xinii-4,4

- 1.96 Z 1.96

p( - a. 96 < II < i. 96 ) = o.gr

re

P( - i. 96Pa . I < I - n a 1.96ft -E) -.

o .9r

p( I -1.96 FE <

µ< E + 1.96¥ ) = o.gr



⇣µ,

�2

n

⌘,


�⇠ N(0, 1),

I if � unknown, then T = X�µ

s⇠ Stn�1.

P

< <

!= 0.95

Density

-4.303 4.30395%-2.228 2.22895%-2.042 2.04295%-1.96 1.9695%

X ⇠ N(0, 1)T ⇠ St30T ⇠ St10T ⇠ St2

fT (t|n� 1) =�(n�2

2 )�(n�1

2 )[(n�1)⇡]1/2�1+ t2

n�1

�n2

-5 -4 -3 -2 -1 0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

15

of n large aus Xin iisln . ry

F-

ttnoatrlfnµ

It

tnioetrlsrnTito .97r ) = 2. oyz for 4=31

e>



⇣µ,

�2

n

⌘,


�⇠ N(0, 1),


s⇠ Stn�1.

P

< <

!= 0.95

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.00.10.20.30.40.50.60.70.80.91.0

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●

15

95% Confidence interval for µ, the population mean,when Xi ⇠ iid(µ, �2)

I CLT: Xd! N

⇣µ,

�2

n

⌘,


�⇠ N(0, 1),


s⇠ Stn�1.

P

< <

!= 0.95

Number of chronic conditions per patient (US National Medical Expenditure Survey: n = 4406, bµ = 1.55)

Probab

ility

0 1 2 3 4 5 6 7 8 9 10 11

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

16

Assuming Xi ~ Poisson

µ = £2 #1.55

1. rr. i. as

'f÷iµ 1.rrtr.at#s

.

95% Confidence interval for µ, the population mean,when Xi ⇠ iid(µ, �2)

I CLT: Xd! N

⇣µ,

�2

n

⌘,


�⇠ N(0, 1),


s⇠ Stn�1.

P

< <

!= 0.95

Number of chronic conditions per patient (US National Medical Expenditure Survey: n = 4406, bµ = 1.55)

Probab

ility

0 1 2 3 4 5 6 7 8 9 10 11

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

CINormal(µ, 0.95) = [1.5021; 1.5818]

CIStudent(µ, 0.95) = [1.5021; 1.5819]

CIBootstrap(µ, 0.95) = [1.5020; 1.5819]

CINB�GLM (µ, 0.95) = [1.5027; 1.5823]

16

95% Confidence interval for µY � µX , the di↵erencebetween population means

If we haveI Xi ⇠ iid(µX ,�2

X), i = 1, ..., nX ,I Yi ⇠ iid(µY ,�2

Y ), i = 1, ..., nY ,

thenI if �2

X = �2Y [Student’s t-test equation],

. CI (µY � µX , 0.95) = (Y �X)± t1�↵2 ,nX+nY �2sp

q1

nX+ 1

nY

where sp = (nX�1)s2X+(nY �1)s2YnX+nY �2 ,

I if �2X 6= �2

Y [Welch-Satterthwaite’s t-test equation],

. CI (µY � µX , 0.95) = (Y �X)± t1�↵2 ,df

qs2XnX

+s2YnY

, where

df =

✓s2X

nX+

s2Y

nY

◆2

✓s2X

nX

◆2

nX�1 +

✓s2Y

nY

◆2

nY �1

.

17

Central limit theorem shiny app:Coverage of Student’s asymptotic confidence intervalshttp://bioinformatics.cruk.cam.ac.uk/apps/stats/central-limit-theorem/

18

Quiz TimePractical 1

http://bioinformatics-core-shared-training.github.io/

IntroductionToStats/practical.html

19

PART II:

Parametric and non-parametric

one-sample location tests



Grand Picture of Statistics

Population Sample

Data

(x1, x2, ..., xn)

Statistics

bµb�2

b⇡

Parameters

µ

�2

⇡

20

Statistical hypothesis testing

A hypothesis test describes a phenomenon by means oftwo non-overlapping idealised models/descriptions:

I the null hypothesis H0, “generally assumed to be true until evidenceindicates otherwise”

I the alternative hypothesis H1.

The aim of the test is to reject the null hypothesis in favour of thealternative hypothesis, and conclude, with a probability ↵ of being wrong,that the idealised model/description of H1 is true.

Theory 1: Dieters lose more fat than the exercisers

Theory 2: There is no majority for Brexit now

Theory 3: Serum vitamin C is reduced in patients

22

. 5

Hi : Ty > 0.5

/* " ¥=°

Ho : µ ,= Me

Hi :

µ , > ME

(>

to : µp=µ¢Hi : Mp < up

Statistical hypothesis testing

Many options:

I One-sided versus two-sided tests,

I Exact versus asymptotic tests,

I Parametric versus non-parametric tests,

23

Parametric or non-parametric ?

T-test Outcome(s) normally distributed

Yes Mildly No

Sample size

Small

Medium

Large

Situations which may suggest the use of non-parametric statistics:I When there is a small sample size or very unequal groups,I When the data has notable outliers,I When one outcome has a distribution other than normal,I When the data are ordered with many ties or are rank ordered.

32

! %¥

.

Parametric location test

(One-sided) t-test

We test:

H0: µIQ = 100,H1: µIQ > 100.

We have Xi ⇠ N(µ,�2), i = 1, ..., n,

We know

I X ⇠ N

⇣µ,

�2

n

⌘,

I Z = X�µ�pn

⇠ N(0, 1),

Thus, if H0 is true, we have:

I Z = X�µ0�pn

⇠ N(0, 1).

Define the p-value:

I p� value = P (T > Tobs)

Density

−4 −3 −2 −1 0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

99.73%

55 145

95.45%

70 130

68.27%85 115100

0.0000

0.0266

25

Statistical hypothesis testing4 possible outcomes

Conclude:I if p-value > ↵ ! do not reject H0.I if p-value < ↵ ! reject H0 in favour of H1.

Test Outcome

H0 not rejected H1 accepted

Unknown Truth H0 true 1� ↵ ↵

H1 true � 1� �

where

I ↵ is the type I error,

I � is the type II error.

25

HIV test

c- ) ( t )

f) FALSE Pos.

( + )FALSE NEG .

C)( > POWER

Function ofDESIGN a

Sample size ,

i , ,


(One-sided) binomial exact test

We test:

H0: ⇡ = 5%,

H1: ⇡ > 5%.

We have Xi ⇠ Bernoulli(⇡), i = 1, ..., n,

We know

I Y =P

n

i=1 Xi ⇠ Binomial (⇡, n) ,


I Y =P

n

i=1 Xi ⇠ Binomial (5%, n) ,

Define the p-value:

I p� value = P (Y > Yobs)

27


(One-sided) binomial exact test

We test:

H0: ⇡ = 5%,

H1: ⇡ > 5%.

We have Xi ⇠ Bernoulli(⇡), i = 1, ..., n,

We know

I Y =P

n

i=1 Xi ⇠ Binomial (⇡, n) ,


I Y =P

n

i=1 Xi ⇠ Binomial (5%, n) ,

Define the p-value:

I p� value = P (Y > Yobs)

Number of successes out of n experiments

Prob

abilit

y

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

0.0

0.1

0.2

0.3

0.4

n = 25n = 100

27

Non-parametric location test

Wilcoxon sign-rank test

A location model is assumed for Xi, i = 1, ..., n:

Xi = ✓ + ei,

where ei ⇠ iid(µe = 0,�2e).

Interest for H0: ✓ = ✓0 against H1: ✓ < ✓0 or ✓ 6= ✓0 or ✓ > ✓0.

Test statistics : W+ =

Pn

i=1 ◆(Xi � ✓0 > 0) Rank(|Xi � ✓0|).

Distribution of W under H0: W+

has no closed-form distribution.

Wilcoxon signed rank test

data: golub[1042, gol.fac == "ALL"]V = 268, p-value = 0.05847alternative hypothesis: true location is not equal to 1.7595 percent confidence interval:1.73868 2.09106sample estimates:(pseudo)median

1.926475

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.00.10.20.30.40.50.60.70.80.91.0

● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●

28

Introduction to Shiny Apps and Exercises

29

PART III:

Parametric and non-parametric

two-sample location tests



Two-sample case

Many options:

I One-sided versus two-sided tests,

I Exact versus asymptotic tests,

I Parametric versus non-parametric tests,

I Tests for paired versus independent data.

31

Parametric two-sample location test

Two-sample two-sided Student-s & Welch’s t-tests

●●●

●

Intensity expression of gene 'CCND3 Cyclin D3'

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

Acute lymphoblasticleukemia (ALL)

n=27

Acute myeloidleukemia (AML)

n=11

We test H0: µY � µX = 0 against H1: µY � µX 6= 0.

We know:

I Student’s t-test [assume �2X

= �2Y]: (Y �X)�(µY �µX )

sp

q1

nX+ 1

nY

⇠ t1�↵2 ,nX+nY �2

I Welch’s t-test [assume �2X

6= �2Y]: (Y �X)�(µY �µX )r

s2X

nX+

s2Y

nY

⇠ t1�↵2 ,df

32



●●●

●


−0.5 0.0 0.5 1.0 1.5 2.0 2.5


n=27


n=11


We know:


= �2Y]: (Y �X)�(µY �µX )

sp

q1

nX+ 1

nY

⇠ t1�↵2 ,nX+nY �2


6= �2Y]: (Y �X)�(µY �µX )r

s2X

nX+

s2Y

nY

⇠ t1�↵2 ,df

Two Sample t-test

data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"]t = 6.7983, df = 36, p-value = 6.046e-08alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:0.8829143 1.6336690sample estimates:mean of x mean of y1.8938826 0.6355909

32



●●●

●


−0.5 0.0 0.5 1.0 1.5 2.0 2.5


n=27


n=11


We know:


= �2Y]: (Y �X)�(µY �µX )

sp

q1

nX+ 1

nY

⇠ t1�↵2 ,nX+nY �2


6= �2Y]: (Y �X)�(µY �µX )r

s2X

nX+

s2Y

nY

⇠ t1�↵2 ,df

Welch Two Sample t-test

data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"]t = 6.3186, df = 16.118, p-value = 9.871e-06alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:0.8363826 1.6802008sample estimates:mean of x mean of y1.8938826 0.6355909

32

Non-parametric two-sample location test

Mann-Whitney-Wilcoxon test

Let

I Xi ⇠ iid(µX ,�2), i = 1, ..., nX ,

I Yi ⇠ iid(µX + �,�2), i = 1, ..., nY .

Interest for H0: � = �0 against H1: � < �0 or � 6= �0 or � > �0.

Standardised test statistic: z =PnY

i=1 R(Yi)�[nY (nX+nY +1)/2]pnXnY (nX+nY +1)/12

,

where R(Yi) denotes the rank of Yi amongst the combined samples, i.e.,

amongst (X1, ..., XnX , Y1, ..., YnY ).

Distribution of Z under H0: Z ⇠ N(0, 1).

Implementation 1:statistic = -4.361334 , p-value = 1.292716e-05

Implementation 2:W = 284, p-value = 6.15e-07alternative hypothesis: true location shift is not equal to 095 percent confidence interval:0.89647 1.57023sample estimates:difference in location

1.21951

●●●

●


−0.5 0.0 0.5 1.0 1.5 2.0 2.5


n=27


n=11

33

Non-parametric two-sample location test

Mann-Whitney-Wilcoxon test

Let

I Xi ⇠ iid(µX ,�2), i = 1, ..., nX ,

I Yi ⇠ iid(µX + �,�2), i = 1, ..., nY .

Interest for H0: � = �0 against H1: � < �0 or � 6= �0 or � > �0.

Standardised test statistic: z =PnY

i=1 R(Yi)�[nY (nX+nY +1)/2]pnXnY (nX+nY +1)/12

,

where R(Yi) denotes the rank of Yi amongst the combined samples, i.e.,

amongst (X1, ..., XnX , Y1, ..., YnY ).

Distribution of Z under H0: Z ⇠ N(0, 1).

Implementation 1:statistic = -4.361334 , p-value = 1.292716e-05

Implementation 2:W = 284, p-value = 6.15e-07alternative hypothesis: true location shift is not equal to 095 percent confidence interval:0.89647 1.57023sample estimates:difference in location

1.21951

●●●

●


−0.5 0.0 0.5 1.0 1.5 2.0 2.5


n=27


n=11

Density

−4 −3 −2 −1 0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

33

Two-sample proportion test

�2goodness-of-fit test

A trial to assess the e↵ectiveness of a new treatment versus a placebo in reducing

tumour size in patients with ovarian cancer:

Observed frequencies Binary outcome

Tumour did not shrink Tumour did shrink

Group Treatment 44 40 (84)

Placebo 24 16 (40)

(68) (56) (124)

I H0 : No association between treatment group and tumour shrinkage,

I H1 : Some association.

Expected frequencies under H0 Binary outcome


Group Treatment

84⇥68124 = 46.06 84⇥58

124 = 37.94

(84)

Placebo

40⇥68124 = 21.94 40⇥56

124 = 18.71

(40)

(68) (56) (124)

We have 2 categorical variables with a total of J = 4 cells (categories).

I H0 : ⇡j = ⇡j0 , j = 1, ..., J ,I H1 : ⇡j 6= ⇡j0 , j = 1, ..., J .

�2-test:

JPj=1

(Oj�Ej)2

Ej⇠ �

2(J � 1).

Pearson’s Chi-squared test with Yates’ continuity correction

data: MX-squared = 0.36474, df = 1, p-value = 0.5459

34








Placebo 24 16 (40)

(68) (56) (124)


I H1 : Some association.Expected frequencies under H0 Binary outcome


Group Treatment

84⇥68124 = 46.06 84⇥58

124 = 37.94

(84)

Placebo

40⇥68124 = 21.94 40⇥56

124 = 18.71

(40)

(68) (56) (124)


I H0 : ⇡j = ⇡j0 , j = 1, ..., J ,I H1 : ⇡j 6= ⇡j0 , j = 1, ..., J .

�2-test:

JPj=1

(Oj�Ej)2

Ej⇠ �

2(J � 1).



34








Placebo 24 16 (40)

(68) (56) (124)




Group Treatment 84⇥68124 = 46.06 84⇥58

124 = 37.94 (84)

Placebo 40⇥68124 = 21.94 40⇥56

124 = 18.71 (40)

(68) (56) (124)


I H0 : ⇡j = ⇡j0 , j = 1, ..., J ,I H1 : ⇡j 6= ⇡j0 , j = 1, ..., J .

�2-test:

JPj=1

(Oj�Ej)2

Ej⇠ �

2(J � 1).



34








Placebo 24 16 (40)

(68) (56) (124)




Group Treatment

84⇥68124 = 46.06 84⇥58

124 = 37.94

(84)

Placebo

40⇥68124 = 21.94 40⇥56

124 = 18.71

(40)

(68) (56) (124)


I H0 : ⇡j = ⇡j0 , j = 1, ..., J ,I H1 : ⇡j 6= ⇡j0 , j = 1, ..., J .

�2-test:

JPj=1

(Oj�Ej)2

Ej⇠ �

2(J � 1).




Fisher’s exact test of independence

�2goodness-of-fit test not suitable when

I n is small

I Ej < 5 for at least one cell.

Observed frequencies Variable 1

Category 1 Category 2

Variable 2 Category 1 a b (a+b)

Category 2 c d (c+d)

(a+c) (b+d) (a+b+c+d=n)

Fisher showed that, under H0 (independence),

P (observed table | H0) = P (X = a) and X ⇠ Hypergeometric(n, a+ c, a+ b).To compute the Fisher’s test:

I Define P (X = a) for all possible tables having the observed marginal

counts,

I Calculate the p� value by defining the percentage of these tables that get

a probability equal to or smaller than the one observed.

Fisher’s Exact Test for Count Data

data: Mp-value = 0.4471alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval:0.3160593 1.6790135sample estimates:odds ratio0.7351707

35


Fisher’s exact test of independence

�2goodness-of-fit test not suitable when

I n is small

I Ej < 5 for at least one cell.




Placebo 24 16 (40)

(68) (56) (124)

Fisher showed that, under H0 (independence),

P (observed table | H0) = P (X = a) and X ⇠ Hypergeometric(n, a+ c, a+ b).To compute the Fisher’s test:

I Define P (X = a) for all possible tables having the observed marginal

counts,

I Calculate the p� value by defining the percentage of these tables that get

a probability equal to or smaller than the one observed.

Fisher’s Exact Test for Count Data

data: Mp-value = 0.4471alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval:0.3160593 1.6790135sample estimates:odds ratio0.7351707

35

F-test of equality of variances

●●●

●


−0.5 0.0 0.5 1.0 1.5 2.0 2.5


n=27


n=11

We test H0: �2Y

= �2X

against H1: �2Y

6= �2X.

We know:

I F-test [assume Xi ⇠ N(µX ,�X) and Yi ⇠ N(µY ,�Y )]:s2Y

s2X

⇠ FnY �1,nX�1

36

F-test of equality of variances

●●●

●


−0.5 0.0 0.5 1.0 1.5 2.0 2.5


n=27


n=11

We test H0: �2Y

= �2X

against H1: �2Y

6= �2X.

We know:

I F-test [assume Xi ⇠ N(µX ,�X) and Yi ⇠ N(µY ,�Y )]:s2Y

s2X

⇠ FnY �1,nX�1

F test to compare two variances

data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"]F = 0.71164, num df = 26, denom df = 10, p-value = 0.4652alternative hypothesis: true ratio of variances is not equal to 195 percent confidence interval:0.2127735 1.8428387sample estimates:ratio of variances

0.7116441

36

Warning

Multiplicity correction

For each test, the probability of rejecting H0 (and accept H1) when H0 istrue equals ↵.

For k tests, the probability of rejecting H0 (and accept H1) at least 1 timewhen H0 is true, ↵k, is given by

↵k = 1� (1� ↵)k.

Thus, for ↵ = 0.05,I if k = 1, ↵1 = 1� (1� ↵)1 = 0.05,I if k = 2, ↵2 = 1� (1� ↵)2 = 0.0975,I if k = 10, ↵10 = 1� (1� ↵)10 = 0.4013.

Idea: change the level of each test so that ↵k = 0.05:

I Bonferroni correction : ↵ = ↵k

k ,

I Dunn-Sidak correction: ↵ = 1� (1� ↵k)1/k.

37

Warning

Non-parametric is not assumption free: Type I error

Simulate 2500 samples with

I Xi ⇠ Uniform(1.5, 2.5), i = 1, ..., nX ,

I Yi ⇠ Uniform(0, 4), i = 1, ..., nY ,

so that E[Xi] = E[Yi] = 2 (i.e., same mean, same median).

Assume

I Xi ⇠ iid(µX ,�2), i = 1, ..., nX ,

I Yi ⇠ iid(µX + �,�2), i = 1, ..., nY .

Test H0: � = �0 against H1: � 6= �0, at the 5% level, by means of

I Mann-Whitney-Wilcoxon test (MWW),

I T-test,

I Welch-test.

b↵ Tests

MWW Student’s t-test Welch’s test

Sample size nX = 200, nY = 70 0.145 0.202 0.055

nX = 20, nY = 7 0.148 0.240 0.062

38

Exercises

39

introduction to statistical analysisbioinformatics-core-shared-training.github.io/...data types x 1...

Documents