introduction to statistical analysisbioinformatics-core-shared-training.github.io/...data types x 1...
TRANSCRIPT
Introduction to Statistical Analysis
Cancer Research UK – 5th of November 2018
D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]
Timeline
9:30 – Morning
I ⇠ 45mn Lecture: data type, summary statistics and graphical displaysI ⇠ 15mn Quiz
10:30 – 15mn Co↵ee & Tea break
I ⇠ 60mn Lecture: some statistical distributions + CLTI ⇠ 15mn Exercises & discussion
12:00 – Lunch break
13:00 – Afternoon
I ⇠ 45mn Lecture: One-sample location testI ⇠ 30mn Exercises with shiny app & discussion
I ⇠ 45mn Lecture: Two-sample location test
15:15 – 15mn Co↵ee & Tea break
I ⇠ 30mn Exercises with shiny app & discussion
16:15 – Group based exercisesI ⇠ 60mn
2
Grand Picture of Statistics
Population Sample
Data
(x1, x2, ..., xn)
Statistics
bµb�2
b⇡
Parameters
µ
�2
⇡
3
UK SMOKERS
Bias
MEASUREMENT ERROR
MISSING VALUES
c¢ ¢POEEFNFNNQEI.ge#sfuAt*INFERENCE
y
PROBABILITY OF@ No suitable == 1 = 0.5
ifmnotnaomsnoenhers } Test 5
Data Types
x1 x2 x3 · · · xn
Cancer status C �C �C · · · C
Nucleic acid sequence C T T · · · A
5-level pain score 3 1 5 · · · 4
# of daily admissions at A&E 16 23 12 · · · 17
Gene expression intensity 882.1 379.5 528.3 · · · 120.9
4
SUMMARY STATISTICS . MUTUAL EXCWSN
LOCATION VARIABILITY( typical value ) - RANKING
-
=FatIII⇐
.se#r.//EauioisI
- MODE- MEAN
- IQR- MEDIAN
- MAD ✓ /QUALITATIVE ( NOMINAL )
¥✓ x x
t¥n¥EI÷e if:/DISCRETE ✓ ✓ ✓
f continuous✓ ✓ ✓
QUANTITATIVE ( NUMERICAL ) HIGH
GRAPHICAL DISPLAYS
÷ARPLOT
- HISTOGRAM
- BOXPLOT
Summary statistics and plots for qualitative data
5-level answers of 21 patients to the question”How much did pain due to your ureteric stones interfere with yourday to day activities ?”:
3, 1, 5, 3, 1, 1, 1, 5, 1, 3, 4, 1, 1, 4, 5, 5, 5, 5, 5, 4, 4,
where
I 1 = ”Not at all”,
I 2 = ”A little bit”,
I 3 = ”Somewhat”,
I 4 = ”Quite a bit”,
I 5 = ”Very much”.
5
^
Tc = .§,I{xi=I where I(xi= c) = 1 if xi = c and
= O otherwise
7 -
7121 -
II.it:#t.HN2=1 °AT1 2 3 4 r
TWO - STAGE PROCESS :
A . .TN#fEnENcEYEFmoDE = YES
B : IF INTERFERENCE : INTERF . LEVEL MODE
= 5
Summary statistics and plots for quantative data
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
Gene expression values of gene“CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77
6
LOCATION :
MEAN : µ^ = I = f §,
xi = 1.83
4¥ ) if " is °&= 1.93
Meenan : need = }or= { (x÷÷÷+D y u is era
g, if replaces sy
by 109,
→ impact on
mean
- > no impact on
melian.
0
Summary statistics and plots for quantative data
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
Gene expression values of gene“CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77
6
VARIABILITYRANGE: Xlu ,
- 29 , ) =2. 77 .
0.46 =2.31
Variance :
^F={ E.
(79-
E)"
=
(0.46-1.83)-+4.11-1.835=-0.5u
st.
osev : E=
Fi = @= >-0¥MEDIAN ABS .
Dev :MID = med ( l x ; -
vied ( x ) ) )
( INTERQJARTIE RANGE : IOTR = 9^0#- 9^0.25
Robust To OUTLIERS
.
.
Summary statistics and plots for quantative data
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
Gene expression values of gene“CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77
6
HISTOGRAMJYarea proportionate, .tg⇒"Frieden.to
"
=¥SPLITDATA RANGE#±gTEsnafter } 1 ° 4 ' 2 8 2 f. Into
^ 2 3 4 6 T 6 INTERVALS
Summary statistics and plots for quantative data
0.0 0.5 1.0 1.5 2.0 2.5 3.0
● ● ●
● ● ● ●● ● ●●●●●● ●●●●● ●● ●● ●●● ● ● ●
Gene expression values of gene“CCND3 Cyclin D3” from 27 patientsdiagnosed with acute lymphoblastic leukaemia:
x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)0.46 1.11 1.28 1.33 1.37 1.52 1.78 1.81 1.82x(10) x(11) x(12) x(13) x(14) x(15) x(16) x(17) x(18)1.83 1.83 1.85 1.9 1.93 1.96 1.99 2.00 2.07x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27)2.11 2.18 2.18 2.31 2.34 2.37 2.45 2.59 2.77
6
#
* ,
Fk¥IE¥raeac.BOXPL= ' . STNMETRKAL
/€¥A 9%15.19:*uotsuitatle
Fyfe,aBjY€?BA"
ottersTotter
Two-sample case: independent versus paired samples
Permeability constants of a placental membrane at term (X) and between 12 to
26 weeks gestational age (Y).
1 2 3 4 5 6 7 8 9 10X 0.80 0.83 1.89 1.04 1.45 1.38 1.91 1.64 0.73 1.46Y 1.15 0.88 0.90 0.74 1.21
Hamilton depression scale factor measurements in 9 patients with mixed anxiety
and depression, taken at the first (X) and second (Y) visit after initiation of a
therapy (administration of a tranquilizer).
1 2 3 4 5 6 7 8 9X 1.83 0.50 1.62 2.48 1.68 1.88 1.55 3.06 1.30Y 0.88 0.65 0.60 2.05 1.06 1.29 1.06 3.14 1.29
Y�X �0.95 0.15 �1.02 �0.43 �0.62 �0.59 �0.49 0.08 �0.01
7
÷TRENT AND SAIE PARTICIPANTS
UNRELATED PARTICIPANTS MEASURED TW
:÷
TIi
- xi
mean of the { Elyseedifferences .
Quiz TimeSections 1 to 4
https://docs.google.com/forms/d/e/1FAIpQLScblQ_
-ISfSCGp_EIVPPI_mnrJHttaKxln8vVoyjJFvS8BL1w/viewform
8
Some parametric distributions: Binomial distribution
If
I n independent experiments,
I outcome of each experiment is dichotomous (success/failure),
I the probability of success ⇡ is the same for all experiments,
then,
I the number of successes out of n trials (experiments), Y =P
n
i=1 Xi,
follows a binomial distribution with parameters n and ⇡:
Y ⇠ Bin(n,⇡),
I the probability of observing exactly y successes out of n experiments, is
given by
P (Y = y|n,⇡) = n!(n� y)!y!
⇡y(1� ⇡)n�y
.
10
Some parametric distributions: Binomial distribution
If
I n independent experiments,
I outcome of each experiment is dichotomous (success/failure),
I the probability of success ⇡ is the same for all experiments,
then,
I the number of successes out of n trials (experiments), Y =P
n
i=1 Xi,
follows a binomial distribution with parameters n and ⇡:
Y ⇠ Bin(n,⇡),
I the probability of observing exactly y successes out of n experiments, is
given by
P (Y = y|n,⇡) = n!(n� y)!y!
⇡y(1� ⇡)n�y
.
Number of successes out of 50 experiments
prob
abilit
y
0.00
0.05
0.10
0.15
0.20
0.25
0 1 2 3 4 5 6 7 8 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
10
←,
Plt = ol YTIII ) = ¥#.0¥( e . o .os to = oas'
~ t.rs
Some parametric distributions: Poisson distribution
If, during a time interval or in a given area,
I events occur independently,
I at the same rate,
I and the probability of an event to occur in a small interval (area) is
proportional to the length of the interval (size of the area),
then,
I the number of events occurring in a fixed time interval or in a given area,
X, may be modelled by means of a Poisson distribution with parameter �:
X ⇠ Poisson(�),
I the probability of observing x during a fixed time interval or in a given area
is given by
P (X = x|�) = �xe��
x!.
11
# EVENTS PER # Hourly ADMISSION AT ANE
UNCT OF TIME ANDFOR # of DAILY ORGAN DONORS in THE VK
PER AREA
Some parametric distributions: Poisson distribution
If, during a time interval or in a given area,
I events occur independently,
I at the same rate,
I and the probability of an event to occur in a small interval (area) is
proportional to the length of the interval (size of the area),
then,
I the number of events occurring in a fixed time interval or in a given area,
X, may be modelled by means of a Poisson distribution with parameter �:
X ⇠ Poisson(�),
I the probability of observing x during a fixed time interval or in a given area
is given by
P (X = x|�) = �xe��
x!.
Number of chronic conditions per patient (US National Medical Expenditure Survey)
Probab
ility
0 1 2 3 4 5 6 7 8 9 10 11
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
11
-
mean an 5
variance
} = 1. 55
! rr
-
Some parametric distributions: Continuous distrib.D
ensi
ty
−10 −5 0 5 10 15 20
0.0
0.1
0.2
0.3
ex−Gaussianskew tGammaInverse GammaGaussian
12
Some parametric distributions: Normal distribution
X ⇠ N(µ,�2), fX(x) =1p2⇡�2
e�(x�µ)2
2�2
E[X] = µ, Var[X] = �2,
Z =X � µ
�⇠ N(0, 1), fZ(z) =
1p2⇡
e�x2
2 .
Probability density function, fZ(z), of a standard normal:
Density
−4 −3 −2 −1 0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
13
÷ 6 g5%1.96
Some parametric distributions: Normal distribution
X ⇠ N(µ,�2), fX(x) =1p2⇡�2
e�(x�µ)2
2�2
E[X] = µ, Var[X] = �2,
Z =X � µ
�⇠ N(0, 1), fZ(z) =
1p2⇡
e�x2
2 .
Probability density function, fZ(z), of a standard normal:
99.73%
µ� 3� µ+ 3�
95.45%
µ� 2� µ+ 2�
68.27%µ� � µ+ �µ
0.0
0.1
0.2
0.3
0.4
13
Some parametric distributions: Normal distribution
X ⇠ N(µ,�2), fX(x) =1p2⇡�2
e�(x�µ)2
2�2
E[X] = µ, Var[X] = �2,
Z =X � µ
�⇠ N(0, 1), fZ(z) =
1p2⇡
e�x2
2 .
(i) Suitable modelling for a lot of variables
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
13
^µ= 1.83
f2= 0.5
->
✓( 1.83,
0.5 )
[0.83
~ 95% 2.83
Some parametric distributions: Normal distribution
X ⇠ N(µ,�2), fX(x) =1p2⇡�2
e�(x�µ)2
2�2
E[X] = µ, Var[X] = �2,
Z =X � µ
�⇠ N(0, 1), fZ(z) =
1p2⇡
e�x2
2 .
(i) Suitable modelling for a lot of variables: IQ
99.73%
55 145
95.45%
70 130
68.27%85 115100
0.0000
0.0266
13
Some parametric distributions: Normal distribution
X ⇠ N(µ,�2), fX(x) =1p2⇡�2
e�(x�µ)2
2�2
E[X] = µ, Var[X] = �2,
Z =X � µ
�⇠ N(0, 1), fZ(z) =
1p2⇡
e�x2
2 .
(ii) Central limit theorem (Lindeberg-Levy CLT). Let (X1, ..., Xn) be n independent and identically distributed
(iid) random variables drawn from distributions of expectedvalues given by µ and finite variances given by �2,
. then
bµ = X =
Pni=1 Xi
nd! N
✓µ,
�2
n
◆.
If Xi ⇠ N(µ,�2), this result is true for all sample sizes.
13
Central limit theorem shiny app:Distribution of the meanhttp://bioinformatics.cruk.cam.ac.uk/apps/stats/central-limit-theorem/
14
#p⇐¥tuna%cIµegI ?
p
¥E
,
[0,;ozT
µ.gg z, [ a. ;q ]
✓
(;g3 " t.it?#_tµ¥*⇐
' s °o° Miao [9 ; az ] X
95% Confidence interval for µ, the population mean,when Xi ⇠ N(µ, �2)
I if X ⇠ N(µ,�2), then X ⇠ N
⇣µ,
�2
n
⌘,
I if X ⇠ N(µ,�2), then Z = X�µ
�⇠ N(0, 1),
P
< <
!= 0.95
15
ozu large and Xinii-4,4
- 1.96 Z 1.96
p( - a. 96 < II < i. 96 ) = o.gr
re
P( - i. 96Pa . I < I - n a 1.96ft -E) -.
o .9r
p( I -1.96 FE <
µ< E + 1.96¥ ) = o.gr
95% Confidence interval for µ, the population mean,when Xi ⇠ N(µ, �2)
I if X ⇠ N(µ,�2), then X ⇠ N
⇣µ,
�2
n
⌘,
I if X ⇠ N(µ,�2), then Z = X�µ
�⇠ N(0, 1),
I if � unknown, then T = X�µ
s⇠ Stn�1.
P
< <
!= 0.95
Density
-4.303 4.30395%-2.228 2.22895%-2.042 2.04295%-1.96 1.9695%
X ⇠ N(0, 1)T ⇠ St30T ⇠ St10T ⇠ St2
fT (t|n� 1) =�(n�2
2 )�(n�1
2 )[(n�1)⇡]1/2�1+ t2
n�1
�n2
-5 -4 -3 -2 -1 0 1 2 3 4 5
0.0
0.1
0.2
0.3
0.4
15
of n large aus Xin iisln . ry
F-
ttnoatrlfnµ
It
tnioetrlsrnTito .97r ) = 2. oyz for 4=31
e>
95% Confidence interval for µ, the population mean,when Xi ⇠ N(µ, �2)
I if X ⇠ N(µ,�2), then X ⇠ N
⇣µ,
�2
n
⌘,
I if X ⇠ N(µ,�2), then Z = X�µ
�⇠ N(0, 1),
I if � unknown, then T = X�µ
s⇠ Stn�1.
P
< <
!= 0.95
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
15
95% Confidence interval for µ, the population mean,when Xi ⇠ iid(µ, �2)
I CLT: Xd! N
⇣µ,
�2
n
⌘,
I if X ⇠ N(µ,�2), then Z = X�µ
�⇠ N(0, 1),
I if � unknown, then T = X�µ
s⇠ Stn�1.
P
< <
!= 0.95
Number of chronic conditions per patient (US National Medical Expenditure Survey: n = 4406, bµ = 1.55)
Probab
ility
0 1 2 3 4 5 6 7 8 9 10 11
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
16
Assuming Xi ~ Poisson
µ = £2 #1.55
1. rr. i. as
'f÷iµ 1.rrtr.at#s
.
95% Confidence interval for µ, the population mean,when Xi ⇠ iid(µ, �2)
I CLT: Xd! N
⇣µ,
�2
n
⌘,
I if X ⇠ N(µ,�2), then Z = X�µ
�⇠ N(0, 1),
I if � unknown, then T = X�µ
s⇠ Stn�1.
P
< <
!= 0.95
Number of chronic conditions per patient (US National Medical Expenditure Survey: n = 4406, bµ = 1.55)
Probab
ility
0 1 2 3 4 5 6 7 8 9 10 11
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
CINormal(µ, 0.95) = [1.5021; 1.5818]
CIStudent(µ, 0.95) = [1.5021; 1.5819]
CIBootstrap(µ, 0.95) = [1.5020; 1.5819]
CINB�GLM (µ, 0.95) = [1.5027; 1.5823]
16
95% Confidence interval for µY � µX , the di↵erencebetween population means
If we haveI Xi ⇠ iid(µX ,�2
X), i = 1, ..., nX ,I Yi ⇠ iid(µY ,�2
Y ), i = 1, ..., nY ,
thenI if �2
X = �2Y [Student’s t-test equation],
. CI (µY � µX , 0.95) = (Y �X)± t1�↵2 ,nX+nY �2sp
q1
nX+ 1
nY
where sp = (nX�1)s2X+(nY �1)s2YnX+nY �2 ,
I if �2X 6= �2
Y [Welch-Satterthwaite’s t-test equation],
. CI (µY � µX , 0.95) = (Y �X)± t1�↵2 ,df
qs2XnX
+s2YnY
, where
df =
✓s2X
nX+
s2Y
nY
◆2
✓s2X
nX
◆2
nX�1 +
✓s2Y
nY
◆2
nY �1
.
17
95% Confidence interval for µY � µX , the di↵erencebetween population means
If we haveI Xi ⇠ iid(µX ,�2
X), i = 1, ..., nX ,I Yi ⇠ iid(µY ,�2
Y ), i = 1, ..., nY ,
thenI if �2
X = �2Y [Student’s t-test equation],
. CI (µY � µX , 0.95) = (Y �X)± t1�↵2 ,nX+nY �2sp
q1
nX+ 1
nY
where sp = (nX�1)s2X+(nY �1)s2YnX+nY �2 ,
I if �2X 6= �2
Y [Welch-Satterthwaite’s t-test equation],
. CI (µY � µX , 0.95) = (Y �X)± t1�↵2 ,df
qs2XnX
+s2YnY
, where
df =
✓s2X
nX+
s2Y
nY
◆2
✓s2X
nX
◆2
nX�1 +
✓s2Y
nY
◆2
nY �1
.
17
Central limit theorem shiny app:Coverage of Student’s asymptotic confidence intervalshttp://bioinformatics.cruk.cam.ac.uk/apps/stats/central-limit-theorem/
18
Quiz TimePractical 1
http://bioinformatics-core-shared-training.github.io/
IntroductionToStats/practical.html
19
PART II:
Parametric and non-parametric
one-sample location tests
Cancer Research UK – 5th of November 2018
D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]
Grand Picture of Statistics
Population Sample
Data
(x1, x2, ..., xn)
Statistics
bµb�2
b⇡
Parameters
µ
�2
⇡
20
Statistical hypothesis testing
A hypothesis test describes a phenomenon by means oftwo non-overlapping idealised models/descriptions:
I the null hypothesis H0, “generally assumed to be true until evidenceindicates otherwise”
I the alternative hypothesis H1.
The aim of the test is to reject the null hypothesis in favour of thealternative hypothesis, and conclude, with a probability ↵ of being wrong,that the idealised model/description of H1 is true.
Theory 1: Dieters lose more fat than the exercisers
Theory 2: There is no majority for Brexit now
Theory 3: Serum vitamin C is reduced in patients
22
. 5
Hi : Ty > 0.5
/* " ¥=°
Ho : µ ,= Me
Hi :
µ , > ME
(>
to : µp=µ¢Hi : Mp < up
Statistical hypothesis testing
Many options:
I One-sided versus two-sided tests,
I Exact versus asymptotic tests,
I Parametric versus non-parametric tests,
23
Parametric or non-parametric ?
T-test Outcome(s) normally distributed
Yes Mildly No
Sample size
Small
Medium
Large
Situations which may suggest the use of non-parametric statistics:I When there is a small sample size or very unequal groups,I When the data has notable outliers,I When one outcome has a distribution other than normal,I When the data are ordered with many ties or are rank ordered.
32
! %¥
.
Parametric location test
(One-sided) t-test
We test:
H0: µIQ = 100,H1: µIQ > 100.
We have Xi ⇠ N(µ,�2), i = 1, ..., n,
We know
I X ⇠ N
⇣µ,
�2
n
⌘,
I Z = X�µ�pn
⇠ N(0, 1),
Thus, if H0 is true, we have:
I Z = X�µ0�pn
⇠ N(0, 1).
Define the p-value:
I p� value = P (T > Tobs)
Density
−4 −3 −2 −1 0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
99.73%
55 145
95.45%
70 130
68.27%85 115100
0.0000
0.0266
25
Statistical hypothesis testing4 possible outcomes
Conclude:I if p-value > ↵ ! do not reject H0.I if p-value < ↵ ! reject H0 in favour of H1.
Test Outcome
H0 not rejected H1 accepted
Unknown Truth H0 true 1� ↵ ↵
H1 true � 1� �
where
I ↵ is the type I error,
I � is the type II error.
25
HIV test
c- ) ( t )
f) FALSE Pos.
( + )FALSE NEG .
C)( > POWER
Function ofDESIGN a
Sample size ,
i , ,
Parametric location test
(One-sided) binomial exact test
We test:
H0: ⇡ = 5%,
H1: ⇡ > 5%.
We have Xi ⇠ Bernoulli(⇡), i = 1, ..., n,
We know
I Y =P
n
i=1 Xi ⇠ Binomial (⇡, n) ,
Thus, if H0 is true, we have:
I Y =P
n
i=1 Xi ⇠ Binomial (5%, n) ,
Define the p-value:
I p� value = P (Y > Yobs)
27
Parametric location test
(One-sided) binomial exact test
We test:
H0: ⇡ = 5%,
H1: ⇡ > 5%.
We have Xi ⇠ Bernoulli(⇡), i = 1, ..., n,
We know
I Y =P
n
i=1 Xi ⇠ Binomial (⇡, n) ,
Thus, if H0 is true, we have:
I Y =P
n
i=1 Xi ⇠ Binomial (5%, n) ,
Define the p-value:
I p� value = P (Y > Yobs)
Number of successes out of n experiments
Prob
abilit
y
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0.0
0.1
0.2
0.3
0.4
n = 25n = 100
27
Non-parametric location test
Wilcoxon sign-rank test
A location model is assumed for Xi, i = 1, ..., n:
Xi = ✓ + ei,
where ei ⇠ iid(µe = 0,�2e).
Interest for H0: ✓ = ✓0 against H1: ✓ < ✓0 or ✓ 6= ✓0 or ✓ > ✓0.
Test statistics : W+ =
Pn
i=1 ◆(Xi � ✓0 > 0) Rank(|Xi � ✓0|).
Distribution of W under H0: W+
has no closed-form distribution.
Wilcoxon signed rank test
data: golub[1042, gol.fac == "ALL"]V = 268, p-value = 0.05847alternative hypothesis: true location is not equal to 1.7595 percent confidence interval:1.73868 2.09106sample estimates:(pseudo)median
1.926475
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
28
Non-parametric location test
Wilcoxon sign-rank test
A location model is assumed for Xi, i = 1, ..., n:
Xi = ✓ + ei,
where ei ⇠ iid(µe = 0,�2e).
Interest for H0: ✓ = ✓0 against H1: ✓ < ✓0 or ✓ 6= ✓0 or ✓ > ✓0.
Test statistics : W+ =
Pn
i=1 ◆(Xi � ✓0 > 0) Rank(|Xi � ✓0|).
Distribution of W under H0: W+
has no closed-form distribution.
Wilcoxon signed rank test
data: golub[1042, gol.fac == "ALL"]V = 268, p-value = 0.05847alternative hypothesis: true location is not equal to 1.7595 percent confidence interval:1.73868 2.09106sample estimates:(pseudo)median
1.926475
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.00.10.20.30.40.50.60.70.80.91.0
● ● ●●● ● ●●●●●●●●●●● ●● ●● ●●● ● ● ●
28
Introduction to Shiny Apps and Exercises
29
PART III:
Parametric and non-parametric
two-sample location tests
Cancer Research UK – 5th of November 2018
D.-L. Couturier / M. Eldridge / M. Fernandes [Bioinformatics core]
Two-sample case
Many options:
I One-sided versus two-sided tests,
I Exact versus asymptotic tests,
I Parametric versus non-parametric tests,
I Tests for paired versus independent data.
31
Parametric two-sample location test
Two-sample two-sided Student-s & Welch’s t-tests
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
We test H0: µY � µX = 0 against H1: µY � µX 6= 0.
We know:
I Student’s t-test [assume �2X
= �2Y]: (Y �X)�(µY �µX )
sp
q1
nX+ 1
nY
⇠ t1�↵2 ,nX+nY �2
I Welch’s t-test [assume �2X
6= �2Y]: (Y �X)�(µY �µX )r
s2X
nX+
s2Y
nY
⇠ t1�↵2 ,df
32
Parametric two-sample location test
Two-sample two-sided Student-s & Welch’s t-tests
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
We test H0: µY � µX = 0 against H1: µY � µX 6= 0.
We know:
I Student’s t-test [assume �2X
= �2Y]: (Y �X)�(µY �µX )
sp
q1
nX+ 1
nY
⇠ t1�↵2 ,nX+nY �2
I Welch’s t-test [assume �2X
6= �2Y]: (Y �X)�(µY �µX )r
s2X
nX+
s2Y
nY
⇠ t1�↵2 ,df
Two Sample t-test
data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"]t = 6.7983, df = 36, p-value = 6.046e-08alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:0.8829143 1.6336690sample estimates:mean of x mean of y1.8938826 0.6355909
32
Parametric two-sample location test
Two-sample two-sided Student-s & Welch’s t-tests
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
We test H0: µY � µX = 0 against H1: µY � µX 6= 0.
We know:
I Student’s t-test [assume �2X
= �2Y]: (Y �X)�(µY �µX )
sp
q1
nX+ 1
nY
⇠ t1�↵2 ,nX+nY �2
I Welch’s t-test [assume �2X
6= �2Y]: (Y �X)�(µY �µX )r
s2X
nX+
s2Y
nY
⇠ t1�↵2 ,df
Welch Two Sample t-test
data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"]t = 6.3186, df = 16.118, p-value = 9.871e-06alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:0.8363826 1.6802008sample estimates:mean of x mean of y1.8938826 0.6355909
32
Non-parametric two-sample location test
Mann-Whitney-Wilcoxon test
Let
I Xi ⇠ iid(µX ,�2), i = 1, ..., nX ,
I Yi ⇠ iid(µX + �,�2), i = 1, ..., nY .
Interest for H0: � = �0 against H1: � < �0 or � 6= �0 or � > �0.
Standardised test statistic: z =PnY
i=1 R(Yi)�[nY (nX+nY +1)/2]pnXnY (nX+nY +1)/12
,
where R(Yi) denotes the rank of Yi amongst the combined samples, i.e.,
amongst (X1, ..., XnX , Y1, ..., YnY ).
Distribution of Z under H0: Z ⇠ N(0, 1).
Implementation 1:statistic = -4.361334 , p-value = 1.292716e-05
Implementation 2:W = 284, p-value = 6.15e-07alternative hypothesis: true location shift is not equal to 095 percent confidence interval:0.89647 1.57023sample estimates:difference in location
1.21951
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
33
Non-parametric two-sample location test
Mann-Whitney-Wilcoxon test
Let
I Xi ⇠ iid(µX ,�2), i = 1, ..., nX ,
I Yi ⇠ iid(µX + �,�2), i = 1, ..., nY .
Interest for H0: � = �0 against H1: � < �0 or � 6= �0 or � > �0.
Standardised test statistic: z =PnY
i=1 R(Yi)�[nY (nX+nY +1)/2]pnXnY (nX+nY +1)/12
,
where R(Yi) denotes the rank of Yi amongst the combined samples, i.e.,
amongst (X1, ..., XnX , Y1, ..., YnY ).
Distribution of Z under H0: Z ⇠ N(0, 1).
Implementation 1:statistic = -4.361334 , p-value = 1.292716e-05
Implementation 2:W = 284, p-value = 6.15e-07alternative hypothesis: true location shift is not equal to 095 percent confidence interval:0.89647 1.57023sample estimates:difference in location
1.21951
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
Density
−4 −3 −2 −1 0 1 2 3 4
0.0
0.1
0.2
0.3
0.4
33
Two-sample proportion test
�2goodness-of-fit test
A trial to assess the e↵ectiveness of a new treatment versus a placebo in reducing
tumour size in patients with ovarian cancer:
Observed frequencies Binary outcome
Tumour did not shrink Tumour did shrink
Group Treatment 44 40 (84)
Placebo 24 16 (40)
(68) (56) (124)
I H0 : No association between treatment group and tumour shrinkage,
I H1 : Some association.
Expected frequencies under H0 Binary outcome
Tumour did not shrink Tumour did shrink
Group Treatment
84⇥68124 = 46.06 84⇥58
124 = 37.94
(84)
Placebo
40⇥68124 = 21.94 40⇥56
124 = 18.71
(40)
(68) (56) (124)
We have 2 categorical variables with a total of J = 4 cells (categories).
I H0 : ⇡j = ⇡j0 , j = 1, ..., J ,I H1 : ⇡j 6= ⇡j0 , j = 1, ..., J .
�2-test:
JPj=1
(Oj�Ej)2
Ej⇠ �
2(J � 1).
Pearson’s Chi-squared test with Yates’ continuity correction
data: MX-squared = 0.36474, df = 1, p-value = 0.5459
34
Two-sample proportion test
�2goodness-of-fit test
A trial to assess the e↵ectiveness of a new treatment versus a placebo in reducing
tumour size in patients with ovarian cancer:
Observed frequencies Binary outcome
Tumour did not shrink Tumour did shrink
Group Treatment 44 40 (84)
Placebo 24 16 (40)
(68) (56) (124)
I H0 : No association between treatment group and tumour shrinkage,
I H1 : Some association.Expected frequencies under H0 Binary outcome
Tumour did not shrink Tumour did shrink
Group Treatment
84⇥68124 = 46.06 84⇥58
124 = 37.94
(84)
Placebo
40⇥68124 = 21.94 40⇥56
124 = 18.71
(40)
(68) (56) (124)
We have 2 categorical variables with a total of J = 4 cells (categories).
I H0 : ⇡j = ⇡j0 , j = 1, ..., J ,I H1 : ⇡j 6= ⇡j0 , j = 1, ..., J .
�2-test:
JPj=1
(Oj�Ej)2
Ej⇠ �
2(J � 1).
Pearson’s Chi-squared test with Yates’ continuity correction
data: MX-squared = 0.36474, df = 1, p-value = 0.5459
34
Two-sample proportion test
�2goodness-of-fit test
A trial to assess the e↵ectiveness of a new treatment versus a placebo in reducing
tumour size in patients with ovarian cancer:
Observed frequencies Binary outcome
Tumour did not shrink Tumour did shrink
Group Treatment 44 40 (84)
Placebo 24 16 (40)
(68) (56) (124)
I H0 : No association between treatment group and tumour shrinkage,
I H1 : Some association.Expected frequencies under H0 Binary outcome
Tumour did not shrink Tumour did shrink
Group Treatment 84⇥68124 = 46.06 84⇥58
124 = 37.94 (84)
Placebo 40⇥68124 = 21.94 40⇥56
124 = 18.71 (40)
(68) (56) (124)
We have 2 categorical variables with a total of J = 4 cells (categories).
I H0 : ⇡j = ⇡j0 , j = 1, ..., J ,I H1 : ⇡j 6= ⇡j0 , j = 1, ..., J .
�2-test:
JPj=1
(Oj�Ej)2
Ej⇠ �
2(J � 1).
Pearson’s Chi-squared test with Yates’ continuity correction
data: MX-squared = 0.36474, df = 1, p-value = 0.5459
34
Two-sample proportion test
�2goodness-of-fit test
A trial to assess the e↵ectiveness of a new treatment versus a placebo in reducing
tumour size in patients with ovarian cancer:
Observed frequencies Binary outcome
Tumour did not shrink Tumour did shrink
Group Treatment 44 40 (84)
Placebo 24 16 (40)
(68) (56) (124)
I H0 : No association between treatment group and tumour shrinkage,
I H1 : Some association.Expected frequencies under H0 Binary outcome
Tumour did not shrink Tumour did shrink
Group Treatment
84⇥68124 = 46.06 84⇥58
124 = 37.94
(84)
Placebo
40⇥68124 = 21.94 40⇥56
124 = 18.71
(40)
(68) (56) (124)
We have 2 categorical variables with a total of J = 4 cells (categories).
I H0 : ⇡j = ⇡j0 , j = 1, ..., J ,I H1 : ⇡j 6= ⇡j0 , j = 1, ..., J .
�2-test:
JPj=1
(Oj�Ej)2
Ej⇠ �
2(J � 1).
Pearson’s Chi-squared test with Yates’ continuity correction
data: MX-squared = 0.36474, df = 1, p-value = 0.545934
Two-sample proportion test
Fisher’s exact test of independence
�2goodness-of-fit test not suitable when
I n is small
I Ej < 5 for at least one cell.
Observed frequencies Variable 1
Category 1 Category 2
Variable 2 Category 1 a b (a+b)
Category 2 c d (c+d)
(a+c) (b+d) (a+b+c+d=n)
Fisher showed that, under H0 (independence),
P (observed table | H0) = P (X = a) and X ⇠ Hypergeometric(n, a+ c, a+ b).To compute the Fisher’s test:
I Define P (X = a) for all possible tables having the observed marginal
counts,
I Calculate the p� value by defining the percentage of these tables that get
a probability equal to or smaller than the one observed.
Fisher’s Exact Test for Count Data
data: Mp-value = 0.4471alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval:0.3160593 1.6790135sample estimates:odds ratio0.7351707
35
Two-sample proportion test
Fisher’s exact test of independence
�2goodness-of-fit test not suitable when
I n is small
I Ej < 5 for at least one cell.
Observed frequencies Binary outcome
Tumour did not shrink Tumour did shrink
Group Treatment 44 40 (84)
Placebo 24 16 (40)
(68) (56) (124)
Fisher showed that, under H0 (independence),
P (observed table | H0) = P (X = a) and X ⇠ Hypergeometric(n, a+ c, a+ b).To compute the Fisher’s test:
I Define P (X = a) for all possible tables having the observed marginal
counts,
I Calculate the p� value by defining the percentage of these tables that get
a probability equal to or smaller than the one observed.
Fisher’s Exact Test for Count Data
data: Mp-value = 0.4471alternative hypothesis: true odds ratio is not equal to 195 percent confidence interval:0.3160593 1.6790135sample estimates:odds ratio0.7351707
35
F-test of equality of variances
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
We test H0: �2Y
= �2X
against H1: �2Y
6= �2X.
We know:
I F-test [assume Xi ⇠ N(µX ,�X) and Yi ⇠ N(µY ,�Y )]:s2Y
s2X
⇠ FnY �1,nX�1
36
F-test of equality of variances
●●●
●
Intensity expression of gene 'CCND3 Cyclin D3'
−0.5 0.0 0.5 1.0 1.5 2.0 2.5
Acute lymphoblasticleukemia (ALL)
n=27
Acute myeloidleukemia (AML)
n=11
We test H0: �2Y
= �2X
against H1: �2Y
6= �2X.
We know:
I F-test [assume Xi ⇠ N(µX ,�X) and Yi ⇠ N(µY ,�Y )]:s2Y
s2X
⇠ FnY �1,nX�1
F test to compare two variances
data: golub[1042, gol.fac == "ALL"] and golub[1042, gol.fac == "AML"]F = 0.71164, num df = 26, denom df = 10, p-value = 0.4652alternative hypothesis: true ratio of variances is not equal to 195 percent confidence interval:0.2127735 1.8428387sample estimates:ratio of variances
0.7116441
36
Warning
Multiplicity correction
For each test, the probability of rejecting H0 (and accept H1) when H0 istrue equals ↵.
For k tests, the probability of rejecting H0 (and accept H1) at least 1 timewhen H0 is true, ↵k, is given by
↵k = 1� (1� ↵)k.
Thus, for ↵ = 0.05,I if k = 1, ↵1 = 1� (1� ↵)1 = 0.05,I if k = 2, ↵2 = 1� (1� ↵)2 = 0.0975,I if k = 10, ↵10 = 1� (1� ↵)10 = 0.4013.
Idea: change the level of each test so that ↵k = 0.05:
I Bonferroni correction : ↵ = ↵k
k ,
I Dunn-Sidak correction: ↵ = 1� (1� ↵k)1/k.
37
Warning
Non-parametric is not assumption free: Type I error
Simulate 2500 samples with
I Xi ⇠ Uniform(1.5, 2.5), i = 1, ..., nX ,
I Yi ⇠ Uniform(0, 4), i = 1, ..., nY ,
so that E[Xi] = E[Yi] = 2 (i.e., same mean, same median).
Assume
I Xi ⇠ iid(µX ,�2), i = 1, ..., nX ,
I Yi ⇠ iid(µX + �,�2), i = 1, ..., nY .
Test H0: � = �0 against H1: � 6= �0, at the 5% level, by means of
I Mann-Whitney-Wilcoxon test (MWW),
I T-test,
I Welch-test.
b↵ Tests
MWW Student’s t-test Welch’s test
Sample size nX = 200, nY = 70 0.145 0.202 0.055
nX = 20, nY = 7 0.148 0.240 0.062
38
Exercises
39