stat1012 statistics for life sciences quick revision notes

20
STAT1012 Statistics for Life Sciences 1 STAT1012 Statistics for Life Sciences Quick Revision Notes Fall, 2019 LEUNG Man Fung, Heman (Reference: lecture and tutorial notes)

Upload: others

Post on 02-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 1

STAT1012 Statistics for Life Sciences

Quick Revision Notes

Fall, 2019

LEUNG Man Fung, Heman

(Reference: lecture and tutorial notes)

Page 2: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 2

Contents I) Descriptive Statistics .................................................................................................................... 4

Central tendency ......................................................................................................................... 4

Dispersion ................................................................................................................................... 4

Graphical methods ...................................................................................................................... 5

II) Probability ................................................................................................................................... 6

Notations..................................................................................................................................... 6

Probability theory ....................................................................................................................... 6

Conditional probability ............................................................................................................... 6

III) Discrete Probability Distributions .............................................................................................. 8

Discrete random variables .......................................................................................................... 8

Binomial distribution .................................................................................................................. 8

Poisson distribution .................................................................................................................... 9

Hypergeometric distribution (not required) ............................................................................... 9

Geometric distribution (not required) ........................................................................................ 9

Negative binomial distribution (not required) ......................................................................... 10

IV) Continuous Probability Distributions ...................................................................................... 11

Continuous random variables ................................................................................................... 11

Uniform distribution ................................................................................................................. 11

Normal distribution ................................................................................................................... 11

Some remarks (not required) ................................................................................................... 12

V) Point Estimation ....................................................................................................................... 13

Sampling .................................................................................................................................... 13

Point estimator ......................................................................................................................... 13

Mean ......................................................................................................................................... 14

Variance .................................................................................................................................... 15

Binomial proportion .................................................................................................................. 15

Poisson rate............................................................................................................................... 15

VI) Interval Estimation .................................................................................................................. 16

Confidence interval ................................................................................................................... 16

Mean ......................................................................................................................................... 16

Page 3: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 3

Variance .................................................................................................................................... 17

Binomial proportion .................................................................................................................. 17

Poisson rate............................................................................................................................... 17

VII) Hypothesis Testing ................................................................................................................. 18

Terminologies ........................................................................................................................... 18

One sample z-test ..................................................................................................................... 19

One sample t-test ..................................................................................................................... 19

One sample chi-squared test .................................................................................................... 19

One sample binomial proportion test ...................................................................................... 20

Some remarks (not required) ................................................................................................... 20

Page 4: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 4

I) Descriptive Statistics

Data type: Qualitative (special: Categorical), Quantitative (Discrete, Continuous)

Population: the whole set of entities of interest

Sample: a subset of the population

Central tendency

Sample mean: �̅�𝑛 =1

π‘›βˆ‘ 𝑋𝑖

𝑛𝑖=1

Sequential update property: �̅�𝑛 =1

𝑛[(𝑛 βˆ’ 1)οΏ½Μ…οΏ½π‘›βˆ’1 + 𝑋𝑛]

Mode: the value which has the greatest number of occurrence (may not be unique)

Median: the β€œmiddle” value, or the average of the two values closest to β€œmiddle” after sorting

Percentile: the p-th percentile (𝑉 𝑝

100) is a value such that p% of the data are less than or equal to

𝑉 𝑝

100. In particular, upper quantile = 𝑉0.75, median = 𝑉0.5, lower quantile = 𝑉0.25

Denote the sorted data by 𝑋(1), 𝑋(2), … , 𝑋(𝑛) where 𝑋(1) ≀ 𝑋(2) ≀ β‹― ≀ 𝑋(𝑛). This is equivalent to

saying that 𝑋(1) is the smallest, 𝑋(2) is the second smallest etc.

Median: 𝑉0.5 = 𝑋(

𝑛+1

2) if n is odd or

1

2[𝑋

(𝑛

2)

+ 𝑋(

𝑛

2+1)

] if n is even

Percentile: 𝑉 𝑝

100= 𝑋(π‘˜) where π‘˜ = π‘Ÿπ‘œπ‘’π‘›π‘‘π‘ˆπ‘ (

𝑛𝑝

100) if

𝑛𝑝

100 is not an integer

Otherwise, 𝑉 𝑝

100=

1

2[𝑋

(𝑛𝑝

100)

+ 𝑋(

𝑛𝑝

100+1)

]

Dispersion

Symmetric: the left hand side of the distribution mirrors the right hand side

Unimodal: the mode is unique

Skewness: measure of asymmetry

Left-skewed (negatively skewed): mean < median, have a few extreme small values

Page 5: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 5

Right-skewed (positively skewed): mean > median, have a few extreme large values

Symmetric β†’ mean = median (converse not true)

Symmetric + unimodal β†’ mean = median = mode (converse not true)

Range: maximum – minimum (𝑋(𝑛) βˆ’ 𝑋(1))

Interquartile range: 𝑉0.75 βˆ’ 𝑉0.25

Sample variance: 𝑆2 =1

π‘›βˆ’1βˆ‘ (𝑋𝑖 βˆ’ οΏ½Μ…οΏ½)2𝑛

𝑖=1 or 𝑆2 =1

π‘›βˆ’1βˆ‘ (𝑋𝑖

2 βˆ’ 𝑛�̅�2)𝑛𝑖=1

Sample standard deviation: 𝑆𝐷 = βˆšπ‘†2

Graphical methods

Bar graph: use for categorical data, show the number of observations in each category

Histogram: use for quantitative data, showing the number of observations in each range

Stem-and-leaf plot: ordered the data into a tree-like structure

Boxplot: show 5 numbers (min, Q1, median, Q3, max), help locate outliers (As a rule of thumb,

some people define outliers as values > Q3 + 1.5*IQR or < Q1 – 1.5*IQR)

Page 6: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 6

II) Probability

Notations

Sample space: the set of all possible outcomes, often denoted as Ξ©

Outcome: a possible type of occurrence

Event: any set of outcomes of interest, can be denoted as 𝐸 βŠ‚ Ξ©

Probability (of an event): denoted by 𝑃(𝐸), always lies between 0 and 1 (both inclusive)

𝑃(𝐸) =# π‘œπ‘“ π‘œπ‘’π‘‘π‘π‘œπ‘šπ‘’π‘  𝑖𝑛 𝐸

# π‘œπ‘“ π‘œπ‘’π‘‘π‘π‘œπ‘šπ‘’π‘  𝑖𝑛 Ξ©

Union: either A or B occurs, or they both occurs, denoted by 𝐴 βˆͺ 𝐡 (logically equivalent to OR)

Intersection: both A and B occur, denoted by 𝐴 ∩ 𝐡 (logically equivalent to AND)

Complement: A does not occur, denoted by 𝐴𝐢 (logically equivalent to NOT)

Commutativity: 𝐴 βˆͺ 𝐡 = 𝐡 βˆͺ 𝐴, 𝐴 ∩ 𝐡 = 𝐡 ∩ 𝐴

Associativity: (𝐴 βˆͺ 𝐡) βˆͺ 𝐢 = 𝐴 βˆͺ (𝐡 βˆͺ 𝐢), (𝐴 ∩ 𝐡) ∩ 𝐢 = 𝐴 ∩ (𝐡 ∩ 𝐢)

Distributive laws: (𝐴 ∩ 𝐡) βˆͺ 𝐢 = (𝐴 βˆͺ 𝐢) ∩ (𝐡 βˆͺ 𝐢), (𝐴 βˆͺ 𝐡) ∩ 𝐢 = (𝐴 ∩ 𝐢) βˆͺ (𝐡 ∩ 𝐢)

DeMorgan’s laws: (𝐴 βˆͺ 𝐡)𝐢 = 𝐴𝐢 ∩ 𝐡𝐢 , (𝐴 ∩ 𝐡)𝐢 = 𝐴𝐢 βˆͺ 𝐡𝐢

Probability theory

Mutually exclusive: A and B are mutually exclusive if 𝑃(𝐴 ∩ 𝐡) = 0 (cannot co-occur)

Independence: 𝑃(𝐴 ∩ 𝐡) = 𝑃(𝐴)𝑃(𝐡) iff A and B are independent. Their complements (A and

BC; AC and B; AC and BC) will be pairwise independent as well

Addition law: 𝑃(𝐴 βˆͺ 𝐡) = 𝑃(𝐴) + 𝑃(𝐡) βˆ’ 𝑃(𝐴 ∩ 𝐡)

Multiplication law: if 𝐴1, … , π΄π‘˜ are mutually independent, then 𝑃(𝐴! ∩ … ∩ π΄π‘˜) = 𝑃(𝐴1) Γ— … Γ—

𝑃(π΄π‘˜)

Conditional probability

Conditional probability: 𝑃(𝐡|𝐴) =𝑃(𝐴∩𝐡)

𝑃(𝐴), if 𝑃(𝐡|𝐴) = 𝑃(𝐡), then A and B are independent

Page 7: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 7

Relative risk: 𝑅𝑅(𝐡|𝐴) =𝑃(𝐡|𝐴)

𝑃(𝐡|𝐴𝐢)

Total probability rule: 𝑃(𝐡) = 𝑃(𝐡|𝐴)𝑃(𝐴) + 𝑃(𝐡|𝐴𝐢)𝑃(𝐴𝐢)

Exhaustive: if 𝐴1, … , π΄π‘˜ are exhaustive, then 𝐴1 βˆͺ … βˆͺ π΄π‘˜ = Ξ© (at least one of them must occur)

Generalized total probability rule: let 𝐴1, … , π΄π‘˜ be mutually exclusive and exhaustive events. For

any event B, we have 𝑃(𝐡) = βˆ‘ 𝑃(𝐡|𝐴𝑖)𝑃(𝐴𝑖)π‘˜π‘–=1

Bayes' theorem: conditional probability + generalized total probability rule. let 𝐴1, … , π΄π‘˜ be

mutually exclusive and exhaustive events. For any event B,

𝑃(𝐴𝑖|𝐡) =𝑃(𝐴𝑖 ∩ 𝐡)

𝑃(𝐡)=

𝑃(𝐴𝑖)𝑃(𝐡|𝐴𝑖)

𝑃(𝐡)=

𝑃(𝐴𝑖)𝑃(𝐡|𝐴𝑖)

βˆ‘ 𝑃(𝐡|𝐴𝑗)𝑃(𝐴𝑗)π‘˜π‘—=1

B

Page 8: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 8

III) Discrete Probability Distributions

Random variables: numeric quantities that take different values with specified probabilities

Discrete random variable: a R.V. that takes value from a discrete set of numbers

Continuous random variable: a R.V. that takes value over an interval of numbers

Discrete random variables

Probability mass function: a pmf assigns a probability to each possible value x of the discrete

random variable X, denoted by 𝑓(π‘₯) = 𝑃(𝑋 = π‘₯)

βˆ‘ 𝑓(π‘₯𝑖)𝑛𝑖=1 = 1 (total probability rule)

Cumulative distribution function: a cdf gives the probability that X is less than or equal to the

value x, denoted by 𝐹(π‘₯) = 𝑃(𝑋 ≀ π‘₯)

Expected value: πœ‡ = 𝐸(𝑋) = βˆ‘ π‘₯𝑖𝑃(𝑋 = π‘₯𝑖)𝑛𝑖=1 (the idea is β€œprobability weighted average”)

Variance: 𝜎2 = π‘‰π‘Žπ‘Ÿ(𝑋) = βˆ‘ (π‘₯𝑖 βˆ’ πœ‡)2𝑃(𝑋 = π‘₯𝑖)𝑛𝑖=1 , alternatively π‘‰π‘Žπ‘Ÿ(𝑋) = 𝐸(𝑋2) βˆ’ [𝐸(𝑋)]2

Translation/rescale: 𝐸(π‘Žπ‘‹ + 𝑏) = π‘ŽπΈ(𝑋) + 𝑏, π‘‰π‘Žπ‘Ÿ(π‘Žπ‘‹ + 𝑏) = π‘Ž2π‘‰π‘Žπ‘Ÿ(𝑋)

Linearity of expectation: 𝐸(βˆ‘ 𝑋𝑖𝑛𝑖=1 ) = βˆ‘ 𝐸(𝑋𝑖)

𝑛𝑖=1

Binomial distribution

Factorial: 𝑛! = 𝑛 Γ— (𝑛 βˆ’ 1) Γ— … Γ— 1, note that 0! = 1

Permutation (order is important): π‘ƒπ‘˜π‘› =

𝑛!

(π‘›βˆ’π‘˜)!

Combination (order is not important): πΆπ‘˜π‘› =

𝑛!

π‘˜!(π‘›βˆ’π‘˜)!, also denoted as (𝑛

π‘˜)

Binomial distribution: probability distribution on the number of successes 𝑋 in 𝑛 independent

experiments, each experiment has a probability of success 𝑝, then 𝑋~𝐡(𝑛, 𝑝)

Pmf: 𝑃(𝑋 = π‘₯) = (𝑛π‘₯

)𝑝π‘₯(1 βˆ’ 𝑝)π‘›βˆ’π‘₯ for π‘₯ = 0, 1, 2, … , 𝑛

Mean: 𝐸(𝑋) = 𝑛𝑝

Variance: π‘‰π‘Žπ‘Ÿ(𝑋) = 𝑛𝑝(1 βˆ’ 𝑝)

Skewness: right-skewed if p<0.5, symmetric if p=0.5, left-skewed if p>0.5

Page 9: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 9

Poisson distribution

Poisson distribution: probability distribution on the number of occurrence 𝑋 (usually of a rare

event) over a period of time or space with rate πœ‡, then 𝑋~π‘ƒπ‘œ(πœ‡)

Pmf: 𝑃(𝑋 = π‘₯) =π‘’βˆ’πœ‡πœ‡π‘₯

π‘₯! for π‘₯ = 0, 1, 2, …

Mean: 𝐸(𝑋) = πœ‡

Variance: π‘‰π‘Žπ‘Ÿ(𝑋) = πœ‡

Skewness: right-skewed

Poisson limit theorem (poisson approximation to binomial): if 𝑋~𝐡(𝑛, 𝑝) where 𝑛 β‰₯ 20, 𝑝 <

0.1 and 𝑛𝑝 < 5, then 𝑋 β‰ˆ π‘Œ~π‘ƒπ‘œ(πœ‡) where πœ‡ = 𝑛𝑝

Hypergeometric distribution (not required)

Hypergeometric distribution: probability distribution on the number of success 𝑋 in 𝑛 trials

without replacement, from a finite population of size 𝑁1 + 𝑁2 = 𝑁 β‰₯ 𝑛 that contains 𝑁1 trials

classified as success, then 𝑋~π»π‘¦π‘π‘’π‘Ÿπ‘”π‘’π‘œπ‘šπ‘’π‘‘π‘Ÿπ‘–π‘(𝑁1, 𝑁2, 𝑛)

Pmf: 𝑃(𝑋 = π‘₯) =(𝑁1

π‘₯ )( 𝑁2π‘›βˆ’π‘₯)

(𝑁𝑛)

for π‘₯ = max(0, 𝑛 βˆ’ 𝑁2) , … , min (𝑛, 𝑁1)

Mean: 𝐸(𝑋) = 𝑛 (𝑁1

𝑁)

Variance: π‘‰π‘Žπ‘Ÿ(𝑋) = 𝑛 (𝑁1

𝑁) (

𝑁2

𝑁) (

π‘βˆ’π‘›

π‘βˆ’1)

Geometric distribution (not required)

Geometric distribution: probability distribution on the number of trials 𝑋 when the first success

occurs, each trial has a probability of success 𝑝, then 𝑋~πΊπ‘’π‘œ(𝑝)

Pmf: 𝑃(𝑋 = π‘₯) = (1 βˆ’ 𝑝)π‘₯βˆ’1𝑝 for π‘₯ = 1, 2, …

Mean: 𝐸(𝑋) =1

𝑝

Variance: π‘‰π‘Žπ‘Ÿ(𝑋) =1βˆ’π‘

𝑝2

Memoryless: 𝑃(𝑋 > π‘˜ + 𝑗|𝑋 > π‘˜) = 𝑃(𝑋 > 𝑗) . Geometric distribution is the only discrete

distribution with this property

Page 10: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 10

Negative binomial distribution (not required)

Negative binomial distribution: probability distribution on the number of times 𝑋 when the π‘Ÿ

success occurs, each trial has a probability of success 𝑝, then 𝑋~𝑁𝐡(π‘Ÿ, 𝑝)

Pmf: 𝑃(𝑋 = π‘₯) = (π‘₯βˆ’1π‘Ÿβˆ’1

)(1 βˆ’ 𝑝)π‘₯βˆ’π‘Ÿπ‘π‘Ÿ for π‘₯ = π‘Ÿ, π‘Ÿ + 1, …

Mean: 𝐸(𝑋) =π‘Ÿ

𝑝

Variance: π‘‰π‘Žπ‘Ÿ(𝑋) =π‘Ÿ(1βˆ’π‘)

𝑝2

Page 11: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 11

IV) Continuous Probability Distributions

Continuous random variables

Probability density function: a pdf specifies the probability of the random variable falling within

a particular range of values, denoted by 𝑓(π‘₯)

𝑃(π‘Ž ≀ 𝑋 ≀ 𝑏) = ∫ 𝑓(π‘₯)𝑑π‘₯𝑏

π‘Ž, which is the area under the curve from a to b

𝑃(𝑋 = π‘Ž) = ∫ 𝑓(π‘₯)𝑑π‘₯π‘Ž

π‘Ž= 0 for all π‘Ž

∫ 𝑓(π‘₯)𝑑π‘₯∞

βˆ’βˆž= 1 (total probability rule)

Cumulative distribution function: a cdf gives the probability that X is less than or equal to the

value x, denoted by 𝐹(π‘₯) = 𝑃(𝑋 ≀ π‘₯) = ∫ 𝑓(𝑑)𝑑𝑑π‘₯

βˆ’βˆž

𝑃(π‘Ž ≀ 𝑋 ≀ 𝑏) = ∫ 𝑓(π‘₯)𝑑π‘₯𝑏

π‘Ž= 𝐹(𝑏) βˆ’ 𝐹(π‘Ž) (by the fundamental theorem of calculus)

Expected value: πœ‡ = 𝐸(𝑋) = ∫ π‘₯𝑓(π‘₯)𝑑π‘₯∞

βˆ’βˆž

Variance: 𝜎2 = π‘‰π‘Žπ‘Ÿ(𝑋) = ∫ (π‘₯ βˆ’ πœ‡)2𝑓(π‘₯)𝑑π‘₯∞

βˆ’βˆž= ∫ π‘₯2𝑓(π‘₯)𝑑π‘₯

∞

βˆ’βˆžβˆ’ πœ‡2

(Note: Calculus is NOT required in our course)

Uniform distribution

Uniform distribution: if 𝑋 follows uniform distribution on the interval [π‘Ž, 𝑏], then it has the same

probability density at any point in the interval and we denote it by 𝑋~π‘ˆ(π‘Ž, 𝑏)

Pdf: 𝑓(π‘₯) =1

π‘βˆ’π‘Ž for π‘Ž ≀ π‘₯ ≀ 𝑏, otherwise 0

Cdf: 𝐹(π‘₯) = ∫1

π‘βˆ’π‘Žπ‘‘π‘‘

π‘₯

π‘Ž= [

𝑑

π‘βˆ’π‘Ž]

π‘Ž

π‘₯

=π‘₯βˆ’π‘Ž

π‘βˆ’π‘Ž for π‘Ž ≀ π‘₯ ≀ 𝑏

Mean: 𝐸(𝑋) =π‘Ž+𝑏

2

Variance: π‘‰π‘Žπ‘Ÿ(𝑋) =(π‘βˆ’π‘Ž)2

12

Normal distribution

Normal distribution: if 𝑋 follows normal distribution with mean πœ‡ and variance 𝜎2 , then

𝑋~𝑁(πœ‡, 𝜎2), often used to represent continuous random variable with unknown distributions

Page 12: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 12

Pdf: 𝑓(π‘₯) =1

√2πœ‹πœŽπ‘’

βˆ’1

2𝜎2(π‘₯βˆ’πœ‡)2

for βˆ’βˆž < π‘₯ < ∞

Shape: bell-shape, symmetric about the mean, unimodal

Standard normal distribution: 𝑍~𝑁(0,1)

Cdf of standard normal: denoted as Ξ¦(𝑧) = 𝑃(𝑍 ≀ 𝑧)

𝑃(π‘Ž ≀ 𝑍 ≀ 𝑏) = 𝑃(𝑍 ≀ 𝑏) βˆ’ 𝑃(𝑍 ≀ π‘Ž) = Ξ¦(𝑏) βˆ’ Ξ¦(π‘Ž)

Ξ¦(βˆ’π‘§) = 1 βˆ’ Ξ¦(𝑧) by symmetric property

Percentile of standard normal: Ξ¦(1.645) = 0.95, Ξ¦(1.96) = 0.975

Standardization: if 𝑋~𝑁(πœ‡, 𝜎2), then π‘‹βˆ’πœ‡

𝜎~𝑁(0,1)

𝑃(π‘Ž < 𝑋 < 𝑏) = 𝑃 (π‘Žβˆ’πœ‡

𝜎< 𝑍 <

π‘βˆ’πœ‡

𝜎) = Φ (

π‘βˆ’πœ‡

𝜎) βˆ’ Ξ¦ (

π‘Žβˆ’πœ‡

𝜎)

De Moivre–Laplace theorem (normal approximation to binomial): if 𝑋~𝐡(𝑛, 𝑝) , 𝑃(π‘Ž < 𝑋 <

𝑏) β‰ˆ 𝑃(π‘Ž + 0.5 ≀ π‘Œ ≀ 𝑏 βˆ’ 0.5) where π‘Œ~𝑁(𝑛𝑝, 𝑛𝑝(1 βˆ’ 𝑝)). The 0.5s are continuity correction

Normal approximation to poisson: if 𝑋~π‘ƒπ‘œ(πœ†), 𝑃(𝑋 ≀ π‘Ž) β‰ˆ 𝑃(π‘Œ ≀ π‘Ž + 0.5) where π‘Œ~𝑁(πœ†, πœ†)

Some remarks (not required)

Statistical parameter: a numerical characteristic of a statistical population or a statistical model.

We are given these numbers (e.g. 𝑝, πœ†, πœ‡) in previous chapters but in reality we do not know

these numbers. These lead to the next part of our course: Statistical Inference

Why approximation: one major reason is that calculating binomial probability involves

combination and large factorials are hard/costly to compute in previous centuries

Variance of sum: π‘‰π‘Žπ‘Ÿ(𝑋 + π‘Œ) = π‘‰π‘Žπ‘Ÿ(𝑋) + π‘‰π‘Žπ‘Ÿ(π‘Œ) + 2πΆπ‘œπ‘£(𝑋, π‘Œ)

Tower rule of expectation: 𝐸(𝑋) = 𝐸[𝐸(𝑋|π‘Œ)]

Law of total variance (EVE): π‘‰π‘Žπ‘Ÿ(𝑋) = 𝐸[π‘‰π‘Žπ‘Ÿ(𝑋|π‘Œ)] + π‘‰π‘Žπ‘Ÿ[𝐸(𝑋|π‘Œ)]

Sum of poisson: if 𝑋~π‘ƒπ‘œ(πœ†1), π‘Œ~π‘ƒπ‘œ(πœ†2) independently, then 𝑋 + π‘Œ~π‘ƒπ‘œ(πœ†1 + πœ†2)

Sum of normal: if 𝑋~𝑁(πœ‡1, 𝜎12), π‘Œ~𝑁(πœ‡2, 𝜎2

2) independently, then 𝑋 + π‘Œ~𝑁(πœ‡1 + πœ‡2, 𝜎12 + 𝜎2

2)

Square of standard normal: if 𝑋~𝑁(πœ‡, 𝜎2), the 𝑍2 = [π‘‹βˆ’πœ‡

𝜎]

2

~πœ’12

Sum of chi square: if 𝑋~πœ’π‘›2, π‘Œ~πœ’π‘š

2 , then 𝑋 + π‘Œ~πœ’π‘›+π‘š2

Page 13: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 13

V) Point Estimation

Statistical inference: process of drawing conclusions from data that are subject to random

variations

Estimation: estimate the values of specific population parameters based on the observed data

Hypothesis testing: test on whether the value of a population parameter is equal to some specific

value based on the observed data

Sampling

Sample: the data obtained after the experiments are performed, usually denoted by π‘₯1, … , π‘₯𝑛

Random sample: the data before the experiments are performed, usually denoted by 𝑋1, … , 𝑋𝑛

Non-probability sample: some elements of the population have no chance of being selected

Probability sample: all elements in the population has known nonzero chance to be selected

Simple random sample: all elements in the population has the same probability to be selected

Systematic sample: elements are selected at regular intervals through certain order

Stratified sample: all elements are classified into different stratums and each stratum is sampled

as an independent sub-population

Cluster sample: all elements are divided into different clusters and a simple random sample of

clusters is selected

Coverage error: exists if some groups are excluded from the frame and have no chance of being

selected

Non-response error: people who do not respond may be different from those who do respond

Measurement error: due to weaknesses in question design, respondent error, and interviewer’s

impact on the respondent

Sampling error: Chance (luck of the draw) variation from sample to sample

Point estimator

Point estimator: a rule for calculating a single value to β€œbest guess” an unknown population

parameter of interest based on the observed data

Page 14: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 14

(Note: estimator πœƒ(𝑋) is random, estimate πœƒ(π‘₯) is fixed, estimand πœƒ is the unknown parameter)

Unbiasedness: 𝐸(πœƒ) = πœƒ

Minimum variance: π‘‰π‘Žπ‘Ÿ(πœƒ) ≀ π‘‰π‘Žπ‘Ÿ(οΏ½ΜƒοΏ½) βˆ€ οΏ½ΜƒοΏ½ ∈ Θ

Independent and identically distributed (i.i.d.): an assumption where the random variables

𝑋1, … , 𝑋𝑛 are sampled such that they are independent and follows the same distribution

Central limit theorem (CLT, Lindeberg–LΓ©vy): Let 𝑋1, … , 𝑋𝑛 be i.i.d. random variables with mean

πœ‡ and finite variance 𝜎2, then as n tends to infinity (>30 in practice), �̅�𝑑→ 𝑁 (πœ‡,

𝜎2

𝑛)

Mean

Estimand: πœƒ = πœ‡ = 𝐸(𝑋)

Sample mean (estimator): πœƒ = οΏ½Μ…οΏ½ =1

π‘›βˆ‘ 𝑋𝑖

𝑛𝑖=1

Expectation: 𝐸(οΏ½Μ…οΏ½) =1

π‘›βˆ‘ 𝐸(𝑋𝑖)

𝑛𝑖=1 =

π‘›πœ‡

𝑛= πœ‡ (unbiased)

Variance: π‘‰π‘Žπ‘Ÿ(οΏ½Μ…οΏ½) =1

𝑛2βˆ‘ π‘‰π‘Žπ‘Ÿ(𝑋𝑖)

𝑛𝑖=1 =

π‘›πœŽ2

𝑛2 =𝜎2

𝑛 (by i.i.d.)

Distribution: οΏ½Μ…οΏ½~𝑁 (πœ‡,𝜎2

𝑛). If 𝑋1, … , 𝑋𝑛~𝑁(πœ‡, 𝜎2), then this follows from the fact that sum of

independent normal is normal (remarks in section IV).

If 𝑋1, … , 𝑋𝑛 follows some other distribution, then this follows from the CLT when n is large

(usually >30). Otherwise ( 𝑛 ≀ 30 ) we have βˆšπ‘› (οΏ½Μ…οΏ½βˆ’πœ‡

𝑆) ~π‘‘π‘›βˆ’1 , where π‘‘π‘›βˆ’1 is a Student’s t-

distribution with degree of freedom n-1.

Page 15: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 15

Variance

Estimand: πœƒ = 𝜎2 = π‘‰π‘Žπ‘Ÿ(𝑋)

Sample variance (estimator): πœƒ = 𝑆2 =1

π‘›βˆ’1βˆ‘ (𝑋𝑖 βˆ’ οΏ½Μ…οΏ½)2𝑛

𝑖=1 , 𝑆′2 =1

π‘›βˆ‘ (𝑋𝑖 βˆ’ πœ‡)2𝑛

𝑖=1 if πœ‡ is known

Expectation: 𝐸(𝑆2) = 𝜎2 (unbiased)

Variance: π‘‰π‘Žπ‘Ÿ(𝑆2) =2𝜎4

π‘›βˆ’1 (not required)

Distribution: (π‘›βˆ’1)𝑆2

𝜎2 ~πœ’π‘›βˆ’12 β‡’ 𝑆2~

𝜎2

π‘›βˆ’1πœ’π‘›βˆ’1

2 (right-skewed)

Binomial proportion

Estimand: πœƒ = 𝑝 = 𝐸(π‘Œ) where π‘Œ1, … , π‘Œπ‘›~𝐡(1, 𝑝) (similar to mean case)

Estimator: οΏ½Μ‚οΏ½ = οΏ½Μ…οΏ½ =1

π‘›βˆ‘ π‘Œπ‘–

𝑛𝑖=1

Expectation: 𝐸(οΏ½Μ‚οΏ½) =1

π‘›βˆ‘ 𝐸(π‘Œπ‘–)

𝑛𝑖=1 =

𝑛𝑝

𝑛= 𝑝 (unbiased)

Variance: π‘‰π‘Žπ‘Ÿ(οΏ½Μ‚οΏ½) =1

𝑛2βˆ‘ π‘‰π‘Žπ‘Ÿ(π‘Œπ‘–)

𝑛𝑖=1 =

𝑛𝑝(1βˆ’π‘)

𝑛2 =𝑝(1βˆ’π‘)

𝑛 (by i.i.d.)

Distribution: οΏ½Μ‚οΏ½~𝐡(𝑛, 𝑝) because the sampling distribution is binomial. For 𝑛 > 30 or 𝑛�̂��̂� > 5,

normal approximation gives οΏ½Μ‚οΏ½~𝑁 (𝑝,𝑝(1βˆ’π‘)

𝑛)

Poisson rate

Estimand: πœƒ = πœ† where 𝑋~π‘ƒπ‘œ(πœ†π‘‡) with 𝑇 as the total number of units

Estimator: οΏ½Μ‚οΏ½ =𝑋

𝑇

Expectation: 𝐸(οΏ½Μ‚οΏ½) =1

𝑇𝐸(𝑋) =

πœ†π‘‡

𝑇= πœ† (unbiased)

Variance: π‘‰π‘Žπ‘Ÿ(οΏ½Μ‚οΏ½) =1

𝑇2π‘‰π‘Žπ‘Ÿ(𝑋) =

πœ†π‘‡

𝑇2=

πœ†

𝑇 (by i.i.d.)

Distribution: οΏ½Μ‚οΏ½~π‘ƒπ‘œ(πœ†π‘‡) because the sampling distribution is Poisson. For 𝑛 > 30 or �̂�𝑇 > 10,

normal approximation gives οΏ½Μ‚οΏ½~𝑁 (πœ†,πœ†

𝑇)

Page 16: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 16

VI) Interval Estimation

Confidence interval

Confidence interval: an interval associated with a confidence level 1 βˆ’ 𝛼 that may contain the

true value of an unknown population parameter

Meaning of confidence level: in the long run, 100(1 βˆ’ 𝛼)% of all the confidence intervals that

can be constructed will contain the unknown true parameter (NOT the probability that an interval

will contain the parameter)

Elements of confidence interval: {πœƒ, 𝑐𝛼, 𝑠𝑒(πœƒ)}, where πœƒ is the point estimate, π‘π‘Ž is the critical

value from an asymptotic distribution under the confidence level 1 βˆ’ 𝛼, 𝑠𝑒(πœƒ) is the standard

error of the point estimate

Mean

Confidence interval (𝜎 is known): οΏ½Μ…οΏ½ Β± 𝑧𝛼

2Γ—

𝜎

βˆšπ‘›

Confidence interval (𝜎 is unknown, 𝑛 > 30): οΏ½Μ…οΏ½ Β± 𝑧𝛼

2Γ—

𝑠

βˆšπ‘›

Confidence interval (𝜎 is unknown, 𝑛 ≀ 30): οΏ½Μ…οΏ½ Β± π‘‘π‘›βˆ’1,𝛼

2Γ—

𝑠

βˆšπ‘› (differs in degree of freedom)

Margin of error: 𝐸 = 𝑐𝛼 Γ— 𝑠𝑒(πœƒ) (width is 2𝐸 which helps determine sample size)

Critical values: standard normal and t-distribution are symmetric around 0 β‡’ 𝑐1βˆ’π›Ό

2= 𝑐𝛼

2

Page 17: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 17

One-sided confidence interval: πœ‡ > οΏ½Μ…οΏ½ βˆ’ 𝑧1βˆ’π›Ό Γ—πœŽ

βˆšπ‘› or πœ‡ < οΏ½Μ…οΏ½ + 𝑧1βˆ’π›Ό Γ—

𝜎

βˆšπ‘›

(Note: this is essentially adjusting the critical value, which arises naturally when we are not

interested in the other bound, e.g. weight > 0 so negative lower bound is not interested)

Variance

Confidence interval (πœ‡ is unknown): ((π‘›βˆ’1)𝑠2

πœ’π‘›βˆ’1,1βˆ’

𝛼2

2 ,(π‘›βˆ’1)𝑠2

πœ’π‘›βˆ’1,

𝛼2

2 )

Confidence interval (πœ‡ is known): (𝑛𝑠′2

πœ’π‘›,1βˆ’

𝛼2

2 ,𝑛𝑠′2

πœ’π‘›,

𝛼2

2 ) = (βˆ‘ (π‘‹π‘–βˆ’πœ‡)2𝑛

𝑖=1

πœ’π‘›,1βˆ’

𝛼2

2 ,βˆ‘ (π‘‹π‘–βˆ’πœ‡)2𝑛

𝑖=1

πœ’π‘›,

𝛼2

2 ) (differs in d.f.)

Critical values: chi-squared distribution is not symmetric, so cannot simplify

Binomial proportion

Confidence interval (𝑛 > 30 or 𝑛�̂��̂� > 5): οΏ½Μ‚οΏ½ Β± 𝑧𝛼

2Γ— 𝑠𝑒(οΏ½Μ‚οΏ½) β‰ˆ οΏ½Μ‚οΏ½ Β± 𝑧𝛼

2Γ— √

𝑝(1βˆ’π‘)

𝑛

(Note: the standard error here is an approximated version from the lecture notes)

Confidence interval (exact method): solve for 𝑝𝐿 , π‘π‘ˆ from {𝑃(𝑋 β‰₯ 𝑛�̂�|𝑝 = 𝑝𝐿) =

𝛼

2

𝑃(𝑋 ≀ 𝑛�̂�|𝑝 = π‘π‘ˆ) =𝛼

2

where

𝑋~𝐡(𝑛, 𝑝)

Poisson rate

Confidence interval (exact method): solve for πœ†πΏ , πœ†π‘ˆ from {𝑃(𝑋 β‰₯ �̂�𝑇|πœ† = πœ†πΏ) =

𝛼

2

𝑃(𝑋 ≀ �̂�𝑇|πœ† = πœ†π‘ˆ) =𝛼

2

where

𝑋~π‘ƒπ‘œ(πœ†π‘‡)

Confidence interval (bootstrap method): generate 𝑁 sample of size π‘š with replacement from 𝑋.

Calculate the point estimate from each bootstrap sample. Sort the means and the bootstrap

confidence interval is given by the corresponding percentiles.

(Note: bootstrap is a very powerful method which can be applied to many statistical problems

that do not require close form)

Page 18: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 18

VII) Hypothesis Testing

Terminologies

Statistical hypothesis: a claim (assumption) about a population parameter

Null hypothesis: 𝐻0, the hypothesis to be tested (default position)

Alternative hypothesis: 𝐻1, a hypothesis challenge (against) 𝐻0 (what we want to conclude)

Hypothesis testing: a procedure to make decision on hypothesis based on some data samples.

The idea is to assume 𝐻0 is true first. If the population under 𝐻0 is unlikely to generate the data

sample, then we can make a decision to reject 𝐻0 (and thus accept 𝐻1).

Test statistics: a quantity (statistics) derived from the sample to help perform hypothesis test

Level of significance: 𝛼, defines the unlikely value of the sample if 𝐻0 is true

Critical value: cutoff values from the distribution of test statistic under 𝐻0 given 𝛼

p-value: probability of obtaining a test statistics at least as extreme as the observed sample value

given 𝐻0 is true

β€œAccept the null hypothesis”: if we fail to reject 𝐻0, we cannot accept it because doing so violates

the idea of prove by contradiction. It is possible that 𝐻0 is not true but we have not collected

enough data to reject it

Type I error: 𝛼, reject 𝐻0 when 𝐻0 is true (false positive).

(Note: traditional statistical procedure controls type I error by the level of significance, so that’s

why both of them are 𝛼)

Page 19: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 19

Type II error: 𝛽, do not reject 𝐻0 when 𝐻0 is false (false negative)

𝐻0 is true 𝐻0 is false

Do not reject 𝐻0 Correct inference (true negative, probability = 1-α)

Type II error (false negative, probability = Ξ²)

Reject 𝐻0 Type I error (false positive, probability = α)

Correct inference (true positive, probability = 1-Ξ²)

One sample z-test

Assumption: known 𝜎, from normal distribution or of large size (𝑛 β‰₯ 30)

Hypothesis: (1) {𝐻0: πœ‡ = πœ‡0

𝐻1: πœ‡ β‰  πœ‡0 or (2) {

𝐻0: πœ‡ = πœ‡0

𝐻1: πœ‡ > πœ‡0 or (3) {

𝐻0: πœ‡ = πœ‡0

𝐻1: πœ‡ < πœ‡0

Test statistics: 𝑧0 =οΏ½Μ…οΏ½βˆ’πœ‡0

𝜎

βˆšπ‘›

, 𝑍0~𝑁(0,1) under null

Decision rule: reject if (1) |𝑧0| > 𝑧1βˆ’π›Ό

2; (2) 𝑧0 > 𝑧1βˆ’π›Ό; (3) 𝑧0 < 𝑧𝛼

p-value: reject if 𝑝0 < 𝛼 where (1) 𝑝0 = 𝑃(𝑍0 > |𝑧0|); (2) 𝑝0 = 𝑃(𝑍0 > 𝑧0); (3) 𝑝0 = 𝑃(𝑍0 <

𝑧0)

One sample t-test

Assumption: unknown 𝜎

Hypothesis: (1) {𝐻0: πœ‡ = πœ‡0

𝐻1: πœ‡ β‰  πœ‡0 or (2) {

𝐻0: πœ‡ = πœ‡0

𝐻1: πœ‡ > πœ‡0 or (3) {

𝐻0: πœ‡ = πœ‡0

𝐻1: πœ‡ < πœ‡0

Test statistics: 𝑑0 =οΏ½Μ…οΏ½βˆ’πœ‡0

𝑠

βˆšπ‘›

, 𝑇0~π‘‘π‘›βˆ’1 under null

Decision rule: reject if (1) |𝑑0| > π‘‘π‘›βˆ’1,1βˆ’π›Ό

2; (2) 𝑑0 > π‘‘π‘›βˆ’1,1βˆ’π›Ό; (3) 𝑑0 < π‘‘π‘›βˆ’1,𝛼

One sample chi-squared test

Assumption: unknown 𝜎, from normal distribution

Page 20: STAT1012 Statistics for Life Sciences Quick Revision Notes

STAT1012 Statistics for Life Sciences 20

Hypothesis: (1) {𝐻0: 𝜎2 = 𝜎0

2

𝐻1: 𝜎2 β‰  𝜎02 or (2) {

𝐻0: 𝜎2 = 𝜎02

𝐻1: 𝜎2 > 𝜎02 or (3) {

𝐻0: 𝜎2 = 𝜎02

𝐻1: 𝜎2 < 𝜎02

Test statistics: πœ’02 =

(π‘›βˆ’1)𝑠2

𝜎02 , Χ0

2~πœ’π‘›βˆ’12 under null

Decision rule: reject if (1) πœ’02 > πœ’

π‘›βˆ’1,1βˆ’π›Ό

2

2 or πœ’02 < πœ’

π‘›βˆ’1,𝛼

2

2 ; (2) πœ’02 > πœ’π‘›βˆ’1,1βˆ’π›Ό

2 ; (3) πœ’02 < πœ’π‘›βˆ’1,𝛼

2

One sample binomial proportion test

Assumption: binomial sample with 𝑛 > 30 or 𝑛𝑝0π‘ž0 > 5

Hypothesis: (1) {𝐻0: 𝑝 = 𝑝0

𝐻1: 𝑝 β‰  𝑝0 or (2) {

𝐻0: 𝑝 = 𝑝0

𝐻1: 𝑝 > 𝑝0 or (3) {

𝐻0: 𝑝 = 𝑝0

𝐻1: 𝑝 < 𝑝0

Test statistics: 𝑧0 =οΏ½Μ…οΏ½βˆ’π‘0

βˆšπ‘0(1βˆ’π‘0)

βˆšπ‘›

, 𝑍0~𝑁(0,1) under null

Decision rule: reject if (1) |𝑧0| > 𝑧1βˆ’π›Ό

2; (2) 𝑧0 > 𝑧1βˆ’π›Ό; (3) 𝑧0 < 𝑧𝛼

Some remarks (not required)

Duality of confidence interval with hypothesis test: 𝐻0 is rejected at significance level 𝛼 if and

only if the corresponding confidence interval does not contain the value claimed by 𝐻0 with

confidence level 1 βˆ’ 𝛼 (true for common cases)

Power: 𝑃(π‘Ÿπ‘’π‘—π‘’π‘π‘‘ 𝐻0|𝐻1 𝑖𝑠 π‘‘π‘Ÿπ‘’π‘’) . As higher power implies a lower type II error, traditional

procedures usually fix the type I error and search for tests with high power

Bayesian inference: most procedures in this course are frequentist procedures. Taking interval

estimation as an example, if we want our interval to have probability 1 βˆ’ 𝛼 covering the

unknown parameter, we should seek credible interval from Bayesian inference instead

(confidence interval does not guarantee that). Consider taking more courses from our

department if you are interested :)