learning objectives -...
TRANSCRIPT
Basic statisticsIntroduction, Overview, Objectives
Volkert Siersma, Cristina Boschini, Thomas A. Gerds
Department of Biostatistics, University of Copenhagen
1 / 47
Learning objectives1. The role of statistics for your research2. Data structures and descriptive statistics:
I categorical or binary or continuous?I normal or not?I paired or tied?I range, mean, standard deviation, quantiles
3. Understanding statistical conclusions:I standard errorI parameter estimates, confidence intervalsI testing statistical hypotheses, p-values, powerI the meaning of "not statistically significant"
4. Basic statistical tools:I barplot, boxplot, plot of means, scatterplot, Kaplan-Meier plotI t-test, Wilcoxon rank test, χ2-test, Fisher’s exact testI (multiple) linear regression, logistic regressionI repeated measures ANOVA, survival analysis
2 / 47
Lectures
[30 Oct 2017] Introduction, p-value, power, confidence[01 Nov 2017] Descriptive statistics, ANOVA[06 Nov 2017] 2x2 tables, odds ratio, χ2 test[08 Nov 2017] One and two sample tests for continuous outcome[13 Nov 2017] Linear regression and correlation[15 Nov 2017] Logistic regression[20 Nov 2017] Multiple regression, confounding, interaction[22 Nov 2017] Repeated measurements, longitudinal data[27 Nov 2017] Survival analysis[29 Nov 2017] Presentation of student projects (homework)
3 / 47
Course material:
4 / 47
Formalities
1. The lectures start 8:15, the computer exercises 12:002. A homework assignment is handed out after lecture 4.3. The homework assignment is turned in after lecture 8.4. This highly ambitious course gives many ECTS points, if
I you attend 80% of all teaching units (we count the signatures)I you present the results of your homework on the last course
day.
5. You may work with a different statistical software6. Material is available at the course Homepage:
https://ifsv.sund.ku.dk/biostat/annualreport/index.php/Course:
Basic_statistics_for_health_researchers_(English_course)
-2017-Fall
5 / 47
Data
6 / 47
Study forms
Laboratory experiment: animal model, cell-cultureControlled conditions
Randomized clinical trial: patients, inclusion criteriaControlled treatment
Observational study: subjects, biopsies, registry dataControlled by design (E.g., case-control design, cohort study)
In all study forms, data are collected on multiple study units usuallyaccording to a pre-defined protocol.
The number of study units is the sample size.
7 / 47
Example: Patient characteristics 1
1Lawrie et al. Journal of Cellular & Molecular Medicine, 13:1248-1260,2009.
8 / 47
Data and variables
ID Gender Age Stage Extranodal involvement Treatment1 male 49 III yes A2 female 68 IV yes B3 female 48 I no B4 female 41 II yes C5 male 22 II no A. . . . . .. . . . . .
I Binary variable: occurs in one of two possible states; here: Genderand Extranodal involvement
I Categorical variable: values indicate membership in one of severalpossible categories; here: Stage (ordinal), Gender, Extranodalinvolvement, Treatment (nominal)
I Continuous variable: values on a continuous scale with a given unit;here: Age (years)
9 / 47
useR!We operate the free statistical software R with the graphical userinterface R-studio.
10 / 47
Project structure\some_project_name--- data
+++ raw.xls (untouched as received)+++ work.csv (processed for data analysis)
--- scripts+++ process-data.R (read: raw.xls, write: work.csv)+++ manu1-main.R (read: work.csv, write: tables and figures)+++ sandbox.R (try code until it works)
--- figures+++ manu1-figure1.pdf (or .eps for journal)+++ manu1-figure2.pdf (or .eps for journal)
--- tables+++ manu1-table1.csv+++ manu1-table2.csv
--- report+++ manu1-analysis-report.Rmd+++ manuscript1.docx
11 / 47
Data management
Data file format. R can read:I comma or semicolon separated text (csv) via
data.table::freadI Excel via library(xlsx)I SAS via library(sas7bdat)I stata via library(readstata13)I SPSS via library(foreign)
Tips
I Export your Excel/SAS/stata/SPSS data to text (csv).I Use an anonymous patient id. Remove patient names.I Data are changing objects. Save your data as, e.g.,
cancerdata_yyyymmdd.csv
12 / 47
Data dictionary (a separate file)[Data sets] non-participants.csv, Stamdata_10092013.csv
[Last update] 2013-11-09
Baseline:
[Label] Age[Unit/levels] Years, Groups: <45,45-55,55-65,65-75,75-85
[Variable name] age
[Label] Civil status[Unit/levels] yes=living with spouse no=no spouse/living alone
[Variable name] spouse
. . .Primary outcome:
[Label] First discharge date[Unit/levels] date, day-month-year[Description] the date from which the patients are followed for 6 months
[Variable name] dis0
13 / 47
Exercise: create R-studio project
1. Download file basstatana.zip from the course homepageand unzip
2. Move the now unfolded directory "basstatana" to a reasonableplace on your computer.
3. In R-studio create a project 2 (global menu). Use the existingdirectory basstatana.
4. Open and read the file README.html
2https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects
14 / 47
R-studio: settings
There are 4 panes (rearrangeable):
I R Source file(s)I Console(s)I Environment/HistoryI Files/Plots/Packages/Help/Viewer
Keyboard shortcutsNavigate between panes
Evaluate (run) R-code of the R Source file(s) in the Console(s)
15 / 47
R studio: R syntax
Example Explanation
a=c(5,1,4) set value of symbol a to vector 5,1,4
a <- c(5,1,4)
a show value of a
print(a)
f=function(x,y){x+y} define function f with arguments x and y
f see function definition of f
f(x=3,y=1) or f(3,1) apply function f
"bla" string
NA missing value
a==1 test where a equals 1
http://tryr.codeschool.com/16 / 47
R studio: data and data.tableRaw data files like data/Mice.csv can be opened from the filemenu. We almost never want to do this.
To work with the data, they need to be read by R. The functionfread from the data.table package is most convenient:
library(data.table)mice <- fread(’data/Mice.csv’)mice
Id Dose BodyWeight UterusWeight1: 4 0.0 11.89 0.00902: 19 50.0 12.78 0.08023: 9 2.5 11.28 0.05154: 14 7.5 11.92 0.09485: 7 1.0 10.03 0.01896: 18 50.0 13.24 0.06237: 3 0.0 10.39 0.00698: 17 50.0 13.29 0.11309: 10 2.5 11.97 0.0560
10: 13 7.5 12.64 0.083311: 5 1.0 11.21 0.029512: 1 0.0 12.69 0.011813: 16 7.5 13.57 0.078014: 6 1.0 10.97 0.026415: 8 1.0 10.64 0.024216: 12 2.5 12.16 0.051417: 11 2.5 10.67 0.044918: 15 7.5 12.69 0.101719: 2 0.0 11.58 0.008820: 20 50.0 13.26 0.0912
17 / 47
R-studio: hints, errors and warnings
I Save R code in script files but never .RData (Global options)I While typing R code press the Tab key all the time to get
completions and to see what arguments are available
I Warning messages sometimes indicate an Error, sometimesthey can be ignored.
I Warning and Error messages are sometimes hard tounderstand, but they can often be googled.
I Use the Internet and in particular stackoverflow to find help,syntax and examples
18 / 47
R-studio: syntax errors
If you forget to send a closing parenthesis, e.g., when you run theR-code
mean(c(1,3,6)
then the Console will look like this:
> mean(c(1,3,6)+
To solve this, put cursor in the console and press the Escape key ortype the missing parenthesis into the Console and press the RETURNkey.
There are many other ways to f(,}[$#*ˆ the code up.
19 / 47
About programming
A Phd student should program the data analysis in order to be ableto reproduce the results 5 years later.
Casey Patton Programming is thinking not typing
Morgan Johansson Programming is best done "in the zone" - a(pleasant) state of mind where your focus on the taskis absolute and everything seems easy. This isprobably much like "the zone" for musicians andathletes.
Folklore Sleeping with a problem can actually solve it
20 / 47
Exercise: data preparation
1. In Rstudio, open the project basstatana.2. Open the R-script file scripts/prepareMiceData.R.3. Read through the file line by line.4. Open the Rmarkdown file demo-mice.Rmd.5. Read through the file line by line and evaluate the R-chunks.6. Export the document to html/word/pdf.
21 / 47
Statistical inference
22 / 47
The general principle of statistical inference
1. Research aims and hypotheses are formulated2. The study population is defined and data are collected
(measured) on a random subset of all samples/individuals inthe population
3. The sampled data are analysed and but conclusions are drawnabout the full study population
It is crucial that the sampled samples/individuals are“representative” for the full study population.
23 / 47
Descriptive statistics to describe the data that you have seen (whatis in the yellow box).
Population: statistical inference
Sample:descriptive
statistics
Statistical inference answer research questions in parameterizedcomplexity about the population (the white box).Inference is based on the data in the yellow box.
24 / 47
Example
Descriptive statistics
Of the 138 patients who survived an out-of-hospital cardiac arrest,48 (34.8%) were female.
Statistical inference
The risk (95% confidence limits) of anoxic brain damage oradmission to a nursing home 12 months after cardiac arrest was13.3 % (7.1% – 22.1%) in males and 10.1% (2.9% – 24.8%) infemales (p>0.05).
25 / 47
Flow chart example: not exactly a random subset of thepopulation
26 / 47
Probability distributions
A variable is any measurement which is not constant in the studypopulation, i.e., different study units may have different values.
The theoretical distribution of a variable describes the unknownprobabilities for seeing a certain value in the population.
The empirical distribution of a variable describes the observedfrequencies of the values in the sample.
27 / 47
Examples of empirical distributions
Histogram of Diabetes$age
Diabetes$age
Fre
quen
cy
20 40 60 80 100
020
4060
80
Histogram of Diabetes$height
Diabetes$height
Fre
quen
cy
55 60 65 70 75
020
4060
Histogram of Diabetes$ratio
Diabetes$ratio
Fre
quen
cy
0 5 10 15 20
050
100
150
male female
Barplot of Diabetes$gender
050
100
150
200
28 / 47
The empirical distribution of gender
table(Diabetes$gender)
female male234 169
The theoretical distribution cannot be normal!
29 / 47
The empirical distribution of height
hist(Diabetes$height)
Histogram of Diabetes$height
Diabetes$height
Fre
quen
cy
55 60 65 70 75
020
4060 The theoretical
distribution maywell be normal
30 / 47
Relation of variables: terminology
Dependent variable/outcome variable/phenotype:
The presumed effect
Independent variable/predictor variable/feature/covariate:
The presumed cause
Experimental or randomized study
I The values of the independent variable are under experimentalcontrol.
I All other variables that may impact the dependent variable arecontrolled.
31 / 47
Interpretation of a p-value
32 / 47
Abstract: Gloria-Bottini et al (part 1)
The well-known relationship between low birth weight and allergiesprompted us to investigate a possible pleiotropic effect of ACP1 onthese conditions. ACP1 is a polymorphic enzyme that affects signaltransduction of insulin and other growth factors, T-cell receptorsignaling, and the regulation of flavoenzyme activity.
Our aim was to compare the relationship between ACP1 and allergywith the relationship between ACP1 and birth weight. We studied299 subjects from the Caucasian population of England, 124subjects from the Caucasian population of central Italy, and 302healthy puerperae and their newborn babies from the sameCaucasian populations.
ACP1 phenotype was determined by starch gel electrophoresis onRBC hemolysate and by DNA analysis.
33 / 47
Abstract: Gloria-Bottini et al. (part 2)
Results: Subjects with high ACP1 activity (ACP1 C,B phenotype)show a lower level of IgE compared to subjects with low ACP1activity (p = 0.01).
The proportion of infants with a birth weight below the firstquartile is lower among infants born to mothers with high ACP1activity than among infants born to mothers with medium-lowactivity (p = 0.01).
The data suggest a protective effect of high-activity ACP1 C,Bphenotype from low birth weight and from allergic manifestationsafter birth.
34 / 47
Statistical hypotheses
In statistics "testing a hypothesis" means deciding between twoalternative explanations of a phenomenon – usually a yes/noquestion.
Example: The birth weight is associated with the ACP1 activity ofthe mother.
H0 Null hypothesis
The birth weight is not associatedwith the ACP1 activity level of themother
H1 Alternative hypothesisThe birth weight is associated withthe ACP1 activity level of the mother
35 / 47
The dilemma of applied statistics
In applied statistics we have to accept a certain probability that theconclusions drawn in a specific situation are false.
We distinguish two types of failure:I False positive finding (controlled by the level of significance α)I False negative finding (controlled by the power of the study β)
Problem: Unfortunately, the lower the level of significance the lowerthe power and vice visa.
The solution is a compromise: Fix the level of significance, usuallyit is set at α = 5%, and then maximize the power.
36 / 47
The p-value
The result of a every statistical test is a decision between the nullhypothesis and the alternative hypothesis.
The decision is usually reported in form of a p-value:
I If p < 5% ⇒ there is evidence for the alternative hypothesis.I If p > 5% ⇒ stay with the null hypothesis.
37 / 47
Statistically significant
The proportion of infants with a birth weight below the firstquartile is lower among infants born to mothers with high ACP1activity than among infants born to mothers with medium-lowactivity (p = 0.01).
The p-value is below the a-priory fixed α-level of significance:
p = 0.01 < 0.05 = α
Thus, the birth weight is statistically significantly associated withthe ACP1 activity of the mother.
Generally speaking, the p-value is the probability to observe thedata (or more extreme data) when the null hypothesis is true.
38 / 47
In summary
We have quite different interpretations of the two kinds of results
I statistically significant result: the likelihood that the conclusionis wrong (false positive finding) is small, at most 5% 3
I not statistically significant: the likelihood that the conclusionis wrong (false negative finding) can be high and depends onthe power of the test/study.
3The value 5% is chosen by us and does not depend on the sample size andother things.
39 / 47
Significance: details
If we see a statistically non-significant result the researchhypothesis could still be correct and there may be a significantbiological difference, but
I the sample size was too smallI the magnitude of the difference was too smallI the statistical test was not appropriate
If we see a statistically significant result the biological differencemay still be non-significant. This will often happen when thesample size is very large.
40 / 47
Negative findings
Thus, not significant may have different reasons, either the power istoo low to detect a too small effect, or there simply is nobiomedically relevant effect.
If you really find a negative result, such as
I Wnt expression is not correlated with-catenin dysregulation inDupuytren’s Disease
I Prior occupation by scirtid beetles does not affect mosquitoand midge populations in treeholes
Then, it may still be important to publish it 4, 5
4Journal of Negative Results http://www.jnr-eeb.org/5Journal of Negative Results in Biomedicine http://www.jnrbm.com
41 / 47
Planning power and sample size
42 / 47
Planning a (randomized) study:
1. Aim: formulate a research question2. Outcome: associated effect, i.e., response to treatment3. Study units: define the subjects or probes4. Groups: devide the study units into two (treatment) groups5. Expected results: make assumptions regarding the outcome in
the two groupsI case binary outcome (positive/negative): probability of
positive outcome in both groupsI case continuous outcome: difference in mean outcome between
groups, standard deviation of outcome6. Calculation:
I fix the sample size and calculate the power for detecting adifference between the groups
I fix the power and calculate the required sample size
43 / 47
Example: study on cancer treatment
I Can treatment reduce the tumor growth?I Study type: randomized laboratory experimentI Experimental units: mice, randomly divided into a treatment
and a placebo groupI Measurements: tumor size after 6 weeksI Outcome: the tumor size has doubled after 6 weeks: binary
variable (present, not present)
Question: What is the statistical power to detect a significantdifference between the probability of tumor doubling in thetreatment group (n=18) and the placebo group (n=18)?
Input parameters: the expected probability of tumor doubling is80% in the placebo group, and it is expected to be reduced to 40%in the treatment group.
44 / 47
Power calculation for given sample size
power.prop.test(n=18,p1=0.4,p2=0.8)
Two-sample comparison of proportions power calculation
n = 18p1 = 0.4p2 = 0.8
sig.level = 0.05power = 0.7041066
alternative = two.sided
NOTE: n is number in *each* group
A 50% reduction of the risk of tumor doubling after 6 weeks can bedetected with a power of 70.4%.
45 / 47
Sample size calculation for given power
power.prop.test(power=.8,p1=0.4,p2=0.8)
Two-sample comparison of proportions power calculation
n = 22.33013p1 = 0.4p2 = 0.8
sig.level = 0.05power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
46 / 47
Take home messages
I A statistical analysis may or may not put evidence on ahypothesis.
I You have to get used to statistical terminology and thestatistical way of thinking.
I It pays off to learn the R language and to become aprogrammer.
47 / 47