learning objectives -...

12
Basic statistics Introduction, Overview, Objectives Volkert Siersma, Cristina Boschini, Thomas A. Gerds Department of Biostatistics, University of Copenhagen 1 / 47 Learning objectives 1. The role of statistics for your research 2. Data structures and descriptive statistics: categorical or binary or continuous? normal or not? paired or tied? range, mean, standard deviation, quantiles 3. Understanding statistical conclusions: standard error parameter estimates, confidence intervals testing statistical hypotheses, p-values, power the meaning of "not statistically significant" 4. Basic statistical tools: barplot, boxplot, plot of means, scatterplot, Kaplan-Meier plot t-test, Wilcoxon rank test, χ 2 -test, Fisher’s exact test (multiple) linear regression, logistic regression repeated measures ANOVA, survival analysis 2 / 47 Lectures [30 Oct 2017] Introduction, p-value, power, confidence [01 Nov 2017] Descriptive statistics, ANOVA [06 Nov 2017] 2x2 tables, odds ratio, χ 2 test [08 Nov 2017] One and two sample tests for continuous outcome [13 Nov 2017] Linear regression and correlation [15 Nov 2017] Logistic regression [20 Nov 2017] Multiple regression, confounding, interaction [22 Nov 2017] Repeated measurements, longitudinal data [27 Nov 2017] Survival analysis [29 Nov 2017] Presentation of student projects (homework) 3 / 47 Course material: 4 / 47

Upload: buidien

Post on 08-Sep-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Basic statisticsIntroduction, Overview, Objectives

Volkert Siersma, Cristina Boschini, Thomas A. Gerds

Department of Biostatistics, University of Copenhagen

1 / 47

Learning objectives1. The role of statistics for your research2. Data structures and descriptive statistics:

I categorical or binary or continuous?I normal or not?I paired or tied?I range, mean, standard deviation, quantiles

3. Understanding statistical conclusions:I standard errorI parameter estimates, confidence intervalsI testing statistical hypotheses, p-values, powerI the meaning of "not statistically significant"

4. Basic statistical tools:I barplot, boxplot, plot of means, scatterplot, Kaplan-Meier plotI t-test, Wilcoxon rank test, χ2-test, Fisher’s exact testI (multiple) linear regression, logistic regressionI repeated measures ANOVA, survival analysis

2 / 47

Lectures

[30 Oct 2017] Introduction, p-value, power, confidence[01 Nov 2017] Descriptive statistics, ANOVA[06 Nov 2017] 2x2 tables, odds ratio, χ2 test[08 Nov 2017] One and two sample tests for continuous outcome[13 Nov 2017] Linear regression and correlation[15 Nov 2017] Logistic regression[20 Nov 2017] Multiple regression, confounding, interaction[22 Nov 2017] Repeated measurements, longitudinal data[27 Nov 2017] Survival analysis[29 Nov 2017] Presentation of student projects (homework)

3 / 47

Course material:

4 / 47

Formalities

1. The lectures start 8:15, the computer exercises 12:002. A homework assignment is handed out after lecture 4.3. The homework assignment is turned in after lecture 8.4. This highly ambitious course gives many ECTS points, if

I you attend 80% of all teaching units (we count the signatures)I you present the results of your homework on the last course

day.

5. You may work with a different statistical software6. Material is available at the course Homepage:

https://ifsv.sund.ku.dk/biostat/annualreport/index.php/Course:

Basic_statistics_for_health_researchers_(English_course)

-2017-Fall

5 / 47

Data

6 / 47

Study forms

Laboratory experiment: animal model, cell-cultureControlled conditions

Randomized clinical trial: patients, inclusion criteriaControlled treatment

Observational study: subjects, biopsies, registry dataControlled by design (E.g., case-control design, cohort study)

In all study forms, data are collected on multiple study units usuallyaccording to a pre-defined protocol.

The number of study units is the sample size.

7 / 47

Example: Patient characteristics 1

1Lawrie et al. Journal of Cellular & Molecular Medicine, 13:1248-1260,2009.

8 / 47

Data and variables

ID Gender Age Stage Extranodal involvement Treatment1 male 49 III yes A2 female 68 IV yes B3 female 48 I no B4 female 41 II yes C5 male 22 II no A. . . . . .. . . . . .

I Binary variable: occurs in one of two possible states; here: Genderand Extranodal involvement

I Categorical variable: values indicate membership in one of severalpossible categories; here: Stage (ordinal), Gender, Extranodalinvolvement, Treatment (nominal)

I Continuous variable: values on a continuous scale with a given unit;here: Age (years)

9 / 47

useR!We operate the free statistical software R with the graphical userinterface R-studio.

10 / 47

Project structure\some_project_name--- data

+++ raw.xls (untouched as received)+++ work.csv (processed for data analysis)

--- scripts+++ process-data.R (read: raw.xls, write: work.csv)+++ manu1-main.R (read: work.csv, write: tables and figures)+++ sandbox.R (try code until it works)

--- figures+++ manu1-figure1.pdf (or .eps for journal)+++ manu1-figure2.pdf (or .eps for journal)

--- tables+++ manu1-table1.csv+++ manu1-table2.csv

--- report+++ manu1-analysis-report.Rmd+++ manuscript1.docx

11 / 47

Data management

Data file format. R can read:I comma or semicolon separated text (csv) via

data.table::freadI Excel via library(xlsx)I SAS via library(sas7bdat)I stata via library(readstata13)I SPSS via library(foreign)

Tips

I Export your Excel/SAS/stata/SPSS data to text (csv).I Use an anonymous patient id. Remove patient names.I Data are changing objects. Save your data as, e.g.,

cancerdata_yyyymmdd.csv

12 / 47

Data dictionary (a separate file)[Data sets] non-participants.csv, Stamdata_10092013.csv

[Last update] 2013-11-09

Baseline:

[Label] Age[Unit/levels] Years, Groups: <45,45-55,55-65,65-75,75-85

[Variable name] age

[Label] Civil status[Unit/levels] yes=living with spouse no=no spouse/living alone

[Variable name] spouse

. . .Primary outcome:

[Label] First discharge date[Unit/levels] date, day-month-year[Description] the date from which the patients are followed for 6 months

[Variable name] dis0

13 / 47

Exercise: create R-studio project

1. Download file basstatana.zip from the course homepageand unzip

2. Move the now unfolded directory "basstatana" to a reasonableplace on your computer.

3. In R-studio create a project 2 (global menu). Use the existingdirectory basstatana.

4. Open and read the file README.html

2https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects

14 / 47

R-studio: settings

There are 4 panes (rearrangeable):

I R Source file(s)I Console(s)I Environment/HistoryI Files/Plots/Packages/Help/Viewer

Keyboard shortcutsNavigate between panes

Evaluate (run) R-code of the R Source file(s) in the Console(s)

15 / 47

R studio: R syntax

Example Explanation

a=c(5,1,4) set value of symbol a to vector 5,1,4

a <- c(5,1,4)

a show value of a

print(a)

f=function(x,y){x+y} define function f with arguments x and y

f see function definition of f

f(x=3,y=1) or f(3,1) apply function f

"bla" string

NA missing value

a==1 test where a equals 1

http://tryr.codeschool.com/16 / 47

R studio: data and data.tableRaw data files like data/Mice.csv can be opened from the filemenu. We almost never want to do this.

To work with the data, they need to be read by R. The functionfread from the data.table package is most convenient:

library(data.table)mice <- fread(’data/Mice.csv’)mice

Id Dose BodyWeight UterusWeight1: 4 0.0 11.89 0.00902: 19 50.0 12.78 0.08023: 9 2.5 11.28 0.05154: 14 7.5 11.92 0.09485: 7 1.0 10.03 0.01896: 18 50.0 13.24 0.06237: 3 0.0 10.39 0.00698: 17 50.0 13.29 0.11309: 10 2.5 11.97 0.0560

10: 13 7.5 12.64 0.083311: 5 1.0 11.21 0.029512: 1 0.0 12.69 0.011813: 16 7.5 13.57 0.078014: 6 1.0 10.97 0.026415: 8 1.0 10.64 0.024216: 12 2.5 12.16 0.051417: 11 2.5 10.67 0.044918: 15 7.5 12.69 0.101719: 2 0.0 11.58 0.008820: 20 50.0 13.26 0.0912

17 / 47

R-studio: hints, errors and warnings

I Save R code in script files but never .RData (Global options)I While typing R code press the Tab key all the time to get

completions and to see what arguments are available

I Warning messages sometimes indicate an Error, sometimesthey can be ignored.

I Warning and Error messages are sometimes hard tounderstand, but they can often be googled.

I Use the Internet and in particular stackoverflow to find help,syntax and examples

18 / 47

R-studio: syntax errors

If you forget to send a closing parenthesis, e.g., when you run theR-code

mean(c(1,3,6)

then the Console will look like this:

> mean(c(1,3,6)+

To solve this, put cursor in the console and press the Escape key ortype the missing parenthesis into the Console and press the RETURNkey.

There are many other ways to f(,}[$#*ˆ the code up.

19 / 47

About programming

A Phd student should program the data analysis in order to be ableto reproduce the results 5 years later.

Casey Patton Programming is thinking not typing

Morgan Johansson Programming is best done "in the zone" - a(pleasant) state of mind where your focus on the taskis absolute and everything seems easy. This isprobably much like "the zone" for musicians andathletes.

Folklore Sleeping with a problem can actually solve it

20 / 47

Exercise: data preparation

1. In Rstudio, open the project basstatana.2. Open the R-script file scripts/prepareMiceData.R.3. Read through the file line by line.4. Open the Rmarkdown file demo-mice.Rmd.5. Read through the file line by line and evaluate the R-chunks.6. Export the document to html/word/pdf.

21 / 47

Statistical inference

22 / 47

The general principle of statistical inference

1. Research aims and hypotheses are formulated2. The study population is defined and data are collected

(measured) on a random subset of all samples/individuals inthe population

3. The sampled data are analysed and but conclusions are drawnabout the full study population

It is crucial that the sampled samples/individuals are“representative” for the full study population.

23 / 47

Descriptive statistics to describe the data that you have seen (whatis in the yellow box).

Population: statistical inference

Sample:descriptive

statistics

Statistical inference answer research questions in parameterizedcomplexity about the population (the white box).Inference is based on the data in the yellow box.

24 / 47

Example

Descriptive statistics

Of the 138 patients who survived an out-of-hospital cardiac arrest,48 (34.8%) were female.

Statistical inference

The risk (95% confidence limits) of anoxic brain damage oradmission to a nursing home 12 months after cardiac arrest was13.3 % (7.1% – 22.1%) in males and 10.1% (2.9% – 24.8%) infemales (p>0.05).

25 / 47

Flow chart example: not exactly a random subset of thepopulation

26 / 47

Probability distributions

A variable is any measurement which is not constant in the studypopulation, i.e., different study units may have different values.

The theoretical distribution of a variable describes the unknownprobabilities for seeing a certain value in the population.

The empirical distribution of a variable describes the observedfrequencies of the values in the sample.

27 / 47

Examples of empirical distributions

Histogram of Diabetes$age

Diabetes$age

Fre

quen

cy

20 40 60 80 100

020

4060

80

Histogram of Diabetes$height

Diabetes$height

Fre

quen

cy

55 60 65 70 75

020

4060

Histogram of Diabetes$ratio

Diabetes$ratio

Fre

quen

cy

0 5 10 15 20

050

100

150

male female

Barplot of Diabetes$gender

050

100

150

200

28 / 47

The empirical distribution of gender

table(Diabetes$gender)

female male234 169

The theoretical distribution cannot be normal!

29 / 47

The empirical distribution of height

hist(Diabetes$height)

Histogram of Diabetes$height

Diabetes$height

Fre

quen

cy

55 60 65 70 75

020

4060 The theoretical

distribution maywell be normal

30 / 47

Relation of variables: terminology

Dependent variable/outcome variable/phenotype:

The presumed effect

Independent variable/predictor variable/feature/covariate:

The presumed cause

Experimental or randomized study

I The values of the independent variable are under experimentalcontrol.

I All other variables that may impact the dependent variable arecontrolled.

31 / 47

Interpretation of a p-value

32 / 47

Abstract: Gloria-Bottini et al (part 1)

The well-known relationship between low birth weight and allergiesprompted us to investigate a possible pleiotropic effect of ACP1 onthese conditions. ACP1 is a polymorphic enzyme that affects signaltransduction of insulin and other growth factors, T-cell receptorsignaling, and the regulation of flavoenzyme activity.

Our aim was to compare the relationship between ACP1 and allergywith the relationship between ACP1 and birth weight. We studied299 subjects from the Caucasian population of England, 124subjects from the Caucasian population of central Italy, and 302healthy puerperae and their newborn babies from the sameCaucasian populations.

ACP1 phenotype was determined by starch gel electrophoresis onRBC hemolysate and by DNA analysis.

33 / 47

Abstract: Gloria-Bottini et al. (part 2)

Results: Subjects with high ACP1 activity (ACP1 C,B phenotype)show a lower level of IgE compared to subjects with low ACP1activity (p = 0.01).

The proportion of infants with a birth weight below the firstquartile is lower among infants born to mothers with high ACP1activity than among infants born to mothers with medium-lowactivity (p = 0.01).

The data suggest a protective effect of high-activity ACP1 C,Bphenotype from low birth weight and from allergic manifestationsafter birth.

34 / 47

Statistical hypotheses

In statistics "testing a hypothesis" means deciding between twoalternative explanations of a phenomenon – usually a yes/noquestion.

Example: The birth weight is associated with the ACP1 activity ofthe mother.

H0 Null hypothesis

The birth weight is not associatedwith the ACP1 activity level of themother

H1 Alternative hypothesisThe birth weight is associated withthe ACP1 activity level of the mother

35 / 47

The dilemma of applied statistics

In applied statistics we have to accept a certain probability that theconclusions drawn in a specific situation are false.

We distinguish two types of failure:I False positive finding (controlled by the level of significance α)I False negative finding (controlled by the power of the study β)

Problem: Unfortunately, the lower the level of significance the lowerthe power and vice visa.

The solution is a compromise: Fix the level of significance, usuallyit is set at α = 5%, and then maximize the power.

36 / 47

The p-value

The result of a every statistical test is a decision between the nullhypothesis and the alternative hypothesis.

The decision is usually reported in form of a p-value:

I If p < 5% ⇒ there is evidence for the alternative hypothesis.I If p > 5% ⇒ stay with the null hypothesis.

37 / 47

Statistically significant

The proportion of infants with a birth weight below the firstquartile is lower among infants born to mothers with high ACP1activity than among infants born to mothers with medium-lowactivity (p = 0.01).

The p-value is below the a-priory fixed α-level of significance:

p = 0.01 < 0.05 = α

Thus, the birth weight is statistically significantly associated withthe ACP1 activity of the mother.

Generally speaking, the p-value is the probability to observe thedata (or more extreme data) when the null hypothesis is true.

38 / 47

In summary

We have quite different interpretations of the two kinds of results

I statistically significant result: the likelihood that the conclusionis wrong (false positive finding) is small, at most 5% 3

I not statistically significant: the likelihood that the conclusionis wrong (false negative finding) can be high and depends onthe power of the test/study.

3The value 5% is chosen by us and does not depend on the sample size andother things.

39 / 47

Significance: details

If we see a statistically non-significant result the researchhypothesis could still be correct and there may be a significantbiological difference, but

I the sample size was too smallI the magnitude of the difference was too smallI the statistical test was not appropriate

If we see a statistically significant result the biological differencemay still be non-significant. This will often happen when thesample size is very large.

40 / 47

Negative findings

Thus, not significant may have different reasons, either the power istoo low to detect a too small effect, or there simply is nobiomedically relevant effect.

If you really find a negative result, such as

I Wnt expression is not correlated with-catenin dysregulation inDupuytren’s Disease

I Prior occupation by scirtid beetles does not affect mosquitoand midge populations in treeholes

Then, it may still be important to publish it 4, 5

4Journal of Negative Results http://www.jnr-eeb.org/5Journal of Negative Results in Biomedicine http://www.jnrbm.com

41 / 47

Planning power and sample size

42 / 47

Planning a (randomized) study:

1. Aim: formulate a research question2. Outcome: associated effect, i.e., response to treatment3. Study units: define the subjects or probes4. Groups: devide the study units into two (treatment) groups5. Expected results: make assumptions regarding the outcome in

the two groupsI case binary outcome (positive/negative): probability of

positive outcome in both groupsI case continuous outcome: difference in mean outcome between

groups, standard deviation of outcome6. Calculation:

I fix the sample size and calculate the power for detecting adifference between the groups

I fix the power and calculate the required sample size

43 / 47

Example: study on cancer treatment

I Can treatment reduce the tumor growth?I Study type: randomized laboratory experimentI Experimental units: mice, randomly divided into a treatment

and a placebo groupI Measurements: tumor size after 6 weeksI Outcome: the tumor size has doubled after 6 weeks: binary

variable (present, not present)

Question: What is the statistical power to detect a significantdifference between the probability of tumor doubling in thetreatment group (n=18) and the placebo group (n=18)?

Input parameters: the expected probability of tumor doubling is80% in the placebo group, and it is expected to be reduced to 40%in the treatment group.

44 / 47

Power calculation for given sample size

power.prop.test(n=18,p1=0.4,p2=0.8)

Two-sample comparison of proportions power calculation

n = 18p1 = 0.4p2 = 0.8

sig.level = 0.05power = 0.7041066

alternative = two.sided

NOTE: n is number in *each* group

A 50% reduction of the risk of tumor doubling after 6 weeks can bedetected with a power of 70.4%.

45 / 47

Sample size calculation for given power

power.prop.test(power=.8,p1=0.4,p2=0.8)

Two-sample comparison of proportions power calculation

n = 22.33013p1 = 0.4p2 = 0.8

sig.level = 0.05power = 0.8

alternative = two.sided

NOTE: n is number in *each* group

46 / 47

Take home messages

I A statistical analysis may or may not put evidence on ahypothesis.

I You have to get used to statistical terminology and thestatistical way of thinking.

I It pays off to learn the R language and to become aprogrammer.

47 / 47