basic data analysis using r xiao he 1. agenda 1.data cleaning (e.g., missing values) 2.descriptive...

25
Basic Data Analysis Using R Xiao He 1

Upload: steven-palmer

Post on 24-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

1

Basic Data Analysis Using RXiao He

Page 2: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

2

AGENDA

1. Data cleaning (e.g., missing values)

2. Descriptive statistics

3. t-tests

4. ANOVA

5. Linear regression

Data visualization

Page 3: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

3

AGENDA

1. Data cleaning (e.g., missing values)

2. Descriptive statistics

3. t-tests

4. ANOVA

5. Linear regression

Page 4: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

4

1. DATA CLEANINGNA and NaN:

1. NA (Not Available): missing values.

a). Represented in the form of NA, or in the form of <NA>.

2. NaN (Not a number): when an arithmetic operation returns a non-numeric result: e.g., in R, 0/0 gives you NaN.

NOTE: You may have coded missing values using other notations. For example, some researchers/programs use -99 or -999 to denote missing values. If that is the case, you need to check for missing values using other methods that won’t be covered here.

Page 5: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

5

1. DATA CLEANINGNA and NaN:

3. Deal with NA and NaN?

a). Check how many cases (rows) do NOT have NA or NaN:complete.cases()#Returns Booleans (TRUE, FALSE). TRUE means no missing #value (in a given row), and FALSE means there is at #least one missing value.

b). Remove cases with NA or NaN:na.omit()#return a data object with NA and NaN removed. For data#frames, an entire row will be removed if it contains

NA#or NaN.

c). More sophisticated ways of dealing with NA (covered by Addie’s workshop in two weeks)

Ex1.1: (refer to handout)

Ex1.2: (refer to handout)

Page 6: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

6

AGENDA

1. Data cleaning (e.g., missing values)

2. Descriptive statistics

3. t-tests

4. ANOVA

5. Linear regression

Page 7: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

7

2. DESCRIPTIVE STATISTICS1. Help you diagnose potential problems w/ data entry or collection:

a. Were any values entered incorrectly? e.g., survey study using 1 – 5 Likert scale, but when you checked the

range of your data, you found that the maximum value in your dataset was 7.

b. Any strange responses?e.g., Did your participants give you any odd responses?

2. Help you get a sense of how your data are distributed.

a. Extreme values (outliers)

b. Non-normality

c. Skewness

d. Unequal variance

Page 8: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

8

2. DESCRIPTIVE STATISTICS Compute individual descriptive statistics

1. Location:

mean(x, trim) median(x)

2. Dispersion

var(x) sd(x) range(x); min(x); max(x) IQR(x)

Ex2.1: (refer to handout)

Ex2.2: (refer to handout)

Page 9: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

9

2. DESCRIPTIVE STATISTICS Compute a set of descriptive statistics

1. Use the function summary().

2. function describe() in the package `psych`.

Ex3.1: (refer to handout)

Ex3.2: (refer to handout)

Page 10: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

10

AGENDA

1. Data cleaning (e.g., missing values)

2. Descriptive statistics

3. t-tests

4. ANOVA

5. Linear regression

Page 11: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

11

3. T TESTS

t.test()t.test(x, y = NULL,

alternative=c("two.sided", "less", "greater”),mu = 0, paired = FALSE, var.equal = FALSE, conf.level = .95)

1. One-sample t-test:

Page 12: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

12

3. T TESTS

t.test()t.test(x, y = NULL,

alternative=c("two.sided", "less", "greater”),mu = 0, paired = FALSE, var.equal = FALSE, conf.level = .95)

1. One-sample t-test:

Suppose someone hypothesized that the mean undergrad age was 19.75. Let’s test whether the mean age was significantly different from 19.75.

H0: mu0 = 19.75

H1: mu0 ≠ 19.75

Ex4.1: (refer to handout)

Page 13: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

13

3. T TESTS

t.test()t.test(x, y = NULL,

alternative=c("two.sided", "less", "greater”),mu = 0, paired = FALSE, var.equal = FALSE, conf.level = .95)

2. Independent t-test:

Let’s test whether the mean height of female students is significantly different from the mean height of male students

H0: muFemale = muMale

H1: muFemale ≠ muMale

Ex4.2: (refer to handout)

Page 14: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

14

3. T TESTS

t.test()t.test(x, y = NULL,

alternative=c("two.sided", "less", "greater”),mu = 0, paired = FALSE, var.equal = FALSE, conf.level = .95)

t.test(formula, data,…)

2. Independent t-test:

Let’s test whether the mean height of female students is significantly different from the mean height of male students

H0: muFemale = muMale

H1: muFemale ≠ muMale

Ex4.3: (refer to handout)

formula = Y ~ XY: outcome variable (e.g., height)X: 2 level grouping variable (e.g., Sex)

Page 15: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

15

3. T TESTS

t.test()t.test(x, y = NULL,

alternative=c("two.sided", "less", "greater”),mu = 0, paired = FALSE, var.equal = FALSE, conf.level = .95)

t.test(formula, data,…)

3. Paired t-test:

Let’s test whether the mean Writing hand span (Wr.Hnd) and the mean Non-writing hand span (NW.Hnd) differ significantly.

H0: muWr.Hnd = muNW.Hnd

H1: muWr.Hnd ≠ muNW.Hnd

Ex4.4: (refer to handout)

Page 16: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

16

AGENDA

1. Data cleaning (e.g., missing values)

2. Descriptive statistics

3. t-tests

4. ANOVA

5. Linear regression

Page 17: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

17

4. ANOVA

aov()aov(formula, data)

1. One-way ANOVA:

Suppose we are interested in whether the mean pulse rates differ amongst people of different exercise statuses.

H0: muNone = muSome = muFreq

H1: Not all groups are equal.

Ex5.1: (refer to handout)

Page 18: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

18

We will import a new dataset and will use it for the next exercise.

hsb2 <- read.table("http://www.ats.ucla.edu/stat/r/faq/hsb2.csv", sep=",", header=TRUE)

Page 19: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

19

4. ANOVA

aov()aov(formula, data)

2. Two-way ANOVA:

The formula/model for factorial ANOVA (take 2 way interaction for example) is specified as follows:

Y ~ X1 * X2 which is equivalent to Y ~ X1 + X2 + X1:X2

Suppose we are interested the main effects of race and schtyp (school type) as well as the interaction effect between the two variables on read. The formula/model can be specified as:

read ~ as.factor(race) * as.factor(schtyp)

Ex5.2: (refer to handout)

Why do we do this?

Page 20: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

20

NoteIn R, by default, ANOVA results are based on Type 1 (“sequential”) Sum of Squares. Some other programs report Type 3 SS (SPSS) or both Type 1 and Type 3 (SAS). The Type 3 SS for each term is calculated given all other terms are in the model.

Page 21: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

21

AGENDA

1. Data cleaning (e.g., missing values)

2. Descriptive statistics

3. t-tests

4. ANOVA

5. Linear regression

Page 22: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

22

Let’s import a new dataset

expenditure <- read.table("http://dornsife.usc.edu/assets/sites/210/docs/GC3/educationExpenditure.txt", sep=",", header=TRUE)

Page 23: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

23Ex6.1: (refer to handout)

5. LINEAR REGRESSIONlm()

lm(formula, data)

education: Per-capita education expenditures, dollars.

income: Per-capita income, dollars.

young: Proportion under 18, per 1000.

urban: Proportion urban, per 1000.

1. Simple linear regression

formula = Y ~ X

Suppose we are interested in testing the regression of per-capita education expenditure on per-capita income, the model should be specified as:

formula = education ~ income

Page 24: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

24

5. LINEAR REGRESSION

lm()lm(formula, data)

education: Per-capita education expenditures, dollars.

income: Per-capita income, dollars.

young: Proportion under 18, per 1000.

urban: Proportion urban, per 1000.

2. Multiple regression

formula = Y ~ X1 + X2 + … + Xp

Suppose we are interested in testing the regression of education on income, young, and urban, the model should be specified as:

formula = education ~ income + young + urban

Ex6.2: (refer to handout)

Page 25: Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data

25

Thanks!