introductory statistics with r - ucla statistical...

59
Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises UCLA Department of Statistics Statistical Consulting Center Introductory Statistics with R Mine C ¸etinkaya [email protected] February 1, 2010 Mine C ¸etinkaya [email protected] Introductory Statistics with R UCLA SCC

Upload: lyhanh

Post on 31-Mar-2018

222 views

Category:

Documents


3 download

TRANSCRIPT

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

UCLA Department of StatisticsStatistical Consulting Center

Introductory Statistics with R

Mine [email protected]

February 1, 2010

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Outline

1 Preliminaries

2 Data sets

3 Descriptive Statistics

4 Probability Models

5 Hypothesis Testing and Confidence Intervals

6 Linear Regression

7 Online Resources for R

8 Upcoming Mini-Courses

9 ExercisesMine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

1 PreliminariesSoftware InstallationR Help

2 Data sets

3 Descriptive Statistics

4 Probability Models

5 Hypothesis Testing and Confidence Intervals

6 Linear Regression

7 Online Resources for R

8 Upcoming Mini-Courses

9 ExercisesMine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Software Installation

Installing R on a Mac

1 Go tohttp://cran.r-project.org/

and select MacOS X

2 Select to download thelatest version: 2.11.0(2010-04-22)

3 Install and Open. The Rwindow should look like this:

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

R Help

R Help

For help with any function in R,put a question mark before thefunction name to determine whatarguments to use, examples andbackground information.

1 ?plot

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

1 Preliminaries

2 Data setsLoading data into RViewing data sets in R

3 Descriptive Statistics

4 Probability Models

5 Hypothesis Testing and Confidence Intervals

6 Linear Regression

7 Online Resources for R

8 Upcoming Mini-Courses

9 ExercisesMine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Loading data into R

Loading data into R

Loading a data set into R:

1 survey = read.table("http://www.stat.ucla.edu/~mine/students_survey_2008. txt",header = TRUE , sep = "\t")

Displaying the dimensions of the data set:

1 dim(survey)

[1] 1325 29

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Viewing data sets in R

Viewing data sets in R

Displaying the first 3 rows and 5 columns of the data set:

1 survey [1:3 ,1:5]

gender hand eyecolor glasses california

1 female left hazel yes yes

2 male right brown no no

3 female right brown yes yes

Displaying the variable names in the data set:

1 names(survey)

[1] "gender" "hand" "eyecolor" "glasses" "california"

[6] "birthmonth" "birthday" "birthyear" "ageinmonths" "height"

[11] "graduate" "oncampus" "time" "walk" "hsclass"

...

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Viewing data sets in R

Attaching / detaching data frames in RAttaching the variables in a data set::

1 attach(survey)

The following object(s) are masked from package:datasets :

sleep

The warning is telling us that we have attached a data framethat contains a column, whose name is sleep. If you type:

1 sleep

the object with that name in the data frame will be seenbefore another object with the same name that is lower in thesearch() path. Thus, your object is “masking” the other.To detach a data frame, i.e. remove from the search() pathof available R objects - but we won’t do that now.

1 detach(sleep)Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

1 Preliminaries

2 Data sets

3 Descriptive StatisticsVariable classesDisplaying categorical dataDisplaying quantitative dataDescribing distributions numerically

4 Probability Models

5 Hypothesis Testing and Confidence Intervals

6 Linear Regression

7 Online Resources for R

8 Upcoming Mini-Courses

9 Exercises

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Variable classes

Displaying the class of a variable:

1 class(instructor)

[1] "factor"

Changing the class of a variable:

1 instructor = as.character(instructor)2 class(instructor)

[1] "character"

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Displaying categorical data

Tables

Tables are useful for displaying the distribution of categoricalvariables.

1 table(gender)

genderfemale male

882 443

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Displaying categorical data

Contingency tables

Contingency tables display two categorical variables at a time.

1 table(gender , hand)

handgender ambidextrous left right

female 9 67 806male 11 45 387

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Displaying categorical data

Frequency bar plotsDisplay counts of each category next to each other for easycomparison.

1 barplot(table(gender), main = "Barplot ofGender")

female male

Barplot of Gender

0200

600

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Displaying categorical data

Relative frequency bar plots

Display relative proportions of each category.

1 barplot(table(gender)/length(gender), main = "Relative Frequency \n Barplot of Gender")

female male

Relative Frequency

Barplot of Gender

0.0

0.3

0.6

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Displaying categorical data

Segmented bar chartsDisplays two categorical variables at a time.

1 barplot(table(gender , hand), col = c("skyblue", "blue"), main = "Segmented Bar Plot \nof Gender")

2 legend("topleft", c("females","males"), col =c("skyblue", "blue"), pch = 16, inset =0.05)

ambidextrous left right

Segmented Bar Plot

of Gender

0200400600800

females

males

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Displaying categorical data

Pie chartsPie charts display counts as percentages of individuals in eachcategory.

1 pct = round(table(gender) / length(gender) *100)

2 lbls = paste(names(table(gender)), "\n", "%",pct)

3 pie(table(gender), labels = lbls)

female

% 67

male

% 33

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Displaying quantitative data

HistogramsDisplay the number of cases in each bin

1 hist(ageinmonths , main = "Histogram of Age inMonths")

Histogram of Age in Months

ageinmonths

Frequency

200 250 300 350

0100

200

300

400

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Displaying quantitative data

Relative frequency histogramsDisplay the proportion of of cases in each bin.

1 hist(ageinmonths , freq = FALSE , main = "Relative Frequency \n Histogram of Age inMonths", xlab = "Age in Months")

Relative Frequency

Histogram of Age in Months

Age in Months

Density

200 250 300 350

0.000

0.010

0.020

0.030

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Displaying quantitative data

Stem-and-Leaf Plots

Preserve individual data values.

1 stem(ageinmonths)

The decimal point is 1 digit(s) to the right of the |

20 | 48

21 | 004444555566666666666666666777777777778888888888889999999999999999

22 | 00000000000000000000000000111111111111111122222222222222222222333333+258

23 | 00000000000000000000000000000000000000000000001111111111111111111111+379

24 | 00000000000000000000000000000000000000000000111111111111111111111111+170

25 | 00000000000001111111111111112222222222222222222223333333344444444445+24

26 | 000000000001111111111222222333334444444444556666778889

27 | 00111222222344566789

28 | 01334558888

29 | 0004569

30 | 267

31 | 02257

32 | 44

33 | 5

34 | 89

35 | 3

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Displaying quantitative data

Boxplots1 boxplot(ageinmonths , main = "Boxplot of Age in

Months")

200

250

300

350

Boxplot of Age in Months

Five Number Summary (Min, Q1, Median, Q3, Max):

1 fivenum(ageinmonths)

[1] 204 228 235 243 353

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Describing distributions numerically

Summary

Categorical variables:

1 summary(hand)

ambidextrous left right

20 112 1193

Quantitative variables:

1 summary(ageinmonths)

Min. 1st Qu. Median Mean 3rd Qu. Max.

204.0 228.0 235.0 237.8 243.0 353.0

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Describing distributions numerically

Measures of center

Mean (arithmetic average):

1 mean(ageinmonths)

[1] 237.8309

Median (value that divides the histogram into two equalareas):

1 median(ageinmonths)

[1] 235

Mode (the most frequent value): for discrete data

1 as.numeric(names(sort(table(ageinmonths),decreasing = TRUE))[1])

[1] 228

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Describing distributions numerically

Mode (alternative)

To find the mode, you may also use the Mode function in theprettyR package.

1 install.packages("prettyR")2 library(prettyR)3 Mode(ageinmonths)

[1] "228"

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Describing distributions numerically

Adding measures to plotsAdding mean and median to a histogram.

1 hist(ageinmonths , main = "Histogram of Age inMonths")

2 abline(v = mean(ageinmonths), col = "blue")3 abline(v = median(ageinmonths), col = "green")4 legend("topright", c("Mean", "Median"), pch =

16, col = c("blue", "green"))

Histogram of Age in Months

ageinmonths

Frequency

200 250 300 350

0100

200

300

400

Mean

Median

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Describing distributions numerically

Measures of spread

Range (Min, Max):

1 range(ageinmonths)

[1] 204 353

IQR:

1 IQR(ageinmonths)

[1] 15

Standard deviation:

1 sd(ageinmonths)

[1] 16.03965

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

1 Preliminaries

2 Data sets

3 Descriptive Statistics

4 Probability ModelsGeometricBinomialPoissonNormal

5 Hypothesis Testing and Confidence Intervals

6 Linear Regression

7 Online Resources for R

8 Upcoming Mini-Courses

9 Exercises

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Geometric

Geometric distribution

If the probability of success is 0.35, what is the probability that thefirst success will be on the 5th trial?

1 dgeom (4 ,0.35)

[1] 0.06247719

Note: dgeom gives the density (or probability mass function for discrete

variables), pgeom gives the distribution function, qgeom gives the

quantile function, and rgeom generates random deviates. This is true for

the functions used for Binomial, Poisson and Normal calculations as well.

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Binomial

Binomial distribution

If the probability of success is 0.35, what is the probability of

3 successes in 5 trials?

1 dbinom (3 ,5 ,0.35)

[1] 0.1811469

at least 3 successes in 5 trials?

1 sum(dbinom (3:5 ,5 ,0.35))

[1] 0.2351694

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Poisson

Poisson distribution

The number of traffic accidents per week in a small city hasPoisson distribution with mean equal to 3. What is the probabilityof

two accidents in a week?

1 dpois (2,3)

[1] 0.2240418

at most one accident in a week?

1 sum(dpois (0:1 ,3))

[1] 0.1991483

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Normal

Normal distribution

Scores on an exam are distributed normally with a mean of 65 anda standard deviation of 12. What percentage of the students havescores

below 50?

1 pnorm (50 ,65 ,12)

[1] 0.1056498

between 50 and 70?

1 pnorm (70 ,65 ,12)-pnorm (50 ,65 ,12)

[1] 0.5558891

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Normal

Normal distribution (cont.)

What is the 90th percentile of the score distribution?

1 qnorm (.90 ,65 ,12)

[1] 80.37862

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

1 Preliminaries

2 Data sets

3 Descriptive Statistics

4 Probability Models

5 Hypothesis Testing and Confidence IntervalsOne sample meansTwo sample meansOne sample proportionsTwo sample proportions

6 Linear Regression

7 Online Resources for R

8 Upcoming Mini-Courses

9 Exercises

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

One sample means

Hypothesis testing for one sample means

Is there evidence to suggest that the average age in months forStats 10 students is more than 235 months? Use α = 0.05.

1 sample100 = sample (1:1325 , 100, replace =FALSE)

2 survey.sub = survey[sample100 ,]3 t.test(survey.sub$ageinmonths , alternative = "

greater", mu = 235, conf.level = 0.95)

One Sample t-test

data: survey.sub$ageinmonths

t = 1.5922, df = 99, p-value = 0.05726

alternative hypothesis: true mean is greater than 235

95 percent confidence interval:

234.9118 Inf

sample estimates:

mean of x

237.06

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

One sample means

Confidence intervals for one sample means

The t.test function prints out a confidence interval as well.

However this function returns a one-sided interval when thealternative is "greater" or "less".

When alternative = "greater" is chosen the lowerconfidence bound is calculated and the upper bound is givenas Inf by default.

When alternative = "less" is chosen the upperconfidence bound is calculated and the lower bound is givenas -Inf by default.

When alternative = "two.sided" is chosen both theupper and the lower confidence bounds are calculated.

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

One sample means

Confidence intervals for one sample means (cont.)

1 t.test(survey.sub$ageinmonths , alternative = "two.sided", mu = 235, conf.level = 0.90)

One Sample t-test

data: survey.sub$ageinmonths

t = 1.5922, df = 99, p-value = 0.1145

alternative hypothesis: true mean is not equal to 235

90 percent confidence interval:

234.9118 239.2082

sample estimates:

mean of x

237.06

Note that we changed the confidence level to 0.90 in order tocorrespond to a one-sided hypothesis test with α = 0.05.

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

One sample means

Confidence intervals for one sample means (cont.)

Alternative calculation of confidence interval:

1 onesample.mean.ci = function(x, conf.level){2 tstar = -qt(p = ((1 - conf.level)/2), df = (

length(x) - 1))3 xbar = mean(x)4 sexbar = sd(x) / sqrt(length(x))5 cilower = xbar - tstar * sexbar6 ciupper = xbar + tstar * sexbar7 return(list = c(cilower , ciupper))8 }9 onesample.mean.ci(survey.sub$ageinmonths ,

0.90)

[1] 234.9118 239.2082

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Two sample means

Hypothesis testing and CI for two sample meansIs there a difference between the ages of females and males?Construct a 95% confidence interval for the difference between theaverage ages of females and males.

1 t.test(survey.sub$ageinmonths[survey.sub$gender == "female"], survey.sub$ageinmonths[survey.sub$gender == "male"], alternative = "two.sided", conf.level= 0.95)

Welch Two Sample t-test

data: survey.sub$ageinmonths[survey.sub$gender == "female"] and

survey.sub$ageinmonths[survey.sub$gender == "male"]

t = 1.25, df = 95.736, p-value = 0.2143

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-1.765100 7.768572

sample estimates:

mean of x mean of y

238.1406 235.1389

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

One sample proportions

Hypothesis testing for one sample proportions

64 out of 100 students in a random sample are females. Is thereevidence to suggest that the population proportion of females isless than 65%? Use a 90% confidence level.

1 prop.test(64, 100, p = 0.65, alternative = "less", conf.level = 0.90)

1-sample proportions test with continuity correction

data: 64 out of 100, null probability 0.65

X-squared = 0.011, df = 1, p-value = 0.4583

alternative hypothesis: true p is less than 0.65

90 percent confidence interval:

0.0000000 0.7035286

sample estimates:

p

0.64

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

One sample proportions

Confidence intervals for one sample proportions

Just like the t.test, the prop.test function will calculate boththe upper and the lower bounds of the confidence interval onlywhen alternative = "two.sided" is chosen. Otherwise a lowerbound of 0 or an upper bound of 1 is produced.

1 prop.test(64, 100, p = 0.65, alternative = "two.sided", conf.level = 0.80)

1-sample proportions test with continuity correction

data: 64 out of 100, null probability 0.65

X-squared = 0.011, df = 1, p-value = 0.9165

alternative hypothesis: true p is not equal to 0.65

80 percent confidence interval:

0.5715825 0.7035286

sample estimates:

p

0.64

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Two sample proportions

Hypothesis testing and CI for two sample proportions

54 out of 64 females and 32 out of 36 males are right handed. Isthere evidence to suggest that proportions of males and femaleswho are right handed are different?

1 prop.test(c(54 ,32), c(64 ,36))

2-sample test for equality of proportions with continuity correction

data: c(54, 32) out of c(64, 36)

X-squared = 0.1051, df = 1, p-value = 0.7458

alternative hypothesis: two.sided

95 percent confidence interval:

-0.2026789 0.1124012

sample estimates:

prop 1 prop 2

0.8437500 0.8888889

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

1 Preliminaries

2 Data sets

3 Descriptive Statistics

4 Probability Models

5 Hypothesis Testing and Confidence Intervals

6 Linear RegressionScatterplots, Association, and CorrelationSimple Linear Regression

7 Online Resources for R

8 Upcoming Mini-Courses

9 ExercisesMine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Scatterplots, Association, and Correlation

ScatterplotsIs there an association between amount of alcohol consumed andmaximum speed?

1 plot(speed ~ alcohol , main = "Scatterplot ofSpeed vs. Alcohol", pch = 20, cex = 0.5)

0 20 40 60 80

050

100

150

Scatterplot of Speed vs. Alcohol

alcohol

speed

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Scatterplots, Association, and Correlation

Correlation

1 cor(alcohol , speed , use = "pairwise.complete.obs")

[1] 0.2309745

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Simple Linear Regression

Simple Linear Regression

Build a linear regression model predicting speed from alcohol.

1 summary(lm(speed~alcohol))

Call:

lm(formula = speed ~ alcohol)

Residuals:

Min 1Q Median 3Q Max

-90.769 -8.725 1.275 11.275 91.541

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 88.7248 0.6511 136.261 <2e-16 ***

alcohol 0.9469 0.1108 8.549 <2e-16 ***

---

Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 21.83 on 1297 degrees of freedom

(26 observations deleted due to missingness)

Multiple R-squared: 0.05335, Adjusted R-squared: 0.05262

F-statistic: 73.09 on 1 and 1297 DF, p-value: < 2.2e-16

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

1 Preliminaries

2 Data sets

3 Descriptive Statistics

4 Probability Models

5 Hypothesis Testing and Confidence Intervals

6 Linear Regression

7 Online Resources for R

8 Upcoming Mini-Courses

9 ExercisesMine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Online Resources for R

Download R: http://cran.stat.ucla.edu/

Search Engine for R: rseek.org

R Reference Card:http://cran.r-project.org/doc/contrib/Short-refcard.pdf

UCLA Statistics Information Portal: http:// info.stat.ucla.edu/grad/

UCLA Statistical Consulting Center: http:// scc.stat.ucla.edu

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Online Resources for R

Download R: http://cran.stat.ucla.edu/

Search Engine for R: rseek.org

R Reference Card:http://cran.r-project.org/doc/contrib/Short-refcard.pdf

UCLA Statistics Information Portal: http:// info.stat.ucla.edu/grad/

UCLA Statistical Consulting Center: http:// scc.stat.ucla.edu

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Online Resources for R

Download R: http://cran.stat.ucla.edu/

Search Engine for R: rseek.org

R Reference Card:http://cran.r-project.org/doc/contrib/Short-refcard.pdf

UCLA Statistics Information Portal: http:// info.stat.ucla.edu/grad/

UCLA Statistical Consulting Center: http:// scc.stat.ucla.edu

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Online Resources for R

Download R: http://cran.stat.ucla.edu/

Search Engine for R: rseek.org

R Reference Card:http://cran.r-project.org/doc/contrib/Short-refcard.pdf

UCLA Statistics Information Portal: http:// info.stat.ucla.edu/grad/

UCLA Statistical Consulting Center: http:// scc.stat.ucla.edu

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Online Resources for R

Download R: http://cran.stat.ucla.edu/

Search Engine for R: rseek.org

R Reference Card:http://cran.r-project.org/doc/contrib/Short-refcard.pdf

UCLA Statistics Information Portal: http:// info.stat.ucla.edu/grad/

UCLA Statistical Consulting Center: http:// scc.stat.ucla.edu

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

1 Preliminaries

2 Data sets

3 Descriptive Statistics

4 Probability Models

5 Hypothesis Testing and Confidence Intervals

6 Linear Regression

7 Online Resources for R

8 Upcoming Mini-Courses

9 ExercisesMine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Upcoming Mini-Courses

May 5, R Stats II: Linear Regression

May 10, R Stats III: Nonlinear Regression

May 12, LaTeX V: Creating Vector Graphics in LaTeX

For a schedule of all mini-courses offered please visithttp:// scc.stat.ucla.edu/mini-courses .

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Thank youAny questions?

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

1 Preliminaries

2 Data sets

3 Descriptive Statistics

4 Probability Models

5 Hypothesis Testing and Confidence Intervals

6 Linear Regression

7 Online Resources for R

8 Upcoming Mini-Courses

9 ExercisesMine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Exercises

1 Construct side-by-side box plots for the distribution of amountof time it takes students to get to class (time) by their meansof transportation (walk).

2 Usually younger students live on campus and older studentslive off campus. Is there evidence to suggest this trend in thisdata set? (Use a random sample of 100 students andα = 0.05.)

3 Calculate a 90% confidence interval for the difference betweenthe average ages of students who live on campus and offcampus.

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Solution to Exercise 1

1 boxplot(time ~ walk , main = "Time to get toclass \n by type of transportation")

bicycle bus car (by yourself) carpool motorcycle other segway skateboard walk

050

100

150

Time to get to class

by type of transportation

Minutes

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Solution to Exercise 2

1 t.test(survey.sub$ageinmonths[survey.sub$oncampus == "yes"], survey.sub$ageinmonths[survey.sub$oncampus == "no"], alternative = "less", conf.level =0.95)

Welch Two Sample t-test

data: survey.sub$ageinmonths[survey.sub$oncampus == "yes"] and

survey.sub$ageinmonths[survey.sub$oncampus == "no"]

t = -5.3322, df = 34.867, p-value = 2.964e-06

alternative hypothesis: true difference in means is less than 0

95 percent confidence interval:

-Inf -10.85376

sample estimates:

mean of x mean of y

232.6111 248.5000

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC

Prelim. Data Descriptive Statistics Prob. Models Hyp. Test & CI Linear Reg. Resources Upcoming Exercises

Solution to Exercise 3

1 t.test(survey.sub$ageinmonths[survey.sub$oncampus == "yes"], survey.sub$ageinmonths[survey.sub$oncampus == "no"], alternative = "two.sided", conf.level= 0.90)

Welch Two Sample t-test

data: survey.sub$ageinmonths[survey.sub$oncampus == "yes"] and

survey.sub$ageinmonths[survey.sub$oncampus == "no"]

t = -5.3322, df = 34.867, p-value = 5.929e-06

alternative hypothesis: true difference in means is not equal to 0

90 percent confidence interval:

-20.92402 -10.85376

sample estimates:

mean of x mean of y

232.6111 248.5000

Mine Cetinkaya [email protected]

Introductory Statistics with R UCLA SCC