> integrating r into > introductory statisticsmc301/talks/integrate_r_jsm2012.pdf · >...

56
> integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew Bray - UCLA JSM 2012 - August 2, 2012

Upload: others

Post on 28-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> integrating R into > introductory statistics

Mine Çetinkaya-Rundel - Duke UniversityAndrew Bray - UCLA

JSM 2012 - August 2, 2012

Page 2: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> experience

Page 3: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> STA 101 AT DUKE

Page 4: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> first course in statistics for non-majors, mostly social sciences majors

> STA 101 AT DUKE

Page 5: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> first course in statistics for non-majors, mostly social sciences majors

> 80-120 students in lecture, 25 students / lab section

> STA 101 AT DUKE

Page 6: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> first course in statistics for non-majors, mostly social sciences majors

> 80-120 students in lecture, 25 students / lab section

> weekly lab sessions using R> labs designed for an interdisciplinary introductory course,

can be modified for discipline-specific courses> can also be used in a first data-analysis course for stats

majors, ideally by reducing step-by-step instructions

> STA 101 AT DUKE

Page 7: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> R

Page 8: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

unlike most software designed specifically for courses at this level, R is> free and open-source> powerful and flexible> relevant beyond the introductory statistics

classroom

> WHY R?

Page 9: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> WHY NOT R?

Page 10: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> perceived challenge of teaching programming in addition to teaching statistical concepts> labs and activities that try to find the right balance

of standard and custom functions> consistent syntax highlighting helps

> WHY NOT R?

Page 11: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> perceived challenge of teaching programming in addition to teaching statistical concepts> labs and activities that try to find the right balance

of standard and custom functions> consistent syntax highlighting helps

> working with a command line tends to be more intimidating than traditional GUI based tools> GUI tools also have a learning curve > a user-friendly IDE (like RStudio)

> WHY NOT R?

Page 12: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> RSTUDIO

Page 13: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> what it helps resolve:> loading and viewing data > saving code> code history> workspace organization> plot history

> RSTUDIO

Page 14: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> what it helps resolve:> loading and viewing data > saving code> code history> workspace organization> plot history

> what still remains a challenge:> working with a command line

> RSTUDIO

Page 15: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> balance

Page 16: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> BALANCE

Page 17: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> teach coding as a way of introducing/reinforcing concepts, especially those that are otherwise difficult to convey without computation> simulations

> sampling distributions > confidence levels

> bootstrapping > randomization tests > ...

> BALANCE

Page 18: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> teach coding as a way of introducing/reinforcing concepts, especially those that are otherwise difficult to convey without computation> simulations

> sampling distributions > confidence levels

> bootstrapping > randomization tests > ...

> hide implementation issues that are outside the scope of the course

> BALANCE

Page 19: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> teach coding as a way of introducing/reinforcing concepts, especially those that are otherwise difficult to convey without computation> simulations

> sampling distributions > confidence levels

> bootstrapping > randomization tests > ...

> hide implementation issues that are outside the scope of the course

> minimize coding for repeated mechanics that can be unified

> BALANCE

Page 20: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

concept: confidence levelsresample from the population many times and construct many confidence intervals (loops)

> TEACH

Page 21: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

concept: confidence levelsresample from the population many times and construct many confidence intervals (loops)

> TEACH

Lab 4B: Foundations for statistical inference - Confidence levels

Sampling from Ames, Iowa

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation isthere in sizes of houses?”. If you have access to only a sample of the population, as is often the case, thetask becomes more complicated. What is your best guess for the typical size if you only know the sizes ofseveral dozen houses? This sort of situation requires that you use your sample to make inference on whatyour population looks like.

The data

In the previous lab we looked at the population data of houses from Ames, Iowa. Let’s start with loadingthat data set.

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")

load("ames.RData")

In this lab we’ll start with just a sample from the population, which is a more realistic situation. Specifically,this is a simple random sample of size 60. Note that the data set has information on many variables onthese houses, but for the first portion of the lab we’ll focus on the size of the house, represented by thevariable Gr.Liv.Area.

population <- ames$Gr.Liv.Area

sample <- sample(population, 60)

Exercise 1 Describe the distribution of your sample. What would you say is the “typical” sizewithin your sample? Also state precisely what you interpreted “typical” to mean.

Exercise 2 Now compare your distribution to your neighbor’s. Do they look similar? Are theyidentical? Why, or why not?

Confidence intervals

One of the most common ways to describe the typical or central value of a distribution is to use the mean.In this case we can calculate the mean age size our sample:

sample_mean <- mean(sample)

Return for a moment to the question that first motivated this lab: based on this sample, what can weinfer about the population? More specifically, what is our best estimate of the mean size of the houses in

This is a product of OpenIntro that is released under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported(http://creativecommons.org/licenses/by-nc-nd/3.0/). This lab was written for OpenIntro by Andrew Bray and Mine Cetinkaya-Rundel.

1

Page 22: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

concept: confidence levelsresample from the population many times and construct many confidence intervals (loops)

> TEACHHere is the rough outline:

(1) Obtain a random sample.

(2) Calculate its means and standard deviation.

(3) Use these statistics to calculate a confidence interval.

(4) Repeat steps (1)-(3) 50 times.

But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.

samp_mean <- rep(NA, 50)

samp_sd <- rep(NA, 50)

n <- 60

Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.

for (i in 1:50) {

samp <- sample(population, n) # obtain a sample of size n = 60 from the population

samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean

samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd

}

Lastly, we construct the confidence intervals.

lower <- samp_mean - 1.96 * samp_sd/sqrt(n)

upper <- samp_mean + 1.96 * samp_sd/sqrt(n)

Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.

c(lower[1], upper[1])

But browsing these intervals one-by-one will get tedious...

3

Lab 4B: Foundations for statistical inference - Confidence levels

Sampling from Ames, Iowa

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation isthere in sizes of houses?”. If you have access to only a sample of the population, as is often the case, thetask becomes more complicated. What is your best guess for the typical size if you only know the sizes ofseveral dozen houses? This sort of situation requires that you use your sample to make inference on whatyour population looks like.

The data

In the previous lab we looked at the population data of houses from Ames, Iowa. Let’s start with loadingthat data set.

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")

load("ames.RData")

In this lab we’ll start with just a sample from the population, which is a more realistic situation. Specifically,this is a simple random sample of size 60. Note that the data set has information on many variables onthese houses, but for the first portion of the lab we’ll focus on the size of the house, represented by thevariable Gr.Liv.Area.

population <- ames$Gr.Liv.Area

sample <- sample(population, 60)

Exercise 1 Describe the distribution of your sample. What would you say is the “typical” sizewithin your sample? Also state precisely what you interpreted “typical” to mean.

Exercise 2 Now compare your distribution to your neighbor’s. Do they look similar? Are theyidentical? Why, or why not?

Confidence intervals

One of the most common ways to describe the typical or central value of a distribution is to use the mean.In this case we can calculate the mean age size our sample:

sample_mean <- mean(sample)

Return for a moment to the question that first motivated this lab: based on this sample, what can weinfer about the population? More specifically, what is our best estimate of the mean size of the houses in

This is a product of OpenIntro that is released under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported(http://creativecommons.org/licenses/by-nc-nd/3.0/). This lab was written for OpenIntro by Andrew Bray and Mine Cetinkaya-Rundel.

1

Page 23: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

concept: confidence levelsresample from the population many times and construct many confidence intervals (loops)

> TEACHHere is the rough outline:

(1) Obtain a random sample.

(2) Calculate its means and standard deviation.

(3) Use these statistics to calculate a confidence interval.

(4) Repeat steps (1)-(3) 50 times.

But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.

samp_mean <- rep(NA, 50)

samp_sd <- rep(NA, 50)

n <- 60

Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.

for (i in 1:50) {

samp <- sample(population, n) # obtain a sample of size n = 60 from the population

samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean

samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd

}

Lastly, we construct the confidence intervals.

lower <- samp_mean - 1.96 * samp_sd/sqrt(n)

upper <- samp_mean + 1.96 * samp_sd/sqrt(n)

Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.

c(lower[1], upper[1])

But browsing these intervals one-by-one will get tedious...

3

Here is the rough outline:

(1) Obtain a random sample.

(2) Calculate its means and standard deviation.

(3) Use these statistics to calculate a confidence interval.

(4) Repeat steps (1)-(3) 50 times.

But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.

samp_mean <- rep(NA, 50)

samp_sd <- rep(NA, 50)

n <- 60

Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.

for (i in 1:50) {

samp <- sample(population, n) # obtain a sample of size n = 60 from the population

samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean

samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd

}

Lastly, we construct the confidence intervals.

lower <- samp_mean - 1.96 * samp_sd/sqrt(n)

upper <- samp_mean + 1.96 * samp_sd/sqrt(n)

Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.

c(lower[1], upper[1])

But browsing these intervals one-by-one will get tedious...

3

Lab 4B: Foundations for statistical inference - Confidence levels

Sampling from Ames, Iowa

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation isthere in sizes of houses?”. If you have access to only a sample of the population, as is often the case, thetask becomes more complicated. What is your best guess for the typical size if you only know the sizes ofseveral dozen houses? This sort of situation requires that you use your sample to make inference on whatyour population looks like.

The data

In the previous lab we looked at the population data of houses from Ames, Iowa. Let’s start with loadingthat data set.

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")

load("ames.RData")

In this lab we’ll start with just a sample from the population, which is a more realistic situation. Specifically,this is a simple random sample of size 60. Note that the data set has information on many variables onthese houses, but for the first portion of the lab we’ll focus on the size of the house, represented by thevariable Gr.Liv.Area.

population <- ames$Gr.Liv.Area

sample <- sample(population, 60)

Exercise 1 Describe the distribution of your sample. What would you say is the “typical” sizewithin your sample? Also state precisely what you interpreted “typical” to mean.

Exercise 2 Now compare your distribution to your neighbor’s. Do they look similar? Are theyidentical? Why, or why not?

Confidence intervals

One of the most common ways to describe the typical or central value of a distribution is to use the mean.In this case we can calculate the mean age size our sample:

sample_mean <- mean(sample)

Return for a moment to the question that first motivated this lab: based on this sample, what can weinfer about the population? More specifically, what is our best estimate of the mean size of the houses in

This is a product of OpenIntro that is released under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported(http://creativecommons.org/licenses/by-nc-nd/3.0/). This lab was written for OpenIntro by Andrew Bray and Mine Cetinkaya-Rundel.

1

Page 24: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

concept: confidence levelsresample from the population many times and construct many confidence intervals (loops)

> TEACHHere is the rough outline:

(1) Obtain a random sample.

(2) Calculate its means and standard deviation.

(3) Use these statistics to calculate a confidence interval.

(4) Repeat steps (1)-(3) 50 times.

But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.

samp_mean <- rep(NA, 50)

samp_sd <- rep(NA, 50)

n <- 60

Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.

for (i in 1:50) {

samp <- sample(population, n) # obtain a sample of size n = 60 from the population

samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean

samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd

}

Lastly, we construct the confidence intervals.

lower <- samp_mean - 1.96 * samp_sd/sqrt(n)

upper <- samp_mean + 1.96 * samp_sd/sqrt(n)

Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.

c(lower[1], upper[1])

But browsing these intervals one-by-one will get tedious...

3

Here is the rough outline:

(1) Obtain a random sample.

(2) Calculate its means and standard deviation.

(3) Use these statistics to calculate a confidence interval.

(4) Repeat steps (1)-(3) 50 times.

But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.

samp_mean <- rep(NA, 50)

samp_sd <- rep(NA, 50)

n <- 60

Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.

for (i in 1:50) {

samp <- sample(population, n) # obtain a sample of size n = 60 from the population

samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean

samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd

}

Lastly, we construct the confidence intervals.

lower <- samp_mean - 1.96 * samp_sd/sqrt(n)

upper <- samp_mean + 1.96 * samp_sd/sqrt(n)

Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.

c(lower[1], upper[1])

But browsing these intervals one-by-one will get tedious...

3

Here is the rough outline:

(1) Obtain a random sample.

(2) Calculate its means and standard deviation.

(3) Use these statistics to calculate a confidence interval.

(4) Repeat steps (1)-(3) 50 times.

But before we do all of this, we need to first create empty vectors where we can save the means andstandard deviations that will be calculated in step (2). And while we’re at it, let’s also store the desiredsample size as n.

samp_mean <- rep(NA, 50)

samp_sd <- rep(NA, 50)

n <- 60

Now we’re ready for the loop where we obtain 50 random samples and quickly calculate and save theirmeans and standard deviations.

for (i in 1:50) {

samp <- sample(population, n) # obtain a sample of size n = 60 from the population

samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean

samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd

}

Lastly, we construct the confidence intervals.

lower <- samp_mean - 1.96 * samp_sd/sqrt(n)

upper <- samp_mean + 1.96 * samp_sd/sqrt(n)

Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper.Let’s view the first interval.

c(lower[1], upper[1])

But browsing these intervals one-by-one will get tedious...

3

Lab 4B: Foundations for statistical inference - Confidence levels

Sampling from Ames, Iowa

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation isthere in sizes of houses?”. If you have access to only a sample of the population, as is often the case, thetask becomes more complicated. What is your best guess for the typical size if you only know the sizes ofseveral dozen houses? This sort of situation requires that you use your sample to make inference on whatyour population looks like.

The data

In the previous lab we looked at the population data of houses from Ames, Iowa. Let’s start with loadingthat data set.

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")

load("ames.RData")

In this lab we’ll start with just a sample from the population, which is a more realistic situation. Specifically,this is a simple random sample of size 60. Note that the data set has information on many variables onthese houses, but for the first portion of the lab we’ll focus on the size of the house, represented by thevariable Gr.Liv.Area.

population <- ames$Gr.Liv.Area

sample <- sample(population, 60)

Exercise 1 Describe the distribution of your sample. What would you say is the “typical” sizewithin your sample? Also state precisely what you interpreted “typical” to mean.

Exercise 2 Now compare your distribution to your neighbor’s. Do they look similar? Are theyidentical? Why, or why not?

Confidence intervals

One of the most common ways to describe the typical or central value of a distribution is to use the mean.In this case we can calculate the mean age size our sample:

sample_mean <- mean(sample)

Return for a moment to the question that first motivated this lab: based on this sample, what can weinfer about the population? More specifically, what is our best estimate of the mean size of the houses in

This is a product of OpenIntro that is released under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported(http://creativecommons.org/licenses/by-nc-nd/3.0/). This lab was written for OpenIntro by Andrew Bray and Mine Cetinkaya-Rundel.

1

Page 25: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> HIDE

concept: confidence levelsplot these confidence intervals and highlight those that do not contain the true population parameter (custom function)

Page 26: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> HIDE

concept: confidence levelsplot these confidence intervals and highlight those that do not contain the true population parameter (custom function)

On your own

1. Using the following custom function, plot all intervals. What proportion of your confidence intervalsinclude the true population mean? Is this proportion exactly equal to the confidence level? If no,explain why.†

plot_ci(lower, upper, mean(population))

2. Pick a confidence level of your choosing. What is the appropriate critical value?

3. Calculate 50 confidence intervals at this confidence level. You do not need to obtain new samples,simply calculate new intervals based on the samples you have already collected. Using the plot ci

function plot all intervals and calculate the proportion of intervals that include the true populationmean. How does this percentage compare to the confidence level you picked?

4. What concepts from the textbook are covered in this lab? What concepts, if any, are not covered inthe textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs,or homework problems? Be specific in your answer.

†This figure should look familiar, if not, see Section 4.2.2.

4

Page 27: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> HIDEConfidence levels

Resample from the population many times and construct manyconfidence intervals (loops)Plot these confidence intervals and highlight those that do notcontain the true population parameter (custom function)

plot.ci(lower, upper, mean(pop))

mu = 1499.6904

Cetinkaya-Rundel, Bray Integrating R into Introductory Statistics useR! 2012 - June 13, 2012 6 / 15

concept: confidence levelsplot these confidence intervals and highlight those that do not contain the true population parameter (custom function)

On your own

1. Using the following custom function, plot all intervals. What proportion of your confidence intervalsinclude the true population mean? Is this proportion exactly equal to the confidence level? If no,explain why.†

plot_ci(lower, upper, mean(population))

2. Pick a confidence level of your choosing. What is the appropriate critical value?

3. Calculate 50 confidence intervals at this confidence level. You do not need to obtain new samples,simply calculate new intervals based on the samples you have already collected. Using the plot ci

function plot all intervals and calculate the proportion of intervals that include the true populationmean. How does this percentage compare to the confidence level you picked?

4. What concepts from the textbook are covered in this lab? What concepts, if any, are not covered inthe textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs,or homework problems? Be specific in your answer.

†This figure should look familiar, if not, see Section 4.2.2.

4

Page 28: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> MINIMIZE

concept: statistical inference

Page 29: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> MINIMIZE

> traditional curriculum for an introductory statistics course includes various statistical inference techniques

concept: statistical inference

Page 30: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> MINIMIZE

> traditional curriculum for an introductory statistics course includes various statistical inference techniques

> when introduced as disconnected topic these can be overwhelming to students

concept: statistical inference

Page 31: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> MINIMIZE

> traditional curriculum for an introductory statistics course includes various statistical inference techniques

> when introduced as disconnected topic these can be overwhelming to students

> to help unify inferential concepts, use one function that does it all, but still requires students to think about the nature of the data and encourages them to conduct exploratory data analysis

concept: statistical inference

Page 32: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> EXAMPLE

concept: statistical inferencefunction: inference() - theoretical and simulation based inference

Page 33: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

inference <- function(data, group = NULL, est = c("mean", "median", "proportion"),

success = NULL, order = NULL, nsim = 10000, conflevel = 0.95, null = NULL,

alternative = c("less", "greater", "twosided"), type = c("ci", "ht"),

method = c("theoretical", "simulation"), drawlines = "yes", simdist = FALSE){

...

}

3

> EXAMPLE

concept: statistical inferencefunction: inference() - theoretical and simulation based inference

Page 34: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

inference <- function(data, group = NULL, est = c("mean", "median", "proportion"),

success = NULL, order = NULL, nsim = 10000, conflevel = 0.95, null = NULL,

alternative = c("less", "greater", "twosided"), type = c("ci", "ht"),

method = c("theoretical", "simulation"), drawlines = "yes", simdist = FALSE){

...

}

3

> data: response variable, quantitative or categorical

> group: explanatory variable, categorical for grouping (optional)

> type: confidence interval or hypothesis test

> method: theoretical or simulation

> ...

> EXAMPLE

concept: statistical inferencefunction: inference() - theoretical and simulation based inference

Page 35: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

question: compare birth weights of babies born to smoker and nonsmoker mothers (source: north carolina births)

> USE IN LABS

Page 36: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

Exercise 2 Make a side-by-side boxplot of habit and weight. What does the plot tell us aboutthe relationship between the two variables habit and weight?

The side-by-side box plots show how the medians of the two distributions compare, and we can alsocompare the means of the distributions. The following command gives the mean weights of babies bornto smoker and non-smoker mothers.

by(nc$weight, nc$habit, mean)

There is clearly an observed difference, but is this difference statistically significant? In order to answerthis question we need to conduct a hypothesis test.

Exercise 3 Check if the conditions necessary for inference are satisfied? Note that you willneed to obtain sample sizes to check the conditions.

Exercise 4 Write the hypotheses for testing if the average weights of babies born to smokerand non-smoker mothers are different.

Next, we introduce a new function, inference, that we will use for conducting hypothesis tests andconstructing confidence intervals.

inference(data = nc$weight, group = nc$habit, est = "mean", type = "ht", null = 0,

alternative = "twosided", method = "theoretical")

Let’s pause for a moment to go through the arguments of this custom function.

• The first argument is data, this is the response variable that we are interested in: weight

• The second argument is the grouping variable, group, this is the variable that we use to split the datainto two groups, smokers and nonsmokers: habit.

• The third argument (est) is the parameter we’re interested in: mean (other options are median, orproportion.)

• Next we decide on the type of inference we want: a hypothesis test (ht) or a confidence interval (ci).

• When doing a hypothesis test we also need to supply the null value, which in this case is 0, sincethe null hypothesis sets the two population means equal to each other.

• The alternative hypothesis can be less, greater, twosided.

• Lastly, the method of inference can be theoretical or simulation based.

Exercise 5 Change the type argument to ci construct a confidence interval for the differencebetween the weights of babies born to smoker and non-smoker mothers.

By default the function reports an interval for (µnonsmoker

� µsmoker

), we can easily change this order:

inference(data = nc$weight, group = nc$habit, est = "mean", type = "ci", null = 0,

alternative = "twosided", method = "theoretical", order = c("smoker", "nonsmoker"))

2

input:

question: compare birth weights of babies born to smoker and nonsmoker mothers (source: north carolina births)

> USE IN LABS

Page 37: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

Exercise 2 Make a side-by-side boxplot of habit and weight. What does the plot tell us aboutthe relationship between the two variables habit and weight?

The side-by-side box plots show how the medians of the two distributions compare, and we can alsocompare the means of the distributions. The following command gives the mean weights of babies bornto smoker and non-smoker mothers.

by(nc$weight, nc$habit, mean)

There is clearly an observed difference, but is this difference statistically significant? In order to answerthis question we need to conduct a hypothesis test.

Exercise 3 Check if the conditions necessary for inference are satisfied? Note that you willneed to obtain sample sizes to check the conditions.

Exercise 4 Write the hypotheses for testing if the average weights of babies born to smokerand non-smoker mothers are different.

Next, we introduce a new function, inference, that we will use for conducting hypothesis tests andconstructing confidence intervals.

inference(data = nc$weight, group = nc$habit, est = "mean", type = "ht", null = 0,

alternative = "twosided", method = "theoretical")

Let’s pause for a moment to go through the arguments of this custom function.

• The first argument is data, this is the response variable that we are interested in: weight

• The second argument is the grouping variable, group, this is the variable that we use to split the datainto two groups, smokers and nonsmokers: habit.

• The third argument (est) is the parameter we’re interested in: mean (other options are median, orproportion.)

• Next we decide on the type of inference we want: a hypothesis test (ht) or a confidence interval (ci).

• When doing a hypothesis test we also need to supply the null value, which in this case is 0, sincethe null hypothesis sets the two population means equal to each other.

• The alternative hypothesis can be less, greater, twosided.

• Lastly, the method of inference can be theoretical or simulation based.

Exercise 5 Change the type argument to ci construct a confidence interval for the differencebetween the weights of babies born to smoker and non-smoker mothers.

By default the function reports an interval for (µnonsmoker

� µsmoker

), we can easily change this order:

inference(data = nc$weight, group = nc$habit, est = "mean", type = "ci", null = 0,

alternative = "twosided", method = "theoretical", order = c("smoker", "nonsmoker"))

2

input:

nonsmoker smoker

24

68

1012

-0.32 0 0.32nonsmoker smoker

24

68

1012

-0.32 0 0.32

One quantitative and one categorical variableDifference between two meansn_nonsmoker = 873 ; n_smoker = 126 Observed difference between means = 0.3155H0: mu_nonsmoker - mu_smoker = 0 HA: mu_nonsmoker - mu_smoker != 0 Standard error = 0.134 Test statistic: Z = 2.359 p-value: 0.0184

output:

question: compare birth weights of babies born to smoker and nonsmoker mothers (source: north carolina births)

> USE IN LABS

Page 38: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

question: compare numbers of sexual partners of males and females (source: national survey of family growth)

> USE IN PROJECTS

Page 39: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

question: compare numbers of sexual partners of males and females (source: national survey of family growth)

input:inference(data = partners, group = gender, type = "ci", est = "mean", method =

"theoretical")

3

> USE IN PROJECTS

Page 40: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

question: compare numbers of sexual partners of males and females (source: national survey of family growth)

input:inference(data = partners, group = gender, type = "ci", est = "mean", method =

"theoretical")

3

output:

female male

01

23

45

67

One quantitative and one categorical variableDifference between two meansn_female = 12190 ; n_male = 10397 Observed difference between means = -0.432Standard error = 0.0361 95 % Confidence interval = ( -0.5 , -0.36 )

> USE IN PROJECTS

Page 41: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> resources

Page 43: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> LAB 2 - PROBABILITY

Page 44: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> LAB 8 - MULTIPLE REGRESSION

Page 45: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> reactions

Page 46: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> STUDENT REACTIONS - LABS

positive:> ``I like them. I feel like in the real world we’ll be using software

to do stats, so I’m glad we’re learning how to use it.’’

> ``I LOVE the labs. They really help cement basic statistic ideas, and I especially love that you can finish them in class.’’

> ``The labs are a lot of fun. It’s great being able to create our own simulations and watch R Studio calculate everything. I also enjoy learning some code.’’

negative:> ``The labs are alright. Sometimes I feel like I’m just plugging in

stuff and I feel disconnected from what I’m really doing. It’s also frustrating when the code doesn’t work.’’

> ``Wish other students focused more.’’

Page 47: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> STUDENT REACTIONS - R

positive:> ``Super useful and powerful software. It’s exciting to be introduced to it.

Once again, don’t always feel comfortable writing code/ understanding what I’m doing.’’

> ``I like it! I kind of know MATLAB, which has helped with the coding a bit, but it’s a little more intuitive/easier, and very helpful.’’

> ``I am not a computer person at all, but I find RStudio very easy to use.’’> ``I like it better than STATA which we used for [another class]. The user

interface is easy and there is plenty of help for it online. Overall, it’s pretty good.’’

negative:> ``I am not a fan of coding in general. I used Python before and RStudio is

better (for me) than Python was, but I am not a fan of either.’’> ``Easy to use, language is not too hard to understand although error

messages could be more informative.’’> ``I don’t think RStudio will have any use to me outside of this class.’’

Page 48: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> also...

Page 49: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> ADDITIONAL CONSIDERATIONS

Page 50: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> ADDITIONAL CONSIDERATIONS

> Labs should be fully integrated with the curriculum

Page 51: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> ADDITIONAL CONSIDERATIONS

> Labs should be fully integrated with the curriculum``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’

Page 52: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> ADDITIONAL CONSIDERATIONS

> Labs should be fully integrated with the curriculum``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’

> Works best in a classroom environment where students can collaborate with each other and get immediate support

Page 53: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> ADDITIONAL CONSIDERATIONS

> Labs should be fully integrated with the curriculum``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’

> Works best in a classroom environment where students can collaborate with each other and get immediate support

``Collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.”

Page 54: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> ADDITIONAL CONSIDERATIONS

> Labs should be fully integrated with the curriculum``What concepts, if any, are not covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, discussion section, previous labs, or homework problems? Be specific in your answer.’’

> Works best in a classroom environment where students can collaborate with each other and get immediate support

``Collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.”

> TAs need to be familiar and comfortable with the material

Page 55: > integrating R into > introductory statisticsmc301/talks/Integrate_R_JSM2012.pdf · > integrating R into > introductory statistics Mine Çetinkaya-Rundel - Duke University Andrew

> thank you