stat 111 - lecture 1 - introduction1 welcome to statistics 111 alex braunstein the goal of this...

29
Stat 111 - Lecture 1 - Introduction 1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability and statistical methods. Key topics covered in the course include exploratory data analysis, regression, probability, estimation, and hypothesis testing

Upload: eric-hutchinson

Post on 27-Dec-2015

228 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 1 - Introduction 1

Welcome to Statistics 111

Alex Braunstein

The goal of this course is to develop basic tools for data analysis, probability and statistical methods. Key topics covered in the course include exploratory data analysis, regression, probability, estimation, and hypothesis

testing

Page 2: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 1 - Introduction 2

Syllabus notes: website

• All handouts will be available on the website:

http://stat.wharton.upenn.edu/~braunsf/stat111.html

• Website also contains my contact information

• Link on website for getting Wharton class account if you are not a Wharton student• Helpful if you want to use Wharton computer labs

Page 3: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 1 - Introduction 3

Syllabus notes: Homeworks

• Homeworks will be handed out at the beginning of every week • ~ 5 homeworks in all

• Homeworks will be submitted at the beginning of class on Mondays• You are encouraged to work together on homework, but

homeworks are to be completed separately and handed in individually.

• Do not copy from another person.

• No late homeworks will be accepted!!• Late homeworks will get a score of zero, without exception • Your lowest homework grade is not included in final grade

Page 4: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 1 - Introduction 4

• Midterm is held on following date:

Monday, June 15th (in class)

• No makeup midterm examination!• A missing midterm exam counts as a zero score• Consider taking this class in the fall or spring

if you can not attend the midterm!

Syllabus Notes: Midterm Exam

Page 5: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 1 - Introduction 5

Student Questionnaire

• Fill out a questionnaire and hand it in before the break

• I will try to incorporate some of the subjects that interest you into future lectures

Page 6: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 1 - Introduction 6

Course Overview

Collecting Data

Exploring DataProbability Intro.

Inference

Comparing Variables Relationships between Variables

Means Proportions Regression Contingency Tables

1

2

34

2 1 1 1

Page 7: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 1 - Introduction 7

Out in public: You do statistics ?!?

• I hated that class in college!

• That was the most boring class ever!

• Lame.

Page 8: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 1 - Introduction 8

• Statistics is all about uncertainty• Focus as much on what we don’t know (or haven’t

observed) instead of what we know

• Formulating the question that we want to answer is often the most difficult part

• Statistics is part mathematics, part roll-up-your-sleeves-and-get-thinking.

Big Picture Ideas

Page 9: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 1 - Introduction 9

Science and Skepticism

• We always need to be cautious about conclusions based on data• Possible sources of bias and confounding?• How might things have gone wrong?

• A little bit of skepticism is a good thing!

Page 10: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 1 - Introduction 10

Statistical Modeling

• Inference: using mathematical models of uncertainty to answer questions

• Connect probability concepts to our data

• Can not make claims without using models and making assumptions

• Are the assumptions reasonable?

Page 11: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 1 - Introduction 11

After the break

• Collecting Data: Design of Experiments

• Sections 3.1-3.2 in Moore, McCabe and Craig

• First couple of classes will not involve much math at all, but we will get into lots of data analysis after that!

Page 12: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Break!

• Hand in questionnaire

• 5 minutes

Stat 111 - Lecture 1 - Introduction 12

Page 13: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

13

Outline for Second Half of Lecture

• Introduction to Experiments• Sources of Bias in Experiments• Techniques for Avoiding Bias

• Matching • Randomization• Block Designs• Blinding and Double-Blinding

• Experiments vs. Observational Studies• Association vs. Causation

Page 14: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

14

Experiments

• Used to address a specific question• Often used to examine causal effects• Eg. medical trials, education interventions

Population ExperimentalUnits

Treatment Group

Control Group

Treatment

No Treatment

Result

Result

12 3 4

• Can we just look at difference in results to get the causal effect of the treatment?

• Depends on whether the experiment was done well• many possible sources of bias in design of experiments

Page 15: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

15

Sources of Bias• An experiment or study is biased if it systematically favors a particular outcome

1. Subjects are not representative of the population2. Treatment and control groups are inherently different on some lurking or

confounding variable3. Subjects are influenced by knowing they are in treatment or control groups4. Evaluator of outcomes is influenced by knowing they are in treatment or control

groups

Population ExperimentalUnits

Treatment Group

Control Group

Treatment

No Treatment

Result

Result

12 3 4

Page 16: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

16

Bias 1: Non-representative units

• If your subjects are not representative of the population, you won’t be able to generalize the results even if the experiment is well done

• Here are two examples• Treatment group: High Level NICUs• Control Group: Low Level NICUs• Problem: classification of NICU is different from

state to state, so a hospital that might qualify as a high level NICU in one state might not in another

• Observed differences between the groups can not be generalized from one state to another

Page 17: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

17

Bias 2: Confounding/Lurking Variables

• Treatment group and control group are different on some variable that also influences the outcome

• A confounding variable means that we can’t attribute difference in outcomes to just the treatment – Part of the difference may be due to the confounding variable

not the treatment

• Simple example: a breast cancer drug trial where only women receive the treatment and only men receive the control• Gender becomes a confounding variable • Are treatment vs control outcomes different due to the

treatment or gender differences between groups?

Page 18: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

18

Bias 3: Subject knows treatment assignment

• A subject’s outcome is influenced by knowing that he/she is in a treatment or control group• Eg. drug trials: patients improve just because they think they are

receiving the drug • Solution: blinded experiment with placebo• Placebo appears to be the treatment, so all subjects

(treatment and control) don’t know their true treatment assignment• Controls may improve outcomes slightly; this is often

called “the placebo effect”

Page 19: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

19

Bias 4: Evaluator knows treatment assignment

• Person evaluating outcome (eg. doctor in drug trial) may also be influenced by knowing who receives treatment

• Not a problem if outcome is something indisputable, such as death!

• This is a problem for more subjective measures like pain reduction or results from social programs

• Solution: double-blinded experiment where neither subjects not evaluators know treatment assignments

Page 20: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

20

Association vs Causation• In the presence of a confounding variable, we can only

conclude there is an association between treatment and outcome, not causation

Page 21: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

21

Examples: “Reporters are stupid”

• Children who watch many hours of TV get lower grades in school on average than those who watch less TV• Does this mean that TV causes poor grades?• What are potential confounding variables?

• People who use artificial sweeteners in place of sugar tend to be heavier than people who use sugar• Does this mean that sweeteners cause weight gain? • What is probably happening here?

Page 22: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

22

One solution: Matching

• Make sure that treatment and control groups are very similar on observed variables like race, gender, age etc.• Block designs: divide subjects into blocks with similar observed

variables before dividing them into treatment vs control

• Special case: Matched Pairs• Subjects are matched up into pairs, then one

member of each pair gets treatment and the other gets control

• Example: Dandruff experiment• treatment applied to one side and control

to other side of head• No reason to expect difference

in sides except for treatment

Page 23: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

23

Another Solution: Randomization

• Problem with matching is that you cannot usually match on unobserved characteristics (eg. Genetics)• Eg. Cholesterol drug trial - can’t match treatment and control

groups on genetic predisposition for high cholesterol

• Randomly assign subjects to treatment or control

• Random assignment should lead to groups that are similar or balanced on both observed and unobserved confounding variables

• Example: student questionnaire earlier in class - each form you filled out was randomly assigned either a 1 or 2

Page 24: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

24

Randomization of In-Class Survey

• Check to see if groups are balanced:

• There are differences, but are they “significant”?• Later on in the course, we will be able to answer questions like this

• Of course, we can’t check the balance for unobserved variables…we just have to trust the randomization process• This is why good science needs to be replicable

Variable Treatment Control

Average Height

Average Shoe Size

Average Number of Siblings

Page 25: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

25

Even Better: Randomization + Matching• Randomization generally leads to treatment and control groups that are evenly balanced

but you can still get unlucky and get unbalanced groups

• Example: randomly placing 20 people (10 males, 10 females) into treatment and control groups.

• How many males will end up in treatment group?• Ideally, we would have 5 males in treatment group, and 5 males in control group

(balanced)• However, there is a chance to get 9 males in treatment and 1 male in control group

(unbalanced)

Page 26: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

26

Even Better: Randomization + Matching• Randomized Blocks: randomize within blocks of

observed variables• Example:

• Divide up subjects into males and females first, then randomly assign treatment or control to subjects in each group separately

• Guarantees that equal number of males end up in treatment group and control group (same with females)

• Randomized Matched Pairs: randomly decide which member of each pair gets treatment vs. control

• Example:• For each head in dandruff experiment, randomly assign which

side of head to get dandruff shampoo vs. control

Page 27: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

27

Experiments vs. Observational Studies

• Often, we want the causal effect of some treatment, but our data are from an observational study• Observational studies examine effects of some variable but

without the advantages of a controlled experiment • No treatment is applied in observational studies

• Example: health effects of smoking• Unethical to randomly impose a treatment• Could there be some confounding variable that explains

health differences between smokers and non-smokers ?

• Very risky to make causal statements from observational data, since we can not avoid bias!

Page 28: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

28

• Report to European Society of Sexual Medicine:• 153 Italian women filled out sexual function

questionnaires• “intriguing correlation”: sexual function/desire

significantly greater among chocolate-eaters• Observational study: association does not imply

causation!• Confounding: average age is 35 among frequent

chocolate-eaters, compared with 40.4 in non-chocolate group

Health Effects of Chocolate

Page 29: Stat 111 - Lecture 1 - Introduction1 Welcome to Statistics 111 Alex Braunstein The goal of this course is to develop basic tools for data analysis, probability

Stat 111 - Lecture 2 - Experiments

29

Next Class - Lecture 2

• Collecting Data: – Surveys and Sampling – Graphical summaries of a single variable

• Moore, McCabe and Craig: Sections 3.3 and 1.1