lecture 4 ways to get data into sas some practice programming review of statistical concepts

Post on 21-Jan-2016

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 4

• Ways to get data into SAS

• Some practice programming

• Review of statistical concepts

Getting data into SAS

• DATALINES statement– Data is contained within a data step

• INFILE statement– Data contained in separate file

• PROC IMPORT– Data contained in separate file

* List Directed Input: Reading data values separated by spaces.;

DATA bp; INFILE DATALINES; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C 84 138 93 143D 89 150 91 140A 78 116 100 162A . . 86 155C 81 145 86 140;RUN ;TITLE 'Data Separated by Spaces';PROC PRINT DATA=bp;RUN;

Obs clinic dbp6 sbp6 dbpbl sbpbl

1 C 84 138 93 143 2 D 89 150 91 140 3 A 78 116 100 162 4 A . . 86 155 5 C 81 145 86 140

* List Directed Input: Reading data values separated by commas;

DATA bp; INFILE DATALINES DLM = ',' ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C,84,138,93,143D,89,150,91,140A,78,116,100,162A,.,.,86,155C,81,145,86,140;RUN ;TITLE 'Data separated by a comma';PROC PRINT DATA=bp;RUN;

* List Directed Input: Reading data values from a .csv type file;

DATA bp; INFILE DATALINES DLM = ',' DSD ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C,84,138,93,143D,89,150,91,140A,78,116,100,162A,,,86,155C,81,145,86,140;TITLE 'Reading in Data using the DSD Option';PROC PRINT DATA=bp;RUN;

* List Directed Input: Reading data values separated by tabs (.txt files);

DATA bp; INFILE DATALINES DLM = '09'x DSD; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C 84 138 93 143D 89 150 91 140A 78 116 100 162A 86 155C 81 145 86 140;TITLE 'Reading in Data separated by a tab';PROC PRINT DATA=bp;RUN;

* Reading data from an external file

DATA bp; INFILE '/home/ph5415/data/bp.csv' DSD FIRSTOBS = 2; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ;TITLE 'Reading in Data from an External File';PROC PRINT DATA=bp;

clinic,dbp6,sbp6,dbpbl,sbpblC,84,138,93,143D,89,150,91,140A,78,116,100,162A,,,86,155C,81,145,86,140

Content of bp.csv

*Using PROC IMPORT to read in data ;

PROC IMPORT DATAFILE='/home/ph5415/data/bp.csv' OUT = bp

DBMS = csv REPLACE ; GETNAMES = yes;

TITLE 'Reading in Data Using PROC IMPORT';

PROC PRINT DATA=bp;PROC CONTENTS DATA=bp;

The CONTENTS Procedure

Data Set Name: WORK.BP Observations: 5 Member Type: DATA Variables: 5 Engine: V8 Indexes: 0 Created: 18:15 Tuesday, January 25, 2005 Observation Length: 40 Last Modified: 18:15 Tuesday, January 25, 2005 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label:

-----Alphabetic List of Variables and Attributes-----

# Variable Type Len Posƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ1 clinic Char 8 322 dbp6 Num 8 04 dbpbl Num 8 163 sbp6 Num 8 85 sbpbl Num 8 24

Some Definitions

• Statistics: The art and science of collecting, analyzing, presenting, and interpreting numerical data.

• Data: facts and figures that are analyzed• Dataset: All the data collected for a study• Elements: Units in which data is collected

– People, companies, schools, households• Variables: Characteristics measured on elements

– People (height, weight)– Company (number of employees)– Schools (percentage of students who graduate in 5 years)– Households (number of computers owned)

Informal Definition

• Statistics:

In a scientific way gain information about something you do not know

Start With Research Question

• What is the proportion of persons without health insurance in Minnesota?

• Do newer BP medications prevent heart disease compared to older medications?

• What is the relationship between grade point average and SAT scores

• Do persons who eat more F&V have lower risk of developing colon cancer.

• Does the program DARE reduce the risk of young persons trying drugs?

Statistics

Start WithQuestion

Start WithQuestion

Design Study And

Collect Data

Compute SummaryCompute SummaryData to AssessData to Assess

Question.Question.

Compute SummaryCompute SummaryData to AssessData to Assess

Question.Question.

Make Conclusions(Inference)

Make Conclusions(Inference)

Statistical Inference

• Estimation (Chapter 4)

• Hypothesis Testing (Chapter 5)– Comparing population proportions (Chap 6)– Comparing population means (Chap 7)

Common Parameters to Estimate

Parameter Parameter Description

Mean of population

Proportion with a certain trait

Correlation between 2 variables

Difference between 2 means

Difference between 2 proportions

Population standard deviation

Statistical Inference

Population with mean

= ?

Population with mean

= ?

A simple random sampleof n elements is selected

from the population..

The sample data provide a value for

the sample mean . .

The sample data provide a value for

the sample mean . .xx

The value of is used tomake inferences about

the value of .

The value of is used tomake inferences about

the value of .

xx

Sampling

• Sample: a subset of target population

(usually a simple random sample - each sample has equal probability of occurring)

• Different samples yield different estimates

• Trying to understand the population parameter (the “true value”)– It’s usually not possible to measure the population value

Point Estimate

Parameter Point Estimate

Sample mean

Sample proportion

Sample correlation

Difference between 2 sample means

Difference between 2 sample proportions

Sample standard deviation

Interval Estimation

In general, confidence intervals are of the form:

SEestimate 96.1

SE = standard error of your estimate

Estimate = mean, proportion, regression coefficient, odds ratio...

1.96 = for 95% CI based on normal distribution

Estimation“What is the average total cholesterol level for MN

residents?”

Random sample of cholesterol levels

sample mean = sum of values / number of observations

Xn

XX

Estimates the population mean:

Estimation

“What is the average total cholesterol level for MN residents?”

sample standard deviation:

sestimates the

population standard deviation:

1

)( 2

n

XXs

Confidence Interval Example

Suppose sample of 100

mean = 215 mg/dL, standard deviation = 20

95% CI = nsX /96.1

= (215 - 1.96*20/10, 215 + 1.96*20/10) approximately = (211, 219)

ns / = standard error of mean

Properties of Confidence Intervals

• As sample size increases, CI gets smaller– If you could sample the whole population;

• Can use different levels of confidence – 90, 95, 99% common– More confidence means larger interval; so a 90% CI is smaller than a 99% CI

• Changes with population standard deviation– More variable population means larger interval

X

Caution with Confidence Intervals

– Data should be from random sample

– More complicated sampling requires different methods• Example - multistage or stratified sampling

– Outliers can cause problems

– Non-normal data can change confidence level• Skewed data a big problem

– Bias not accounted for• Non-responders

• Target and sampled population different

95% Confidence Intervals with SAS

1) Construct from output

estimate +/- 1.96*SE

2) Provided automatically by some procedures

PROC MEANS DATA = STUDENTS LCLM UCLM;

VAR AGE;

top related