1 creating and tweaking data hrp223 – 2010 october 24, 2011 copyright © 1999-2011 leland stanford...

47
1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.

Post on 19-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

1

Creating and Tweaking Data

HRP223 – 2010October 24, 2011

Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved.Warning: This presentation is protected by copyright law and international treaties. Unauthorized reproduction of this presentation, or any portion of it, may result in severe civil and criminal penalties and will be prosecuted to maximum extent possible under the law.

Page 2: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

2

Topics

• Creating data with loops in data steps• Creating variables• Modifying variables

Page 3: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

3

Making 100 records

Once you have the 100 records you can add in details based on the value of dude.

You can easily add in a random number for everyone’s height and make half the people

male.

Click here to add new variables.

Page 4: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

4

Case Clauses

• You can add new variables using functions and simple assignment statements inside case-when-else-end phrases within the SQL.

This is what we are building. Notice the computed column, the

name of the new node and the name of the new data set.

Click Preview to see the SQL code.

Page 5: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

5

ComputedAt the end of the process we want a character variable with dudes

1 to 50 to be male and 51 to 100 to be female and anything else labeled as *** BAD SEX ***.

1st 2nd

Page 6: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

6

Page 7: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

7

Be sure to specify a character column if you

are making strings of characters.

Always specify a value for when it does not know

what value to use.

Add in the rules on how to replace.

Page 8: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

8

Page 9: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

9

Click Modify Task and click Computed Column

to add in the random height.

Page 10: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

10

Display 5 digits including the . Two are after the decimal

To change how the new values appear click here.

Page 11: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

11

You can find functions here. Use OnLineDoc to find more information.

rand("normal", 67, 3)

Page 12: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

12

More Complex Variables

• You can compute a new value for different levels of a existing variable. Say I want to add 2 inches to all the males.

• Open the with height and gender• Click computed columns• Recode the sex column• Specify it is a numeric column• Replace female with the height variable• Replace male with the height variable + 2

Page 13: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

13

Page 14: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

14

Nested Loops

• How to create data.– Use loops. – Be sure to include an end with every do.– Include an output inside the innermost loop.• If you forget the output, the only time it will write a

record to the new dataset is at the end of the data step.

Page 15: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

15

Advanced Expressions

• If you get sick of clicking you can write complex case statements yourself using:

• Computed Column• Advanced Expression• Type the case logic casewhen (treat = "Placebo") then rand("norm", 10, 1)when (treat = "Low") then rand("norm", 19, 2)when (treat = "High") then rand("norm", 20, 2)else .E end

Page 16: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

16

Page 17: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

17

Fixing Bad Values

• You will eventually need to fix bad data.– Say you want to set Placebo5 to be a score of 10.

Name the node and output.

Select the variables that are not

modified.

Page 18: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

18

Fixing Bad Values

• Tell it to compute a column and choose either Recode column or do a case-when-else-end statement in an Advanced expression.

Page 19: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

19

Page 20: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

20

To get a better look click validate

Page 21: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

21

Hang on to this syntax

case when (logic check) then new valueelse originalVariableend

Page 22: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

22

Collapsing Groups

• Often you will have a categorical variable and you will want to reduce the number of groups.– High Dose and Low Dose are the same as being on

a drug.• You can create a new variable or just use a

custom format to change how the values appear.

Page 23: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

23

Adding a New Column

• Choose Computed Column and recode a column.

Page 24: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

24

Adding a User Defined Format

Here we are changing characters to appear as other

characters.

Page 25: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

25

1

2

3

4

Repeat until you have filled in all the values you want to appear differently.

Page 26: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

26

Using Formats

• The formats are not automatically associated with any variables. You need to tell SAS to apply the format when it is creating a dataset or when it is processing a variable.

• Some processing nodes do better if you have assigned the format in a previous step.

Page 27: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

27

Select the variable that needs the

format and click properties.

Click Change… and then pick the User

Defined format.

Click Change… and then pick the User

Defined format.

Page 28: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

28

Same Information Formatted

Page 29: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

29

Combining

• When you have data in two tables, you need to tell SQL how the two tables are related to each other.– Typically you have a subject ID number in both

files. The variable that can be used to link information is called the key.

Page 30: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

30

Demographics

Response to TreatmentHere the two tables have different variables (except ID) and they are in a different sort order.

We want the favorite color merged in to see if it is related to response to treatment.

Page 31: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

31

Merging

• Merging is trivially easy with EG. Choose a table and do the Query Builder…. And push the Join Tables button.

Page 32: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

32

Double click on the dividing lines to make the columns wide

enough to read.

Page 33: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

33

Notice the name t1. In the SQL statements, variables from this

table will have the prefix t1.

This table will be referred to as t2.

It noticed that the two tables have the common variable ID. Therefore it is going to match records that have a common

value in ID.

Double click the link for details.

Page 34: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

34

Joins

• You will typically do inner joins and left joins.– Inner Joins: select the marching records– Left Joins: select all records on the left side and

any records that match on the right.

Page 35: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

35

Inner Joins

• Inner Joins are useful when you want to keep the information from the tables, if and only if, there are matches in both tables.– Here you keep the records where you have

demographic and response to treatment information on people.

Page 36: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

36

Left Joins

• Left joins are useful when you have a table with everybody on the left side of the join and not everyone has records in the right table.– A typical example has the left side with the IDs of

everyone in a family and the right table has information on diagnoses. Not everyone is sick so you want to keep all the IDs on the left and add in diagnoses where you can.

Page 37: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

37

Typical Left Join

Notice the numeric variable is formatted to

display with words.

Page 38: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

38

Page 39: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

39

Coalesce

• The previous example leaves NULL for the people who are disease free. You probably want to list the rest as healthy.

• The coalesce function returns the first non-missing value. – Coalesce works on numeric lists.– Coalesce works on character lists.

Page 40: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

40

Page 41: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

41

Coalesce

• If you are using left joins from multiple tables, coalesce can be really useful.– Say you have people who have reported disease,

other people have verified disease and the rest are assumed to be healthy. You can coalesce an indicator variable from the verified table and reported table and call everybody else healthy.

Page 42: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

42

If the tables have indicator variables, once the tables are linked, the coalesce function is easy:COALESCEC(t3.status2 , t2.status1, "Healthy"))

Page 43: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

43

No indicator variables?

• If the tables you are coalescing do not have indicator variables, just make them as part of the query by adding a column which has the ID in the child tables (e.g., reported and verified) recoded to a word like “reported” or “verified”.

Page 44: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

44

The two new indicator columns.

Page 45: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

45

Coalesce the new columns

• Once the new columns are created, create a new variable using the Advanced expression option for a new computed column. Then do coalesce on the new variables. Double click on the new variables and it will insert the code.

Page 46: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

46

After double clicking the ver variable the

code is inserted.

Don’t forget the comma before double clicking

the rep variable.

After inserting reported and verified, put in

another comma and the “healthy” option.

Page 47: 1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © 1999-2011 Leland Stanford Junior University. All rights reserved. Warning: This

47