chapter 9 cleansing and augmenting the data xiaogang su department of statistics university of...

Post on 14-Jan-2016

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Chapter 9 Cleansing and Augmenting

the Data

Xiaogang Su

Department of Statistics

University of Central Florida

Section 9.1 Cleansing and Augmenting the Data

Objectives

• Modify the data using the functions

— UPCASE

— COMPBL

— TRANWRD.

Scenario

The secretary at Statistics Department maintains some faculty information. The data set contains information about all faculty in Fall 2001.

...

UPCASE Function

The UPCASE function converts all of the letters in the data value into uppercase.

var=UPCASE(var);

country=upcase(country);COUNTRY

france

BEFORE

COUNTRY

FRANCE

AFTER

Example:

...

COMPBL Function

The COMPBL function compresses multiple consecutive blanks in a data value into one blank. Since the length of a variable is set at compilation, the resulting data value is padded with blanks.

Example:

var=COMPBL(var);

name=compbl(name);NAME

DE PABLOS

BEFORE

NAME

DE PABLOS

AFTER

...

TRANWRD Function

The TRANWRD function replaces all occurrences of a pattern of characters in a data value with another pattern of characters.

Example:

var=TRANWRD(var, target, replacement);

name=tranwrd(name,’Miss’, ’Ms’);NAME

Miss. Joy Ho

BEFORE

Ms. Joy Ho

AFTER

NAME

...

Cleansing Existing Data

This demonstration illustrates

• changing Country to be always uppercase for USA and only the first letter be upper case for other counties

• removing multiple blanks from a last name

Program: pg2-ch9-ex01.sas

Section 9.2 Creating New Character Variables

Objectives

• Modify character data values.

• Specifically, use the

SCAN function

TRIM function

concatenation

operator

SUBSTR function

...

ccutchin@mail.ucf.edulguo@mail.ucf.eduhoffman@pegasus.cc.ucf.edumorij@pegasus.cc.ucf.edu

ScenarioUniversity has two email systems “pegasus” and “mail”. The e-mail address structures are in the form of

<Firstinitial>Lastname@pegasus.cc.ucf.edu

Or <Firstinitial>Lastname@mail.ucf.edu

...

Refresher Quiz!

1. What are the two phases of the DATA step?

Compilation and Execution

2. What are the three required attributes of all variables?

Name, Type, and Length

3. In what phase are these attributes assigned?

Compilation

...

SUBSTR Function

The SUBSTR function extracts a portion of the character data value based on how many characters are designated for retrieval.

var1=SUBSTR(var, start ,<number of chars>);

NAME

Dorothy Ename1 = substr(name,1,3); NAME1

Dor

...

Retrieve First Initial

The first step is to retrieve the initial letter of an employee’s first name.

first_initial = substr(emp_firstname,1,1);

EMP_FIRSTNAME

Dorothy E

FIRST_INITIAL

D

...

Retrieve Middle Initial

The second step is to retrieve the middle initial.

Problems:

• Not all middle initials are in the same location, so you cannot use the SUBSTR function.

• Not all employees have a middle initial.

lastname firstname email MILLS DOROTHY E DEMILLS@IA.COM BOWER EILEEN A. EABOWER@IA.COM READING TONY R. TRREADING@IA.COM JUDD CAROL A. CAJUDD@IA.COM ANDERSON CHRISTOPHER CANDERSON@IA.COM

...

SCAN Function

The SCAN function extracts a portion of the character data value based on what word-number to retrieve.

Example:

var1=SCAN(var, word-number, <delimiter(s)>);

NAME

Dorothy Ename1=scan(name,2,' ');

NAME1

E

...

Retrieve Middle Initial

middle_initial=substr(scan(emp_firstname,2,' '),1,1);

Example:

emp_firstname = Dorothy E

Second word = 'E'

...

Retrieve Middle Initial

middle_initial=substr(scan(emp_firstname,2,' '),1,1);

Example:

emp_firstname = Dorothy E

1 character = E

Second word = 'E'

Retrieve Middle Initial

middle_initial=substr(scan(emp_firstname,2,' '),1,1);

Example:

emp_firstname = Eileen Alma

Second word = 'Alma'

...

Retrieve Middle Initial

middle_initial=substr(scan(emp_firstname,2,' '),1,1);

Example:

emp_firstname = Eileen Alma

1 character = A

Second word = 'Alma'

...

Retrieve Middle Initial

middle_initial=substr(scan(emp_firstname,2,' '),1,1);

Example:

emp_firstname = Hanna

Second word = missing

...

When you put the e-mail address together, separate assignments must be made for those with middle initials and for those without.

Retrieve Middle Initial

middle_initial=substr(scan(emp_firstname,2,' '),1,1);

Example:

emp_firstname = Hanna

1 character = missing

...

Concatenation Operator

The third step is to put the first initial, middle initial, and last name together.

The concatenation operator joins character data values together.

var2 = var !! var1;

Concatenation Operator

Example: newname=name !! name1;

Compilation

NAME1

$

6

NEWNAME

$

15+

NAME

$

9

...

Concatenation Operator

Example: newname=name !! name1;

Dorothy E

2 spaces

NEWNAME

$

15

Dorothy E

Execution

NAME

$

9

NAME1

$

6

+

...

TRIM Function

The TRIM function removes trailing blanks from a character data value during execution.

var2=TRIM(var) !! var1;

TRIM Function

Example: newname=trim(name) !! name1;

NEWNAME

$

15

NAME1

$

6

NAME

$

9

Compilation

+

...

TRIM Function

Example: newname=trim(name) !! name1;

0 spaces

7 spaces

NEWNAME

$

15

DorothyEDorothy E

Execution

NAME

$

9

NAME1

$

6

+

...

TRIM Function

Example: newname=trim(name) !! ‘ ‘ !! name1;

1 space

7 spaces

NEWNAME

$

16

Dorothy EDorothy E

Execution

NAME

$

9

NAME1

$

6

+ +space

...

Pegasus, Mail, or Other email system

This demonstration illustrates putting all the pieces together.

Program: pg2-ch9-ex02.sas

Section 9.3 Creating New Numeric Variables

Objectives

• Modify numeric data values.

• Specifically, use the

SUM Function

ROUND Function

INT Function

MEAN Function

Scenario

International Airlines will provide a retroactive, prorated bonus to employees for their years of service up to and including a maximum of ten years of past company service.

sum_ avg_

EMP_ID bonus bonus

E00002 18956.98 2708

E00003 60211.06 12042

E00004 4218.99 4219

E00005 17142.96 1905

E00007 20361.21 2909

The variable names that containthe bonuses areBONUS1, BONUS2,and so on.

Implied Variable Lists

Because the variable names have the same prefix word followed by a numbered suffix, you can use an implied implied variable listvariable list to refer to a group of variables.

The syntax may vary slightly based on the statement in which the implied variable list is used, but the basic syntax is

varA-varN

where A is a starting number, and N is an ending number. The numbers must be consecutive.

The SUM Function

The SUM function

• adds variable values together

• ignores missing values.

var=SUM(var1, var2, …, varN);

var=sum(var1, var2, var3);

Example:

VAR

18

VAR1

12

VAR2

.

VAR3

6

...

The SUM Function

When you use an implied variable list, use the keyword OF in front of the first variable name to prevent subtraction from occurring.

Example:

var=SUM(OF var1-varN);

var=sum(of var1-var3);VAR1

12

VAR2

.

VAR3

6

VAR

18

The MEAN Function

The MEAN function returns the arithmetic mean (average) and ignores missing values.

Example:

var=MEAN(var1, var2,…,varN);

var=mean(of var1-var3);VAR

9

VAR1

12

VAR2

.

VAR3

6

...

The ROUND Function

The ROUND function returns a value rounded to the nearest round-off unit. If round-off unit is not provided, the variable is rounded to the nearest integer.

Any number or fractional value can be used as a round-off unit.

var=ROUND(var1,<round-off unit>);

The ROUND Function

Examples:

VAR1A

12

VAR2A

43

VAR3A

6.5

VAR4A

-6.5

VAR1

12.12

VAR2

42.54

VAR3

6.472

VAR4

-6.47

var1a = round(var1);var2a = round(var2);var3a = round(var3,.1);var4a = round(var4,.1);

The INT Function

The INT function returns the integer portion of an argument.

var=INT(var1);

The INT Function

var1a = int(var1);var2a = int(var2);var3a = int(var3);var4a = int(var4);

Examples:

VAR1

12.12

VAR2

42.54

VAR3

6.472

VAR4

-6.47

VAR1A

12

VAR2A

42

VAR3A

6

VAR4A

-6

Section 9.4 Creating New Numerical Variables Based on Dates

Objectives

• Modify numeric date values.

• Specifically, use

SAS time and datetime constants

the INTNX function

the TODAY function

the YEAR function

...

Scenario

New employees are eligible for benefits on the first of the month following their start dates.

benefit_ emp_hire_ benefit_Obs date date LASTNAME FIRSTNAME days 1 05/01/1999 01APR1999 ANDERSON SANDRA 3342 06/01/1999 10MAY1999 BARBOUR DAVID 3033 02/01/2000 01JAN2000 BELL MICHAEL 584 08/01/1999 18JUL1999 BENTZ ALICE 2425 02/01/2000 03JAN2000 BOLCH AMY 58

Refresher Quiz!

1. What date is the starting point from which all SAS dates are based?

2. What option takes effect if only a two-digit year is specified?

January 1, 1960

YEARCUTOFF=

3. How do you specify a SAS date constant for September 12, 2008? '12sep2008'd

...

SAS Times

Just as SAS has a starting point for dates, it also has a starting point for times.

Time is measured as the number of seconds since midnight.

'hh:mm<:ss.s>'T

SAS Datetimes

SAS datetimes are a combination of dates and times, and are measured as the number of seconds since January 1, 1960.

'ddmmmyyyy:hh:mm<:ss.s>'DT

The INTNX Function

The INTNX function advances a date, time, or datetime value by a given interval, and returns a date, time, or datetime value.

Example:

var1=intnx('year',var,1);VAR

0

SAS date for 01JAN1960

VAR1

366

SAS date for 01JAN1961

var=INTNX('interval',start-from,increment);

...

The TODAY Function

The TODAY function returns the current date as a SAS date.

Example:

Assume today’s date is September 12, 2008.

var1=today(); VAR1

17787

SAS date for 12SEP2008

var=TODAY();

...

Other DATE Functions

There are other functions that return values from SAS dates.

var=YEAR(SAS-date);

var=MONTH(SAS-date);

var=DAY(SAS-date);

Other DATE Functions

Example:

Assume today’s date is September 12, 2008.

THISYEAR=YEAR(TODAY());

THISDAY

12

THISMONTH=MONTH(TODAY());

THISDAY=DAY(TODAY());

THISMONTH

9

THISYEAR

2008

...

Calculating Eligibility Date

This demonstration illustrates putting all the pieces together.

Program: pg2-ch9-ex03.sas

top related