migrate 09s 1

Upload: ahmed-eid

Post on 02-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 Migrate 09S 1

    1/12

    Intro Data Migrate Manage Mod Advanced Upcoming

    UCLA Department of StatisticsStatistical Consulting Center

    Migrating to R for SAS/SPSS/Stata UsersVivian Lew

    [email protected]

    April 19, 2009

    Vivian Lew [email protected] to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    Outline

    1 Introduction

    2 About Data

    3 Convert to R

    4 Managing Data

    5 Variable Modification

    6 Advanced Management

    7 Upcoming Mini-Courses

    Vivian Lew [email protected] to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    1 IntroductionWhat are SAS (1966), Stata (1985), and SPSS (1968)?What is R (1997)?AssumptionsPhilosophy

    2 About Data

    Getting Data Into RDatasets in SAS, Stata, & SPSS3 Convert to R

    The Easy WayClean: R Only

    4 Managing DataSubscripts & AttachingSubsetting (Conditioning) and SamplingSaving

    5 Variable ModificationGeneratingRecoding an Existing Variable

    6 Advanced ManagementRestructuring Data Frames

    Special Scalars7 Upcoming Mini-Courses

    Vivian Lew [email protected]

    Migrating to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    What are SAS (1966), Stata (1985), and SPSS (1968)?

    What are SAS (1966),Stata (1985), and SPSS (1968)?

    Statistical packages allow you to read raw data and transformit into a proprietary file (sas7bdat, dta, sav files)

    Provide statistical and graphical procedures to help youanalyze data.

    Provide a management system to store output from statisticalprocedures for processing in other procedures, or to let youcustomize printed output.

    Provide a macro (internal programming) language which usecommands

    Provide a matrix language to add new functions andprocedures (SAS/IML and SPSS Matrix).

    Vivian Lew [email protected]

    Migrating to R for SAS/SPSS/Stata Users UCLA SCC

  • 8/11/2019 Migrate 09S 1

    2/12

    Intro Data Migrate Manage Mod Advanced Upcoming

    What is R (1997)?

    What is R (1997)?

    R is a statistical environment a fully planned and coherent

    system, rather than collection of specific and inflexible tools whichhave been added in increments

    It runs on a variety of platforms including Windows, Unix andMacOS.

    It provides an unparalleled platform for programming newstatistical methods in a straightforward manner.

    It contains advanced statistical routines not yet available inother packages.

    It has state-of-the-art graphics capabilities.

    Its efficient with its resources and it is free!

    Vivian Lew [email protected] to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    Assumptions

    Assumptions About Your Knowledge

    Assume that you have used a commercial stat package like

    SAS/STATA/SPSS

    Assume you have had the opportunity to read data into a

    commercial stat package

    Assume you have written data from commercial stat packageand generated output

    Assume you have managed data (e.g., merging datasets) with

    a stat package

    Assume you have attempted to generate descriptive statisticsor estimated models too

    Vivian Lew [email protected] to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    Philosophy

    Some Basic Differences

    SAS/STATA & SPSS generate considerable output, R gives

    minimal outputSAS & SPSS commands are not case sensitive, R is case

    sensitive so foo, FOO, fOO etc. are represent different thingsin R

    R allows the continuous modification of data, SAS will not

    (SPSS & Stata will)

    R Basic functions are available by default. Other functions aremade available as needed. SAS and SPSS include all functionsat the time of installation.

    Vivian Lew [email protected]

    Migrating to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    Philosophy

    Basic R Components

    Data Object (a container): Primary data object is the vector

    an ordered collection of elementsR functions perform operations on vectorcontents

    Vectors can be specialized (e.g., character,logical)Objects can be collections of vectors

    R functions : operate on an Objects contents. Functions areprocessing instructions

    Functions work differently on different objects

    Functions are built-in or user written

    Vivian Lew [email protected]

    Migrating to R for SAS/SPSS/Stata Users UCLA SCC

  • 8/11/2019 Migrate 09S 1

    3/12

    Intro Data Migrate Manage Mod Advanced Upcoming

    Philosophy

    R command hierarchy

    R script: a collection of R commands similar to a Stata Do file or aSAS program file. Typically calls one or more existing Rfunctions in a script.

    R function: a self-contained programming object similar to commands

    in Stata/SAS/SPSSR program: A collection of functions for solutions you intend to reuse

    and/or share with others. Similar to SAS and Stata Macroprogramming.

    Rpackage: A expert level R program. Tested and documented toCRAN standards. Downloadable as a self-installing andconfiguring module. Contains information ondependencies (on other R modules). SAS does not allowthis, Stata allows the use of Ado files written by users.

    Lets examine a little R script

    Vivian Lew [email protected] to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    1 IntroductionWhat are SAS (1966), Stata (1985), and SPSS (1968)?What is R (1997)?AssumptionsPhilosophy

    2 About DataGetting Data Into RDatasets in SAS, Stata, & SPSS

    3 Convert to R

    The Easy WayClean: R Only

    4 Managing DataSubscripts & AttachingSubsetting (Conditioning) and SamplingSaving

    5 Variable ModificationGeneratingRecoding an Existing Variable

    6 Advanced ManagementRestructuring Data FramesSpecial Scalars

    7 Upcoming Mini-Courses

    Vivian Lew [email protected] to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    Getting Data Into R

    Key Characteristics ofData Objects in R

    Every Data Object has Attributes and they define behavior

    Class: ManyR objects have a class attribute, a character

    vectorgiving the names of the classes from which theobject inherits. If the object does not have a classattribute, it hasan implicit class, e.g., matrix or

    array

    Mode: a character string giving the desired mode or storagemode (type) of the object. Commonly: numeric,character, and logical (e.g., TRUE/FALSE), but also

    list

    Length: Number of elements in the object

    Having problems with a data object? Its probably being used

    incorrectly

    Vivian Lew [email protected]

    Migrating to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    Getting Data Into R

    Common Data Objects

    Vector: a data structure consisting of elements that are accessedby indexing

    List: a generic vector (can be mixed mode)

    Array: A collection of items that are randomly accessible byintegers, the index. Array objects cannot be mixed mode.

    Matrix: A two-dimensional array. By convention, the first index isthe row, andthe second index is the column. ALLcolumns in a matrix must have the same mode(e.g.,numeric) and the samelength.

    Dataframe: A generalizedmatrix, different columns can have different

    modes (e.g., numeric, character, factor). This is similar to

    SAS, Stata, and SPSS datasets.

    Vivian Lew [email protected]

    Migrating to R for SAS/SPSS/Stata Users UCLA SCC

  • 8/11/2019 Migrate 09S 1

    4/12

    Intro Data Migrate Manage Mod Advanced Upcoming

    Datasets inSAS, Stata, & SPSS

    Datasets in SAS, Stata, & SPSS

    Many data analyses involve the dataset a collection ofvalues that function as a single unit. Example, we survey the

    students enrolled in a basic statistics course, for each student

    we have gender, age, year in school, major, etc. For eachstudent we may have values for each of these variables.

    If we were to store the data in a matrix in R (rows =

    observations, columns=variables) a matrix would want all ofthe data to have the same mode, so it would force the numericvariables to be storedas character (this creates problems)

    The solution is to store the data as a data frame.

    Vivian Lew [email protected] to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    1 IntroductionWhat are SAS (1966), Stata (1985), and SPSS (1968)?What is R (1997)?AssumptionsPhilosophy

    2 About DataGetting Data Into RDatasets in SAS, Stata, & SPSS

    3 Convert to R

    The Easy WayClean: R Only

    4 Managing DataSubscripts & AttachingSubsetting (Conditioning) and SamplingSaving

    5 Variable ModificationGeneratingRecoding an Existing Variable

    6 Advanced ManagementRestructuring Data FramesSpecial Scalars

    7 Upcoming Mini-Courses

    Vivian Lew [email protected] to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    The Easy Way

    Convert to CSV then toR

    The quickest way between commercial Stat Package datasets

    and R is to use the Stat Package software to convert anexisting dataset to a comma separated values file (.csv) andtransform it into an R object by using function read.csv

    mydata1

  • 8/11/2019 Migrate 09S 1

    5/12

    Intro Data Migrate Manage Mod Advanced Upcoming

    Clean: R Only

    BeyondCSV

    Suppose instead you do not have access to software other than R but need to

    convert the following: aStata dataset and a SAS dataset. R will need some

    additional packages to make this happen. It seems like a hassle to go this

    route, but consider the following:

    A minimum SAS install requires around 2gb of disk space now, a minimum R

    install might only require 30mb or so and you add on packages as needed.Thesetwo packages are particularly handy if you are d oing a lot of work

    between R andother software.

    1 l i b r a r y ( f o r e ig n ) \ # L o a d t h e n e e de d p a c ka g e s .

    2 li b r a r y ( Hmi sc ) \ # L o a d t h e n e e de d p a c ka g e s .

    3 my d a t a 2

  • 8/11/2019 Migrate 09S 1

    6/12

    Intro Data Migrate Manage Mod Advanced Upcoming

    Subscripts & Attaching

    Extracting Data from Data Frames

    In the data frame, the columns represent variables and the rowsare observations. One way to analyze the data is to work withthe variables (columns).

    1 Use its name, e.g., summary(eq2$MAG) eq2 is the data

    frame, and MAG is the header of one of its columns. Dollarsign ($) is an operator that accesses that column.

    2 Use subscripts, e.g., summary(eq2[4]) will work for dataframes or summary(eq2[,4]) works generally to extract the 4thcolumn MAG

    3 Use attach() it essentially splits the data frame into variablesand allows use the names directly (no need for $), for example:

    attach(eq2)

    summary(MAG)

    Vivian Lew [email protected] to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    Subscripts & Attaching

    Extracting Rows from Data Frames

    In the data frame, the rows are observations. Another way to

    analyze the data is to work with a combination of specificobservations and variables.

    1 Use subscripts, e.g., summary(eq2[1:10,]), notice the commais telling us we want all the columns, but only rows 1:10

    2 Use subscripts, e.g., summary(eq2[1:10,4]), now itsobservations 1 through 10 and variable 4 (which is MAG)

    3 Even with attach, understanding subscripts is VERY helpful,e.g., summary(MAG[1:10])

    Vivian Lew [email protected] to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    Subsetting (Conditioning) and Sampling

    Subsetting (Conditioning or Keeping)

    A common task is to create a smaller data set (data frame) basedupon the values of a variable. For example, we might only want

    large earthquakes, function which() is quick:

    1 e q2 [ M AG > 4 , ]

    2 B I G O N E S 4 , ]

    3 b i g o n e s 4 , 1 : 4]

    4 s t r ( B I G O N E S )

    5 s t r ( b i g o n e s )

    Vivian Lew [email protected]

    Migrating to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    Subsetting (Conditioning) and Sampling

    Subsetting (Conditioning or Keeping)

    For more complicated data extraction from data frames, use thesubset() function because it is designed to handle more complexconditioning. Note that subset will always want to create a newdata frame.

    1 small1 < - s u b se t ( e q2 , M AG < 1 , s e le c t =c ( M A G ,

    L AT , L ON , D AT E ) )

    2 s t r ( s m a l l 1 )

    3 small2 < - s u b se t ( e q2 , M AG < 1 , s e le c t =- c ( M A G ,

    L AT , L ON , D AT E ) )

    4 s t r ( s m a l l 2 )

    5 small3 < - s u b se t ( e q 2 , M A G < 3 \ & D E PT H < 5 .9 ,

    s e l e c t = c ( MA G , L AT , L ON , D EP TH ) )

    6 s t r ( s m a l l 3 )

    7 s u m m a r y ( s m a l l 3 )

    Vivian Lew [email protected]

    Migrating to R for SAS/SPSS/Stata Users UCLA SCC

  • 8/11/2019 Migrate 09S 1

    7/12

    Intro Data Migrate Manage Mod Advanced Upcoming

    Subsetting (Conditioning) and Sampling

    Sampling from a DataFrame

    Drawing randomsamples from a data frame is simple but can befrustrating, this is agood example of why it matters that wereworking with a data frame and the fact that function sample was

    written with vectors in mind:

    1 n o t w h a t i w a n t e d

  • 8/11/2019 Migrate 09S 1

    8/12

    Intro Data Migrate Manage Mod Advanced Upcoming

    Generating

    Completely New Variable

    Suppose you wish to number each observation, a simple one is tojust use its row numb er. We can then combine the new column

    vector with the existing data frame.1 ID

  • 8/11/2019 Migrate 09S 1

    9/12

    Intro Data Migrate Manage Mod Advanced Upcoming

    Recoding anExisting Variable

    Factorsin Detail contdAnother way

    1 MAG 2

  • 8/11/2019 Migrate 09S 1

    10/12

    Intro Data Migrate Manage Mod Advanced Upcoming

    Restructuring Data Frames

    MergingMerging is concatenating columns with the addition of matching on an identifiervariable.

    1 team

  • 8/11/2019 Migrate 09S 1

    11/12

    Intro Data Migrate Manage Mod Advanced Upcoming

    Special Scalars

    Dates

    Dates are represented as the number of days since 1970-01-01 (Unix Epoch),with negative valuesfor dates prior to January 1, 1970.

    They are always printed following the rules of the current Gregorian calendar.

    A variable might be named date and may look like a date, but it may not berecognized as a date by R.

    The solution is to convert datevariables to an R date. Once that is done,mathematical operations can be performed.

    attach(eq2)

    summary(DATE)

    is(DATE)

    newdate

  • 8/11/2019 Migrate 09S 1

    12/12

    Intro Data Migrate Manage Mod Advanced Upcoming

    Upcoming Mini-Courses

    Next week:

    Migrating to R for SAS/SPSS/Stata Users (April 21, Tuesday)

    Time Series in R (April 23, Thursday)Following week:

    Basic LaTeX Mac/Windows (April 28, Tuesday)Basic R (April 30, Thursday)

    For a schedule of all mini-courses offered please visithttp:// scc.stat.ucla.edu/mini-courses .

    Vivian Lew [email protected]

    Migrating to R for SAS/SPSS/Stata Users UCLA SCC

    Intro Data Migrate Manage Mod Advanced Upcoming

    Our Survey

    PLEASE Follow this link and take our brief surveyhttp://scc.stat.ucla.edu/survey

    It will help us improve this course, thank you.

    Vivian Lew [email protected]

    Migrating to R for SAS/SPSS/Stata Users UCLA SCC

    http://scc.stat.ucla.edu/mini-courseshttp://scc.stat.ucla.edu/mini-courses