notes on using r

8/20/2019 Notes on Using R

1/139

Data Analysis and Data Management Using R

(Version 0.98)

TOM BACKER J OHNSEN

Faculty of Psychology, University of Bergen

March 28, 2008


2/139

ii

Copyright c 2008 by Tom Backer Johnsenhttp://www.galton.uib.no/johnsen

ISBN 82-91713-40-5Universitetet i Bergen

Det psykologiske fakultetChristies gt. 12

5015 Bergen, NorwayTel: +47 55 58 31 90Fax: +47 55 58 98 79

URL: http://www.uib.no/psyfa/isp


3/139

Contents

Preface xi

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiCaution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xivAcknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

1 Introduction 1

1.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Inexperienced users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Experienced users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Finally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Starting and Stopping R 5

2.1 Opening a session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 For the impatient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Closing a session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.5 Getting help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.6 Fine-tuning the installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Basic stuff 11

3.1 Simple expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.3 Simple univariate plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.5 Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5.1 Naming objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5.2 Object contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.6 Other information on the session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.7 The “Workspace” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.7.1 Listing all objects in the workspace . . . . . . . . . . . . . . . . . . . . . . . . . 163.7.2 Deleting objects in the workspace . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.8 Directories and the workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.9 Basic rules for commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Data sets / Frames 19

4.1 Sources of data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Entering data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3.1 Variable types in frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

iii


4/139

iv CONTENTS

4.4 Reading data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4.1 Reading frames from text files . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4.2 Selecting the data file in a window . . . . . . . . . . . . . . . . . . . . . . . . . 234.4.3 Reading data from the clipboard . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4.4 Reading data from the net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.5 Inspecting and editing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.6 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.6.1 Identification of missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.6.2 What to do with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.6.3 Advanced handling of missing data . . . . . . . . . . . . . . . . . . . . . . . . 26

4.7 Attaching data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.8 Detaching data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.9 Use of the with () function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Data analysis 29

5.1 Classification of techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Univariate statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2.1 Simple counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2.2 Continuous measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2.3 Computing SS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2.4 Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Bivariate techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3.1 Correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3.2 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.3.3 t-test, independent means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.3.4 t-test, dependent means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3.5 Two-way frequency tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.4 Multivariate techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.4.1 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.4.2 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.4.3 Principal component analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.4.4 Principal factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.4.5 Final comments on factor analysis . . . . . . . . . . . . . . . . . . . . . . . . . 445.4.6 Reliability and Item analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4.7 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.4.8 Item Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.4.9 Factorial Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.4.10 Multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525.5 On Differences and Similarities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.5.1 For the Courageous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6 Resampling, Permutations and Bootstrapping 596.1 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.1.1 Permutation (Randomization) tests . . . . . . . . . . . . . . . . . . . . . . . . 606.1.2 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646.1.3 Using the “boot” package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.1.4 Random numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.1.5 Final Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7 Data Management 69


5/139

CONTENTS v

7.1 Handling data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.1.1 Editing data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.1.2 List the data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.1.3 Other useful commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.1.4 Selecting subsets of columns in a frame . . . . . . . . . . . . . . . . . . . . . . 717.1.5 Row subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.1.6 Repeated measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

7.2 Handling commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.2.1 Saving commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.2.2 Using saved commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

7.3 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7.4 Input and Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.4.1 File Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.4.2 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

7.5 More on workspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.6 Transfer of output to MS Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.6.1 Table 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.6.2 Table 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.7 Final comments: The “Power of Plain Text” . . . . . . . . . . . . . . . . . . . . . . . . 82

8 Scripts, functions and R 85

8.1 Writing functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8.1.1 Editing functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

8.1.2 Sample function 1: “Hello world” . . . . . . . . . . . . . . . . . . . . . . . . . 86

8.1.3 General comments on functions . . . . . . . . . . . . . . . . . . . . . . . . . . 88

8.1.4 Sample function 2: Compute an SS value . . . . . . . . . . . . . . . . . . . . . 88

8.1.5 Sample function 3: Improved version of the SS function . . . . . . . . . . . . 898.1.6 Things to remember . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8.1.7 Sample function 4: Administrative tasks . . . . . . . . . . . . . . . . . . . . . 90

8.2 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8.2.1 Sample script 1: Saving data frames . . . . . . . . . . . . . . . . . . . . . . . . 91

8.2.2 Sample script 2: Simple computations . . . . . . . . . . . . . . . . . . . . . . . 93

8.2.3 Sample script 3: Formatted output . . . . . . . . . . . . . . . . . . . . . . . . . 95

8.2.4 Sample script 4: ANOVA with simulated data . . . . . . . . . . . . . . . . . . 968.2.5 Sample script 5: A more general version . . . . . . . . . . . . . . . . . . . . . . 96

8.2.6 Nested scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

A Data Transfer 99

A.1 Why is this important? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101A.2 The “audit trail” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A.3 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A.4 Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A.4.1 Manual data entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A.4.2 Scanning forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

A.4.3 Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

A.4.4 Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

A.4.5 The final data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105A.5 Final comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


6/139

vi CONTENTS

B Installation and Fine-tuning 107B.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107B.2 Editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

B.2.1 Tinn-R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

B.2.2 WinEdt and RWinEdt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108B.2.3 Notepad Plus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108B.2.4 vim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

B.3 GUI Interfaces to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109B.3.1 R Commander . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109B.3.2 SciView-R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

B.4 Installing packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109B.4.1 Normal installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.4.2 Updating packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B.4.3 Failed installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

C Other tools 113C.0.4 Spreadsheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113C.0.5 Managing text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114C.0.6 Bibliographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116C.0.7 Presentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116C.0.8 Portable Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117C.0.9 Combining Statistical Output and Authoring . . . . . . . . . . . . . . . . . . . 117

Reference card 119References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121


7/139

List of Tables

4.1 Data stored in a text file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Clipboard data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Missing observations in frame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 Rearranged sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Data for Item Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 The elements used for Cronbach’s alpha . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6.1 Permutations of four values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.1 File “output.txt” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747.2 Means, Standard Deviations and Intercorrelations . . . . . . . . . . . . . . . . . . . . 787.3 Regression Analysis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

8.1 File Input.data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928.2 Simple data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 938.3 Contents of ”Tiny.txt” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

vii


8/139

viii LIST OF TABLES


9/139

List of Figures

2.1 R Opening Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Closing R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3 Sample Help Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1 1000 normally distributed random values . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.1 Histogram with density estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.2 Scatterplot with regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.3 Scree plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.1 Editing the attitude data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.1 The data entry loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

ix


10/139

x LIST OF FIGURES


11/139

Preface

For quite some time I and several colleagues of mine have been looking for alternatives to the‘standard’ statistical packages like SPSS and Statistica used by most of our colleagues in the socialsciences. This search has been triggered by several trends we have seen in the past few years.

• Suppliers of commercial statistical software has become more and more paranoid about theinstallation of the software in respect to licenses, installation periods etc.. Consequently, Icannot be sure that I as a researcher at all times have access to one of the essential tools of our trade, a set of statistical routines. For instance, one piece of mathematical software I hadinstalled on my portable required that I was logged on the Internet in order to use it. It wasimmediately uninstalled. As I am writing this, my installation of both Statistica and SPSSwill not work due to a missing ’time code’, my licenses for the programs is only valid forone year at a time.

• The price of the licenses have gone up, which, together with increased pressure on the fund-ing of universities, increases the risk of running into potentially interesting collaboratorswho simply cannot afford to use these tools. We have had that problem for a long time with

researchers from the third world, now the problem is spreading into the west as well. Inother words, if a data file in SPSS format (.sav) is sent to a colleague, you cannot be sure thatit can be read by the recipient without problems. The solution is simple, drop SPSS and usea more general format instead, one based on “plain text”.

• In addition, and as a matter of principle, all parts of the research process should be open topeer review. This includes both documentation of details about the data analysis itself and,at least in principle, the software used for the data analysis as well. In contrast, commercialand proprietary programs like the ones mentioned above do not encourage documentationof all steps in the data analysis. Of course, it is possible to use scripts in these programs,

but this is not enforced in any way, and it seems that few users actually do so. In addition,

the source for these programs is definitely not open for scrutiny. This is problematic – bothfor the people selling the products, the researchers who insist on using them, as well as thereaders of what the same researchers produce. To reiterate: All parts of the research processshould be open to peer review.

When using “open source” type of software, these problems are at least reduced. There areseveral alternatives for doing statistical operations in that “world”, but one of them turns out to

be far superior in very many respects compared to the standard ones. This is a system simplycalled “R”. It is also based on a different type of user interface, a command or “script” orientedapproach, which for experienced users is far superior to what is popularly preferred.

The standard (base) version of R as installed by default is very powerful in itself. In addition,

there are a very large number of additions (so-called “packages”) which may be added to thesystem for all kinds of more specialized techniques. In other words, you may tailor your system

xi


12/139

xii LIST OF FIGURES

to fit your own needs. If you have programming skills you may even add your own componentsin various ways.

Now, if you have a piece of software which is open, available at no cost, customizable withan immensely rich set of available methods and procedures which seems to be optimal for both

novices and experienced users, what do you do? The conclusion is simple: You make the change.In any case, R is obviously a statistical tool worth exploring. As a latecomer, I have just started

that process for myself, and writing this document is part of that exploration.

Overview

For one thing, the focus in this document is on elementary data analysis rather than statisticsas such. In addition, the focus is more on how things may be done rather than why one mightwant to do a particular analysis. In other words, there are little methodology and little statistics.I will assume that basic information about statistics and methodology is known to the readerand available from other sources. In addition, the scope is not limited to R alone, in the threeappendices I touch on themes that are for the most part less oriented on R and more toward“project management” as such. This includes tools for entering large amounts of data etc., as wellas other useful tools, needed for the production of high quality typesetting of documents etc..

The main problem has been to decide on what to include in the text and the sequence to treatthe various themes. After all, a text structured like a book is very much a linear affair, at leastin respect to what follows what. The only way the author can do to break the strict sequence of the text is to include aids like a table of contents, an index, and a lot of cross references to otherrelevant parts of the text as direct references or footnotes. In any case, you have to make a decisionon what you regard as a sensible sequence. For me, that has given me a lot of headaches. I finallydecided on the following:

• Of the first two chapters, the “Introduction” and “Starting and stopping R”, the first includessome background information and the second the basics on starting and stopping the sys-tem, as well as pointing out where to get find documentation and help from the installation.The second chapter also includes something for the really impatient user.

• The chapter on “Basic stuff” contains fundamental information on expressions, functionsand objects as well as the essentials of a “workspace”. It ends with a summary of the basicrules for commands.

• Datasets / frames: One can start using and experimenting with the system using nothing but the built-in data sets. But, the real fun starts with the analysis of “real” data, either datagenerated according to one’s own specifications or “your” data in an empirical sense. Herethe basics of data management are described.

With an initial coverage of these themes, the foundation for a central theme of the document,data analysis, has been laid:

• Data analysis: Here four types of methods are described, (a) simple univariate statistics,(b) some simple bivariate operations, (c) variants of factor analysis, (d) reliability and itemanalysis, (e) factorial analysis of variance and finally, (f) multiple regression. This subject isdiscussed in chapter 5.

• Following the chapter on data analysis, the next chapter is oriented towards modern andmore general alternatives to the classical tests of significance, a class of techniques that fallinto the general category called resampling techniques. These are “computer intensive”


13/139

LIST OF FIGURES xiii

methods in the sense that they are for all practical purposes impossible to perform with-out computers. It should be added that these techniques are not present in the so-called“standard” packages, but are easily accessible in R.

However, every time you start working on a new a project, it implies much more than the abil-ity to start a data analysis of a particular type. That part is often quite simple in itself, often almosttrivial compared with the necessary preparations for the data analysis. The really important partsof a project does not involve the statistical tool directly, but includes activities like data collection(constructing forms, scales etc.), transfer of the data into a format you can use, e.g. data entry,strategies for getting the results you want into documents in the correct format, etc..

In addition, after you have obtained the results you need, you have to transform the resultsfrom the software into a more readable format, i.e. to generate proper tables and figures for apaper or report. These parts need to be documented as well. So, some of these subjects needs to

be covered as well, at least briefly, towards the end. So, in addition to the focus leading towardsdata analysis in the chapters mentioned above, some other subjects needs to be covered:

• The first of these additional themes is a discussion in appendix A on the proper way to enterdata sets with a minimum of errors, if possible using dedicated software. Here the idea of maintaining an “audit trail” for the project is introduced. This is a subject that is becomingincreasingly important for several reasons. You have to be able to transfer the recorded datato a format suitable for reading into R and at the same time be able to record what has beendone, e.g., to document operations. So, more information on data transfer is needed.

• This will require information about the data handling beyond what is defined in part 4 onpage 19, e.g. the ability to read data stored in other formats, and after possible transforma-tions, to store them in a suitable format (as well as a recording of what was done).

To be able to do the needed transformations (where it is impossible to supply more than afew suggestions. Again, the main thing is to be able to document what you actually havedone, and that is where the next part is important.

In general, it is not a trivial matter to transfer output from a statistical program to yourfavorite word processor. For one thing, SPSS and Statistica generate too much output whileR by default generates too little. In both cases the elements are in the wrong sequence forthe APA standard. The part called “Transfer of output to MS Word” covers the transfer of output from R to a properly formatted table in a document formatted according to the APAPublication manual (Association, 2001) and (Nicol & Pexman, 1999). That part may be of interest to users of other statistical systems as well.

• One very important aspect of the R language is that it is command oriented. Normally, thecommands are entered one by one at the keyboard. But it is also very useful to have sets of R commands (“scripts”) stored and run from simple text files, or alternatively, as functionsstored in the workspace. This is a great advantage in respect to documentation of yourproject as well as reusing sets of commands. This subject is briefly touched upon in chapter8.

• Finally, the last chapter in the third part is called “Fine-tuning your installation” which cov-ers the basics of the installation process itself, as well as alternatives in respect to GUI inter-

faces and editors for the R system. That section also includes some very brief comments onother software that may be useful.


14/139

xiv LIST OF FIGURES

Font conventions

This document is set in a font called ’Palatino Roman’, and that holds for most of the text. How-ever, when commands for the R language and function names are printed, a monospaced ’Courier’

type of font is used.

Caution

This document is at best a very short introduction to a tool of vast capabilities, covering onlya few themes. For an impression about how superficial this introduction really is, consider themost comprehensive textbook on R I know of. This is (Crawley, 2007), which runs to 942 pages, amassive book packed with information and examples. And that one is still far from complete, onequite large subject I know a little about is one called Social Network Analysis or SNA. There area large number of useful functions i R for analysis of that type of data (e.g. found in the packagecalled sna) which are not covered in that book at all. There are probably many many more such

subjects.So, as an introductory text this is naturally quite superficial and only covers what I consider the

most important aspects of the language. It can, as a matter of course, become much, much, better.The next version will (hopefully) be better than this one. In addition I welcome any suggestionsin respect to errors and/or improvements.

Acknowledgment

Professor Hans-Rüdiger Pfister from the University of Lüneburg, Germany has been very gen-erous with many suggestions about the use of R, as well as some of the material presented here,

especially the very nice example on factorial ANOVA in chapter 5 which is also used in the samplescripts in the discussion of that subject in chapter 8.


15/139

Chapter 1

Introduction

The purpose of this chapter is simple, tell what the R language is, cover someaspects of what is called “open source” types of software, together with somecomments on the needs of different types of users.

1.1 What is R?

R 1 is a computer language oriented towards statistical computing similar to the S language, origi-nally developed at Bell Laboratories in the US. This language has been implemented in a first-classpiece of software which is nice to use both for learning statistics and for problems of more expe-rienced researchers. One reason is simple; in contrast with more conventional packages like SPSSand Statistica, you have to know something about what you are doing when using R. And that is

not a bad thing, at least for researchers.To quote from the introduction to one of the manuals for the system (included when the systemis downloaded and installed):

... R is an integrated suite of software facilities for data manipulation, calculation andgraphical display. Among other things it has:

• an effective data handling and storage facility,• a suite of operators for calculations on single values, arrays, and matrices,• a large, coherent, integrated collection of intermediate tools for data analysis,

• graphical facilities for data analysis and display of results either directly on thescreen or stored on file for later use.

And:

• a well developed, simple and effective programming language (called ’S’) whichincludes conditionals, loops, user defined recursive functions and input and out-put facilities. (Indeed, most of the system supplied functions are themselves writ-ten in the S language.)

1The language for controlling S is said to be essentially the same as R, it is claimed that most procedures available for

S may be used in R without modification. Also, one may see references to the “S language” even when R is discussed.In other words, one may regard S and R to be two different implementations of the same language.

1


16/139

2 CHAPTER 1. INTRODUCTION

The term “environment” is intended to characterize it as a fully planned and coherentsystem, rather than an collection of very specific and often somewhat inflexible tools,as is frequently the case with other software for data analysis.

R is very much a vehicle for newly developing methods of interactive data analysis.

It has developed rapidly, and has been extended by a large collection of packages.However, most programs written in R are essentially ephemeral, written for a singlepiece of data analysis.

My feeling is that this quote represents a series of understatements. R is in many ways much better suited for research purposes than the commonly used commercial products. For one thing,it has to a very large extent been created by scientists for scientists. In particular, the graphicsfunctions in R are really very advanced and flexible.

Compared with the popular GUI (Graphical User Interface) systems, some of the characteris-tics of R are:

• It is command oriented, controlled by entering commands at a console or “window”(a spe-cial screen). Persons who are used to a conventional GUI system may regard this as anold-fashioned, slow and cumbersome interface. That is true, at least to begin with andfor novices. However, once you know the basic commands, it is much faster and easierin use than any GUI interface you can think of. However, the real gain is in flexibility. Youhave many more degrees of freedom than in conventional statistical packages controlled bymenus and the mouse (MM type programs). The wonderful part of GUI interfaces is whathas been called WYSIWYG (What You See Is What You Get), the drawback is simply thereverse: WYSIAYG (What You See Is ALL You Get).

• The commands may be stored in a file for later editing and reused. This is very handy fordocumentation of anything beyond a trivial set of operations in respect to data analysis in aresearch project as well as testing out variants of the same analysis.

• It is object oriented, so you can have anything from a minimal amount of output (the default)to a lot more if you so wish.

• It is a very good example of an “open source” program. This means that it is downloadablefor free from the net, and the same holds for a very large number (hundreds) of “packages”for more specialized functions which can be added to your version of the system. If youwant to (but you do not have to of course), you are even able to read the source to see whatany part of the system really does. In other words, this type of software is subjected to “peerreview” like any other part of a research process. In contrast, the code for conventional

packages is definitely not open for inspection and may have (and probably has) errors noresearcher will ever know about.

• When first installed, it has a “base”; a basic set of functions and data sets are included,enough for most users. This “base” can be expanded by downloading “packages” of yourown selection (among hundreds of alternatives) covering YOUR particular needs. The avail-able packages cover everything from general functions to highly specialized themes. In otherwords, you may tailor the system to your own requirements.

And according to (Verzani, 2004):

R is excellent software to use while first learning statistics. It provides a coherent,flexible system for data analysis that can be extended as needed.


17/139

1.2. INEXPERIENCED USERS 3

It is perfectly possible to start the students with the bare basics without having them to copewith a confusing menu structure that includes much more than even normally experienced userswould ever need. You only have to observe the problems a group of fresh users have with theirfirst encounters with a so-called “user friendly” “point-and-click” program to have serious doubts

about the “user friendliness” of the same programs. It can even be argued that these programsencourage the learning of habits that are very much less than optimal for researchers.

1.2 Inexperienced users

Users not acquainted with conventional programs should have no problems other than learningabout statistics and the same time as getting acquainted with R. They have no preconceptions toget rid of, and the main problem is to have them hooked on to the optimal starting point and tohave a reasonable progression from that point.

1.3 Experienced users

Users experienced with conventional statistical systems like SPSS and Statistica may have somehandicaps when starting with R for a number of reasons. For one thing, the menu structure inthese programs includes much more than most users will ever need. The initial problem withthese systems is therefore to learn what can (or should) be ignored, rather than what is actuallyneeded and useful for the problem at hand. Furthermore, conventional statistical systems arerather inflexible in respect to how the data for the analysis is organized, where the main unit for theanalysis is a “data set”. In other words, with other systems you tend to work with one (often verylarge) data set or file which contains everything you conceivably might need. In contrast, withR, you are encouraged to work with subsets of your complete data set, which contains what you

need for one particular analysis or a class of related techniques. In particular, my understandingis that the concept of a “data set” in conventional systems is quite rigid.

In any case, the closest equivalent to the “data set” concept in R is a “data frame” which isalmost the same as the conventional “data set”, but somewhat looser. The differences in terminol-ogy may also need getting used to, and in a few cases things are very different. 2 One could (orrather should) regard R (or S) as a programming language dedicated to statistical and mathematicaloperations, since all the elements of more conventional programming languages are there as well(loops, conditional execution, functions etc.), although I have to a large extent avoided that partof the system in this exposition.

1.4 FinallyTo quote (Crawley, 2005):

Learning R is not easy, but you will not regret investing the effort to master the basics.

2For instance: The basic rule in conventional packages is that all observations from the same individual or “unit” isplaced in the same row of the data matrix. With R (and S), the basic unit seems to be the variable, or perhaps even theobservation as such. This shows up in the use of repeated observations, where all observations of the same variable isassumed to be in the same vector or column, with additional “factors” identifying the repeat for each observation and

the individual case to which the observation belongs. The effect is that imported data of this type from these systemsmay have to be rearranged before analysis, a simple import is not sufficient. Se section 7.1.6 on 72 for more details.


18/139

4 CHAPTER 1. INTRODUCTION


19/139

Chapter 2

Starting and Stopping R

This chapter covers little more than the really elementary information on how

to start and stop the system. In addition, some help is presented for the reallyimpatient readers who can generate some instructive output with only a fewkeystrokes.

An additional subject is to present the basics of getting help from the system,plus information on where more documentation for the system may be located.

If the program is not installed already, the first thing to do if you want to use the program is of course to get hold of it. See the “Installation and fine-tuning”, part B.1 on page 107.

2.1 Opening a session

Figure 2.1: R Opening Window

After an installation of R on Windows, you willprobably find an icon on the desktop, a blue“R”. Click that once to start the program and thescreen will look like the one in Figure 2.1. Analternative is to click on a copy of a file called“.Rdata” (or a shortcut to a file with that name).See part 3.7: “The workspace” on page 15 be-low for details. When the program is started,the current workspace is loaded and a “prompt”appears (normally a “>”). This is an invitation to

type something (a command). If you do so andend the line with pressing the “Enter” button,the program responds with some output and anew prompt appears. In other words, R is an“interactive language”.

2.2 For the impatient

Whenever I come across a new program, I always want to have a try at generating some resultsmore or less immediately. I always assume that others do the same thing.

With statistical software, the first question is: How do I get some data into the system? WithR, that is simple. You already have some data at your fingertips once the program is started. So,

5


20/139

6 CHAPTER 2. STARTING AND STOPPING R

there are a number of data sets that may be used without any operations; they are part of theinstallation.

The next question is: How do I get some results? Enter the following command when yourcomputer displays the window shown in Figure 2.1: R Opening window:

> mean (attitude)

As a result, you would see:

rating complaints privileges learning raises critical advance

64.63333 66.60000 53.13333 56.36667 64.63333 74.76667 42.93333

What does this mean? First, there is a data set (a “frame” in R terminology) called “attitude”which is part of the installation of R and are available at all times (there are many others). Detailsabout this data set is obtained by entering the name with a ? in front of the name of the data set,i.e. ?attitude (there are also a large number of other data sets, which are listed if you enter the

command data ()). You have access to a list of the variable names in the “attitude” data set by anames () command, i.e.:

> names (attitude)

The function mean () returns an “object” which contains the means for the columns in thedata set named in the call on the function. It is possible to assign a name to the object (or ratherthe other way around), as in the next command:

> xbar


21/139

2.3. CLOSING A SESSION 7

> barplot(xbar)

And you get a nice-looking bar plot of the variable means in a separate window. Not perfect,some of the longer variable names underneath the bars have been omitted for some reason, and

you may for instance want to add some colors, but it is a good start towards something that might be useful. Also, note that you can reuse commands, use the “up” and “down” arrows to locate acommand you would like to use again, possibly edit it, and press the enter key. This is handy forcorrecting errors in commands without having to retype the whole line or for trying out variantsof commands.

2.3 Closing a session

When you want to end the session, either close the window or type the command q () at theprompt and then press the Enter key. In both cases the window shown in Figure 2.2: “Closing R”appears.

Figure 2.2: Closing R

If the “Yes” button is clicked, the workspace containing allthe “objects” that has been generated in these (and possibly thosefrom previous sessions) are saved in the “active directory”. Thisinformation is stored in a file called .Rdata. When the system isstarted for the first time, this file is placed in the same directoryas the program itself by default. This is not recommended in anycase. With some installations (large networks etc.) you may not

even be permitted to do so. The solution is to use the option “Change directory” found in the“File” menu before closing down and then pick a “working directory” within your private “area”.If you then look in that location with Windows Explorer you will find the .Rdata file there. Double-click the file name to start the program.

The general theme of “workspaces” is discussed in more detail in part 3.7 on page 15.

2.4 Documentation

In general, the documentation for the system is very nice. When installed, several documents inPDF format are included and are available from the Help menu item in the main form seen whenthe program is started. These include several manuals in PDF format, where two are potentiallyuseful for novices:

• An Introduction to R: This is not very introductory in my eyes, but may be useful nevertheless,

especially after you have had some experience with the language.

• R Data Import/Export: This covers most of the options in respect to importing data into thesystem, some very advanced.

Documentation of the more technical kind and not for beginners include:

• R language definition: For more experienced users and for people who have some backgroundin programming, this is a very interesting document. After using R for a while, it is recom-mended to at least look at this one now and then.

Parts of the same documentation is also available in quite readable HTML format from thesame menu, nice if you do not have a PDF reader installed.


22/139


There are also a number of very active mailing lists. For details first click Help and then Rproject home page when you are logged on to the net. On that page locate the “Mailing lists”entry in the list on the left side. For beginners, follow the instructions on the “R-help” part of thepage. It is always instructive to read the questions and and the answers in this list, the top people

in a number of fields are very active there, and the tolerance for newbies is high. But, read the“Posting guide” before posting a question to the list!

Apart from the documentation included with the installation and the mailing lists, there area large number of downloadable documents that covers many different fields. A list over “Con-tributed documentation” is maintained on the main site for the R project:

http://cran.r-project.org/

Click on the “Contributed” item in the list on the left. These documents are in general verygood (some are in reality early versions of printed books in PDF format). One of the more usefulones from my point of view is “Notes on the use of R for psychology experiments and question-naires” by Jonathan Baron and Yuelin Li, but there are several others that also may be of interest

for psychologists as well. Another nice downloadable document is called “R for beginners” andis found at:

http://cran.r-project.org/doc/contrib/Paradis-rdebuts en.pdf

The author (Emmanuel Paradise) goes into some aspects of R in a quite detailed manner, butthe exposition is clear. The chapter on graphics is very nice, and so is the chapter on programmingwith R. Well worth looking at.

In addition to the manuals included with the installation and downloadable material fromthe Internet there are a number of published books as well, which may be useful. Of the lattertype, (Dalgaard, 2002) is a very readable introduction and strongly recommended, the same holdsfor (Verzani, 2004). Read the Dalgaard one first, it is very instructive. Books like (Everitt, 2005),

(Crawley, 2005), and (MainDonald & Braun, 2003) are more advanced and intended for more ex-perienced users. The real and more recent “bible” (more than 950 pages, packed with information)is (Crawley, 2007), an extremely nice book, quite expensive, but really very nice and very useful. Areference with a more narrow focus oriented towards resampling as covered in part 6.1 is (Good,2005). Also very good.

However, books like the ones mentioned above may omit entire fields of research. One exam-ple of the latter is what is called “Social Network Analysis” 2 which really is a rapidly growingfield of interest for (among other) researchers within organizational and social psychology. Inother words, given your particular interests you may have to look closely at the downloadablepackages on CTAN and their documentation and at the same time be aware that variations in theterminology used in the documentation may not be obvious. Posing polite questions at the user

list for R may be a very good start.Another source of information that is potentially useful is a “Wiki” for R:

http://wiki.r-project.org

2.5 Getting help

When using the system, help on any function may be obtained either by writing the name of thefunction preceded by a "?", e.g.:

2I recently attended a conference for the INSNA organization (called “Sunbelt” conferences). A very large part of

the contributions I saw there used R for the graphics (one third?) and very many had used LaTex for the generation of the slides.


23/139

2.6. FINE-TUNING THE INSTALLATION 9

> ?lm

Or alternatively:

> help (lm)

The lm () function covers what is also called multiple regression (in general : Linear models)as well as analysis of variance. The commands above results in the opening of a new window(Figure 2.3) with the documentation for that function, covering all the options and an example:Note: there is much more information in that particular window than is shown here, you needto scroll! If you want to look at examples on the usage of a particular function, use the example(command). This is especially nice for the graphical functions, where there are many options. Tryexample (dotchart), example (plot), example (contour).

Figure 2.3: Sample Help Window

In addition to the example () func-tion, there are a number of demo ()’s.See demo (graphics), demo (persp) anddemo (image).

As mentioned above, help on the pre-installed data sets (frames) are handled inthe same manner, e.g.

> ?attitude

To get information on installed pack-ages you use the library() command,e.g.:

> library (help=utils)

Where "utils" is one of the manypackages installed by default .

There is also a function called RSiteSearch () in the R language itself that can be used tosearch for terms on the net. So, if you enter the command RSiteSearch ("repeated measures")(and are connected to the net) you get a very large number of links which includes the term enteredas the argument. There are also a number of options (see the help for the function) that can limitthe search somewhat.

2.6 Fine-tuning the installation

You may very well want to change the installation in several ways, e.g. to install a better editorthan Notepad, or perhaps install one of several possible GUI interfaces for R, maybe add packagesfor access to more specialized techniques, etc.. See part B.1 on page 107 for details about thesesubjects.


24/139



25/139

Chapter 3

Basic stuff

This chapter covers the bare basics of R, just to give you an initial idea of whatthe system can do. The subjects include:

• Using R as an overgrown calculator and simple expressions.• Basic functions.• A very basic discussion of “objects”, a very important aspect of R.

3.1 Simple expressions

Let us start with some really simple-minded stuff without using any data sets. Type the following

at the > prompt:> 2 + 2

And the answer or response appears on the next line:

[1] 4

The conclusion so far is that the system can add. It can do more. In general, this type of operation works for almost any type of formula, where the basic operators are +, -, /, *, and ^(plus, minus, division, multiplication, and raising to a power). All the normal rules for expressionshold, e.g. multiplication and division are done before addition and subtraction. Parentheses may

be used to control the sequence of the operations.Other operators are available, like relational and logical and relational operators, plus matrixoperations. Functions like log (), exp (), sin (), cos (), tan (), sqrt () etc. have the normalmeaning, e.g.:

> log10 (2)

[1] 0.30103

The latter is what is called a “function”, which in this case returns the logarithm base 10 of 2,again a single value. However, some times R prints values that look more impressive than reallythey are, e.g.:

> sin (pi)[1] 1.224606e-16

11


26/139

12 CHAPTER 3. BASIC STUFF

This result is written in a so-called “scientific notation”. With a negative exponent (-16), thisresult means: “Move the decimal point 16 places to the left”, in other words, this corresponds tothe value of 0.0000000000000001224606. For all practical purposes this value is equal to zero (butnot really, nevertheless close). This kind of result is not uncommon when using computers, these

types of results may be hidden in some way or another, but they are there anyhow, behind thescenes so to speak. R is explicit in this respect.

3.2 Vectors

In contrast with single values, sometimes one wants to operate on or with a group of values on thesame time, e.g. all the observations on one variable in a data matrix. Imagine data for a sample of 5 persons, where the values for one of the variables consist of the numbers of 1 to 5. Anticipatingthe discussion on “objects”, we can generate such a vector with the c () function and assign aname to it:

> x x x

[ 1 ] 1 2 3 4 5

This is a vector, a collection of values of the same type. This is not a very advanced one, butwe can do arithmetical operations on such vectors just as easy as for simple (single) values. Wecan for instance print this vector with the constant of 10 added to all the values in the vector:

> x + 1 0

[1] 11 12 13 14 15

Or print the squares of all the values:

> x ^ 2

[1] 1 4 9 16 25

Logical expressions return one of two possible values either TRUE or FALSE. An example might be:

> x < 3

[1] TRUE TRUE FALSE FALSE FALSE

Regard the x ¡ 3 part as a statement “x is less than 5”. Each of the five values in the vector x are

then examined and compared with the constant 3. Two of the smallest values are smaller than 3,so the first two values printed are TRUE while the remainder are FALSE.


27/139

3.3. SIMPLE UNIVARIATE PLOTS 13

3.3 Simple univariate plots

However, a very different kind of result is obtained by typing the command (also a function):

> plot (rnorm (1000))

And then press Enter. In this case a new window opens with something like Figure 3.1.The command generating this plot is really a two-step affair, first a vector (an ordered collec-

tion of values) containing 1000 random numbers is generated, where the values have a normaldistribution with a mean of zero and standard deviation of one (by the call on the rnorm () func-tion). Then each of the 1000 values in this vector are plotted (the plot () command) with thevalue on the vertical axis and the number of the value within the sequence on the horizontal axis.Since the values tend towards a mean of zero, the points are concentrated along a horizontal zonein the middle of the plot with fewer points at the upper and lower extremes. If you want to usethis or any other figure in a document or presentation, right-click the window for the figure and amenu appears with some alternatives for a “copy” operation.

3.4 Functions

The use of log10 (), rnorm () and plot () in the examples above represents the use of functionsin the R language. In some cases (as with log10 ()) the result is returned as a single value, in othercases the result may be a vector of a specified length (as with the call on rnorm () or the call onc ()). An example of the results from an even more complex function in this respect is lm ()which returns all the results from a multiple regression stored in an object (a linear model, seepart 5.4 below on this subject). Some functions in R are “magical” (?, ?) in the sense that the resultobtained from them depends on what is thrown at them.

Figure 3.1: 1000 normally distributed ran-dom values

The plot () function is typical in this respect. If the argument is a single vector, the result is a plot likethe one in Figure 3.1, if there are two vectors as argu-ments (or a data frame with two columns) you will geta standard scatterplot similar to 5.2 on page 35, and if the argument refers to a data frame with more thantwo variables (columns), you will get a combined plotwith subplots consisting of scatter plots for all pairs of columns.

The function summary () is similar, where the ob-tained output is dependent on the nature of the object

used as the argument to the function.There are a number of other plotting procedures worth looking into, one is hist () which isused for the generation of histograms, another is boxplot () which generates normal box plots.All of them have different options for tailoring the plots to what you may want.

3.5 Objects

Most of the operations above are not really useful, for the simple reason that the results for themost part are not kept anywhere and therefore cannot be used at a later stage in the session. Inorder to be able to do so, the results have to be stored in some manner. The trick is to assign theresults of an operation to a name:

> a < - 5 + 1


28/139


or:

> b x


29/139

3.6. OTHER INFORMATION ON THE SESSION 15

For obvious reasons, operators like +, -, / and * are not permitted in an object name, nor are blanks or quotes permitted (use a period or an (underline) instead). In general, the “name” of anobject contains a “pointer” or “adress” to where the object is stored in the memory of the machine,which means that it is a simple matter to change the object that a particular name is assigned to.

Just enter a new command with an assignment to that name. On the other hand, this also meansthat it is is easy to loose something by assigning a new object to a same name that is already apointer to something.

Certain one-letter names should be avoided, as “c”, “q ” or “t” since there are useful functionswith those names (c (): combine values into a vector or list, q (): quit the session, and t ():transpose a matrix respectively). Some names are reserved as well, like return, break, if, TRUE,and FALSE, they are all part of the R programming language. You should also avoid using objectnames that are names of functions you use.

There are other conventions in the use of names. Very often n is used for the length of a vectoror the number of cases in a sample, x and y are normally used as symbols for data vectors, the latteras symbols for dependent variables, and i and j are often used for indices, i.e. for “numbering” or

“referring” to things.Also note that the case of letters in a name are important (as in all Unix/Linux based systems):

An object called “ab” is not the same as one called “Ab”, “AB”, or “aB”.

3.5.2 Object contents

Since the actual contents of a object may vary from a simple value to a large number of differenttypes of information , e.g. the results from a multiple regression (part 5.4.10) or a factor analysis(part 5.4.4) it may be useful to be able to inspect what elements are included in the object. This isachieved with the str (). Use that function with the name of the object you are interested in, andyou get a list of the contents of the object.

3.6 Other information on the session

There are other functions that might be useful, giving information on the session and what isavailable. In addition to the ls () function mentioned above, another one is sessionInfo ().The command data () lists all the data sets or frames included in the session. The commandlibrary () provides a list of all the installed packages and therefore are available for use. Seealso the part on Installing packages in part B.4 on page 109 below.

3.7 The “Workspace”

In addition to the “object” concept, the “workspace” concept is very important when using R.When you start the system by clicking on the .Rdata file in a directory (or a shortcut to a file withthat name), you start the system and at the same time load all the objects, i.e. the generated objectsstored in that workspace, i.e. the contents of the .Rdata file. So, if you ended the previous session

by clicking on the “Yes” button in the final window (see Figure 2.2: Closing R on page 7), all theobjects in that workspace was saved, and you can continue the next session where you left off.

In other words, just as an object can be regarded as a container for information on something(including where the information comes from), the workspace may be in itself regarded as a con-tainer for objects. Every time you start R, you start the system in the state that you saved thatparticular workspace in.

This is one of the major differences between R and more conventional systems for statisticalcomputations. With systems like SPSS you normally start a session with one data set alone and


30/139


the system has no memory of what has been done before, nor having the tools for maintainingsuch a memory of operations. In contrast, with R, a workspace may contain any number of datasets (frames), results stored in objects, functions, etc. accumulated over several sessions. This isvery nice, but potentially problematic as well. If a workspace contains a lot of objects with obscure

names you may run into problems. So, assigning meaningful names to objects is important. Seepart 3.5.1 on page 14 above.

For that reason it is also good practice to keep things apart that should be kept apart, i.e. tohave workspaces in separate catalogs or directories for each project you are working on.

More information on this subject is found in part 7.5 on page 77 below.

3.7.1 Listing all objects in the workspace

In order to get a list of all the names used for objects in a workspace, the command ls () is used.Alternatively, the command objects () can be used. In the GUI interface, there is option for thisoperation in the “Misc” menu entry as well.

3.7.2 Deleting objects in the workspace

Specific objects in the workspace can be removed with the rm () command, e.g.:

> rm (height, weight)

All the objects in the workspace can also be removed with the same command combined witha list:

> rm (list=ls ())

Strictly speaking, you should use rm (list = ls (all.names = TRUE)) instead, since thesimple version of the ls () command does not list variables or objects with names starting witha period. Normally, this should not be necessary, for the simple reason that names starting with aperiod should not be used by normal users in any case. Alternatively (if you are using a Windowsmachine), there is a “Remove all objects” entry in the “Misc” menu on the GUI interface.

However, this is a dangerous operation, for the simple reason that it is likely that you willwant to keep some objects. So list them before removing anything.

3.8 Directories and the workspace

By default, R reads from and writes files to the “active directory”, and this is the catalog or di-rectory where the program is started. More precise, this directory is the one where the .Rdata fileused for starting the program resides. Since it is perfectly OK (and often very useful) to operatewith more than one workspace (i.e. different .Rdata) files it may be convenient to have separatedirectories for each “project”, each with a .Rdata file. This is how to do it:

Look in the “File” menu and select the “Change dir.”, and when you have located the directoryyou want to use with that option, save your workspace there with the “Save Workspace” option.One other useful operation is to simplify the use of that workspace, locate the .Rdata file using the

“File Explorer‘”, right-click on the file name and create a shortcut to the file. Drag the shortcut tothe desktop, and give it a suitable name.


31/139

3.9. BASIC RULES FOR COMMANDS 17

3.9 Basic rules for commands

The basic rules for commands are:

• Blank lines are ignored.• The parts of a line following a “#” are also ignored, this is handy for adding comments or

explanatory text (annotations) in scripts, functions and data files.

• Strings are enclosed in either double quotes, e.g. “Extroversion score” or single quotes,e.g.’Extroversion score’. The two types work the same way, but you have to be consistent.You cannot start a string with a double and close it with a single quote.

• Blanks are also ignored within commands, with the exception of blanks within strings. Itis recommended to use blanks to increase readability. However, there is one very specificsituation where blanks outside strings matters. If you write x


32/139



33/139

Chapter 4

Data sets / Frames

The subject of this chapter is a special case of the object concept introduced thelast part of the previous chapter, a type of object, the data frame.

Handling data is a large subject in itself, and therefore only the basic input oper-ations plus a discussion of missing data are covered in this chapter. More detailson this subject is found in the chapter on Project Management.

Apart from the data sets that are included with R when it was installed (e.g. the attitude dataset used above), you will probably want to get hold of some data of your own at a very early stage.Therefore, learning the basics of how to handle data sets or “frames” as they are called in the Rcontext has to be one of the first and really very important things to learn.

4.1 Sources of data sets

For any type of analysis, some data are needed. In this context, there are three possible sources.One may:

• Read data sets from file (see part 4.4 on page 22), where there are a number of differentalternatives, both in respect to where the data sets are read from, as well as how the dataare formatted. When read, the data set(s) may be saved in your workspace, if so, they areimmediately available the next time you start R.

The reading of data sets from files is part of the normal use of R, usually performed withsome variant of the read.table () function. You may obtain a list of the objects (which

includes any data sets or frames) in your workspace with the objects () function. Alter-natively, you may use the data () function to obtain a list of the data frames installed bydefault.

• Alternatively, use any of the data sets included when the base version of R is installed, oradded when other packages are added to the installation. These data sets are available at alltimes, like the attitude data set used above. Unlike the attitude data set, some data setsare part of packages. Then you need to use the library () command to get hold of them,

but they are nevertheless available with a minimal effort. Few of these data sets come fromfields related to psychology, but they may nevertheless be useful, at least for demonstrations.

• In addition, one may generate data with specific properties in a number of different ways.This is a very powerful aspect of R, and very useful in many contexts, not the least for

19


34/139

20 CHAPTER 4. DATA SETS / FRAMES

exploring what the different techniques actually do. This approach is used in the discussionof ANOVA in part 5.4.9 on page 49 as well as a script version of this procedure in part 8.2.3on page 95.

Or, any combination of these three.

1

All three approaches are used in the chapter on dataanalysis. The important point to keep in mind is that once data have been read into memory, theycan be saved in the workspace (and you may have more than one workspace, see part 3.7 on page15 and 7.5 on page 77. In that case the data sets are available at all times once that particularworkspace have been opened.

These data sets are all in the public domain, and are commonly used for demonstrations. Oneof the data sets in the list is called “attitude” as mentioned above and used in in a number of contexts here. Information about this data set is obtained in the same manner as other types of help:

> ?attitude

This opens a window with information on the data set. The same holds for any of the otherdata sets included in the installation.

4.2 Data frames

A frame (i.e. a data set in the R sense) is an object with rows and columns, one row for each personor case (except the first one), and one column for each variable. The first (top) row will usuallycontain the names for the variables, and (less often) the first (leftmost) column contains names forthe cases.

The concept of a data frame in R is the closest approximation to the “data set” as used inconventional statistical programs. It differs from an R type “vector” or a “matrix” in the sensethat the values in columns within the frame may be of different types, but the data within onecolumn are always of the same type.

In principle, one should distinguish between the representation of the frame (e.g. normally asa text file) and the frame itself as stored in the workspace. An example of a text type of represen-tation of a frame is seen in the box containing figure 4.3.

There are a number of different ways to import a frame (data set). The most basic ones areto read the data set from a text file or from the clipboard. On the other hand there are also thepossibility of reading (importing) data from other types of files as well, including SPSS type .SAVfiles. For more details beyond what is mentioned here, see the “R Data Import/Export” manual,available from the Help menu on the R interface. The most important of the functions are includedin the “foreign” package (library).

Normally, a row contains all the data from one subject and a column all observations on thesame variable, although this convention is somewhat less strict in R (one exception is a repeatedmeasures type of analysis). The columns must have unique names, where the name is eithercreated when the frame is read from file with one of the functions of the read.xxx () type, oradded afterwards with the names () function combined with the c () function.

1Actually, there is at least one other category, very general statistical techniques based on using real data, but withvariations on where each observation belongs. One class within this category is called “randomization” or “permuta-tion” tests (Edgington, 1995), where some of the values in an analysis are systematically shuffled, another is “bootstrap-ping” (Efron & Tibshirani, 1993), where each observation are kept in place but where the same results are computedfor a large number of subsamples. A third variant is “jackknifing” (Efron, 1979) and (Efron & Tibshirani, 1993) which

explores the effect of systematic exclusion of a subset of the cases in an analysis. A common label for these techniquesis “resampling” or “computer intensive”. Relevant R packages are boot and bootstrap. See part 6.1 on page 60


35/139

4.3. ENTERING DATA 21

4.3 Entering data

My preferred tool for manual data entry when the data sets are really small is a spreadsheet (Ex-cel2, Gnumeric, or Calc from OpenOffice). Write the variable names as the first row in the sheet,

and corresponding values for each case in the rows below.

Table 4.1: Data stored in a text file

#This a test data set

hori h1 h2 v1 v2 path gender

148 29 29 8 5 7.0 male

115 19 27 3 5 12.5 male

107 15 27 4 6 1.5 female

134 21 29 4 6 2.0 male

.......

105 24 20 1 3 1.0 female96 21 22 3 4 25.0 female

129 22 28 2 4 8.0 male

85 18 20 4 4 18.0 male

However, if the project involves anything more thana (very) small number of variables and cases, it isstrongly recommended to use a dedicated tool for dataentry. A recommended strategy is to maintain a com-plete data set in one file, and to split the data set intoseveral subsets of variables for each type of analysis asneeded. See the part A on “Data Transfer” on page 99.

The basic rules are:

• Avoid having blanks (spaces) in the variable names

or in the variable values. If you need more that oneword, separate the words with periods or under-scores, e.g. Neo.Score or Tom Johnsen.

• All the values of the same variable must go in thesame column.

And of course:

• All the values from the same person must (normally) be in the same row. There is oneimportant exception, when you have a “repeated measures” type of design. This is where Rdiffers from programs like SPSS and Statistica. The structure of the data file used with this

type of design is covered in section 7.1.6 on page 72.

The nice thing with spreadsheets is that you can have several related data sets (perhaps fromthe same project) in different sheets in the same file, and that it is easy to transfer the data in eachsheet to R, either using the clipboard (which also works from other sources as well, see below), orvia a “plain text” file.

In the latter case you may have to export the spreadsheet as a text file. With the relevantsheet active, select File -> Save as. In the next window select the .CSV option in the “Save astype” dropdown list and change the name of the file if that is needed. Alternatively, use the “tabdelimited” option when saving the file.

The generated file will have the contents of the spreadsheet stored as text, where all the valueson one row are separated with semicolons “;”.

However, and dependent on the setup of your system (imposed by the so-called “locale”), 3

the values in the file may be separated with commas. Also, some locales use commas as decimal

2Using Excel may be problematic if the data set has more than 256 variables (the upper limit for the number of columns in that program, at least for the version I am using at the moment). In that case the solution is either touse another spreadsheet program (there are several free one’s available, a very popular one is called “Gnumeric”:http://www.gnome.org/projects/gnumeric/, or the Calc program in OpenOffice, and yes, both are able to read andwrite Excel files).

3Some locales (including the Norwegian one which most of my students are subjected to) use commas as decimalseparators in text files. This is in my opinion “against nature” for researchers and should be changed. Permanently.Open the Control panel for Windows, find the icon for “Regional and Language options”. With the “Regional option”

click on the “Customize” button. My advice is to set the decimal separator to a period and the list separator to asemicolon.


36/139


separators which may cause problems. Therefore it may be necessary to either change some op-tions in your setup of Windows or to edit your file with NotePad or a similar editing programoriented towards “plain text”. This does NOT include the files generated by programs like MSWord. For most versions of MS Word the .doc file formats contain a lot of junk that does not con-

tain anything relevant to the text content of the file (and probably irrelevant in respect to anythingelse as well). Also note that files with the extension of .CSV will by default be opened by thespreadsheet program, which brings you (partially) back to where you started. It may therefore bean advantage to change the file extension into something that does not invoke anything else thanan editor.

4.3.1 Variable types in frames

One of the differences between R vectors and matrices on the one hand and data frames on theother, is that the columns in a frame do not have to be of the same type. The basic types are:

• Numeric, a value represented as a number. Anything represented by numbers, nominal,ordinal, or higher measurements.

• A string, e.g. a name.

• A “category” or “factor” type variable, normally used as an independent variable in mul-tiple regression, in ANOVA designs, or as category names in frequency tables. They areassumed to have a limited set of possible values, either represented as numbers or text, e.g.codes like “0” and “1” or words like “Male”, and “Female” for gender. In general, this typemay be simply categorical or ordered (ordinal).

However, all the observations within a column are assumed to be of the same type.In respect to the “factor” type, R is normally quite tolerant in respect to what kind of values are

being accepted as factors in an analysis. So a variable or data column in a frame contains a series of 1’s, and 2’s (or 20’s and 30’s for that matter) works just as well as a variable or column containingstrings like “YES” or “NO”. However, in some contexts you have to be explicit about telling thefunction that the vector is a factor, especially when the values are not a simple dichotomy. This isespecially important in respect to multiple regression and ANOVA types of analysis.

4.4 Reading data frames

Given that there are a number of different ways of storing potential frames in memory or files,R permits a large number of options in respect to importing frames, including reading data from“foreign” formats like SPSS or Excel. Only two are covered here, reading text files in two variants,plus reading data from the clipboard.

4.4.1 Reading frames from text files

There are several methods of creating frames for data analysis. One of the most common methodsare variations on the read.table () command for reading data from simple text files, for instance;

> mydata


37/139

4.4. READING DATA FRAMES 23

Note: If you have a very large number of variables in your data set, some editors maygenerate problems for the read.table ()

operation due to line wrapping. It is there-fore advisable to carefully check the data ma-trix after it has been read by R with the edit() or the fix () commands. In particular,check that the number of rows in the data setis correct, and have a look at the values forthe last (rightmost) variables. That is in anycase wise to do whenever a data set is trans-ferred from one format to another. In par-ticular, check that the “missing observations”have come across correctly.

Where it is assumed that the data is storedin a simple type of file called “small.data”,looking something like the one in Table 4.3.With this command it is assumed that the el-

ements (variable names in the first row (line)and data values in the succesive rows or lines)are separated with blanks (spaces) or tabs.If they are separated by something else, e.g.semicolons (which could be the case if the dataset was saved from a spreadsheet program asa type CSV file), you have to use:

> mydata mydata z


38/139


Table 4.2: Clipboard data

# Data on clipboard

A B C

1 1 291 2 16

1 3 55

2 1 198

2 2 107

2 3 181

The data set is then in the workspace with the name"z". Very simple, very efficient, and works from almost anysource, a file in plain text, a page on the web, a spreadsheet,any statistical program I can think of. If you want to save the

data, you have of course to at least save the workspace whenyou exit the system. Alternatively and more permanent, savethe data set to file with a write.table () command as de-scribed in part 8.2.1 on page 91 below.

Missing observations are signalled by the text NA as thevalue. As with any import of data to R it is important to check that these values are coded correctly.

4.4.4 Reading data from the net

If the data you are interested in are available from some location on the net as a text file, you canreplace the name of the file in the examples above with the URL for the file, e.g.:

> z


39/139

4.6. MISSING DATA 25

that may be used. We can either:

• “Fill in” or “plug” all or a subset of the holes in the data set with more or less plausiblevalues. This is called imputing, and there are many alternative strategies for doing this. R is

very strong in this respect.Statistically speaking, this strategy is problematic if the values to be imputed are not “miss-ing completely at random” (often abbreviated to “MCAR”).

Alternatively we may:

• Filter or ignore parts of the data set when doing the analysis, or operate with subsets of therows in data set.

There are two main variants of the latter

notes on using r

Documents