using r with - pelagicos · thebook using r with multivariate statistics was written to supple ment...
TRANSCRIPT
Using R WithMultivariate Statistics
Randall E. Schumacker
University ofAlabama
DSAGELos Angeles | London | New Delhi
Singapore | Washington DC
®SAGELos Angeles | London | New DelhiSingapore | Washington DC
FOR INFORMATION:
SAGE Publications, Inc.
2455 Teller Road
Thousand Oaks, California 91320
E-mail: [email protected]
SAGE Publications Ltd.
1 Oliver's Yard
55 City Road
London EC1Y1SP
United Kingdom
SAGE Publications India Pvt. Ltd.
B1/11 Mohan Cooperative Industrial Area
Mathura Road, New Delhi 110 044
India
SAGE Publications Asia-Pacific Re. Ltd.
3 Church Street
#10-04 Samsung Hub
Singapore049483
Acquisitions Editor VickiKnight
Editorial Assistant: Yvonne McDuffee
eLearning Editor: Katie Bierach
Production Editor Kelly DeRosa
Copy Editor QuADS Prepress (P) Ltd.
Typesetter C&M Digitals (P) Ltd.
Proofreader: Jennifer Grubba
Indexer Michael Ferreira
Cover Designer: Michelle Kenny
Marketing Manager: Nicole Elliott
Copyright © 2016 by SAGE Publications, Inc.
All rights reserved. No part of this book may be reproduced or utilizedin any form or by any means, electronic or mechanical, includingphotocopying, recording, or by any information storage and retrievalsystem, without permission in writing from the publisher.
All trademarks depicted within this book, including trademarks appearingas part of a screenshot, figure, or other image are included solely for thepurpose of illustration and are the property of their respective holders.The use of the trademarks in no way indicates any relationship with, orendorsement by, the holders of said trademarks. SPSS is a registeredtrademark of International Business Machines Corporation.
Printed in the United States of America
Library of Congress Cataloging-in-Publication Data
Schumacker, Randall E.Using R with multivariate statistics : a primer / Randall E. Schumacker,University of Alabama, Tuscaloosa.
pages cm
Includes bibliographical references and index.
ISBN 978-1-4833-7796-4 (pbk. : alk. paper)
1. Multivariate analysis—Data processing. 2. R (Computer programlanguage) 3- Statistics—Data processing. I. Title.QA278.S37 2016519.5'3502855133—dc23 2015011814
This book is printed on acid-free paper.
* CertifiedSourcing
www.sfiprograin.orgSFI-00453
SFI
15 16 17 18 19 10 9 8 7 6 5 4 3 2 1
Brief Contents
Preface xiii
Acknowledgments xix
About the Author xxi
1. Introduction and Overview 1
2. Multivariate Statistics: Issues and Assumptions 9
3. Hotelling's T2: A Two-Group Multivariate Analysis 27
4. Multivariate Analysis ofVariance 57
5. Multivariate Analysis of Covariance 81
6. Multivariate Repeated Measures 99
7. Discriminant Analysis 131
8. Canonical Correlation 147
9. Exploratory Factor Analysis 171
10. Principal Components Analysis 207
11. Multidimensional Scaling 229
12. Structural Equation Modeling 255
Statistical Tables 315
Chapter Answers 325
R Installation and Usage 355
R Packages, Functions, Data Sets, and Script Files 367
Index 375
®SAGE IsoSAGE was founded in 1965 by Sara Miller McCune tosupport the dissemination of usable knowledge by publishing
innovative and high-quality research and teaching content.
Today, we publish more than 850 journals, including those
of more than 300 learned societies, more than 800 new
books per year, and a growing range of library products
including archives, data, case studies, reports, conference
highlights, and video. SAGE remains majority-owned by our
founder, and after Sara's lifetime will become owned by a
charitable trust that secures our continued independence.
Los Angeles | London | New Delhi | Singapore | Washington DC
Detailed Contents
Preface xiii
Acknowledgments xix
About the Author xxi
1. Introduction and Overview 1
Background 1Persons of Interest 1
Factors Affecting Statistics 2R Software 5Web Resources 7
References 7
2. Multivariate Statistics: Issues and Assumptions 9
Issues 11
Assumptions 12Normality 12Determinant ofa Matrix 16Equality of Variance-Covariance Matrix 18Box M Test 21
SPSS Check 23Summary 23Web Resources 24
References 24
3. Hotelling's T2: A Two-Group Multivariate Analysis 27
Overview 28
Assumptions 29Univariate Versus Multivariate Hypothesis 30
Statistical Significance 32
Practical Examples Using R 33Single Sample 33TwoIndependent GroupMean Difference 36Two Groups (Paired)
Dependent Variable Mean Difference 42Power and Effect Size 49
A Priori Power Estimation 50
Effect Size Measures 52Reporting and Interpreting 54Summary 54Exercises 55
Web Resources 55
References 55
4. Multivariate Analysis ofVariance 57
MANOVA Assumptions 58Independent Observations 59Normality 62Equal Variance-Covariance Matrices 63Summary 66
MANOVA Example: One-Way Design 66MANOVA Example: Factorial Design 70Effect Size 76
Reporting and Interpreting 78Summary 79Exercises 79
Web Resources 80
References 80
5. Multivariate Analysis of Covariance 81
Assumptions 82Multivariate Analysis of Covariance 84
MANCOVA Example 85Dependent Variable: Adjusted Means 87
Reporting and Interpreting 93Propensity Score Matching 94Summary 97Web Resources 97
References 98
6. Multivariate Repeated Measures 99
Assumptions 101
Advantages of Repeated Measure Design 102
Multivariate Repeated Measure Examples 103
Single Dependent Variable 103
Several Dependent Variables: Profile Analysis 108
Doubly Multivariate Repeated Measures 114
Reporting and Interpreting Results 126
Summary 127
Exercises 128
Web Resources 128
References 128
iscriminant Analysis 131
Overview 133
Assumptions 133
Dichotomous Dependent Variable 134
Box MTest 135
Classification Summary 136Chi-Square Test 137
Polytomous Dependent Variable 138
Box M Test 139
Classification Summary 141
Chi-Square Test 142
Effect Size 142
Reporting and Interpreting 143Summary 144
Exercises 144
Web Resources 145
References 145
inonical Correlation 147
Overview 148
Assumptions 149R Packages 150
CCA Package 152
yacca Package 158
Canonical Correlation Example 158
Effect Size 165
Reporting and Interpreting 165Summary 166Exercises 167
Web Resources 168
References 168
9. Exploratory Factor Analysis 171
Overview 172
Types of Factor Analysis 173Assumptions 173Factor Analysis Versus Principal Components Analysis 176EFA Example 178
R Packages 178Data Set Input 179Sample Size Adequacy 180Number ofFactors and Factor Loadings 183Factor Rotation and Extraction:
Orthogonal Versus Oblique Factors 190Factor Scores 195
Graphical Display 201Reporting and Interpreting 201Summary 202Exercises 203
Web Resources 203
References 204
Appendix: Attitudes Toward Educational Research Scale 205
10. Principal Components Analysis 207
Overview 208
Assumptions 209Bartlett Test (Sphericity) 209KMO Test (Sampling Adequacy) 210Determinant ofCorrelation Matrix 210
Basics of Principal Components Analysis 211Principal Component Scores 215
Principal Component Example 216R Packages 216Data Set 216
Assumptions 219Number ofComponents 220
Reporting and Interpreting 226
Summary 227
Exercises 228
Web Resources 228
References 228
Multidimensional Scaling 229
Overview 231
Assumptions 232
Proximity Matrix 233
MDS Model 233
MDS Analysis 233
Sample Size 233
Variable Scaling 234
Number ofDimensions 234
R Packages 234
Goodness-of-Fit Index 236
MDS Metric Example 237
MDS Nonmetric Example 244
Reporting and Interpreting Results 251
Summary 252
Exercises 252
Web Resources 253
References 253
structural Equation Modeling 255
Overview 257
Assumptions 258
Multivariate Normality 258
Positive Definite Matrix 261
Equal Variance-Covariance Matrices 263Correlation Versus Covariance Matrix 264
Basic Correlation and Covariance Functions 265
Matrix Input Functions 267
Reference Scaling in SEMModels 270
R Packages 271
Finding R Packages and Functions 271
SEMPackages 273
CFA Models 275
Basic Model 275
Multiple Group Model 282Structural Equation Models 290
Basic SEM Model 290
Longitudinal SEMModels 295Reporting and Interpreting Results 310Summary 311Exercises 312
Web Resources 312
References 312
Statistical Tables 315
Table 1: Areas Under the Normal Curve (z Scores) 316
Table 2: Distribution of t for Given Probability Levels 317Table 3: Distribution of r for Given Probability Levels 318Table 4: Distribution of Chi-Square
for Given Probability Levels 319Table 5: The F Distribution for Given Probability
Levels (.05 Level) 321
Table 6: The Distribution of F for Given ProbabilityLevels (.01 Level) 322
Table 7: Distribution of Hartley F for Given ProbabilityLevels 323
Chapter Answers 325
R Installation and Usage 355
R Packages, Functions, Data Sets, and Script Files 367
Index 375
Preface
The book Using R With Multivariate Statistics was written to supplement existing full textbooks on the various multivariate statisticalmethods. The multivariate statistics books provide a more in-depth
coverage of the methods presented in this book, but without the use of Rsoftware. The R code is provided for some of the data set examples in themultivariate statistics books listed below. It is hoped that students can runthe examples in R and compare results in the books that used SAS, IBM®SPSS® Statistics*, or STATA statistics packages. The advantage of R is that itisfree and runs on Windows, Mac, and LINUX operating systems.
The full textbooks also providea more in-depth discussion of the assumptions and issues, as well as provide data analysis and interpretation of theresults using SPSS, SAS, and/or STATA. The several multivariate statistics booksI consulted and referenced are as follows:
• Afifi, A., Clark, V, & May, S. (2004). Computer-aided multivariateanalysis (4th ed.). Boca Raton, FL: Chapman & Hall/CRC Press.
• Hair, J. F, Jr., Black, W. C, Babin, B. J., & Anderson, R. E. (2010).Multivariate data analysis (7th ed.). Upper Saddle River, NJ: PrenticeHall.
• Meyers, L. S., Gamst, G., & Guarino, A. J. (2013). Applied multivariate research: Design and interpretation (2nd ed.). Thousand Oaks,CA: Sage.
• Raykov, T, & Marcoulides, G. A. (2008). An introduction to appliedmultivariate analysis. New York, NY: Routledge (Taylor & FrancisGroup).
• Stevens, S. S. (2009). Applied multivariate statistics for the socialsciences(5th ed.). New York, NY: Routledge (Taylor & Francis Group).
• Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics(5th ed.). Boston, MA: Allyn & Bacon.
'SPSS is a registered trademark of International Business Machines Corporation.
>- xiii
xiv < USING R WITH MULTIVARIATE STATISTICS
This book was written to provide researchers with access to the freeR software when conducting multivariate statistical analysis. There aremany packages and functions available, which can be overwhelming, so Ihave collected some of the widely used packages and functions for themultivariate methods in the book. Many of the popular multivariate statistics books will provide a more complete treatment of the topics coveredin this book along with SAS and/or SPSS solutions. I am hopeful that thisbook will provide a good supplemental coverage of topics in multivariatebooks and permit faculty and students to run R software analyses. The Rsoftware permits the end users to customize programs to provide the typeof analysis and output they desire. The R commands can be saved in ascript file for future use, can be readily shared, and can provide the usercontrol over the analytic steps and algorithms used. The advantages ofusing R software are many, including the following:
• Free software
• The ability to customize statistical analysis• Control over analytic steps and algorithms used• Available on Window, Mac, and Linux operating systems• Multitude of packages and functions to conduct analytics• Documentation and reference guides available
A Data Sets
The multivariate textbooks listed above have numerous examples and datasets available either in their book or on the publishers* website. There arealso numerous data sets available for statistical analysis in R, which can beviewed by using the following R command(s):
> data() # alphabetical list of data sets
or
> data(package=.packages(all.available=TRUE)) data sets listed in
various R packages
or
> library(help = "datasets") alphabetical list of data in the
R dataset package
Preface • xv
or, you can also enter the following URL to obtain a list:
http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/OOIndex.html
The type of data set we would generally want is one that contained aset of continuous dependent variables and a set of continuous independentvariables. The correlation of the two linear sets of variables is the basis for
conducting many of the multivariate statistics covered in the book.The input and use of the data sets are generally provided with a brief
explanation and example in R code. Overall, the use of the data sets canbe enhanced by taking the time to study an R tutorial located at
http://ww2.coastal.edu/kingw/statistics/R-tutorials/dataframes.html
The following R commands are helpful in understanding the data set,where the data set name is specified for each function: in this example, iris.
>
>
help(iris)
describe(iris)
> iris
> head(iris, n = 10)
> tail(iris, n = 10)
A Input Data Files
# information on iris data set
# descriptive statistics on
variables in iris data set
# list the iris data
# print first ten record lines
in iris data set
# print last ten record lines
in iris data set
There are many ways to input data files, depending on how the data arecoded (Schumacker, 2014). You may wish to use Notepad to initiallyview a data file. Commercial software packages have their own format(SPSS: *.sav; SAS: *.sas; EXCEL: *.xls; etc.). A data file may be formattedwith commas between the data values, semicolons, a tab, or a space.Each file type requires specifying the separation type between data valuesusing the sep() argument in one of the following R functions that readsthe data file:
read.csv # read a comma or semi-colon data separated fileread.delim # read a tab delimited file
read.table # read a space data separated file
xvi -4 USING R WITH MULTIVARIATE STATISTICS
The separation types in the sep() argument are as follows:
# comma separated file
# semi-colon separated file
# tab deliminated file
# space separated file
You can find out more about reading in data files with different separationtypes using >?read.table.
A useful approach for finding and reading data sets on your computeris to embed the file.choose() function. This opens a dialog window andpermits a search of your folders for the data set. Click on the data set, andit is read into the file. The R command would be as follows:
> mydata = read.table(file.choose(), header = TRUE, sep =" ")
This command would find a data file with variable names on the first
line (header = TRUE) and a space between the data values.Many statistical methods use a correlation or covariance matrix. Some
use a partial correlation or partial covariance matrix. The correlation andcovariance matrices are computed by using the following commands,respectively:
> cor(mydata)
> cov(mydata)
The corpcor package has two useful functions that permit conversionin either direction from correlation to partial correlation; or partial correlation to correlation. This also applies to covariance matrices; in this examplethe matrix is mymatrix.
> cor2pcor(mymatrix) # compute partial correlation matrix fromcorrelation/covariance matrix
> pcor2cor(mymatrix) # converts partial correlation or covariancematrix to a correlation matrix
A chi-square test of whether two correlation matrices are equal is conducted using the following R commands.
> library(psych)
> cortest(Rl,R2,nl,n2,cor=FALSE) # input 2 correlation
matrices and sample sizes
Preface p- xvii
Also, this function permits testing whether a single correlation matrixis an identity matrix.
> cortest(Rl,R2=NULL,nl,n2=NULL,cor=FALSE) # input 1 correlation
matrix and sample size.
You will find these functions very useful when running multivariatestatistical analyses.
A R Packages
The multivariate statistical analyses require the use of certain R packages.In the appendix, for each chapter, I have compiled a list of the R packages,functions, data sets, and Rscript files I used to conduct the analyses. Thisshould provide a handy reference guide. You can also obtain a list ofpackages by
> ??packages
Information about a specific R package can be obtained by
> help(package="psych")
I recommend using the options in the pull-down menu whenever possible. The options include installing, loading, and updating packages. Youcan also issue individual commands for these options:
> install.packages()
> update.packages()
You may receive a notice that a particular package runs under a certain version of R. When this occurs, simply uninstall your current versionof R in the Control Panel, and then install the newer version of R from the
website (http://www.r-project.org/).
xviii 4 USING R WITH MULTIVARIATE STATISTICS
File Un View Mac IPactagts) Windows Help
|ia?|^|y| |RjJe|l lMd P«k»9«-
d 9.1 wwc cSet CRAN minor...
Select repositories...
Instill package).*)...
Update packages...
Install packages) from local :tp files..
There are two very important additions to the R software package.After installing R, either of these can make your use of R much easier,especially in organizing files and packages. The two software products areRCommander and RStudio. You will need to decide which one fits yourneeds. These are considered graphical user interfaces, which means theycome with pull-down menus and dialog windows displaying various typesof information. They can be downloaded from the following websites:
> http://www.rcommander.com/
> http://www.rstudio.com/
Acknowledgments
The photographs of eminent statisticians who influenced the fieldof multivariate statistics were given by living individuals and/orcommon sources on the Internet. The biographies were a compi
lation of excerpts from common source Internet materials, comments invarious textbooks, flyers, and conference pamphlets. I would like to suggest sources for additional information about eminent statisticians that maybe of interest to scholars and academicians. First, Wikipedia (http://www.wikipedia.org/), which provides contributed information on individuals inmany different languages around the globe, and their list of many foundersof statistics (http://en.wikipedia.org/wiki/Founders_of_statistics). TheAmerican Statistical Association (www.amstat.org) supports a websitewith biographies and links to many other statistical societies. The World ofStatistics (www.worldofstatistics.org) provides a website with famous statisticians' biographies and/or links to reference sources. A list of famousstatisticians can be found on Wikipedia (http://en.wikipedia.org/wiki/List_of_statisticians). Simply Google and you will find websites aboutfamous statisticians. Any errors or omissions in the biographies are unintentional, and in the purview of my responsibilities, not the publisher's.
SAGE Publications would like to thank the following reviewers:
Xiaofen Keating, The University of Texas at Austin
Richard Feinn, Southern Connecticut State University
James Alan Fox, Northeastern University
Thomas H. Short, John Carroll University
Jianmin Guan, University of Texas at San Antonio
Edward D. Gailey, Fairmont State University
Prathiba Natesan, University of North Texas
David E. Drew, Claremont Graduate University
xx M USING R WITH MULTIVARIATE STATISTICS
Camille L. Bryant, Columbus State University
Darrell Rudmann, Shawnee State University
Jann W. Maclnnes, University of Florida
Tamara A. Hamai, California State University, Dominguez Hills
Weihua Fan, University of Houston
About the Author
Randall E. Schumacker is Professor of Educational Research at The
University of Alabama. He has written and coedited several books, including A Beginner's Guide to Structural Equation Modeling (4th ed.),AdvancedStructuralEquation Modeling:Issuesand Techniques, InteractionandNon-LinearEffects in StructuralEquation Modeling, NewDevelopmentsand Techniques in StructuralEquationModeling, Understanding StatisticalConcepts Using S-PLUS, Understanding Statistics Using R, and LearningStatistics Using R.
He was the founder and is now Emeritus Editor of Structural EquationModeling: A Multidisciplinary fournal, and he established the StructuralEquation Modeling Special Interest Group within the American EducationalResearch Association. He is also the Emeritus Editor of Multiple LinearRegression Viewpoints, the oldest journal sponsored by the AmericanEducational Research Association (Multiple Linear Regression: GeneralLinear Model Special Interest Group).
He has conducted international and national workshops, has servedon the editorial board of several journals, and currently pursues hisresearch interests in measurement, statistics, and structural equation modeling. He was the 1996 recipient of the Outstanding Scholar Award and the1998 recipient of the Charn Oswachoke International Award. In 2010, helaunched the DecisionKit App for the iPhone, iPad, and iTouch, which canassist researchers in making decisions about which measurement, researchdesign, or statistic to use in their research projects. In 2011, he receivedthe Apple iPad Award, and in 2012, he received the CITFaculty TechnologyAward at the University of Alabama. In 2013, he received the McCroryFaculty Excellence in Research Award from the College of Education at theUniversity of Alabama. In 2014, he was the recipient of the StructuralEquation Modeling Service Award at the American Educational ResearchAssociation.
^ xxi