r: an open source statistical environment
DESCRIPTION
R: An Open Source Statistical Environment. Valentin Todorov UNIDO [email protected]. MSIS 2008 (Luxembourg, 7-9 April 2008). Outline. Introduction: the R Platform and Availability R Learning Curve (is R hard to learn) R Extensibility (R Packages) R and the others (Interfaces) - PowerPoint PPT PresentationTRANSCRIPT
R: An Open Source Statistical Environment
8.4.2008 1MSIS 2008, Luxembourg: Valentin Todorov
R: An Open Source Statistical Environment
Valentin Todorov
UNIDO
MSIS 2008 (Luxembourg, 7-9 April 2008)
R: An Open Source Statistical Environment
8.4.2008 2MSIS 2008, Luxembourg: Valentin Todorov
Outline
• Introduction: the R Platform and Availability• R Learning Curve (is R hard to learn)• R Extensibility (R Packages)• R and the others (Interfaces)• R Graphics• R for Time series• R for Survey Analysis• R and the Outliers (Robust Statistics in R)• More R features (WEB, Missing data, OOP, GUI)• Summary and Conclusions
R: An Open Source Statistical Environment
8.4.2008 3MSIS 2008, Luxembourg: Valentin Todorov
What is R
• R is “ a system for statistical computation and graphics. It provides, among other things, a programming language, high-level graphics, interfaces to other languages and debugging facilities”
• Developed after the S language and environment– S was developed at Bell Labs (John Chambers et al.)– S-Plus: a value added implementation of the S language- Insightful
Corporation – much code written for S runs unaltered under R
• Significantly influenced by Scheme, a Lisp dialect
R: An Open Source Statistical Environment
8.4.2008 4MSIS 2008, Luxembourg: Valentin Todorov
What is R
• Ihaka and Gentleman, University of Auckland (New Zealand)– 1993 a preliminary version of R– 1995 released under the GNU Public License– Now: R-core team consisting of 17 members including John
Chambers
• R provides a wide variety of statistical (linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering, robust methods and many more) and graphical techniques
• R is available as Free Software under the terms of the GNU General Public License (GPL).
R: An Open Source Statistical Environment
8.4.2008 5MSIS 2008, Luxembourg: Valentin Todorov
R Extensibility (R Packages)
• One of the most important features of R is its extensibility by creating packages of functions and data.
• The R package system provides a framework for developing, documenting, and testing extension code.
• Packages can include R code, documentation, data and foreign code written in C or Fortran.
• Packages are distributed through the CRAN repository – http://cran.r-project.org - currently more than 1300 packages covering a wide variety of statistical methods and algorithms. ‘base’ and ‘recommended’ packages are included in all binary distributions.
R: An Open Source Statistical Environment
8.4.2008 6MSIS 2008, Luxembourg: Valentin Todorov
R and the Others (R Interfaces)
• Reading and writing data (text files, XML, spreadsheet like data, e.g. Excel
• Read and write data formats of SAS, S-Plus, SPSS, STATA, Systat, Octave – package foreign.
• Emulation of Matlab – package matlab.• Communication with RDBMS – ROracle, RMySql,
RSQLite, RmSQL, RPgSQL, RODBC – large data sets, concurrency
• Package filehash – a simple key-value style database, the data are stored on disk but are handled like data sets
• Can use compiled native code in C, C++, Fortran, Java
R: An Open Source Statistical Environment
8.4.2008 7MSIS 2008, Luxembourg: Valentin Todorov
R Graphics
• One of the most important strengths of R – simple exploratory graphics as well as well-designed publication quality plots.
• The graphics can include mathematical symbols and formulae where needed.
• Can produce graphics in many formats:– On screen– PS and PDF for including in LaTex and pdfLaTeX or for distribution– PNG or JPEG for the Web– On Windows, metafiles for Word, PowerPoint, etc.
R: An Open Source Statistical Environment
8.4.2008 8MSIS 2008, Luxembourg: Valentin Todorov
R Graphics: basic and multipanel plots (trellis)
Histogram
Sepal.Width
Den
sity
2.0 2.5 3.0 3.5 4.0
0.0
0.4
0.8
1.2
setosa versicolor
4.5
5.5
6.5
7.5
Boxplot
Sep
al.W
idth
setosa versicolor
4.5
5.5
6.5
7.5
4.5 5.5 6.5 7.5
2.0
3.0
4.0
Sepal.Length
Sep
al.W
idth
Bagplot
-2 -1 0 1 2
2.0
3.0
4.0
Normal Q-Q Plot
norm quantiles
Sep
al.W
idth
Scatter Plot Matrix
SepalLength
SepalWidth
PetalLength
setosa
SepalLength
SepalWidth
PetalLength
versicolor
SepalLength
SepalWidth
PetalLength
virginica
Three
Varieties
of
Iris
R: An Open Source Statistical Environment
8.4.2008 9MSIS 2008, Luxembourg: Valentin Todorov
R Graphics: parallel plot and coplot
-35
-25
-15
165 170 175 180 185
165 170 175 180 185 165 170 175 180 185
-35
-25
-15
long
lat
100 200 300 400 500 600
Given : depth
SepalLength
SepalWidth
PetalLength
Min Max
setosa versicolor
SepalLength
SepalWidth
PetalLength
virginica Three
Varieties
of
Iris
R: An Open Source Statistical Environment
8.4.2008 10MSIS 2008, Luxembourg: Valentin Todorov
R for Time Series• Package stats
– classical time series modeling tools – arima() for Box-Jenkins type analysis
– structural time series – StructTS()– filtering and decomposition – decompose() and HoltWinters()
• Package forecast – additional forecast methods and graphical tools
• Analyzing monthly or lower frequency time series:– TRAMO/SEATS – X-12-ARIMA
accessible through the Gretl library
• Task View Econometrics: http://cran.r-project.org/web/views/Econometrics.html
R: An Open Source Statistical Environment
8.4.2008 11MSIS 2008, Luxembourg: Valentin Todorov
R for Time Series: Example
• Fitting an ARIMA model to a univariate time series with arima() and using tsdiag() for plotting time series analysis diagnostic
R: An Open Source Statistical Environment
8.4.2008 12MSIS 2008, Luxembourg: Valentin Todorov
R for Survey Analysis
• Complex survey samples are usually analysed by specialized software packages: SUDAAN, Bascula 4 (Statistics Netherlands), etc.
• STATA provides much more comprehensive support for analysing survey data than SAS and SPSS and could successfully compete with the specialized packages
R: An Open Source Statistical Environment
8.4.2008 13MSIS 2008, Luxembourg: Valentin Todorov
R for Survey Analysis
• R – package survey - http://faculty.washington.edu/tlumley/survey/
– stratification, clustering, possibly multistage sampling, unequal sampling probabilities or weights; multistage stratified random sampling with or without replacements
– Summary statistics: means, totals, ratios, quantiles, contingency tables, regression models, for the whole sample and for domains
– Variances by Taylor linearization or by replicate weights (BRR, jack-knife, bootstrap, or user-supplied)
– Graphics: histograms, hexbin scatterplots, smoothers
• Other packages: pps, sampling, sampfling
R: An Open Source Statistical Environment
8.4.2008 14MSIS 2008, Luxembourg: Valentin Todorov
R and the Outliers (Robust Statistics in R)
• What are Outliers – atypical observations which are inconsistent with the rest of the
data or deviate from the postulated model– may arise through contamination, errors in data gathering, or
misspecification of the model. – classical statistical methods are very sensitive to such data
• What are Robust methods– Produce reasonable results even when one or more outliers may
appear in the data– Robust regression - robustbase– Robust multivariate methods – rrcov, robustbase– Robust time series analysis - robust-ts
R: An Open Source Statistical Environment
8.4.2008 15MSIS 2008, Luxembourg: Valentin Todorov
R and the Outliers: Example
• Example: Wages and Hours - http://lib.stat.cmu.edu/DASL/ – a national sample of 6000 households with a male head earning
less than $15,000 annually in 1966 - 9 independent variables; classified into 39 demographic groups
– estimate y = the labour supply (average hours) from the available data (for the example we will consider only one variable: x = average age of the respondents:
– We will fit an Ordinary Least Squares (OLS) and a robust Least Trimmed Squares model
xy 10
R: An Open Source Statistical Environment
8.4.2008 16MSIS 2008, Luxembourg: Valentin Todorov
R and the Outliers: Example OLS
25 30 35 40 45 50 55
2000
2050
2100
2150
2200
2250
AGE
HR
S
(a)
0 10 20 30 40
-3-2
-10
12
3
Index
Sta
ndar
dize
d LS
resi
dual
(b)
19
-2.5
2.5
R: An Open Source Statistical Environment
8.4.2008 17MSIS 2008, Luxembourg: Valentin Todorov
R and the Outliers: Example LTS
25 30 35 40 45 50 55
2000
2050
2100
2150
2200
2250
AGE
HR
S
(c)
0 10 20 30 40
-20
-10
010
20
Index
Sta
ndar
dize
d LT
S re
sidu
al
(d)
3432
29
4
5
-2.5
2.5
R: An Open Source Statistical Environment
8.4.2008 18MSIS 2008, Luxembourg: Valentin Todorov
R and the Outliers: Example Covariance•Marona & Yohai (1998)
•rrcov: data set maryo
•A bivariate data set with:
•sample correlation: 0.81
•interchange the largest and smallest value in the first coordinate
•the sample correlation becomes 0.05
18.0
8.01
00,20
S
n
-2 -1 0 1
-2-1
01
2
19
9
TOLERANCE ELLIPSE (97.5%)
cleancontaminated
R: An Open Source Statistical Environment
8.4.2008 19MSIS 2008, Luxembourg: Valentin Todorov
More R…
• R and the WEB - several projects that provide possibilities to use R over the WEB
• R and the Missing – advanced missing value handling– mvnmle: ML estimation for multivariate data with missing values– mitools: Tools for multiple imputation of missing data– mice - Multivariate Imputation by Chained Equations – EMV: Estimation of Missing Values for a Data Matrix – VIM: provides methods for the visualisation as well as imputation
of missing data
• R Objects – R is an Object Oriented language (however in a quite different sense from C++, Java, C#)
R: An Open Source Statistical Environment
8.4.2008 20MSIS 2008, Luxembourg: Valentin Todorov
More R…
• R GUI– R Commander: a basic statistics GUI, consisting of a window
containing several menus, buttons, and information fields– Sciviews: a suite of companion applications for Windows
• R and SDMX• R Reports
– package xtable: coerce data to LaTeX and HTML tables – package Sweave: a framework for mixing text and R code for
automatic report gene
R: An Open Source Statistical Environment
8.4.2008 21MSIS 2008, Luxembourg: Valentin Todorov
Summary
• Output Management System– SAS/SPSS: it is rarely used for routine work– R: output is easily passed from one function to another to do
further processing and to obtain more results
• Macro Language– SAS/SPSS: a special language with own syntax. The new
functions are not run in the same way as the built-in procedures– R itself is a programming language
• Matrix Language– SAS/SPSS: A special language with own syntax– R is a vector and matrix based language complemented by
additional packages: Matitrx, SparseM
R: An Open Source Statistical Environment
8.4.2008 22MSIS 2008, Luxembourg: Valentin Todorov
Summary (cont.)
• Publishing results– SAS/SPSS: Cut and paste to a Word processor or exporting to a
file– R: produce LaTex output (including graphics) using for
example the Sweave package
• Data size– SAS/SPSS: Limited by the size of the disk– R: Limited by the size of the RAM, (not trivial) usage of
databases for large data sets is possible
• Data structure– SAS/SPSS: Rectangular data set– R: Rectangular data frame, vector, list
R: An Open Source Statistical Environment
8.4.2008 23MSIS 2008, Luxembourg: Valentin Todorov
Summary (cont.)
• Interface to other programming languages– SAS/SPSS: Not available– R: R can be easily mixed with Fortran, C, C++ and Java
• Source code– SAS/SPSS: Not available– R: the source code of R itself as well as of its packages is a
part of the distribution
R: An Open Source Statistical Environment
8.4.2008 24MSIS 2008, Luxembourg: Valentin Todorov
References• Hornik, K and Leisch, F, (2005) R Version 2.1.0, Computational Statistics, 20
2 pp 197-202
• Kabacoff, R. (2008) Quick-R for SAS and SPSS users, available from http://www.statmethods.net/index.html
• López-de-Lacalle, J, (2006) The R-computing language: Potential for Asian economists, Journal of Asian Economics, 17 6, pp 1066-1081
• Muenchen, R. (2007), R for SAS and SPSS users, URL: http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf
• Murrel, P. (2005) R Graphics, Chapman & Hall
• R Development Core Team (2007) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. URL: http://www.r-project.org/
• Templ, M and Filzmoser, F (2008), Visualisation of Missing Values and Robust Imputation in Environmental Surveys, submitted for publication
• Wheeler, D.A., (2007) Why Open Source Software / Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers!