r: an open source statistical environment 8.4.20081msis 2008, luxembourg: valentin todorov r: an...

24
R: An Open Source Statistical Environment 8.4.2008 1 MSIS 2008, Luxembourg: Valentin Todor ov R: An Open Source Statistical Environment Valentin Todorov UNIDO [email protected] MSIS 2008 (Luxembourg, 7-9 April 2008)

Upload: nicholas-justin-owens

Post on 23-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 1MSIS 2008, Luxembourg: Valentin Todorov

R: An Open Source Statistical Environment

Valentin Todorov

UNIDO

[email protected]

MSIS 2008 (Luxembourg, 7-9 April 2008)

Page 2: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 2MSIS 2008, Luxembourg: Valentin Todorov

Outline

• Introduction: the R Platform and Availability• R Learning Curve (is R hard to learn)• R Extensibility (R Packages)• R and the others (Interfaces)• R Graphics• R for Time series• R for Survey Analysis• R and the Outliers (Robust Statistics in R)• More R features (WEB, Missing data, OOP, GUI)• Summary and Conclusions

Page 3: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 3MSIS 2008, Luxembourg: Valentin Todorov

What is R

• R is “ a system for statistical computation and graphics. It provides, among other things, a programming language, high-level graphics, interfaces to other languages and debugging facilities”

• Developed after the S language and environment– S was developed at Bell Labs (John Chambers et al.)– S-Plus: a value added implementation of the S language- Insightful

Corporation – much code written for S runs unaltered under R

• Significantly influenced by Scheme, a Lisp dialect

Page 4: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 4MSIS 2008, Luxembourg: Valentin Todorov

What is R

• Ihaka and Gentleman, University of Auckland (New Zealand)– 1993 a preliminary version of R– 1995 released under the GNU Public License– Now: R-core team consisting of 17 members including John

Chambers

• R provides a wide variety of statistical (linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering, robust methods and many more) and graphical techniques

• R is available as Free Software under the terms of the GNU General Public License (GPL).

Page 5: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 5MSIS 2008, Luxembourg: Valentin Todorov

R Extensibility (R Packages)

• One of the most important features of R is its extensibility by creating packages of functions and data.

• The R package system provides a framework for developing, documenting, and testing extension code.

• Packages can include R code, documentation, data and foreign code written in C or Fortran.

• Packages are distributed through the CRAN repository – http://cran.r-project.org - currently more than 1300 packages covering a wide variety of statistical methods and algorithms. ‘base’ and ‘recommended’ packages are included in all binary distributions.

Page 6: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 6MSIS 2008, Luxembourg: Valentin Todorov

R and the Others (R Interfaces)

• Reading and writing data (text files, XML, spreadsheet like data, e.g. Excel

• Read and write data formats of SAS, S-Plus, SPSS, STATA, Systat, Octave – package foreign.

• Emulation of Matlab – package matlab.• Communication with RDBMS – ROracle, RMySql,

RSQLite, RmSQL, RPgSQL, RODBC – large data sets, concurrency

• Package filehash – a simple key-value style database, the data are stored on disk but are handled like data sets

• Can use compiled native code in C, C++, Fortran, Java

Page 7: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 7MSIS 2008, Luxembourg: Valentin Todorov

R Graphics

• One of the most important strengths of R – simple exploratory graphics as well as well-designed publication quality plots.

• The graphics can include mathematical symbols and formulae where needed.

• Can produce graphics in many formats:– On screen– PS and PDF for including in LaTex and pdfLaTeX or for distribution– PNG or JPEG for the Web– On Windows, metafiles for Word, PowerPoint, etc.

Page 8: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 8MSIS 2008, Luxembourg: Valentin Todorov

R Graphics: basic and multipanel plots (trellis)

Histogram

Sepal.Width

Den

sity

2.0 2.5 3.0 3.5 4.0

0.0

0.4

0.8

1.2

setosa versicolor

4.5

5.5

6.5

7.5

Boxplot

Sep

al.W

idth

setosa versicolor

4.5

5.5

6.5

7.5

4.5 5.5 6.5 7.5

2.0

3.0

4.0

Sepal.Length

Sep

al.W

idth

Bagplot

-2 -1 0 1 2

2.0

3.0

4.0

Normal Q-Q Plot

norm quantiles

Sep

al.W

idth

Scatter Plot Matrix

SepalLength

SepalWidth

PetalLength

setosa

SepalLength

SepalWidth

PetalLength

versicolor

SepalLength

SepalWidth

PetalLength

virginica

Three

Varieties

of

Iris

Page 9: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 9MSIS 2008, Luxembourg: Valentin Todorov

R Graphics: parallel plot and coplot

-35

-25

-15

165 170 175 180 185

165 170 175 180 185 165 170 175 180 185

-35

-25

-15

long

lat

100 200 300 400 500 600

Given : depth

SepalLength

SepalWidth

PetalLength

Min Max

setosa versicolor

SepalLength

SepalWidth

PetalLength

virginica Three

Varieties

of

Iris

Page 10: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 10MSIS 2008, Luxembourg: Valentin Todorov

R for Time Series• Package stats

– classical time series modeling tools – arima() for Box-Jenkins type analysis

– structural time series – StructTS()– filtering and decomposition – decompose() and HoltWinters()

• Package forecast – additional forecast methods and graphical tools

• Analyzing monthly or lower frequency time series:– TRAMO/SEATS – X-12-ARIMA

accessible through the Gretl library

• Task View Econometrics: http://cran.r-project.org/web/views/Econometrics.html

Page 11: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 11MSIS 2008, Luxembourg: Valentin Todorov

R for Time Series: Example

• Fitting an ARIMA model to a univariate time series with arima() and using tsdiag() for plotting time series analysis diagnostic

Page 12: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 12MSIS 2008, Luxembourg: Valentin Todorov

R for Survey Analysis

• Complex survey samples are usually analysed by specialized software packages: SUDAAN, Bascula 4 (Statistics Netherlands), etc.

• STATA provides much more comprehensive support for analysing survey data than SAS and SPSS and could successfully compete with the specialized packages

Page 13: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 13MSIS 2008, Luxembourg: Valentin Todorov

R for Survey Analysis

• R – package survey - http://faculty.washington.edu/tlumley/survey/

– stratification, clustering, possibly multistage sampling, unequal sampling probabilities or weights; multistage stratified random sampling with or without replacements

– Summary statistics: means, totals, ratios, quantiles, contingency tables, regression models, for the whole sample and for domains

– Variances by Taylor linearization or by replicate weights (BRR, jack-knife, bootstrap, or user-supplied)

– Graphics: histograms, hexbin scatterplots, smoothers

• Other packages: pps, sampling, sampfling

Page 14: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 14MSIS 2008, Luxembourg: Valentin Todorov

R and the Outliers (Robust Statistics in R)

• What are Outliers – atypical observations which are inconsistent with the rest of the

data or deviate from the postulated model– may arise through contamination, errors in data gathering, or

misspecification of the model. – classical statistical methods are very sensitive to such data

• What are Robust methods– Produce reasonable results even when one or more outliers may

appear in the data– Robust regression - robustbase– Robust multivariate methods – rrcov, robustbase– Robust time series analysis - robust-ts

Page 15: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 15MSIS 2008, Luxembourg: Valentin Todorov

R and the Outliers: Example

• Example: Wages and Hours - http://lib.stat.cmu.edu/DASL/ – a national sample of 6000 households with a male head earning

less than $15,000 annually in 1966 - 9 independent variables; classified into 39 demographic groups

– estimate y = the labour supply (average hours) from the available data (for the example we will consider only one variable: x = average age of the respondents:

– We will fit an Ordinary Least Squares (OLS) and a robust Least Trimmed Squares model

xy 10

Page 16: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 16MSIS 2008, Luxembourg: Valentin Todorov

R and the Outliers: Example OLS

25 30 35 40 45 50 55

2000

2050

2100

2150

2200

2250

AGE

HR

S

(a)

0 10 20 30 40

-3-2

-10

12

3

Index

Sta

ndar

dize

d LS

resi

dual

(b)

19

-2.5

2.5

Page 17: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 17MSIS 2008, Luxembourg: Valentin Todorov

R and the Outliers: Example LTS

25 30 35 40 45 50 55

2000

2050

2100

2150

2200

2250

AGE

HR

S

(c)

0 10 20 30 40

-20

-10

010

20

Index

Sta

ndar

dize

d LT

S re

sidu

al

(d)

3432

29

4

5

-2.5

2.5

Page 18: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 18MSIS 2008, Luxembourg: Valentin Todorov

R and the Outliers: Example Covariance•Marona & Yohai (1998)

•rrcov: data set maryo

•A bivariate data set with:

•sample correlation: 0.81

•interchange the largest and smallest value in the first coordinate

•the sample correlation becomes 0.05

18.0

8.01

00,20

S

n

-2 -1 0 1

-2-1

01

2

19

9

TOLERANCE ELLIPSE (97.5%)

cleancontaminated

Page 19: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 19MSIS 2008, Luxembourg: Valentin Todorov

More R…

• R and the WEB - several projects that provide possibilities to use R over the WEB

• R and the Missing – advanced missing value handling– mvnmle: ML estimation for multivariate data with missing values– mitools: Tools for multiple imputation of missing data– mice - Multivariate Imputation by Chained Equations – EMV: Estimation of Missing Values for a Data Matrix – VIM: provides methods for the visualisation as well as imputation

of missing data

• R Objects – R is an Object Oriented language (however in a quite different sense from C++, Java, C#)

Page 20: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 20MSIS 2008, Luxembourg: Valentin Todorov

More R…

• R GUI– R Commander: a basic statistics GUI, consisting of a window

containing several menus, buttons, and information fields– Sciviews: a suite of companion applications for Windows

• R and SDMX• R Reports

– package xtable: coerce data to LaTeX and HTML tables – package Sweave: a framework for mixing text and R code for

automatic report gene

Page 21: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 21MSIS 2008, Luxembourg: Valentin Todorov

Summary

• Output Management System– SAS/SPSS: it is rarely used for routine work– R: output is easily passed from one function to another to do

further processing and to obtain more results

• Macro Language– SAS/SPSS: a special language with own syntax. The new

functions are not run in the same way as the built-in procedures– R itself is a programming language

• Matrix Language– SAS/SPSS: A special language with own syntax– R is a vector and matrix based language complemented by

additional packages: Matitrx, SparseM

Page 22: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 22MSIS 2008, Luxembourg: Valentin Todorov

Summary (cont.)

• Publishing results– SAS/SPSS: Cut and paste to a Word processor or exporting to a

file– R: produce LaTex output (including graphics) using for

example the Sweave package

• Data size– SAS/SPSS: Limited by the size of the disk– R: Limited by the size of the RAM, (not trivial) usage of

databases for large data sets is possible

• Data structure– SAS/SPSS: Rectangular data set– R: Rectangular data frame, vector, list

Page 23: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 23MSIS 2008, Luxembourg: Valentin Todorov

Summary (cont.)

• Interface to other programming languages– SAS/SPSS: Not available– R: R can be easily mixed with Fortran, C, C++ and Java

• Source code– SAS/SPSS: Not available– R: the source code of R itself as well as of its packages is a

part of the distribution

Page 24: R: An Open Source Statistical Environment 8.4.20081MSIS 2008, Luxembourg: Valentin Todorov R: An Open Source Statistical Environment Valentin Todorov UNIDO

R: An Open Source Statistical Environment

8.4.2008 24MSIS 2008, Luxembourg: Valentin Todorov

References• Hornik, K and Leisch, F, (2005) R Version 2.1.0, Computational Statistics, 20

2 pp 197-202

• Kabacoff, R. (2008) Quick-R for SAS and SPSS users, available from http://www.statmethods.net/index.html

• López-de-Lacalle, J, (2006) The R-computing language: Potential for Asian economists, Journal of Asian Economics, 17 6, pp 1066-1081

• Muenchen, R. (2007), R for SAS and SPSS users, URL: http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf

• Murrel, P. (2005) R Graphics, Chapman & Hall

• R Development Core Team (2007) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. URL: http://www.r-project.org/

• Templ, M and Filzmoser, F (2008), Visualisation of Missing Values and Robust Imputation in Environmental Surveys, submitted for publication

• Wheeler, D.A., (2007) Why Open Source Software / Free Software (OSS/FS, FLOSS, or FOSS)? Look at the Numbers!