“working with s-plus”

“WORKING WITH S-PLUS”

Michael Elliott

Assistant Professor of Biostatistics

University of Michigan School of Public Health

2

Overview

• S-Plus is a commercial statistical software package– Plots or other graphics– Vector and matrix computation– Non-parametric estimation such as kernel density estimation and

cubic spline regression – Simulations.

• Importing and exporting data• Executing standard statistical procedures• Graphics• Computation• Robust statistics• R: an improved(?) alternative to S-Plus• Where to go for further help

3

•Can import or export ASCII data plus most major database management and statistical package data files (dBASE, Lotus, Excel, SAS, SPSS, Stata)

•S-plus creates data objects:–Scalar

–Vector

–Matrix

–Array

–Factor (vector of categorical variables)

–Frame (matrix of different types)

–List (general collection of different types)

Importing and Exporting Data

4

Example dataset

• 1,000 young, first-time-licensed drivers in Michigan

Age when Time N of Time to

ID licensed licensed Gender Offenses 1st Offense

001 17.5 3.4 0 1 1.0

002 16.1 6.2 0 1 3.3

. . . . . .

. . . . . .

. . . . . .

5

Two ways to import and export data:

• Use menu-driven commands

• Use batch commands in the “Commands Window”

6

Menu-Driven Import/Export

• Click on File

– Import Data

• From File

• Fill in location and file type information

7

Batch Command Import/Export

• Use the read.table or scan commands– Scan is a quick way to read in a small vector– Read.table generally works better for importing 2-

dimensional tables:

splus_read.table(file="h:/splus/splus.txt",na.strings='.',col.names=c('id',"age","tol","male","noff","ttof"))

Generally better to manage complex hierarchical data structures in, e.g., SAS and generate flat files for S-plus.

8

Basic operations

• The usual operators are used: also ^ for exponentiation, %% for modulo, and _ or <- for assignment.

• c(1,2,5) creates a 3x1 vector with the elements 1, 2, and 5; c(0:10) creates an 11x1 vector with elements 0 through 10; rep(1,10) creates a 10x1 vector of 1.

• matrix(c(1,1,2,2),2,2) creates a 2x2 matrix with the 1st column being 1s and the 2nd column 2s.

• What would array(rep(c(1,1,2,2),2),dim=c(2,2,2) generate?

• To extract a portion of a vector, use name[a:b]; similarly, to read a row or column of a matrix, use name[a,] or name[,b] respectively.

9

• Logical vectors can be generated by conditional

statements:

>tf_(c(1,2,5)<=c(2,4,1))

>tf

[1] T T F

• Since these vectors have the property that T=1, F=0, they are very useful. For example, they could construct a linear spline design matrix:

yi=0+1ti+2 (ti-1)+ + 3 (ti-2)+ +i

X_cbind(t,c((t-theta1)*(t>theta1)),

c((t-theta2)*(t>theta2)))

10

• sort(object) provides the values in object in alphanumeric

order

> sort(c(3,2,1,5))

[1] 1 2 3 5

• order(object) provides the location of elements in object in alphanumeric order

> order(c(3,2,1,5))

[1] 3 2 1 4

11

Common statistical procedures

• S-plus commands are functions of the form

function(parameter1,parameter2,…)

• Descriptive functions– summary(object)

– hist(object)

– stem(object)

• Univariate and bivariate statistical tests– t.test(object,mu=value)

– chisq.test(object)

12

• Regression models– Linear regression– Analysis of variance– Generalized linear models

• Survival analysis– Nonparametric– Parametric– Semiparametic (Cox proportional hazard model)

• Other– Random effects models– Factor analysis

13

• Example fitting a (Gaussian) linear regression model:

yi=0+1xi1+2xi2+i where yi is the age at time of license,

xi1 is the number of offenses, and xi2 is the total time of license for the ith subject. The ’s are fixed but unknown parameters, and the are (also unknown) normally distributed error terms.

• The function lm() provides the least-square (maximum likelihood) estimates of together with numerous related diagnostics:

myreg_lm(age~noff+tol,data=splus)

14

• myreg is a list containing a number of items– myreg$coefficients

– myreg$residuals

– myreg$fitted.values

– see help(lm.object) for a complete list of available items

• Use summary.lm(myreg) to obtain conventional regression analysis output.

15

• summary.lm(object) also provides useful items:– cov.unscaled: a p x p matrix of (unscaled) covariances of the

regression coefficents

– sigma: the square root of the estimated variance of the random error

> summary.lm(myreg)$sigma*summary.lm(myreg)$cov.unscaled

(Intercept) noff tol

(Intercept) 0.01584473684 0.00005315618 -0.00250628666

noff 0.00005315618 0.00014034184 -0.00004702984

tol -0.00250628666 -0.00004702984 0.00042518954

– r.squared: the multiple R-squared statistic for the model.

– fstatistic: numeric vector of length three giving the F test for the regression

– see help(summary.lm) for a complete list of available items

16

ANOVA:

aov(formula, data = <name of data>, projections = F, contrasts = NULL, ...)

GLM: glm(formula, family = gaussian, data = <data>,

weights = <weight vector>, na.action = na.fail, contrasts = NULL, ...)

Survival:KM: kaplanMeier(Surv(time,cens), data= <name of data>, weights=NULL, na.action=na.fail, se.fit = T, conf.interval = "log", coverage = 0.95, …)

G-rho family: survdiff(Surv(ttof,cens)~male,rho=<rho>) Cox PH: coxph(Surv(time,cens), method= <ties>, …)

17

Graphics

• S-plus easily produces nice graphics, with fine control available for little effort.

• Plot age at time of license versus number of offenses:

tol_splus[,2]

ttof_splus[,6]

plot(tol,ttof,xlab=“Age of time of license”,

ylab="Time to First Offense")

18

• You can easily control a variety of features, including

– Labels (xlab=“xxxx”, ylab=“yyyy”)

– Plotting symbol (pch=‘symbol’, pch=n)

– Axis type (log=“x”, log=“y”, log=“xy”)

– Legends (legend(x,y,legend=…,pch=…,lty=…) )

– Titles (title())

– Size. (pty=“s”, mfrow=c(2,2))

• It is also relatively easy to produce – multiple graphs on a single page (par)

– Overlays of graphs (par)

– Barplots (barplot)

– Scatterplot matricies (brush)

– Contour plots (contour)

– 3-D plots (use menu)

19

Matrix Functions and Other Computations• Vector and matrix computation is very simply in S-plus.

• To obtain the product of a scalar with a vector or matrix, use ‘*’; this also will give you element-by-element multiplication of two equal-size vectors, matricies, or arrays:

>c(1,2,3)*c(4,5,6)

[1] 4 10 18

• Vector or matrix multiplication can be done using %*%:

• Transpose using t().

1 2 2 3 10 13A= B= A%*%B=

2 4 4 5 20 26

20

• solve() inverts a square matrix or solves a set of linear equations:– solve(A)%*%A=I

– A%*%solve(A,b)=b

21

• chol() returns the Choleski decomposition of a non-negative symmetric matrix (note this is the transpose of the usual lower-triangular form):– t(chol(A))%*%chol(A)=A

22

• QR decomposition of an n x p matrix into an n x n orthonormal matrix Q and an n x p upper triangular matrix R is given by qr.Q(qr(A)) and qr.R(qr(A)).

23

• eigen() calculates the eigenvalues and eigenvectors of a square matrix: for A k x k square symmetric matrix, eigen(A)$vector and eigen(A)$values yield a k x k matrix and k x 1 vector such that, for each element j in eigen(A)$values and each column ej in eigen(A)$vector:

Aej=jej <-> A=jejejT , where ej

Tej=1 and eiTej=0 if i<>j.

24

• Numerical optimization procedures are also available:

nlmin(f, x, …)

– f is an S-PLUS function which takes as its only argument a real vector of length p.

– x is a vector of length p used as the starting point for the optimizer.

Also nlminb for constrained optimization, nlregb for minimizing sums of squares of nonlinear functions subject to bound-constrained parameters, and nls for general non-linear regression.

25

User-Defined Functions

• You may define your own functions using the function() function:

mydet_function(A,B){

detA_Re(prod(eigen(A)$values))

detB_Re(prod(eigen(B)$values))

detAB_detA*detB

return(detA,detB,detAB)

}

26

Robust Statistics

• S-plus’ “niche” market is its extensive package of robust statistical procedures.

27

• As a measure of central tendency, the mean is highly

sensitive to skewed or outlying observations, whereas the median is unaffected, but less optimal if the distribution is truly symmetric.– A simple compromise between the median and mean is the

trimmed mean, which is the mean of the 1-2 central part of the distribution. This is implemented in S-plus via the trim parameter in the mean function: mean(x,trim=).

– A major subspeciality in statistics is devoted to this area and S-plus has the function location() which can be used to implement them.

28

• Similarly, for the linear regression model

yi=xi+i

the least-square parameter estimate of that solve

minb (yi-xib)2 are not as robust to outliers as the median (L1) estimator minb |yi-xib|.

• Fit in S-plus using l1fit(X,y) where X is the design matrix and y the vector of dependent observations.– note this is the reverse of the usual order: design variables first,

then outcome.

29

• Kernel density estimation provides a probability density function estimate from the histogram of observations:

density(object,kernel=<>,bandwidth=<>)

kernel=kernel-density smoother:

"box" - a rectangular box (the default).

"triangle" - a triangle (a box convolved with itself).

"parzen" - the parzen function (a box convolved with a triangle).

"normal" - the gaussian density function.

bandwidth=the kernel bandwidth smoothing parameter.

30

• Robust and additive regression models extend the

assumption of a linear relationship between E(Y) and X to the more general case of a smooth non-linear relationship.

loess(formula,data=<dataset>,span=<>,degree=<1,2>,…)span=width over which tricube weights calculateddegree=X linear or quadraticplot using plot.loess(loess(…))

smooth.spline(x, y, df = <>, spar = <>, all.knots = <T,F>,df.offset = 0, penalty = 1)

Parameters allow knot choice and smoothing parameters to be pre-set or data-driven.

31

Simulations• A number of built-in procedures are available to generate

random data.– rnorm(n,mu=0,sd=1) generates n N(0,1) random variables.

– qnorm(p,mu=0,sd=1) and pnorm(q,mu=0,sd=1) gives q such that P(X<q)=p for X~N(0,1).

– dnorm(x,mu=0,sd=1) gives the PDF of the N(0,1) distribution at x.

– Similar procedures are available for 19 other standard distributions.

• The for and if commands implement loops and conditional execution of statements.

• Together these procedures allow for easy implementation of simulations.

32

betalb_rep(0,200)betaub_rep(0,200)count_0for(i in 1:200){x_runif(30,0,10)y_1+2*x+rnorm(30)myreg_lm(y~x)betalb[i]_myreg$coefficients[2]-

qt(.975,28)*sqrt(deviance(myreg)/28)/sqrt(29*var(x))betaub[i]_myreg$coefficients[2]+

qt(.975,28)*sqrt(deviance(myreg)/28)/sqrt(29*var(x))if (betalb[i]<2&&betaub[i]>2) count_count+1}

33

Could delete the if statement and just compute

100-(sum(betalb>2)+sum(betaub<2))/2

34

Memory allocation in S-Plus

• In loops, RAM is not released until the loop is completed.– In large simulations, later iterations will slow relative to earlier

iterations.

– This problem is made worse in nested loops (loops within loops).

• Try to avoid loops, and especially nested loops, whenever possible– Use vector and matrix manipulations, instead.

– Have yik, i=1,…,n, k=1,…,ni. Want yi.=Σyik.

– Could loop across i and k, or be clever and use cumsum:

cumsum(y)[c(n1,…, nn)]-c(0,cumsum(y)[c(n1,…,nn-1)])

35

RAn alternative to S-Plus is R.

• Part of the GNU ("GNU's Not Unix") constellation of free Linux-based software.

• Runs under Linux, UNIX and Windows.

• Big difference: dynamic allocation of memory

• Can run loops without chewing up memory.

36

• Most basic code is virtually interchangeable. A few differences:– Generalized linear models are implemented differently in R. (The

same functionality is available but the components have different names.)

– Missing data usually leads to failure in S unless the “na.omit” is used; "na.omit" is the default in R.

– No-intercept models are defined by y~x+0 instead of y~x-1.

– Write out “TRUE” and “FALSE” instead of using “T” and “F”.

• More advanced functions (linear and non-linear mixed models, smoothing splines, adaptive quadrature, SEMs) have packages that need to be downloaded.

37

Where to go to get help

• This has been a very brief introduction.• The on-line help menu in S-plus is very useful.• Venables and Ripley (1997) Modern Applied Statistics

with S-Plus (2nd edition).• Websites:

Splus:– www.mathsoft.com– Statlib site: lib.stat.cmu/S/

R:– http://www.R-project.org/ – http://www.gnu.org/gnu/the-gnu-project.html

See http://cceb.med.upenn.edu/elliott/ for this presentation.

http://www.mathsoft.com/

http://cceb.med.upenn.edu/elliott/

“working with s-plus”

Documents

export data

data objects

export ascii data

timelicensed drivers

total time of license

linear spline design

5tf1 t t fsince

number of offenses