“working with s-plus”

37
“WORKING WITH S-PLUS” Michael Elliott Assistant Professor of Biostatistics University of Michigan School of Public Health

Upload: merrill

Post on 12-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

“WORKING WITH S-PLUS”. Michael Elliott Assistant Professor of Biostatistics University of Michigan School of Public Health. Overview. S-Plus is a commercial statistical software package Plots or other graphics Vector and matrix computation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: “WORKING WITH S-PLUS”

“WORKING WITH S-PLUS”

Michael Elliott

Assistant Professor of Biostatistics

University of Michigan School of Public Health

Page 2: “WORKING WITH S-PLUS”

2

Overview

• S-Plus is a commercial statistical software package– Plots or other graphics– Vector and matrix computation– Non-parametric estimation such as kernel density estimation and

cubic spline regression – Simulations.

• Importing and exporting data• Executing standard statistical procedures• Graphics• Computation• Robust statistics• R: an improved(?) alternative to S-Plus• Where to go for further help

Page 3: “WORKING WITH S-PLUS”

3

•Can import or export ASCII data plus most major database management and statistical package data files (dBASE, Lotus, Excel, SAS, SPSS, Stata)

•S-plus creates data objects:–Scalar

–Vector

–Matrix

–Array

–Factor (vector of categorical variables)

–Frame (matrix of different types)

–List (general collection of different types)

Importing and Exporting Data

Page 4: “WORKING WITH S-PLUS”

4

Example dataset

• 1,000 young, first-time-licensed drivers in Michigan

Age when Time N of Time to

ID licensed licensed Gender Offenses 1st Offense

001 17.5 3.4 0 1 1.0

002 16.1 6.2 0 1 3.3

. . . . . .

. . . . . .

. . . . . .

Page 5: “WORKING WITH S-PLUS”

5

Two ways to import and export data:

• Use menu-driven commands

• Use batch commands in the “Commands Window”

Page 6: “WORKING WITH S-PLUS”

6

Menu-Driven Import/Export

• Click on File

– Import Data

• From File

• Fill in location and file type information

Page 7: “WORKING WITH S-PLUS”

7

Batch Command Import/Export

• Use the read.table or scan commands– Scan is a quick way to read in a small vector– Read.table generally works better for importing 2-

dimensional tables:

splus_read.table(file="h:/splus/splus.txt",na.strings='.',col.names=c('id',"age","tol","male","noff","ttof"))

Generally better to manage complex hierarchical data structures in, e.g., SAS and generate flat files for S-plus.

Page 8: “WORKING WITH S-PLUS”

8

Basic operations

• The usual operators are used: also ^ for exponentiation, %% for modulo, and _ or <- for assignment.

• c(1,2,5) creates a 3x1 vector with the elements 1, 2, and 5; c(0:10) creates an 11x1 vector with elements 0 through 10; rep(1,10) creates a 10x1 vector of 1.

• matrix(c(1,1,2,2),2,2) creates a 2x2 matrix with the 1st column being 1s and the 2nd column 2s.

• What would array(rep(c(1,1,2,2),2),dim=c(2,2,2) generate?

• To extract a portion of a vector, use name[a:b]; similarly, to read a row or column of a matrix, use name[a,] or name[,b] respectively.

Page 9: “WORKING WITH S-PLUS”

9

• Logical vectors can be generated by conditional

statements:

>tf_(c(1,2,5)<=c(2,4,1))

>tf

[1] T T F

• Since these vectors have the property that T=1, F=0, they are very useful. For example, they could construct a linear spline design matrix:

yi=0+1ti+2 (ti-1)+ + 3 (ti-2)+ +i

X_cbind(t,c((t-theta1)*(t>theta1)),

c((t-theta2)*(t>theta2)))

Page 10: “WORKING WITH S-PLUS”

10

• sort(object) provides the values in object in alphanumeric

order

> sort(c(3,2,1,5))

[1] 1 2 3 5

• order(object) provides the location of elements in object in alphanumeric order

> order(c(3,2,1,5))

[1] 3 2 1 4

Page 11: “WORKING WITH S-PLUS”

11

Common statistical procedures

• S-plus commands are functions of the form

function(parameter1,parameter2,…)

• Descriptive functions– summary(object)

– hist(object)

– stem(object)

• Univariate and bivariate statistical tests– t.test(object,mu=value)

– chisq.test(object)

Page 12: “WORKING WITH S-PLUS”

12

• Regression models– Linear regression– Analysis of variance– Generalized linear models

• Survival analysis– Nonparametric– Parametric– Semiparametic (Cox proportional hazard model)

• Other– Random effects models– Factor analysis

Page 13: “WORKING WITH S-PLUS”

13

• Example fitting a (Gaussian) linear regression model:

yi=0+1xi1+2xi2+i where yi is the age at time of license,

xi1 is the number of offenses, and xi2 is the total time of license for the ith subject. The ’s are fixed but unknown parameters, and the are (also unknown) normally distributed error terms.

• The function lm() provides the least-square (maximum likelihood) estimates of together with numerous related diagnostics:

myreg_lm(age~noff+tol,data=splus)

Page 14: “WORKING WITH S-PLUS”

14

• myreg is a list containing a number of items– myreg$coefficients

– myreg$residuals

– myreg$fitted.values

– see help(lm.object) for a complete list of available items

• Use summary.lm(myreg) to obtain conventional regression analysis output.

Page 15: “WORKING WITH S-PLUS”

15

• summary.lm(object) also provides useful items:– cov.unscaled: a p x p matrix of (unscaled) covariances of the

regression coefficents

– sigma: the square root of the estimated variance of the random error

> summary.lm(myreg)$sigma*summary.lm(myreg)$cov.unscaled

(Intercept) noff tol

(Intercept) 0.01584473684 0.00005315618 -0.00250628666

noff 0.00005315618 0.00014034184 -0.00004702984

tol -0.00250628666 -0.00004702984 0.00042518954

– r.squared: the multiple R-squared statistic for the model.

– fstatistic: numeric vector of length three giving the F test for the regression

– see help(summary.lm) for a complete list of available items

Page 16: “WORKING WITH S-PLUS”

16

ANOVA:

aov(formula, data = <name of data>, projections = F, contrasts = NULL, ...)

GLM: glm(formula, family = gaussian, data = <data>,

weights = <weight vector>, na.action = na.fail, contrasts = NULL, ...)

Survival:KM: kaplanMeier(Surv(time,cens), data= <name of data>, weights=NULL, na.action=na.fail, se.fit = T, conf.interval = "log", coverage = 0.95, …)

G-rho family: survdiff(Surv(ttof,cens)~male,rho=<rho>) Cox PH: coxph(Surv(time,cens), method= <ties>, …)

Page 17: “WORKING WITH S-PLUS”

17

Graphics

• S-plus easily produces nice graphics, with fine control available for little effort.

• Plot age at time of license versus number of offenses:

tol_splus[,2]

ttof_splus[,6]

plot(tol,ttof,xlab=“Age of time of license”,

ylab="Time to First Offense")

Page 18: “WORKING WITH S-PLUS”

18

• You can easily control a variety of features, including

– Labels (xlab=“xxxx”, ylab=“yyyy”)

– Plotting symbol (pch=‘symbol’, pch=n)

– Axis type (log=“x”, log=“y”, log=“xy”)

– Legends (legend(x,y,legend=…,pch=…,lty=…) )

– Titles (title())

– Size. (pty=“s”, mfrow=c(2,2))

• It is also relatively easy to produce – multiple graphs on a single page (par)

– Overlays of graphs (par)

– Barplots (barplot)

– Scatterplot matricies (brush)

– Contour plots (contour)

– 3-D plots (use menu)

Page 19: “WORKING WITH S-PLUS”

19

Matrix Functions and Other Computations• Vector and matrix computation is very simply in S-plus.

• To obtain the product of a scalar with a vector or matrix, use ‘*’; this also will give you element-by-element multiplication of two equal-size vectors, matricies, or arrays:

>c(1,2,3)*c(4,5,6)

[1] 4 10 18

• Vector or matrix multiplication can be done using %*%:

• Transpose using t().

1 2 2 3 10 13A= B= A%*%B=

2 4 4 5 20 26

Page 20: “WORKING WITH S-PLUS”

20

• solve() inverts a square matrix or solves a set of linear equations:– solve(A)%*%A=I

– A%*%solve(A,b)=b

Page 21: “WORKING WITH S-PLUS”

21

• chol() returns the Choleski decomposition of a non-negative symmetric matrix (note this is the transpose of the usual lower-triangular form):– t(chol(A))%*%chol(A)=A

Page 22: “WORKING WITH S-PLUS”

22

• QR decomposition of an n x p matrix into an n x n orthonormal matrix Q and an n x p upper triangular matrix R is given by qr.Q(qr(A)) and qr.R(qr(A)).

Page 23: “WORKING WITH S-PLUS”

23

• eigen() calculates the eigenvalues and eigenvectors of a square matrix: for A k x k square symmetric matrix, eigen(A)$vector and eigen(A)$values yield a k x k matrix and k x 1 vector such that, for each element j in eigen(A)$values and each column ej in eigen(A)$vector:

Aej=jej <-> A=jejejT , where ej

Tej=1 and eiTej=0 if i<>j.

Page 24: “WORKING WITH S-PLUS”

24

• Numerical optimization procedures are also available:

nlmin(f, x, …)

– f is an S-PLUS function which takes as its only argument a real vector of length p.

– x is a vector of length p used as the starting point for the optimizer.

Also nlminb for constrained optimization, nlregb for minimizing sums of squares of nonlinear functions subject to bound-constrained parameters, and nls for general non-linear regression.

Page 25: “WORKING WITH S-PLUS”

25

User-Defined Functions

• You may define your own functions using the function() function:

mydet_function(A,B){

detA_Re(prod(eigen(A)$values))

detB_Re(prod(eigen(B)$values))

detAB_detA*detB

return(detA,detB,detAB)

}

Page 26: “WORKING WITH S-PLUS”

26

Robust Statistics

• S-plus’ “niche” market is its extensive package of robust statistical procedures.

Page 27: “WORKING WITH S-PLUS”

27

• As a measure of central tendency, the mean is highly

sensitive to skewed or outlying observations, whereas the median is unaffected, but less optimal if the distribution is truly symmetric.– A simple compromise between the median and mean is the

trimmed mean, which is the mean of the 1-2 central part of the distribution. This is implemented in S-plus via the trim parameter in the mean function: mean(x,trim=).

– A major subspeciality in statistics is devoted to this area and S-plus has the function location() which can be used to implement them.

Page 28: “WORKING WITH S-PLUS”

28

• Similarly, for the linear regression model

yi=xi+i

the least-square parameter estimate of that solve

minb (yi-xib)2 are not as robust to outliers as the median (L1) estimator minb |yi-xib|.

• Fit in S-plus using l1fit(X,y) where X is the design matrix and y the vector of dependent observations.– note this is the reverse of the usual order: design variables first,

then outcome.

Page 29: “WORKING WITH S-PLUS”

29

• Kernel density estimation provides a probability density function estimate from the histogram of observations:

density(object,kernel=<>,bandwidth=<>)

kernel=kernel-density smoother:

"box" - a rectangular box (the default).

"triangle" - a triangle (a box convolved with itself).

"parzen" - the parzen function (a box convolved with a triangle).

"normal" - the gaussian density function.

bandwidth=the kernel bandwidth smoothing parameter.

Page 30: “WORKING WITH S-PLUS”

30

• Robust and additive regression models extend the

assumption of a linear relationship between E(Y) and X to the more general case of a smooth non-linear relationship.

loess(formula,data=<dataset>,span=<>,degree=<1,2>,…)span=width over which tricube weights calculateddegree=X linear or quadraticplot using plot.loess(loess(…))

smooth.spline(x, y, df = <>, spar = <>, all.knots = <T,F>,df.offset = 0, penalty = 1)

Parameters allow knot choice and smoothing parameters to be pre-set or data-driven.

Page 31: “WORKING WITH S-PLUS”

31

Simulations• A number of built-in procedures are available to generate

random data.– rnorm(n,mu=0,sd=1) generates n N(0,1) random variables.

– qnorm(p,mu=0,sd=1) and pnorm(q,mu=0,sd=1) gives q such that P(X<q)=p for X~N(0,1).

– dnorm(x,mu=0,sd=1) gives the PDF of the N(0,1) distribution at x.

– Similar procedures are available for 19 other standard distributions.

• The for and if commands implement loops and conditional execution of statements.

• Together these procedures allow for easy implementation of simulations.

Page 32: “WORKING WITH S-PLUS”

32

betalb_rep(0,200)betaub_rep(0,200)count_0for(i in 1:200){x_runif(30,0,10)y_1+2*x+rnorm(30)myreg_lm(y~x)betalb[i]_myreg$coefficients[2]-

qt(.975,28)*sqrt(deviance(myreg)/28)/sqrt(29*var(x))betaub[i]_myreg$coefficients[2]+

qt(.975,28)*sqrt(deviance(myreg)/28)/sqrt(29*var(x))if (betalb[i]<2&&betaub[i]>2) count_count+1}

Page 33: “WORKING WITH S-PLUS”

33

Could delete the if statement and just compute

100-(sum(betalb>2)+sum(betaub<2))/2

Page 34: “WORKING WITH S-PLUS”

34

Memory allocation in S-Plus

• In loops, RAM is not released until the loop is completed.– In large simulations, later iterations will slow relative to earlier

iterations.

– This problem is made worse in nested loops (loops within loops).

• Try to avoid loops, and especially nested loops, whenever possible– Use vector and matrix manipulations, instead.

– Have yik, i=1,…,n, k=1,…,ni. Want yi.=Σyik.

– Could loop across i and k, or be clever and use cumsum:

cumsum(y)[c(n1,…, nn)]-c(0,cumsum(y)[c(n1,…,nn-1)])

Page 35: “WORKING WITH S-PLUS”

35

RAn alternative to S-Plus is R.

• Part of the GNU ("GNU's Not Unix") constellation of free Linux-based software.

• Runs under Linux, UNIX and Windows.

• Big difference: dynamic allocation of memory

• Can run loops without chewing up memory.

Page 36: “WORKING WITH S-PLUS”

36

• Most basic code is virtually interchangeable. A few differences:– Generalized linear models are implemented differently in R. (The

same functionality is available but the components have different names.)

– Missing data usually leads to failure in S unless the “na.omit” is used; "na.omit" is the default in R.

– No-intercept models are defined by y~x+0 instead of y~x-1.

– Write out “TRUE” and “FALSE” instead of using “T” and “F”.

• More advanced functions (linear and non-linear mixed models, smoothing splines, adaptive quadrature, SEMs) have packages that need to be downloaded.

Page 37: “WORKING WITH S-PLUS”

37

Where to go to get help

• This has been a very brief introduction.• The on-line help menu in S-plus is very useful.• Venables and Ripley (1997) Modern Applied Statistics

with S-Plus (2nd edition).• Websites:

Splus:– www.mathsoft.com– Statlib site: lib.stat.cmu/S/

R:– http://www.R-project.org/ – http://www.gnu.org/gnu/the-gnu-project.html

See http://cceb.med.upenn.edu/elliott/ for this presentation.