graphics with r - university of michigan · pdf filepng - png bitmap device (like .gif but...
TRANSCRIPT
Graphical capabilities
One of the strengths of the S language is graphics.
Simple, exploratory graphics are easy to produce.
Publication quality graphics can be created.
Several device drivers are available including:On-screen graphics - either Windows, or X11, or Macintoshpostscript - PostScript graphics commandspdf - Adobe Portable Document Formatpng - PNG bitmap device (like .gif but free of softwarepatents)jpeg - JPEG bitmapWMF - Windows meta-file (Windows only)
Graphics with R – p.2/37
Declaring graphics devices
The on-screen devices are the most commonly used. Forpublication-quality graphics the postscript, pdf, or WMF devices arepreferred because they produce scalable images. Use bitmapdevices only when there is no alternative.
The preferred sequence is to specify a graphics device then callgraphics functions. If you do not specify a device first, theon-screen device is started.
A contributed R package called lattice provides Trellis graphicsfunctions. When using lattice it is important to declare the deviceusing trellis.device before issuing graphics commands.
Graphics with R – p.3/37
Types of graphics functions
High-level - functions such as plot, hist, boxplot, or pairs thatproduce an entire plot or initialize a plot.
low-level - functions that add to an existing plot created with ahigh-level plotting function. Examples are points, lines,text, axis, � � �
Trellis functions - functions such as xyplot, bwplot, orhistogram that can produce an entire multipanel display in asingle call.
After creating a new plot with a high-level plotting function, you canadd to the plot by making calls to low-level plotting functions. Youcannot, however, do this after a trellis function call.
Graphics with R – p.4/37
An example
It is common to illustrate the central-limit effect by computingmeans of samples from a nonsymmetric distribution and showingthat the distribution of the mean tends to a normal distribution asthe sample size increases. This is best illustrated with graphicaldisplays.> # generate the samples as a matrix> rmt <- matrix(rexp(1000 * 16), nrow = 16)> mns <- # Apply the mean function to columns+ cbind(rmt[ 1, ], # means of samples of 1+ apply(rmt[ 1:4, ], 2, mean), # means of samples of 4+ apply(rmt[ 1:16, ], 2, mean) # means of samples of 16+ )> meds <- # Apply the median function to columns+ cbind(rmt[ 1, ], # medians of samples of 1+ apply(rmt[ 1:4, ], 2, median), # medians of samples of 4+ apply(rmt[ 1:16, ], 2, median) # medians of samples of 16+ )
Graphics with R – p.5/37
Using high-level plotting functions
> hist(mns[, 1]) # a histogram of the means of samples of 1
Histogram of mns[, 1]
mns[, 1]
Fre
quen
cy
0 2 4 6 8
010
020
030
040
050
060
0
Graphics with R – p.6/37
Enhancing high-level plots
> hist(mns[,2], main = "Means of samples of size 4",+ xlab = "Size 4 means", las = 1) # sets the axis label style
Means of samples of size 4
Size 4 means
Fre
quen
cy
0 1 2 3 4
0
100
200
300
400
Graphics with R – p.7/37
Using low-level graphics functions
> hist(mns[,3], main = "Means of samples of size 16",+ xlab = "Size 16 means", las = 1, col = "darkred", prob = TRUE)> lines(density(mns[,3]), col = "blue")
Means of samples of size 16
Size 16 means
Den
sity
0.5 1.0 1.5 2.0 2.5
0.0
0.5
1.0
1.5
Graphics with R – p.8/37
Using lattice graphics
> library(lattice)> histogram(˜ mns | ssz, data = data.frame(mns = c(mns),+ ssz = gl(3, 1000, labels = c("1", "4", "16"))),+ layout = c(3, 1), main = "Histograms of means by sample size")
Histograms of means by sample size
mns
Per
cent
of T
otal
0
10
20
30
40
50
60
70
0 2 4 6
1 4
0 2 4 6
0 2 4 6
16
Graphics with R – p.9/37
Using the formula-data specification
Most of the high-level R graphics functions allow a formula-dataspecification for the plot. In trellis-style graphics from the latticepackage the formula-data specification is the only way to specify aplot.
A formula in S is indicated by the � character. Because formulasare used to specify statistical models, this character is often readas “is modelled as”. The second argument in a formula-dataspecification is usually a data frame with variables correspondingto the names in the formula.
To use the formula-data specification, first construct a data framewith the data to be plotted. The preferred form is to “stack” all thedata into a single column with accompanying columns that indicatethe groups of observations.
Graphics with R – p.10/37
Stacking the simulation data
Recall the simulated data of means and medians of samples ofsize 1, 4, and 16 from an exponential distribution. We arrange thisinto a data frame with 6000 rows and three columns - the simulateddata, the sample size being simulated, and an indicator of mean ormedian.
The gl function can be used to generate patterned data like thesample size and the type of simulation.> alldat <- data.frame(sim = c(mns, meds),
ssz = gl(3, 1000, len = 6000, labels = c("1", "4", "16")),type = gl(2, 3000, labels = c("Mean", "Median")))
> str(alldat)‘data.frame’: 6000 obs. of 3 variables:$ sim : num 0.934 0.200 1.866 1.074 1.283 ...$ ssz : Factor w/ 3 levels "1","4","16": 1 1 1 1 1 1 1 1 1 1 ...$ type: Factor w/ 2 levels "Mean","Median": 1 1 1 1 1 1 1 1 1 1 ...
Graphics with R – p.11/37
Formulas specifying plots
The general form of a formula specifying a plot isy ˜ x | g
where y is assigned to the vertical axis, x is assigned to thehorizontal axis, and g is a grouping factor or expression.
For special cases like the histogram, the vertical axis ispre-specified. In these cases we use a one-sided formula where yis omitted.
In trellis graphics functions the grouping expression can includemultiple factors separated by an arithmetic operator - often the *because this indicates “crossing” the factors. For example,histogram(˜ sim | ssz * type, data = alldat,
layout = c(3, 2),main = "Histograms of means and medians by sample size")
Graphics with R – p.12/37
Example of multiple grouping factors
Histograms of means and medians by sample size
sim
Per
cent
of T
otal
0
10
20
30
40
50
60
70
0 1 2 3 4 5 6 7
1 Mean
4 Mean
0 1 2 3 4 5 6 7
16 Mean
1 Median
4 Median
0 1 2 3 4 5 6 7
0
10
20
30
40
50
60
70 16
Median
Graphics with R – p.13/37
Boxplots
Another way of comparing distributions of sample data is with aboxplot (high-level graphics) or bwplot (lattice).bwplot(ssz ˜ sim | type, data = alldat,
main = "Boxplots of means and medians by sample size")
Boxplots of means and medians by sample size
sim
1
4
16
0 1 2 3 4 5 6 7
Mean Median
0 1 2 3 4 5 6 7
Graphics with R – p.14/37
Changing the lattice layout
The layout argument in lattice is used to rearrange the panels. Itis a two-component or three-component integer vector in the order(columns, rows, pages) (not (rows, columns, pages)).bwplot(ssz ˜ sim | type, data = alldat, layout = c(1,2))
main = "Boxplots of means and medians by sample size",
Boxplots of means and medians by sample size
sim
1
4
16
0 1 2 3 4 5 6 7
Mean
1
4
16
Median
Graphics with R – p.15/37
Scatter plots
A basic data plot is the scatter plot for x-y data. Recall> data(Formaldehyde)> plot(optden ˜ carb, data = Formaldehyde, xlab = "Carbohydrate (ml)",+ ylab = "Optical Density", main = "Formaldehyde data", col = 4,+ las = 1)
0.2 0.4 0.6 0.8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Formaldehyde data
Carbohydrate (ml)
Opt
ical
Den
sity
Graphics with R – p.16/37
Common additional graphics parameters
The call to plot on the previous slide included arguments
xlab - x axis label
ylab - y axis label
main - main title on the plot
col - color of the plotted points
las - axis label style
Although not required it is common to use these arguments.
Usually it is easy to get a first cut at a data plot, which may besufficient for exploratory graphics. Presentation graphics canrequire considerable experimentation with different settings,involving many cycles of editing and rerunning the code. Creatingfile of R code that can be edited and rerun using R BATCH is agood idea.
Graphics with R – p.17/37
Adding lines to plots
We often use scatter plots to picture relationships betweenvariables then procede to fit statistical models representing theserelationships. The lines and abline functions can be used toadd lines to a data plot depicting
scatterplot smoothers - a smooth curve generated fromthe y versus x data. This is added to enhance visualization ofthe y � x relationship.
predictions from fitted models - a fitted simple linearregression model can be added directly to a plot with abline.For more complicated models, use predict to create (x, y)pairs to add to the plot with lines.
> abline(fm1 <- lm(optden ˜ carb, data = Formaldehyde))> data(cars); plot(cars, xlab = "Speed (mph)",+ ylab = "Stopping distance (ft)", las = 1, main = "cars data")> lines(lowess(cars$speed, cars$dist, f = 2/3, iter = 3), col = "red")
Graphics with R – p.18/37
Adding a fitted model
0.2 0.4 0.6 0.8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Formaldehyde data
Carbohydrate (ml)
Opt
ical
Den
sity
Graphics with R – p.19/37
Adding a smoothed curve
5 10 15 20 25
0
20
40
60
80
100
120
cars data
Speed (mph)
Sto
ppin
g di
stan
ce (
ft)
Graphics with R – p.20/37
Logarithmic axes
The plot of the stopping distance versus speed for the cars datashows a curvilinear relationship. It also hints at increasing variancein the stopping distance as the speed (and the mean stoppingdistance) increase.
In cases like this a logarithmic transformation can stabilize thevariance and perhaps produce a simpler relationship. Theargument log is used to request logarithmic axes. It can take thevalues "x", "y", or "xy".
plot(cars, xlab = "Speed (mph)", ylab = "Stopping distance (ft)",las = 1, log = "xy", main = "cars data (logarithmic scales)")
lines(lowess(cars$speed, cars$dist, f = 2/3, iter = 3), col = "red")
Graphics with R – p.21/37
Cars data - logarithmic scales
5 10 15 20 25
2
5
10
20
50
100
cars data (logarithmic scales)
Speed (mph)
Sto
ppin
g di
stan
ce (
ft)
Graphics with R – p.22/37
Scatterplots with lattice
The lattice equivalent to plot is xyplot. You must use aformula-data specification with xyplot. Also, you must complete theplot in a single function call. Instead of using multiple calls to plotpoints then add scatterplot smoothers, etc., you describe all theactions for creating the plot in a panel function passed as thepanel argument.
> xyplot(optden ˜ carb, data = Formaldehyde, xlab = "Carbohydrate (ml)",+ ylab = "Optical density", main = "Default panel function")> xyplot(optden ˜ carb, data = Formaldehyde, xlab = "Carbohydrate (ml)",+ ylab = "Optical density",+ panel = function(x, y)
�panel.xyplot(x,y); panel.lmline(x,y)
�
)> xyplot(dist ˜ speed, data = cars, xlab = "Speed (mph)",+ ylab = "Stopping distance (ft)",+ panel = function(x, y)
�
panel.xyplot(x,y); panel.loess(x,y)
�
)> xyplot(log(dist) ˜ log(speed), data = cars, xlab = "log(Speed) (log(mph))",+ ylab = "log(Stopping distance) (log(ft))",+ panel = function(x, y)
�
panel.xyplot(x,y); panel.loess(x,y)
�
)
Graphics with R – p.23/37
Formaldehyde xyplot with default panel
Default panel function
Carbohydrate (ml)
Opt
ical
den
sity
0.2
0.4
0.6
0.8
0.2 0.4 0.6 0.8
Graphics with R – p.24/37
Formaldehyde xyplot with custom panel
Carbohydrate (ml)
Opt
ical
den
sity
0.2
0.4
0.6
0.8
0.2 0.4 0.6 0.8
Graphics with R – p.25/37
Cars xyplot with custom panel
Speed (mph)
Sto
ppin
g di
stan
ce (
ft)
0
20
40
60
80
100
120
5 10 15 20 25
Graphics with R – p.26/37
Cars xyplot on the logarithmic scale
log(Speed) (log(mph))
log(
Sto
ppin
g di
stan
ce)
(log(
ft))
1
2
3
4
5
1.5 2 2.5 3
There is a scales = list(log = TRUE) argument for xyplotbut currently it has no effect.
Graphics with R – p.27/37
xyplots conditioned on a factor
One of the most powerful uses of trellis-style graphics is comparingpatterns of responses across different groups. For example, theOxboys data in the nlme package gives the height (cm) of someboys from Oxford, England. Each subject was measured severaltimes throughout his adolescence. The time of measurement wasconverted to an arbitrary scale centered at the midpoint of the data.
We wish to compare the growth patterns for these boys. Tofacilitate comparison, the data for each boy is plotted in a separatepanel but scale of the axes is the same for each panel.> data(Oxboys, package = nlme)> xyplot(height ˜ age | Subject, data = Oxboys, ylab = "Height (cm)")> xyplot(height ˜ age | Subject, data = Oxboys, ylab = "Height (cm)",+ aspect = "xy", # calculate an optimal aspect ratio+ panel = function(x,y)
�
panel.grid(); panel.xyplot(x,y)
�
)
Graphics with R – p.28/37
Oxboys data, standard aspect ratio
age
Hei
ght (
cm)
130140150160170
−1 −0.5 0 0.5 1
10 26
−1 −0.5 0 0.5 1
25 9
−1 −0.5 0 0.5 1
2 6
−1 −0.5 0 0.5 1
7
17 16 15 8 20 1
130140150160170
18
130140150160170
5 23 11 21 3 24 22
12 13
−1 −0.5 0 0.5 1
14 19
−1 −0.5 0 0.5 1
4
Graphics with R – p.29/37
Oxboys data, "xy" aspect ratio
age
Hei
ght (
cm)
130
140
150
160
170
−1−0.500.51
10 26
−1−0.500.51
25 9
−1−0.500.51
2 6
−1−0.500.51
7 17
−1−0.500.51
16 15
−1−0.500.51
8 20
−1−0.500.51
1
18 5
−1−0.500.51
23 11
−1−0.500.51
21 3
−1−0.500.51
24 22
−1−0.500.51
12 13
−1−0.500.51
14 19
−1−0.500.51
130
140
150
160
170
4
Graphics with R – p.30/37
Scatterplot matrices
Scatterplot matrices are helpful in initial exploration of multivariatedata. These provide scatterplots of all pairs of variables in the dataand are produced by either the high-level graphics function pairsor the lattice function splom.
Both pairs and splom allow a panel function to be specified.Check the documentation for details.> data(USArrests)> pairs(USArrests)> splom(˜ USArrests)> splom(˜ USArrests,+ panel = function(x,y)
�
panel.xyplot(x,y); panel.loess(x,y)
�
)
Graphics with R – p.31/37
Pairs plot of USArrests data
Murder
50 150 250 10 20 30 40
510
15
5015
025
0
Assault
UrbanPop
3050
7090
5 10 15
1020
3040
30 50 70 90
Rape
Graphics with R – p.32/37
Default splom plot of USArrests data
Murder
00
5
5
10
1015
15
Assault
5050
100
100
150
150
200
200
200
200
250
250300
300350
350
UrbanPop
3030
4040
50
50
60
60
60
60
70
70
80
8090 90
Rape
1010
20
20
30
3040
40
Graphics with R – p.33/37
Enhanced splom plot of USArrests data
Murder
00
5
5
10
1015
15
Assault
5050
100
100
150
150
200
200
200
200
250
250300
300350
350
UrbanPop
3030
4040
50
50
60
60
60
60
70
70
80
8090 90
Rape
1010
20
20
30
3040
40
Graphics with R – p.34/37
Multiple figures per page
We have seen the use of the lattice package to produce multipanelplots where all the panels are related in some way. Occasionallywe wish to put multiple unrelated figures on a page. In thehigh-level graphics, the par function can be used to set thegraphics parameters mfrow (multiple figures created row-wise) ormfcol (multiple figures created columnwise) to do this.> data(Formaldehyde)> fm1 <- lm(optden ˜ carb, data = Formaldehyde)> par(mfrow = c(1,2)) # the next plot command produces 4 figures> plot(fm1)
Note - Creating an attractive multipanel plot in this way is difficult.The default setting for the axis spacing, etc., do not work well andusually require adjustment. If your multipanel plot can be created inthe lattice style, use that instead. When browsing statistics journalsone often sees figures created with par(mfrow = ) that shouldinstead have been done with trellis/lattice. Graphics with R – p.35/37
First two diagnostic plots
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
−0.
010
0.00
00.
005
Fitted values
Res
idua
ls
Residuals vs Fitted
6
54
−1.0 −0.5 0.0 0.5 1.0
−2.
0−
1.0
0.0
0.5
1.0
Theoretical Quantiles
Sta
ndar
dize
d re
sidu
als
Normal Q−Q plot
6
1
5
lm(formula = optden ~ carb, data = Formaldehyde)
Graphics with R – p.36/37
Second two diagnostic plots
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Fitted values
Sta
ndar
dize
d re
sidu
als
Scale−Location plot
6
1
5
1 2 3 4 5 6
0.0
0.5
1.0
1.5
2.0
Obs. number
Coo
k’s
dist
ance
Cook’s distance plot
6
1
5
lm(formula = optden ~ carb, data = Formaldehyde)
Graphics with R – p.37/37