r tutorial

119
R Tutorial Kelly Black Department of Mathematics and Computer Science Clarkson University Contents: 1. Input 2. Basic Data Types 3. Basic Operations and Numerical Descriptions 4. Basic Probability Distributions 5. Basic Plots 6. Intermediate Plotting 7. Indexing Into Vectors 8. Linear Least Squares Regression 9. Calculating Confidence Intervals 10. Calculating p Values 11. Calculating the Power of a Test 12. Two Way Tables 13. Data Management 14. Scripting 15. Time Data Types 16. Case Study: Working Through a HW Problem 17. Case Study II: A JAMA Paper on Cholesterol Indices and tables Index Search Page

Upload: prithwish-ghosh

Post on 24-Nov-2015

51 views

Category:

Documents


9 download

TRANSCRIPT

  • R Tutorial

    Kelly Black Department of Mathematics and Computer Science Clarkson University

    Contents:

    1. Input

    2. Basic Data Types

    3. Basic Operations and Numerical Descriptions

    4. Basic Probability Distributions

    5. Basic Plots

    6. Intermediate Plotting

    7. Indexing Into Vectors

    8. Linear Least Squares Regression

    9. Calculating Confidence Intervals

    10. Calculating p Values

    11. Calculating the Power of a Test

    12. Two Way Tables

    13. Data Management

    14. Scripting

    15. Time Data Types

    16. Case Study: Working Through a HW Problem

    17. Case Study II: A JAMA Paper on Cholesterol

    Indices and tables

    Index Search Page

  • 1. Input

    Contents

    Assignment Reading a CSV file Brief Note on Fixed Width Files

    Here we explore how to define a data set in an R session. Only two commands are explored. The first is for simple assignment of data, and the second is for reading in a data file. There are many ways to read data into an R session, but we focus on just two to keep it simple.

    1.1. Assignment

    The most straight forward way to store a list of numbers is through an assignment using the c command. (c stands for combine.) The idea is that a list of numbers is stored under a given name, and the name is used to refer to the data. A list is specified with the c command, and assignment is specified with the bubba

    When you enter this command you should not see any output except a new command line. The command creates a list of numbers called bubba. To see what numbers is included in bubba type bubba and press the enter key:

    > bubba [1] 3 5 7 9

    If you wish to work with one of the numbers you can get access to it using the variable and then square brackets indicating which number:

    > bubba[2] [1] 5 > bubba[1] [1] 3 > bubba[0] numeric(0) > bubba[3] [1] 7 > bubba[4] [1] 9

    Notice that the first entry is referred to as the number 1 entry, and the zero entry can be used to indicate how the computer will treat the data. You can store strings using both single and double quotes, and you can store real numbers.

  • You now have a list of numbers and are ready to explore. In the chapters that follow we will examine the basic operations in R that will allow you to do some of the analyses required in class.

    1.2. Reading a CSV file

    Unfortunately, it is rare to have just a few data points that you do not mind typing in at the prompt. It is much more common to have a lot of data points with complicated relationships. Here we will examine how to read a data set from a file using the read.csv function but first discuss the format of a data file.

    We assume that the data file is in the format called comma separated values (csv). That is, each line contains a row of values which can be numbers or letters, and each value is separated by a comma. We also assume that the very first row contains a list of labels. The idea is that the labels in the top row are used to refer to the different columns of values.

    First we read a very short, somewhat silly, data file. The data file is called simple.csv and has three columns of data and six rows. The three columns are labeled trial, mass, and velocity. We can pretend that each row comes from an observation during one of two trials labeled A and B. A copy of the data file is shown below and is created in defiance of Werner Heisenberg:

    silly.csv

    trial mass velocity

    A 10 12

    A 11 14

    B 5 8

    B 6 10

    A 10.5 13

    B 7 11

    The command to read the data file is read.csv. We have to give the command at least one arguments, but we will give three different arguments to indicate how the command can be used in different situations. The first argument is the name of file. The second argument indicates whether or not the first row is a set of labels. The third argument indicates that there is a comma between each number of each line. The following command will read in the data and assign it to a variable called heisenberg:

    > heisenberg heisenberg trial mass velocity

  • 1 A 10.0 12 2 A 11.0 14 3 B 5.0 8 4 B 6.0 10 5 A 10.5 13 6 B 7.0 11 > summary(heisenberg) trial mass velocity A:3 Min. : 5.00 Min. : 8.00 B:3 1st Bu.: 6.25 1st Qu.:10.25 Median : 8.50 Median :11.50 Mean : 8.25 Mean :11.33 3rd Qu.:10.38 3rd Qu.:12.75 Max. :11.00 Max. :14.00

    (Note that if you are using a Microsoft system the file naming convention is different from what we use here. If you want to use a backslash it needs to be escaped, i.e. use two backslashes together \. Also you can specify what folder to use by clicking on the File option in the main menu and choose the option to specify your working directory.)

    To get more information on the different options available you can use the help command:

    > help(read.csv)

    If R is not finding the file you are trying to read then it may be looking in the wrong folder/directory. If you are using the graphical interface you can change the working directory from the file menu. If you are not sure what files are in the current working directory you can use the dir() command to list the files and the getwd() command to determine the current working directory:

    > dir() [1] "fixedWidth.dat" "simple.csv" "trees91.csv" "trees91.wk1" [5] "w1.dat" > getwd() [1] "/home/black/write/class/stat/stat383-13F/dat"

    The variable heisenberg contains the three columns of data. Each column is assigned a name based on the header (the first line in the file). You can now access each individual column using a $ to separate the two names:

    > heisenberg$trial [1] A A B B A B Levels: A B > heisenberg$mass [1] 10.0 11.0 5.0 6.0 10.5 7.0 > heisenberg$velocity [1] 12 14 8 10 13 11

    If you are not sure what columns are contained in the variable you can use the names command:

    > names(heisenberg) [1] "trial" "mass" "velocity"

  • We will look at another example which is used throughout this tutorial. we will look at the data found in a spreadsheet located at http://cdiac.ornl.gov/ftp/ndp061a/trees91.wk1 . A description of the data file is located at http://cdiac.ornl.gov/ftp/ndp061a/ndp061a.txt . The original data is given in an excel spreadsheet. It has been converted into a csv file, trees91.csv , by deleting the top set of rows and saving it as a csv file. This is an option to save within excel. (You should save the file on your computer.) It is a good idea to open this file in a spreadsheet and look at it. This will help you make sense of how R stores the data.

    The data is used to indicate an estimate of biomass of ponderosa pine in a study performed by Dale W. Johnson, J. Timothy Ball, and Roger F. Walker who are associated with the Biological Sciences Center, Desert Research Institute, P.O. Box 60220, Reno, NV 89506 and the Environmental and Resource Sciences College of Agriculture, University of Nevada, Reno, NV 89512. The data is consists of 54 lines, and each line represents an observation. Each observation includes measurements and markers for 28 different measurements of a given tree. For example, the first number in each row is a number, either 1, 2, 3, or 4, which signifies a different level of exposure to carbon dioxide. The sixth number in every row is an estimate of the biomass of the stems of a tree. Note that the very first line in the file is a list of labels used for the different columns of data.

    The data can be read into a variable called tree in using the read.csv command:

    > tree attributes(tree) $names [1] "C" "N" "CHBR" "REP" "LFBM" "STBM" "RTBM" "LFNCC" [9] "STNCC" "RTNCC" "LFBCC" "STBCC" "RTBCC" "LFCACC" "STCACC" "RTCACC" [17] "LFKCC" "STKCC" "RTKCC" "LFMGCC" "STMGCC" "RTMGCC" "LFPCC" "STPCC" [25] "RTPCC" "LFSCC" "STSCC" "RTSCC" $class [1] "data.frame" $row.names [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" [31] "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42" "43" "44" "45" [46] "46" "47" "48" "49" "50" "51" "52" "53" "54"

  • The first thing that R stores is a list of names which refer to each column of the data. For example, the first column is called C, the second column is called N. Tree is of type data.frame. Finally, the rows are numbered consecutively from 1 to 54. Each column has 54 numbers in it.

    If you know that a variable is a data frame but are not sure what labels are used to refer to the different columns you can use the names command:

    > names(tree) [1] "C" "N" "CHBR" "REP" "LFBM" "STBM" "RTBM" "LFNCC" [9] "STNCC" "RTNCC" "LFBCC" "STBCC" "RTBCC" "LFCACC" "STCACC" "RTCACC" [17] "LFKCC" "STKCC" "RTKCC" "LFMGCC" "STMGCC" "RTMGCC" "LFPCC" "STPCC" [25] "RTPCC" "LFSCC" "STSCC" "RTSCC"

    If you want to work with the data in one of the columns you give the name of the data frame, a $ sign, and the label assigned to the column. For example, the first column in tree can be called using tree$C:

    > tree$C [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 [39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4

    1.3. Brief Note on Fixed Width Files

    There are many ways to read data using R. We only give two examples, direct assignment and reading csv files. However, another way deserves a brief mention. It is common to come across data that is organized in flat files and delimited at preset locations on each line. This is often called a fixed width file.

    The command to deal with these kind of files is read.fwf. Examples of how to use this command are not explored here, but a brief example is given. If you would like more information on how to use this command enter the following command:

    > help(read.fwf)

    The read.fwf command requires at least two options. The first is the name of the file and the second is a list of numbers that gives the length of each column in the data file. A negative number in the list indicates that the column should be skipped. Here we give the command to read the data file fixedWidth.dat . In this data file there are three columns. The first colum is 17 characters wide, the second column is 15 characters wide, and the last column is 7 characters wide. In the example below we use the optional col.names option to specify the names of the columns:

    > a = read.fwf('fixedWidth.dat',widths=c(-17,15,7),col.names=c('temp','offices')) > a temp offices 1 17.0 35 2 18.0 117 3 17.5 19 4 17.5 28

  • 2. Basic Data Types

    Contents

    Variable Types Tables

    We look at some of the ways that R can store and organize data. This is a basic introduction to a small subset of the different data types recognized by R and is not comprehensive in any sense. The main goal is to demonstrate the different kinds of information R can handle. It is assumed that you know how to enter data or read data files which is covered in the first chapter.

    2.1. Variable Types

    2.1.1. Numbers

    The way to work with real numbers has already been covered in the first chapter and is briefly discussed here. The most basic way to store a number is to make an assignment of a single number:

    > a

    The a [1] 3

    This allows you to do all sorts of basic operations and save the numbers:

    > b b [1] 3.464102

    If you want to get a list of the variables that you have defined in a particular session you can list them all using the ls command:

    > ls() [1] "a" "b"

    You are not limited to just saving a single number. You can create a list (also called a vector) using the c command:

    > a a

  • [1] 1 2 3 4 5 > a+1 [1] 2 3 4 5 6 > mean(a) [1] 3 > var(a) [1] 2.5

    You can get access to particular entries in the vector in the following manner:

    > a a[1] [1] 1 > a[2] [1] 2 > a[0] numeric(0) > a[5] [1] 5 > a[6] [1] NA

    Note that the zero entry is used to indicate how the data is stored. The first entry in the vector is the first number, and if you try to get a number past the last number you get NA.

    Examples of the sort of operations you can do on vectors is given in a next chapter.

    To initialize a list of numbers the numeric command can be used. For example, to create a list of 10 numbers, initialized to zero, use the following command:

    > a a [1] 0 0 0 0 0 0 0 0 0 0

    If you wish to determine the data type used for a variable the type command:

    > typeof(a) [1] "double"

    2.1.2. Strings

    You are not limited to just storing numbers. You can also store strings. A string is specified by using quotes. Both single and double quotes will work:

    > a a [1] "hello" > b b [1] "hello" "there" > b[1] [1] "hello"

    The name of the type given to strings is character,

    > typeof(a) [1] "character"

  • > a = character(20) > a [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

    2.1.3. Factors

    Another important way R can store data is as a factor. Often times an experiment includes trials for different levels of some explanatory variable. For example, when looking at the impact of carbon dioxide on the growth rate of a tree you might try to observe how different trees grow when exposed to different preset concentrations of carbon dioxide. The different levels are also called factors.

    Assuming you know how to read in a file, we will look at the data file given in the first chapter. Several of the variables in the file are factors:

    > summary(tree$CHBR) A1 A2 A3 A4 A5 A6 A7 B1 B2 B3 B4 B5 B6 B7 C1 C2 C3 C4 C5 C6 3 1 1 3 1 3 1 1 3 3 3 3 3 3 1 3 1 3 1 1 C7 CL6 CL7 D1 D2 D3 D4 D5 D6 D7 1 1 1 1 1 3 1 1 1 1

    Because the set of options given in the data file corresponding to the CHBR column are not all numbers R automatically assumes that it is a factor. When you use summary on a factor it does not print out the five point summary, rather it prints out the possible values and the frequency that they occur.

    In this data set several of the columns are factors, but the researchers used numbers to indicate the different levels. For example, the first column, labeled C, is a factor. Each trees was grown in an environment with one of four different possible levels of carbon dioxide. The researchers quite sensibly labeled these four environments as 1, 2, 3, and 4. Unfortunately, R cannot determine that these are factors and must assume that they are regular numbers.

    This is a common problem and there is a way to tell R to treat the C column as a set of factors. You specify that a variable is a factor using the factor command. In the following example we convert tree$C into a factor:

    > tree$C [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 [39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 > summary(tree$C) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 2.000 2.519 3.000 4.000 > tree$C tree$C [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 [39] 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 Levels: 1 2 3 4 > summary(tree$C) 1 2 3 4 8 23 10 13 > levels(tree$C)

  • [1] "1" "2" "3" "4"

    Once a vector is converted into a set of factors then R treats it in a different manner then when it is a set of numbers. A set of factors have a decrete set of possible values, and it does not make sense to try to find averages or other numerical descriptions. One thing that is important is the number of times that each factor appears, called their frequencies, which is printed using the summary command.

    2.1.4. Data Frames

    Another way that information is stored is in data frames. This is a way to take many vectors of different types and store them in the same variable. The vectors can be of all different types. For example, a data frame may contain many lists, and each list might be a list of factors, strings, or numbers.

    There are different ways to create and manipulate data frames. Most are beyond the scope of this introduction. They are only mentioned here to offer a more complete description. Please see the first chapter for more information on data frames.

    One example of how to create a data frame is given below:

    > a b levels bubba bubba first second f 1 1 2 A 2 2 4 B 3 3 6 A 4 4 8 B > summary(bubba) first second f Min. :1.00 Min. :2.0 A:2 1st Qu.:1.75 1st Qu.:3.5 B:2 Median :2.50 Median :5.0 Mean :2.50 Mean :5.0 3rd Qu.:3.25 3rd Qu.:6.5 Max. :4.00 Max. :8.0 > bubba$first [1] 1 2 3 4 > bubba$second [1] 2 4 6 8 > bubba$f [1] A B A B Levels: A B

    2.1.5. Logical

    Another important data type is the logical type. There are two predefined variables, TRUE and FALSE:

  • > a = TRUE > typeof(a) [1] "logical" > b = FALSE > typeof(b) [1] "logical"

    The standard logical operators can be used:

    < less than

    > great than

    = greater than or equal

    == equal to

    != not equal to

    | entry wise or

    || or

    ! not

    & entry wise and

    && and

    xor(a,b) exclusive or

    Note that there is a difference between operators that act on entries within a vector and the whole vector:

    > a = c(TRUE,FALSE) > b = c(FALSE,FALSE) > a|b [1] TRUE FALSE > a||b [1] TRUE > xor(a,b) [1] TRUE FALSE

    There are a large number of functions that test to determine the type of a variable. For example the is.numeric function can determine if a variable is numeric:

    > a = c(1,2,3) > is.numeric(a) [1] TRUE > is.factor(a) [1] FALSE

  • 2.2. Tables

    Another common way to store information is in a table. Here we look at how to define both one way and two way tables. We only look at how to create and define tables; the functions used in the analysis of proportions are examined in another chapter.

    2.2.1. One Way Tables

    The first example is for a one way table. One way tables are not the most interesting example, but it is a good place to start. One way to create a table is using the table command. The arguments it takes is a vector of factors, and it calculates the frequency that each factor occurs. Here is an example of how to create a one way table:

    > a results results a A B C 4 3 2 > attributes(results) $dim [1] 3 $dimnames $dimnames$a [1] "A" "B" "C" $class [1] "table" > summary(results) Number of cases in table: 9 Number of factors: 1

    If you know the number of occurrences for each factor then it is possible to create the table directly, but the process is, unfortunately, a bit more convoluted. There is an easier way to define one-way tables (a table with one row), but it does not extend easily to two-way tables (tables with more than one row). You must first create a matrix of numbers. A matrix is like a vector in that it is a list of numbers, but it is different in that you can have both rows and columns of numbers. For example, in our example above the number of occurrences of A is 4, the number of occurrences of B is 3, and the number of occurrences of C is 2. We will create one row of numbers. The first column contains a 4, the second column contains a 3, and the third column contains a 2:

    > occur occur [,1] [,2] [,3] [1,] 4 3 2

  • At this point the variable occur is a matrix with one row and three columns of numbers. To dress it up and use it as a table we would like to give it labels for each columns just like in the previous example. Once that is done we convert the matrix to a table using the as.table command:

    > colnames(occur) occur A B C [1,] 4 3 2 > occur occur A B C A 4 3 2 > attributes(occur) $dim [1] 1 3 $dimnames $dimnames[[1]] [1] "A" $dimnames[[2]] [1] "A" "B" "C" $class [1] "table"

    2.2.2. Two Way Tables

    If you want to add rows to your table just add another vector to the argument of the table command. In the example below we have two questions. In the first question the responses are labeled Never, Sometimes, or Always. In the second question the responses are labeled Yes, No, or Maybe. The set of vectors a, and b, contain the response for each measurement. The third item in a is how the third person responded to the first question, and the third item in b is how the third person responded to the second question.

    > a b results results b a Maybe No Yes Always 2 0 0 Never 0 1 1 Sometimes 2 1 1

    The table command allows us to do a very quick calculation, and we can immediately see that two people who said Maybe to the first question also said Sometimes to the second question.

    Just as in the case with one-way tables it is possible to manually enter two way tables. The procedure is exactly the same as above except that we now have more than one row. We give

  • a brief example below to demonstrate how to enter a two-way table that includes breakdown of a group of people by both their gender and whether or not they smoke. You enter all of the data as one long list but tell R to break it up into some number of columns:

    > sexsmoke rownames(sexsmoke) colnames(sexsmoke) sexsmoke sexsmoke smoke nosmoke male 70 120 female 65 140

    The matrix command creates a two by two matrix. The byrow=TRUE option indicates that the numbers are filled in across the rows first, and the ncols=2 indicates that there are two columns.

  • 3. Basic Operations and Numerical Descriptions

    Contents

    Basic Operations Basic Numerical Descriptions Operations on Vectors

    We look at some of the basic operations that you can perform on lists of numbers. It is assumed that you know how to enter data or read data files which is covered in the first chapter, and you know about the basic data types.

    3.1. Basic Operations

    Once you have a vector (or a list of numbers) in memory most basic operations are available. Most of the basic operations will act on a whole vector and can be used to quickly perform a large number of calculations with a single command. There is one thing to note, if you perform an operation on more than one vector it is often necessary that the vectors all contain the same number of entries.

    Here we first define a vector which we will call a and will look at how to add and subtract constant numbers from all of the numbers in the vector. First, the vector will contain the numbers 1, 2, 3, and 4. We then see how to add 5 to each of the numbers, subtract 10 from each of the numbers, multiply each number by 4, and divide each number by 5.

    > a a [1] 1 2 3 4 > a + 5 [1] 6 7 8 9 > a - 10 [1] -9 -8 -7 -6 > a*4 [1] 4 8 12 16 > a/5 [1] 0.2 0.4 0.6 0.8

    We can save the results in another vector called b:

    > b b [1] -9 -8 -7 -6

    If you want to take the square root, find e raised to each number, the logarithm, etc., then the usual commands can be used:

    > sqrt(a) [1] 1.000000 1.414214 1.732051 2.000000

  • > exp(a) [1] 2.718282 7.389056 20.085537 54.598150 > log(a) [1] 0.0000000 0.6931472 1.0986123 1.3862944 > exp(log(a)) [1] 1 2 3 4

    By combining operations and using parentheses you can make more complicated expressions:

    > c c [1] 0.2384058 0.4069842 0.5640743 0.7152175

    Note that you can do the same operations with vector arguments. For example to add the elements in vector a to the elements in vector b use the following command:

    > a + b [1] -8 -6 -4 -2

    The operation is performed on an element by element basis. Note this is true for almost all of the basic functions. So you can bring together all kinds of complicated expressions:

    > a*b [1] -9 -16 -21 -24 > a/b [1] -0.1111111 -0.2500000 -0.4285714 -0.6666667 > (a+3)/(sqrt(1-b)*2-1) [1] 0.7512364 1.0000000 1.2884234 1.6311303

    You need to be careful of one thing. When you do operations on vectors they are performed on an element by element basis. One ramification of this is that all of the vectors in an expression must be the same length. If the lengths of the vectors differ then you may get an error message, or worse, a warning message and unpredictable results:

    > a b a+b [1] 11 13 15 14 Warning message: longer object length is not a multiple of shorter object length in: a + b

    As you work in R and create new vectors it can be easy to lose track of what variables you have defined. To get a list of all of the variables that have been defined use the ls() command:

    > ls() [1] "a" "b" "bubba" "c" "last.warning" [6] "tree" "trees"

    Finally, you should keep in mind that the basic operations almost always work on an element by element basis. There are rare exceptions to this general rule. For example, if you look at the minimum of two vectors using the min command you will get the minimum of all of the numbers. There is a special command, called pmin, that may be the command you want in some circumstances:

  • > a b min(a,b) [1] -4 > pmin(a,b) [1] -1 -2 -3 -4

    3.2. Basic Numerical Descriptions

    Given a vector of numbers there are some basic commands to make it easier to get some of the basic numerical descriptions of a set of numbers. Here we assume that you can read in the tree data that was discussed in a previous chapter. It is assumed that it is stored in a variable called tree:

    > tree names(tree) [1] "C" "N" "CHBR" "REP" "LFBM" "STBM" "RTBM" "LFNCC" [9] "STNCC" "RTNCC" "LFBCC" "STBCC" "RTBCC" "LFCACC" "STCACC" "RTCACC" [17] "LFKCC" "STKCC" "RTKCC" "LFMGCC" "STMGCC" "RTMGCC" "LFPCC" "STPCC" [25] "RTPCC" "LFSCC" "STSCC" "RTSCC"

    Each column in the data frame can be accessed as a vector. For example the numbers associated with the leaf biomass (LFBM) can be found using tree$LFBM:

    > tree$LFBM [1] 0.430 0.400 0.450 0.820 0.520 1.320 0.900 1.180 0.480 0.210 0.270 0.310 [13] 0.650 0.180 0.520 0.300 0.580 0.480 0.580 0.580 0.410 0.480 1.760 1.210 [25] 1.180 0.830 1.220 0.770 1.020 0.130 0.680 0.610 0.700 0.820 0.760 0.770 [37] 1.690 1.480 0.740 1.240 1.120 0.750 0.390 0.870 0.410 0.560 0.550 0.670 [49] 1.260 0.965 0.840 0.970 1.070 1.220

    The following commands can be used to get the mean, median, quantiles, minimum, maximum, variance, and standard deviation of a set of numbers:

    > mean(tree$LFBM) [1] 0.7649074 > median(tree$LFBM) [1] 0.72 > quantile(tree$LFBM) 0% 25% 50% 75% 100% 0.1300 0.4800 0.7200 1.0075 1.7600 > min(tree$LFBM) [1] 0.13 > max(tree$LFBM) [1] 1.76 > var(tree$LFBM) [1] 0.1429382 > sd(tree$LFBM) [1] 0.3780717

    Finally, the summary command will print out the min, max, mean, median, and quantiles:

  • > summary(tree$LFBM) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1300 0.4800 0.7200 0.7649 1.0080 1.7600

    The summary command is especially nice because if you give it a data frame it will print out the summary for every vector in the data frame:

    > summary(tree) C N CHBR REP LFBM Min. :1.000 Min. :1.000 A1 : 3 Min. : 1.00 Min. :0.1300 1st Qu.:2.000 1st Qu.:1.000 A4 : 3 1st Qu.: 9.00 1st Qu.:0.4800 Median :2.000 Median :2.000 A6 : 3 Median :14.00 Median :0.7200 Mean :2.519 Mean :1.926 B2 : 3 Mean :13.05 Mean :0.7649 3rd Qu.:3.000 3rd Qu.:3.000 B3 : 3 3rd Qu.:20.00 3rd Qu.:1.0075 Max. :4.000 Max. :3.000 B4 : 3 Max. :20.00 Max. :1.7600 (Other):36 NA's :11.00 STBM RTBM LFNCC STNCC Min. :0.0300 Min. :0.1200 Min. :0.880 Min. :0.3700 1st Qu.:0.1900 1st Qu.:0.2825 1st Qu.:1.312 1st Qu.:0.6400 Median :0.2450 Median :0.4450 Median :1.550 Median :0.7850 Mean :0.2883 Mean :0.4662 Mean :1.560 Mean :0.7872 3rd Qu.:0.3800 3rd Qu.:0.5500 3rd Qu.:1.788 3rd Qu.:0.9350 Max. :0.7200 Max. :1.5100 Max. :2.760 Max. :1.2900 RTNCC LFBCC STBCC RTBCC Min. :0.4700 Min. :25.00 Min. :14.00 Min. :15.00 1st Qu.:0.6000 1st Qu.:34.00 1st Qu.:17.00 1st Qu.:19.00 Median :0.7500 Median :37.00 Median :18.00 Median :20.00 Mean :0.7394 Mean :36.96 Mean :18.80 Mean :21.43 3rd Qu.:0.8100 3rd Qu.:41.00 3rd Qu.:20.00 3rd Qu.:23.00 Max. :1.5500 Max. :48.00 Max. :27.00 Max. :41.00 LFCACC STCACC RTCACC LFKCC Min. :0.2100 Min. :0.1300 Min. :0.1100 Min. :0.6500 1st Qu.:0.2600 1st Qu.:0.1600 1st Qu.:0.1600 1st Qu.:0.8100 Median :0.2900 Median :0.1700 Median :0.1650 Median :0.9000 Mean :0.2869 Mean :0.1774 Mean :0.1654 Mean :0.9053 3rd Qu.:0.3100 3rd Qu.:0.1875 3rd Qu.:0.1700 3rd Qu.:0.9900 Max. :0.3600 Max. :0.2400 Max. :0.2400 Max. :1.1800 NA's :1.0000 STKCC RTKCC LFMGCC STMGCC Min. :0.870 Min. :0.330 Min. :0.0700 Min. :0.100 1st Qu.:0.940 1st Qu.:0.400 1st Qu.:0.1000 1st Qu.:0.110 Median :1.055 Median :0.475 Median :0.1200 Median :0.130 Mean :1.105 Mean :0.473 Mean :0.1109 Mean :0.135 3rd Qu.:1.210 3rd Qu.:0.520 3rd Qu.:0.1300 3rd Qu.:0.150 Max. :1.520 Max. :0.640 Max. :0.1400 Max. :0.190 RTMGCC LFPCC STPCC RTPCC Min. :0.04000 Min. :0.1500 Min. :0.1500 Min. :0.1000 1st Qu.:0.06000 1st Qu.:0.2000 1st Qu.:0.2200 1st Qu.:0.1300 Median :0.07000 Median :0.2400 Median :0.2800 Median :0.1450 Mean :0.06648 Mean :0.2381 Mean :0.2707 Mean :0.1465 3rd Qu.:0.07000 3rd Qu.:0.2700 3rd Qu.:0.3175 3rd Qu.:0.1600

  • Max. :0.09000 Max. :0.3100 Max. :0.4100 Max. :0.2100 LFSCC STSCC RTSCC Min. :0.0900 Min. :0.1400 Min. :0.0900 1st Qu.:0.1325 1st Qu.:0.1600 1st Qu.:0.1200 Median :0.1600 Median :0.1800 Median :0.1300 Mean :0.1661 Mean :0.1817 Mean :0.1298 3rd Qu.:0.1875 3rd Qu.:0.2000 3rd Qu.:0.1475 Max. :0.2600 Max. :0.2800 Max. :0.1700

    3.3. Operations on Vectors

    Here we look at some commonly used commands that perform operations on lists. The commands include the sort, min, max, and sum commands. First, the sort command can sort the given vector in either ascending or descending order:

    > a = c(2,4,6,3,1,5) > b = sort(a) > c = sort(a,decreasing = TRUE) > a [1] 2 4 6 3 1 5 > b [1] 1 2 3 4 5 6 > c [1] 6 5 4 3 2 1

    The min and the max commands find the minimum and the maximum numbers in the vector:

    > min(a) [1] 1 > max(a) [1] 6

    Finally, the sum command adds up the numbers in the vector:

    > sum(a) [1] 21

  • 4. Basic Probability Distributions

    Contents

    The Normal Distribution The t Distribution The Binomial Distribution The Chi-Squared Distribution

    We look at some of the basic operations associated with probability distributions. There are a large number of probability distributions available, but we only look at a few. If you would like to know what distributions are available you can do a search using the command help.search(distribution).

    Here we give details about the commands associated with the normal distribution and briefly mention the commands for other distributions. The functions for different distributions are very similar where the differences are noted below.

    For this chapter it is assumed that you know how to enter data which is covered in the previous chapters.

    To get a full list of the distributions available in R you can use the following command:

    help(Distributions)

    For every distribution there are four commands. The commands for each distribution are prepended with a letter to indicate the functionality:

    d returns the height of the probability density function

    p returns the cumulative density function

    q returns the inverse cumulative density function (quantiles)

    r returns randomly generated numbers

    4.1. The Normal Distribution

    There are four functions that can be used to generate the values associated with the normal distribution. You can get a full list of them and their options using the help command:

    > help(Normal)

    The first function we look at it is dnorm. Given a set of values it returns the height of the probability distribution at each point. If you only give the points it assumes you want to use a mean of zero and standard deviation of one. There are options to use different values for the mean and standard deviation, though:

  • > dnorm(0) [1] 0.3989423 > dnorm(0)*sqrt(2*pi) [1] 1 > dnorm(0,mean=4) [1] 0.0001338302 > dnorm(0,mean=4,sd=10) [1] 0.03682701 >v dnorm(v) [1] 0.39894228 0.24197072 0.05399097 > x y plot(x,y) > y plot(x,y)

    The second function we examine is pnorm. Given a number or a list it computes the probability that a normally distributed random number will be less than that number. This function also goes by the rather ominous title of the Cumulative Distribution Function. It accepts the same options as dnorm:

    > pnorm(0) [1] 0.5 > pnorm(1) [1] 0.8413447 > pnorm(0,mean=2) [1] 0.02275013 > pnorm(0,mean=2,sd=3) [1] 0.2524925 > v pnorm(v) [1] 0.5000000 0.8413447 0.9772499 > x y plot(x,y) > y plot(x,y)

    If you wish to find the probability that a number is larger than the given number you can use the lower.tail option:

    > pnorm(0,lower.tail=FALSE) [1] 0.5 > pnorm(1,lower.tail=FALSE) [1] 0.1586553 > pnorm(0,mean=2,lower.tail=FALSE) [1] 0.9772499

    The next function we look at is qnorm which is the inverse of pnorm. The idea behind qnorm is that you give it a probability, and it returns the number whose cumulative distribution matches the probability. For example, if you have a normally distributed random variable with mean zero and standard deviation one, then if you give the function a probability it returns the associated Z-score:

    > qnorm(0.5) [1] 0

  • > qnorm(0.5,mean=1) [1] 1 > qnorm(0.5,mean=1,sd=2) [1] 1 > qnorm(0.5,mean=2,sd=2) [1] 2 > qnorm(0.5,mean=2,sd=4) [1] 2 > qnorm(0.25,mean=2,sd=2) [1] 0.6510205 > qnorm(0.333) [1] -0.4316442 > qnorm(0.333,sd=3) [1] -1.294933 > qnorm(0.75,mean=5,sd=2) [1] 6.34898 > v = c(0.1,0.3,0.75) > qnorm(v) [1] -1.2815516 -0.5244005 0.6744898 > x y plot(x,y) > y plot(x,y) > y plot(x,y)

    The last function we examine is the rnorm function which can generate random numbers whose distribution is normal. The argument that you give it is the number of random numbers that you want, and it has optional arguments to specify the mean and standard deviation:

    > rnorm(4) [1] 1.2387271 -0.2323259 -1.2003081 -1.6718483 > rnorm(4,mean=3) [1] 2.633080 3.617486 2.038861 2.601933 > rnorm(4,mean=3,sd=3) [1] 4.580556 2.974903 4.756097 6.395894 > rnorm(4,mean=3,sd=3) [1] 3.000852 3.714180 10.032021 3.295667 > y hist(y) > y hist(y) > y hist(y) > qqnorm(y) > qqline(y)

    4.2. The t Distribution

    There are four functions that can be used to generate the values associated with the t distribution. You can get a full list of them and their options using the help command:

    > help(TDist)

    These commands work just like the commands for the normal distribution. One difference is that the commands assume that the values are normalized to mean zero and standard

  • deviation one, so you have to use a little algebra to use these functions in practice. The other difference is that you have to specify the number of degrees of freedom. The commands follow the same kind of naming convention, and the names of the commands are dt, pt, qt, and rt.

    A few examples are given below to show how to use the different commands. First we have the distribution function, dt:

    > x y plot(x,y) > y plot(x,y)

    Next we have the cumulative probability distribution function:

    > pt(-3,df=10) [1] 0.006671828 > pt(3,df=10) [1] 0.9933282 > 1-pt(3,df=10) [1] 0.006671828 > pt(3,df=20) [1] 0.996462 > x = c(-3,-4,-2,-1) > pt((mean(x)-2)/sd(x),df=20) [1] 0.001165548 > pt((mean(x)-2)/sd(x),df=40) [1] 0.000603064

    Next we have the inverse cumulative probability distribution function:

    > qt(0.05,df=10) [1] -1.812461 > qt(0.95,df=10) [1] 1.812461 > qt(0.05,df=20) [1] -1.724718 > qt(0.95,df=20) [1] 1.724718 > v qt(v,df=253) [1] -2.595401 -1.969385 -1.650899 > qt(v,df=25) [1] -2.787436 -2.059539 -1.708141

    Finally random numbers can be generated according to the t distribution:

    > rt(3,df=10) [1] 0.9440930 2.1734365 0.6785262 > rt(3,df=20) [1] 0.1043300 -1.4682198 0.0715013 > rt(3,df=20) [1] 0.8023832 -0.4759780 -1.0546125

  • 4.3. The Binomial Distribution

    There are four functions that can be used to generate the values associated with the binomial distribution. You can get a full list of them and their options using the help command:

    > help(Binomial)

    These commands work just like the commands for the normal distribution. The binomial distribution requires two extra parameters, the number of trials and the probability of success for a single trial. The commands follow the same kind of naming convention, and the names of the commands are dbinom, pbinom, qbinom, and rbinom.

    A few examples are given below to show how to use the different commands. First we have the distribution function, dbinom:

    > x y plot(x,y) > y plot(x,y) > x y plot(x,y)

    Next we have the cumulative probability distribution function:

    > pbinom(24,50,0.5) [1] 0.4438624 > pbinom(25,50,0.5) [1] 0.5561376 > pbinom(25,51,0.5) [1] 0.5 > pbinom(26,51,0.5) [1] 0.610116 > pbinom(25,50,0.5) [1] 0.5561376 > pbinom(25,50,0.25) [1] 0.999962 > pbinom(25,500,0.25) [1] 4.955658e-33

    Next we have the inverse cumulative probability distribution function:

    > qbinom(0.5,51,1/2) [1] 25 > qbinom(0.25,51,1/2) [1] 23 > pbinom(23,51,1/2) [1] 0.2879247 > pbinom(22,51,1/2) [1] 0.200531

    Finally random numbers can be generated according to the binomial distribution:

  • > rbinom(5,100,.2) [1] 30 23 21 19 18 > rbinom(5,100,.7) [1] 66 66 58 68 63

    4.4. The Chi-Squared Distribution

    There are four functions that can be used to generate the values associated with the Chi-Squared distribution. You can get a full list of them and their options using the help command:

    > help(Chisquare)

    These commands work just like the commands for the normal distribution. The first difference is that it is assumed that you have normalized the value so no mean can be specified. The other difference is that you have to specify the number of degrees of freedom. The commands follow the same kind of naming convention, and the names of the commands are dchisq, pchisq, qchisq, and rchisq.

    A few examples are given below to show how to use the different commands. First we have the distribution function, dchisq:

    > x y plot(x,y) > y plot(x,y)

    Next we have the cumulative probability distribution function:

    > pchisq(2,df=10) [1] 0.003659847 > pchisq(3,df=10) [1] 0.01857594 > 1-pchisq(3,df=10) [1] 0.981424 > pchisq(3,df=20) [1] 4.097501e-06 > x = c(2,4,5,6) > pchisq(x,df=20) [1] 1.114255e-07 4.649808e-05 2.773521e-04 1.102488e-03

    Next we have the inverse cumulative probability distribution function:

    > qchisq(0.05,df=10) [1] 3.940299 > qchisq(0.95,df=10) [1] 18.30704 > qchisq(0.05,df=20) [1] 10.85081 > qchisq(0.95,df=20) [1] 31.41043 > v qchisq(v,df=253) [1] 198.8161 210.8355 217.1713

  • > qchisq(v,df=25) [1] 10.51965 13.11972 14.61141

    Finally random numbers can be generated according to the Chi-Squared distribution:

    > rchisq(3,df=10) [1] 16.80075 20.28412 12.39099 > rchisq(3,df=20) [1] 17.838878 8.591936 17.486372 > rchisq(3,df=20) [1] 11.19279 23.86907 24.81251

  • 5. Basic Plots

    Contents

    Strip Charts Histograms Boxplots Scatter Plots Normal QQ Plots

    We look at some of the ways R can display information graphically. This is a basic introduction to some of the basic plotting commands. It is assumed that you know how to enter data or read data files which is covered in the first chapter, and it is assumed that you are familiar with the different data types.

    In each of the topics that follow it is assumed that two different data sets, w1.dat and trees91.csv have been read and defined using the same variables as in the first chapter. Both of these data sets come from the study discussed on the web site given in the first chapter. We assume that they are read using read.csv into variables w1 and tree:

    > w1 names(w1) [1] "vals" > tree names(tree) [1] "C" "N" "CHBR" "REP" "LFBM" "STBM" "RTBM" "LFNCC" [9] "STNCC" "RTNCC" "LFBCC" "STBCC" "RTBCC" "LFCACC" "STCACC" "RTCACC" [17] "LFKCC" "STKCC" "RTKCC" "LFMGCC" "STMGCC" "RTMGCC" "LFPCC" "STPCC" [25] "RTPCC" "LFSCC" "STSCC" "RTSCC"

    5.1. Strip Charts

    A strip chart is the most basic type of plot available. It plots the data in order along a line with each data point represented as a box. Here we provide examples using the w1 data frame mentioned at the top of this page, and the one column of the data is w1$vals.

    To create a strip chart of this data use the stripchart command:

    > help(stripchart) > stripchart(w1$vals)

  • Strip Chart

    This is the most basic possible strip charts. The stripchart() command takes many of the standard plot() options for labeling and annotations.

    As you can see this is about as bare bones as you can get. There is no title nor axes labels. It only shows how the data looks if you were to put it all along one line and mark out a box at each point. If you would prefer to see which points are repeated you can specify that repeated points be stacked:

    > stripchart(w1$vals,method="stack")

    A variation on this is to have the boxes moved up and down so that there is more separation between them:

    > stripchart(w1$vals,method="jitter")

    If you do not want the boxes plotting in the horizontal direction you can plot them in the vertical direction:

    > stripchart(w1$vals,vertical=TRUE) > stripchart(w1$vals,vertical=TRUE,method="jitter")

  • Since you should always annotate your plots there are many different ways to add titles and labels. One way is within the stripchart command itself:

    > stripchart(w1$vals,method="stack", main='Leaf BioMass in High CO2 Environment', xlab='BioMass of Leaves')

    If you have a plot already and want to add a title, you can use the title command:

    > title('Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves')

    Note that this simply adds the title and labels and will write over the top of any titles or labels you already have.

    5.2. Histograms

    A histogram is very common plot. It plots the frequencies that data appears within certain ranges. Here we provide examples using the w1 data frame mentioned at the top of this page, and the one column of data is w1$vals.

    To plot a histogram of the data use the hist command:

    > hist(w1$vals) > hist(w1$vals,main="Distribution of w1",xlab="w1")

  • Histogram Options

    Many of the basic plot commands accept the same options. The help(hist) command will give you options specifically for the hist command. You can also use the help command to see more but also note that if you use help(plot) you may see more options. Experiment with different options to see what you can do.

    As you can see R will automatically calculate the intervals to use. There are many options to determine how to break up the intervals. Here we look at just one way, varying the domain size and number of breaks. If you would like to know more about the other options check out the help page:

    > help(hist)

    You can specify the number of breaks to use using the breaks option. Here we look at the histogram for various numbers of breaks:

    > hist(w1$vals,breaks=2) > hist(w1$vals,breaks=4) > hist(w1$vals,breaks=6) > hist(w1$vals,breaks=8) > hist(w1$vals,breaks=12) >

  • You can also vary the size of the domain using the xlim option. This option takes a vector with two entries in it, the left value and the right value:

    > hist(w1$vals,breaks=12,xlim=c(0,10)) > hist(w1$vals,breaks=12,xlim=c(-1,2)) > hist(w1$vals,breaks=12,xlim=c(0,2)) > hist(w1$vals,breaks=12,xlim=c(1,1.3)) > hist(w1$vals,breaks=12,xlim=c(0.9,1.3)) >

    The options for adding titles and labels are exactly the same as for strip charts. You should always annotate your plots and there are many different ways to add titles and labels. One way is within the hist command itself:

    > hist(w1$vals, main='Leaf BioMass in High CO2 Environment', xlab='BioMass of Leaves')

    If you have a plot already and want to change or add a title, you can use the title command:

    > title('Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves')

    Note that this simply adds the title and labels and will write over the top of any titles or labels you already have.

    It is not uncommon to add other kinds of plots to a histogram. For example, one of the options to the stripchart command is to add it to a plot that has already been drawn. For example, you might want to have a histogram with the strip chart drawn across the top. The addition of the strip chart might give you a better idea of the density of the data:

    > hist(w1$vals,main='Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves',ylim=c(0,16)) > stripchart(w1$vals,add=TRUE,at=15.5)

    5.3. Boxplots

    A boxplot provides a graphical view of the median, quartiles, maximum, and minimum of a data set. Here we provide examples using two different data sets. The first is the w1 data frame mentioned at the top of this page, and the one column of data is w1$vals. The second is the tree data frame from the trees91.csv data file which is also mentioned at the top of the page.

    We first use the w1 data set and look at the boxplot of this data set:

    > boxplot(w1$vals)

  • Again, this is a very plain graph, and the title and labels can be specified in exactly the same way as in the stripchart and hist commands:

    > boxplot(w1$vals, main='Leaf BioMass in High CO2 Environment', ylab='BioMass of Leaves')

    Note that the default orientation is to plot the boxplot vertically. Because of this we used the ylab option to specify the axis label. There are a large number of options for this command. To see more of the options see the help page:

    > help(boxplot)

    As an example you can specify that the boxplot be plotted horizontally by specifying the horizontal option:

    > boxplot(w1$vals, main='Leaf BioMass in High CO2 Environment', xlab='BioMass of Leaves', horizontal=TRUE)

    The option to plot the box plot horizontally can be put to good use to display a box plot on the same image as a histogram. You need to specify the add option, specify where to put the box plot using the at option, and turn off the addition of axes using the axes option:

  • > hist(w1$vals,main='Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves',ylim=c(0,16)) > boxplot(w1$vals,horizontal=TRUE,at=15.5,add=TRUE,axes=FALSE)

    If you are feeling really crazy you can take a histogram and add a box plot and a strip chart:

    > hist(w1$vals,main='Leaf BioMass in High CO2 Environment',xlab='BioMass of Leaves',ylim=c(0,16)) > boxplot(w1$vals,horizontal=TRUE,at=16,add=TRUE,axes=FALSE) > stripchart(w1$vals,add=TRUE,at=15)

    Some people shell out good money to have this much fun.

    For the second part on boxplots we will look at the second data frame, tree, which comes from the trees91.csv file. To reiterate the discussion at the top of this page and the discussion in the data types chapter, we need to specify which columns are factors:

    > tree tree$C tree$N boxplot(tree$STBM, main='Stem BioMass in Different CO2 Environments', ylab='BioMass of Stems')

    That plot does not tell the whole story. It is for all of the trees, but the trees were grown in different kinds of environments. The boxplot command can be used to plot a separate box plot for each level. In this case the data is held in tree$STBM, and the different levels are stored as factors in tree$C. The command to create different boxplots is the following:

    boxplot(tree$STBM~tree$C)

    Note that for the level called 2 there are four outliers which are plotted as little circles. There are many options to annotate your plot including different labels for each level. Please use the help(boxplot) command for more information.

    5.4. Scatter Plots

    A scatter plot provides a graphical view of the relationship between two sets of numbers. Here we provide examples using the tree data frame from the trees91.csv data file which is mentioned at the top of the page. In particular we look at the relationship between the stem biomass (tree$STBM) and the leaf biomass (tree$LFBM).

    The command to plot each pair of points as an x-coordinate and a y-coorindate is plot:

    > plot(tree$STBM,tree$LFBM)

  • It appears that there is a strong positive association between the biomass in the stems of a tree and the leaves of the tree. It appears to be a linear relationship. In fact, the corelation between these two sets of observations is quite high:

    > cor(tree$STBM,tree$LFBM) [1] 0.911595

    Getting back to the plot, you should always annotate your graphs. The title and labels can be specified in exactly the same way as with the other plotting commands:

    > plot(tree$STBM,tree$LFBM, main="Relationship Between Stem and Leaf Biomass", xlab="Stem Biomass", ylab="Leaf Biomass")

    5.5. Normal QQ Plots

    The final type of plot that we look at is the normal quantile plot. This plot is used to determine if your data is close to being normally distributed. You cannot be sure that the data is normally distributed, but you can rule out if it is not normally distributed. Here we provide examples using the w1 data frame mentioned at the top of this page, and the one column of data is w1$vals.

    The command to generate a normal quantile plot is qqnorm. You can give it one argument, the univariate data set of interest:

    > qqnorm(w1$vals)

    You can annotate the plot in exactly the same way as all of the other plotting commands given here:

    > qqnorm(w1$vals, main="Normal Q-Q Plot of the Leaf Biomass", xlab="Theoretical Quantiles of the Leaf Biomass", ylab="Sample Quantiles of the Leaf Biomass")

    After you creat the normal quantile plot you can also add the theoretical line that the data should fall on if they were normally distributed:

    > qqline(w1$vals)

    In this example you should see that the data is not quite normally distributed. There are a few outliers, and it does not match up at the tails of the distribution.

  • 6. Intermediate Plotting

    Contents

    Continuous Data Discrete Data Miscellaneous Options

    We look at some more options for plotting, and we assume that you are familiar with the basic plotting commands (Basic Plots). A variety of different subjects ranging from plotting options to the formatting of plots is given.

    In many of the examples below we use some of Rs commands to generate random numbers according to various distributions. The section is divided into three sections. The focus of the first section is on graphing continuous data. The focus of the second section is on graphing discrete data. The third section offers some miscellaneous options that are useful in a variety of contexts.

    6.1. Continuous Data

    Contents

    Multiple Data Sets on One Plot Error Bars Adding Noise (jitter) Multiple Graphs on One Image Density Plots Pairwise Relationships Shaded Regions Plotting a Surface

    In the examples below a data set is defined using Rs normally distributed random number generator.

    > x y cor(x,y) [1] 0.7400576

    6.1.1. Multiple Data Sets on One Plot

    One common task is to plot multiple data sets on the same plot. In many situations the way to do this is to create the initial plot and then add additional information to the plot. For example, to plot bivariate data the plot command is used to initialize and create the plot. The points command can then be used to add additional data sets to the plot.

    First define a set of normally distributed random numbers and then plot them. (This same data set is used throughout the examples below.)

  • > x y cor(x,y) [1] 0.7400576 > plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff") > x1 y1 points(x1,y1,col=2)

    Note that in the previous example, the colour for the second set of data points is set using the col option. You can try different numbers to see what colours are available. For most installations there are at least eight options from 1 to 8. Also note that in the example above the points are plotted as circles. The symbol that is used can be changed using the pch option.

    > x2 y2 points(x2,y2,col=3,pch=2)

    Again, try different numbers to see the various options. Another helpful option is to add a legend. This can be done with the legend command. The options for the command, in order, are the x and y coordinates on the plot to place the legend followed by a list of labels to use. There are a large number of other options so use help(legend) to see more options. For example a list of colors can be given with the col option, and a list of symbols can be given with the pch option.

    > plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff") > points(x1,y1,col=2,pch=3) > points(x2,y2,col=4,pch=5) > legend(14,70,c("Original","one","two"),col=c(1,2,4),pch=c(1,3,5))

  • Figure 1.

    The three data sets displayed on the same graph.

    Another common task is to change the limits of the axes to change the size of the plotting area. This is achieved using the xlim and ylim options in the plot command. Both options take a vector of length two that have the minimum and maximum values.

    > plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff",xlim=c(0,30),ylim=c(0,100)) > points(x1,y1,col=2,pch=3) > points(x2,y2,col=4,pch=5) > legend(14,70,c("Original","one","two"),col=c(1,2,4),pch=c(1,3,5))

    6.1.2. Error Bars

    Another common task is to add error bars to a set of data points. This can be accomplished using the arrows command. The arrows command takes two pairs of coordinates, that is two pairs of x and y values. The command then draws a line between each pair and adds an arrow head with a given length and angle.

    > plot(x,y,xlab="Independent",ylab="Dependent",main="Random Stuff") > xHigh yHigh

  • > xLow yLow arrows(xHigh,yHigh,xLow,yLow,col=2,angle=90,length=0.1,code=3)

    Figure 2.

    A data set with error bars added.

    Note that the option code is used to specify where the bars are drawn. Its value can be 1, 2, or 3. If code is 1 the bars are drawn at pairs given in the first argument. If code is 2 the bars are drawn at the pairs given in the second argument. If code is 3 the bars are drawn at both.

    6.1.3. Adding Noise (jitter)

    In the previous example a little bit of noise was added to the pairs to produce an artificial offset. This is a common thing to do for making plots. A simpler way to accomplish this is to use the jitter command.

    > numberWhite numberChipped par(mfrow=c(1,2)) > plot(numberWhite,numberChipped,xlab="Number White Marbles Drawn", ylab="Number Chipped Marbles Drawn",main="Pulling Marbles")

  • > plot(jitter(numberWhite),jitter(numberChipped),xlab="Number White Marbles Drawn", ylab="Number Chipped Marbles Drawn",main="Pulling Marbles With Jitter")

    Figure 3.

    Points with noise added using the jitter command.

    6.1.4. Multiple Graphs on One Image

    Note that a new command was used in the previous example. The par command can be used to set different parameters. In the example above the mfrow was set. The plots are arranged in an array where the default number of rows and columns is one. The mfrow parameter is a vector with two entries. The first entry is the number of rows of images. The second entry is the number of columns. In the example above the plots were arranged in one row with two plots across.

    > par(mfrow=c(2,3)) > boxplot(numberWhite,main="first plot") > boxplot(numberChipped,main="second plot") > plot(jitter(numberWhite),jitter(numberChipped),xlab="Number White Marbles Drawn", ylab="Number Chipped Marbles Drawn",main="Pulling Marbles With Jitter")

  • > hist(numberWhite,main="fourth plot") > hist(numberChipped,main="fifth plot") > mosaicplot(table(numberWhite,numberChipped),main="sixth plot")

    Figure 4.

    An array of plots using the par command.

    6.1.5. Density Plots

    There are times when you do not want to plot specific points but wish to plot a density. This can be done using the smoothScatter command.

    > numberWhite numberChipped smoothScatter(numberWhite,numberChipped, xlab="White Marbles",ylab="Chipped Marbles",main="Drawing Marbles")

  • Figure 5.

    The SmoothScatter can be used to plot densities.

    Note that the previous example may benefit by superimposing a grid to help delimit the points of interest. This can be done using the grid command.

    > numberWhite numberChipped smoothScatter(numberWhite,numberChipped, xlab="White Marbles",ylab="Chipped Marbles",main="Drawing Marbles") > grid(4,3)

    6.1.6. Pairwise Relationships

    There are times that you want to explore a large number of relationships. A number of relationships can be plotted at one time using the pairs command. The idea is that you give it a matrix or a data frame, and the command will create a scatter plot of all combinations of the data.

    > uData vData wData

  • > xData yData d pairs(d)

    Figure 5.

    Using pairs to produce all permutations of a set of relationships on one graph.

    6.1.7. Shaded Regions

    A shaded region can be plotted using the polygon command. The polygon command takes a pair of vectors, x and y, and shades the region enclosed by the coordinate pairs. In the example below a blue square is drawn. The vertices are defined starting from the lower left. Five pairs of points are given because the starting point and the ending point is the same.

    > x = c(-1,1,1,-1,-1) > y = c(-1,-1,1,1,-1) > plot(x,y) > polygon(x,y,col='blue') >

  • A more complicated example is given below. In this example the rejection region for a right sided hypothesis test is plotted, and it is shaded in red. A set of custom axes is constructed, and symbols are plotted using the expression command.

    > stdDev x y right plot(x,y,type="l",xaxt="n",ylab="p", xlab=expression(paste('Assumed Distribution of ',bar(x))), axes=FALSE,ylim=c(0,max(y)*1.05),xlim=c(min(x),max(x)), frame.plot=FALSE) > axis(1,at=c(-5,right,0,5), pos = c(0,0), labels=c(expression(' '),expression(bar(x)[cr]),expression(mu[0]),expression(' '))) > axis(2) > xReject yReject polygon(c(xReject,xReject[length(xReject)],xReject[1]), c(yReject,0, 0), col='red')

    Figure 6.

    Using polygon to produce a shaded region.

  • The axes are drawn separately. This is done by first suppressing the plotting of the axes in the plot command, and the horizontal axis is drawn separately. Also note that the expression command is used to plot a Greek character and also produce subscripts.

    6.1.8. Plotting a Surface

    Finally, a brief example of how to plot a surface is given. The persp command will plot a surface with a specified perspective. In the example, a grid is defined by multiplying a row and column vector to give the x and then the y values for a grid. Once that is done a sine function is specified on the grid, and the persp command is used to plot it.

    > x y xg yg f persp(x,y,f,theta=-10,phi=40) >

    The %*% notation is used to perform matrix multiplication.

    6.2. Discrete Data

    Contents

    Barplot Mosaic Plot

    In the examples below a data set is defined using Rs hypergeometric random number generator.

    > numberWhite numberChipped numberWhite numberWhite plot(numberWhite) >

    In this case R will produce a barplot. The barplot command can also be used to create a barplot. The barplot command requires a vector of heights, though, and you cannot simply give it the raw data. The frequencies for the barplot command can be easily calculated using the table command.

  • > numberWhite totals totals numberWhite 0 1 2 3 4 13 11 2 > barplot(totals,main="Number Draws",ylab="Frequency",xlab="Draws") >

    In the previous example the barplot command is used to set the title for the plot and the labels for the axes. The labels on the ticks for the horizontal axis are automatically generated using the labels on the table. You can change the labels by setting the row names of the table.

    > totals rownames(totals) totals numberWhite none one two three 4 13 11 2 > barplot(totals,main="Number Draws",ylab="Frequency",xlab="Draws") >

    The order of the frequencies is the same as the order in the table. If you change the order in the table it will change the way it appears in the barplot. For example, if you wish to arrange the frequencies in descending order you can use the sort command with the decreasing option set to TRUE.

    > barplot(sort(totals,decreasing=TRUE),main="Number Draws",ylab="Frequency",xlab="Draws")

    The indexing features of R can be used to change the order of the frequencies manually.

    > totals numberWhite none one two three 4 13 11 2 > sort(totals,decreasing=TRUE) numberWhite one two none three 13 11 4 2 > totals[c(3,1,4,2)] numberWhite two none three one 11 4 2 13 > barplot(totals[c(3,1,4,2)]) >

    The barplot command returns the horizontal locations of the bars. Using the locations and putting together the previous ideas a Pareto Chart can be constructed.

    > xLoc = barplot(sort(totals,decreasing=TRUE),main="Number Draws", ylab="Frequency",xlab="Draws",ylim=c(0,sum(totals)+2)) > points(xLoc,cumsum(sort(totals,decreasing=TRUE)),type='p',col=2) > points(xLoc,cumsum(sort(totals,decreasing=TRUE)),type='l') >

  • 6.2.2. Mosaic Plot

    Mosaic plots are used to display proportions for tables that are divided into two or more conditional distributions. Here we focus on two way tables to keep things simpler. It is assumed that you are familiar with using tables in R (see the section on two way tables for more information: Two Way Tables).

    Here we will use a made up data set primarily to make it easier to figure out what R is doing. The fictitious data set is defined below. The idea is that sixteen children of age eight are interviewed. They are asked two questions. The first question is, do you believe in Santa Claus. If they say that they do then the term belief is recorded, otherwise the term no belief is recorded. The second question is whether or not they have an older brother, older sister, or no older sibling. (We are keeping it simple here!) The answers that are recorded are older brother, older sister, or no older sibling.

    > santa santa belief sibling 1 no belief older brother 2 no belief older brother 3 no belief older brother 4 no belief older sister 5 belief no older sibling 6 belief no older sibling 7 belief no older sibling 8 belief older sister 9 belief older brother 10 belief older sister 11 no belief older brother 12 no belief older sister 13 belief no older sibling 14 belief older sister 15 no belief older brother 16 no belief no older sibling > summary(santa) belief sibling belief :8 no older sibling:5 no belief:8 older brother :6 older sister :5

  • The data is given as strings, so R will automatically treat them as categorical data, and the data types are factors. If you plot the individual data sets, the plot command will default to producing barplots.

    > plot(santa$belief) > plot(santa$sibling) >

    If you provide both data sets it will automatically produce a mosaic plot which demonstrates the relative frequencies in terms of the resulting areas.

    > plot(santa$sibling,santa$belief) > plot(santa$belief,santa$sibling)

    The mosaicplot command can be called directly

    > totals = table(santa$belief,santa$sibling) > totals no older sibling older brother older sister belief 4 1 3 no belief 1 5 2 > mosaicplot(totals,main="Older Brothers are Jerks", xlab="Belief in Santa Claus",ylab="Older Sibling")

    The colours of the plot can be specified by setting the col argument. The argument is a vector of colours used for the rows. See Fgure :ref`figure7_intermediatePlotting` for an example.

    > mosaicplot(totals,main="Older Brothers are Jerks", xlab="Belief in Santa Claus",ylab="Older Sibling", col=c(2,3,4))

  • Figure 7.

    Example of a mosaic plot with colours.

    The labels and the order that they appear in the plot can be changed in exactly the same way as given in the examples for barplot above.

    > rownames(totals) [1] "belief" "no belief" > colnames(totals) [1] "no older sibling" "older brother" "older sister" > rownames(totals) colnames(totals) totals No Older Older Brother Older Sister Believes 4 1 3 Does not Believe 1 5 2 > mosaicplot(totals,main="Older Brothers are Jerks", xlab="Belief in Santa Claus",ylab="Older Sibling")

    When changing the order keep in mind that the table is a two dimensional array. The indices must include both rows and columns, and the transpose command (t) can be used to switch how it is plotted with respect to the vertical and horizontal axes.

  • > totals No Older Older Brother Older Sister Believes 4 1 3 Does not Believe 1 5 2 > totals[c(2,1),c(2,3,1)] Older Brother Older Sister No Older Does not Believe 5 2 1 Believes 1 3 4 > mosaicplot(totals[c(2,1),c(2,3,1)],main="Older Brothers are Jerks", xlab="Belief in Santa Claus",ylab="Older Sibling",col=c(2,3,4)) > mosaicplot(t(totals),main="Older Brothers are Jerks", ylab="Belief in Santa Claus",xlab="Older Sibling",col=c(2,3))

    6.3. Miscellaneous Options

    Contents

    Multiple Representations On One Plot Multiple Windows Print To A File Annotation and Formatting

    The previous examples only provide a slight hint at what is possible. Here we give some examples that provide a demonstration of the way the different commands can be combined and the options that allow them to be used together.

    6.3.1. Multiple Representations On One Plot

    First, an example of a histogram with an approximation of the density function is given. In addition to the density function a horizontal boxplot is added to the plot with a rug representation of the data on the horizontal axis. The horizontal bounds on the histogram will be specified. The boxplot must be added to the histogram, and it will be raised above the histogram.

    > x = rexp(20,rate=4) > hist(x,ylim=c(0,18),main="This Are An Histogram",xlab="X") > boxplot(x,at=16,horizontal=TRUE,add=TRUE) > rug(x,side=1) > d = density(x) > points(d,type='l',col=3) >

    6.3.2. Multiple Windows

    The dev commands allow you to create and manipulate multiple graphics windows. You can create new windows using the dev.new() command, and you can choose which one to make active using the dev.set() command. The dev.list(), dev.next(), and dev.prev() command can be used to list the graphical devices that are available.

  • In the following example three devices are created. They are listed, and different plots are created on the different devices.

    > dev.new() > dev.new() > dev.new() > dev.list() X11cairo X11cairo X11cairo 2 3 4 > dev.set(3) X11cairo 3 > x = rnorm(20) > hist(x) > dev.set(2) X11cairo 2 > boxplot(x) > dev.set(4) X11cairo 4 > qqnorm(x) > qqline(x) > dev.next() X11cairo 2 > dev.set(dev.next()) X11cairo 2 > plot(density(x)) >

    6.3.3. Print To A File

    There are a couple ways to print a plot to a file. It is important to be able to work with graphics devices as shown in the previous subsection (Multiple Windows). The first way explored is to use the dev.print command. This command will print a copy of the currently active device, and the format is defined by the device argument.

    In the example below, the current window is printed to a png file called hist.png that is 200 pixels wide.

    > x = rnorm(100) > hist(x) > dev.print(device=png,width=200,"hist.png") >

    To find out what devices are available on your system use the help command.

    > help(device)

    Another way to print to a file is to create a device in the same way as the graphical devices were created in the previous section. Once the device is created, the various plot commands are given, and then the device is turned off to write the results to a file.

  • > png(file="hist.png") > hist(x) > rug(x,side=1) > dev.off()

    6.3.4. Annotation and Formatting

    Basic annotation can be performed in the regular plotting commmands. For example, there are options to specify labels on axes as well as titles. More options are available using the axis command.

    Most of the primary plotting commands have an option to turn off the generation of the axes using the axes=FALSE option. The axes can be then added using the axis command which allows for a greater number of options.

    In the example below a bivariate set of random numbers are generated and plotted as a scatter plot. The axes are added, but the horizontal axis is located in the center of the data rather than at the bottom of the figure. Note that the horizontal and vertical axes are added separately, and are specified using the first argument to the command. (Use help(axis) for a full list of options.)

    > x y summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. -17.9800 -9.0060 0.7057 -1.2060 8.2600 10.9200 > plot(x,y,axes=FALSE,col=2) > axis(1,pos=c(0,0),at=seq(-7,5,by=1)) > axis(2,pos=c(0,0),at=seq(-18,11,by=2)) >

    In the previous example the at option is used to specify the tick marks.

    When using the plot command the default behavior is to draw an axis as well as draw a box around the plotting area. The drawing of the box can be suppressed using the bty option. The value can be o, l, 7, c, u, ], or n. (The lines drawn roughly look like the letter given except for n which draws no lines.) The box can be drawn later using the box command as well.

    > x y plot(x,y,bty="7") > plot(x,y,bty="n") > box(lty=3) >

    The par command can be used to set the default values for various parameters. A couple are given below. In the example below the default background is set to grey, no box will be drawn around the window, and the margins for the axes will be twice the normal size.

  • > par(bty="l") > par(bg="gray") > par(mex=2) > x y plot(x,y) >

    Another common task is to place a text string on the plot. The text command takes a coordinate and a label, and it places the label at the given coordinate. The text command has options for setting the offset, size, font, and other options. In the example below the label numbers! is placed on the plot. Use help(text) to see more options.

    > x y plot(x,y) > text(-1,-2,"numbers!") >

    The default text command will cut off any characters outside of the plot area. This behavior can be overridden using the xpd option.

    > x y plot(x,y) > text(-7,-2,"outside the area",xpd=TRUE) >

  • 7. Indexing Into Vectors

    Contents

    Indexing With Logicals Not Available or Missing Values Indices With Logical Expression

    Given a vector of data one common task is to isolate particular entries or censor items that meet some criteria. Here we show how to use Rs indexing notation to pick out specific items within a vector.

    7.1. Indexing With Logicals

    We first give an example of how to select specific items in a vector. The first step is to define a vector of data, and the second step is to define a vector made up of logical values. When the vector of logical values is used for the index into the vector of data values only the items corresponding to the variables that evaluate to TRUE are returned:

    > a b a[b] [1] 1 4 > max(a[b]) [1] 4 > sum(a[b]) [1] 5

    7.2. Not Available or Missing Values

    One common problem is data entries that are marked NA or not available. There is a predefined variable called NA that can be used to indicate missing information. The problem with this is that some functions throw an error if one of the entries in the data is NA. Some functions allow you to ignore the missing values through special options:

    > a a [1] 1 2 3 4 NA > sum(a) [1] NA > sum(a,na.rm=TRUE) [1] 10

    There are other times, though, when this option is not available, or you simply want to censor them. The is.na function can be used to determine which items are not available. The logical not operator in R is the ! symbol. When used with the indexing notation the items within a vector that are NA can be easily removed:

    > a is.na(a)

  • [1] FALSE FALSE FALSE FALSE TRUE > !is.na(a) [1] TRUE TRUE TRUE TRUE FALSE > a[!is.na(a)] [1] 1 2 3 4 > b b [1] 1 2 3 4

    7.3. Indices With Logical Expression

    Any logical expression can be used as an index which opens a wide range of possibilities. For example, you can remove or focus on entries that match specific criteria. For example, you might want to remove all entries that are above a certain value:

    > a = c(6,2,5,3,8,2) > a [1] 6 2 5 3 8 2 > b = a[a b [1] 2 5 3 2

    For another example, suppose you want to join together the values that match two different factors in another vector:

    > d = data.frame(one=as.factor(c('a','a','b','b','c','c')), two=c(1,2,3,4,5,6)) > d one two 1 a 1 2 a 2 3 b 3 4 b 4 5 c 5 6 c 6 > both = d$two[(d$one=='a') | (d$one=='b')] > both [1] 1 2 3 4

    Note that a single | was used in the previous example. There is a difference between || and |. A single bar will perform a vector operation, term by term, while a double bar will evaluate to a single TRUE or FALSE result:

    > (c(TRUE,TRUE))|(c(FALSE,TRUE)) [1] TRUE TRUE > (c(TRUE,TRUE))||(c(FALSE,TRUE)) [1] TRUE > (c(TRUE,TRUE))&(c(FALSE,TRUE)) [1] FALSE TRUE > (c(TRUE,TRUE))&&(c(FALSE,TRUE)) [1] FALSE

  • 8. Linear Least Squares Regression

    Here we look at the most basic linear least squares regression. The main purpose is to provide an example of the basic commands. It is assumed that you know how to enter data or read data files which is covered in the first chapter, and it is assumed that you are familiar with the different data types.

    We will examine the interest rate for four year car loans, and the data that we use comes from the U.S. Federal Reserves mean rates . We are looking at and plotting means. This, of course, is a very bad thing because it removes a lot of the variance and is misleading. The only reason that we are working with the data in this way is to provide an example of linear regression that does not use too many data points. Do not try this without a professional near you, and if a professional is not near you do not tell anybody you did this. They will laugh at you. People are mean, especially professionals.

    The first thing to do is to specify the data. Here there are only five pairs of numbers so we can enter them in manually. Each of the five pairs consists of a year and the mean interest rate:

    > year rate plot(year,rate, main="Commercial Banks Interest Rate for 4 Year Car Loan", sub="http://www.federalreserve.gov/releases/g19/20050805/") > cor(year,rate) [1] -0.9880813

    At this point we should be excited because associations that strong never happen in the real world unless you cook the books or work with averaged data. The next question is what straight line comes closest to the data? In this case we will use least squares regression as one way to determine the line.

    Before we can find the least square regression line we have to make some decisions. First we have to decide which is the explanatory and which is the response variable. Here, we arbitrarily pick the explanatory variable to be the year, and the response variable is the interest rate. This was chosen because it seems like the interest rate might change in time rather than time changing as the interest rate changes. (We could be wrong, finance is very confusing.)

    The command to perform the least square regression is the lm command. The command has many options, but we will keep it simple and not explore them here. If you are interested use the help(lm) command to learn more. Instead the only option we examine is the one necessary argument which specifies the relationship.

  • Since we specified that the interest rate is the response variable and the year is the explanatory variable this means that the regression line can be written in slope-intercept form:

    rate=(slope)year+(intercept)

    The way that this relationship is defined in the lm command is that you write the vector containing the response variable, a tilde (~), and a vector containing the explanatory variable:

    > fit fit Call: lm(formula = rate ~ year) Coefficients: (Intercept) year 1419.208 -0.705

    When you make the call to lm it returns a variable with a lot of information in it. If you are just learning about least squares regression you are probably only interested in two things at this point, the slope and the y-intercept. If you just type the name of the variable returned by lm it will print out this minimal information to the screen. (See above.)

    If you would like to know what else is stored in the variable you can use the attributes command:

    > attributes(fit) $names [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" $class [1] "lm"

    One of the things you should notice is the coefficients variable within fit. You can print out the y-intercept and slope by accessing this part of the variable:

    > fit$coefficients[1] (Intercept) 1419.208 > fit$coefficients[[1]] [1] 1419.208 > fit$coefficients[2] year -0.705 > fit$coefficients[[2]] [1] -0.705

    Note that if you just want to get the number you should use two square braces. So if you want to get an estimate of the interest rate in the year 2015 you can use the formula for a line:

    > fit$coefficients[[2]]*2015+fit$coefficients[[1]] [1] -1.367

  • So if you just wait long enough, the banks will pay you to take a car!

    A better use for this formula would be to calculate the residuals and plot them:

    > res res [1] 0.132 -0.003 -0.178 -0.163 0.212 > plot(year,res)

    That is a bit messy, but fortunately there are easier ways to get the residuals. Two other ways are shown below:

    > residuals(fit) 1 2 3 4 5 0.132 -0.003 -0.178 -0.163 0.212 > fit$residuals 1 2 3 4 5 0.132 -0.003 -0.178 -0.163 0.212 > plot(year,fit$residuals) >

    If you want to plot the regression line on the same plot as your scatter plot you can use the abline function along with your variable fit:

    > plot(year,rate, main="Commercial Banks Interest Rate for 4 Year Car Loan", sub="http://www.federalreserve.gov/releases/g19/20050805/") > abline(fit)

    Finally, as a teaser for the kinds of analyses you might see later, you can get the results of an F-test by asking R for a summary of the fit variable:

    > summary(fit) Call: lm(formula = rate ~ year) Residuals: 1 2 3 4 5 0.132 -0.003 -0.178 -0.163 0.212 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1419.20800 126.94957 11.18 0.00153 ** year -0.70500 0.06341 -11.12 0.00156 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.2005 on 3 degrees of freedom Multiple R-Squared: 0.9763, Adjusted R-squared: 0.9684 F-statistic: 123.6 on 1 and 3 DF, p-value: 0.001559

  • 9. Calculating Confidence Intervals

    Contents

    Calculating a Confidence Interval From a Normal Distribution Calculating a Confidence Interval From a t Distribution Calculating Many Confidence Intervals From a t Distribution

    Here we look at some examples of calculating confidence intervals. The examples are for both normal and t distributions. We assume that you can enter data and know the commands associated with basic probability. Note that an easier way to calculate confidence intervals using the t.test command is discussed in section The Easy Way.

    9.1. Calculating a Confidence Interval From a Normal Distribution

    Here we will look at a fictitious example. We will make some assumptions for what we might find in an experiment and find the resulting confidence interval using a normal distribution. Here we assume that the sample mean is 5, the standard deviation is 2, and the sample size is 20. In the example below we will use a 95% confidence level and wish to find the confidence interval. The commands to find the confidence interval in R are the following:

    > a s n error left right left [1] 4.123477 > right [1] 5.876523

    The true mean has a probability of 95% of being in the interval between 4.12 and 5.88 assuming that the original random variable is normally distributed, and the samples are independent.

    9.2. Calculating a Confidence Interval From a t Distribution

    Calculating the confidence interval when using a t-test is similar to using a normal distribution. The only difference is that we use the command associated with the t-distribution rather than the normal distribution. Here we repeat the procedures above, but we will assume that we are working with a sample standard deviation rather than an exact standard deviation.

    Again we assume that the sample mean is 5, the sample standard deviation is 2, and the sample size is 20. We use a 95% confidence level and wish to find the confidence interval. The commands to find the confidence interval in R are the following:

  • > a s n error left right left [1] 4.063971 > right [1] 5.936029

    The true mean has a probability of 95% of being in the interval between 4.06 and 5.94 assuming that the original random variable is normally distributed, and the samples are independent.

    We now look at an example where we have a univariate data set and want to find the 95% confidence interval for the mean. In this example we use one of the data sets given in the data input chapter. We use the w1.dat data set:

    > w1 summary(w1) vals Min. :0.130 1st Qu.:0.480 Median :0.720 Mean :0.765 3rd Qu.:1.008 Max. :1.760 > length(w1$vals) [1] 54 > mean(w1$vals) [1] 0.765 > sd(w1$vals) [1] 0.3781222

    We can now calculate an error for the mean:

    > error error [1] 0.1032075

    The confidence interval is found by adding and subtracting the error from the mean:

    > left right left [1] 0.6617925 > right [1] 0.8682075

    There is a 95% probability that the true mean is between 0.66 and 0.87 assuming that the original random variable is normally distributed, and the samples are independent.

  • 9.3. Calculating Many Confidence Intervals From a t Distribution

    Suppose that you want to find the confidence intervals for many tests. This is a common task and most software packages will allow you to do this.

    We have three different sets of results:

    Comparison 1

    Mean Std. Dev. Number (pop.)

    Group I 10 3 300

    Group II 10.5 2.5 230

    Comparison 2

    Mean Std. Dev. Number (pop.)

    Group I 12 4 210

    Group II 13 5.3 340

    Comparison 3

    Mean Std. Dev. Number (pop.)

    Group I 30 4.5 420

    Group II 28.5 3 400

    For each of these comparisons we want to calculate the associated confidence interval for the difference of the means. For each comparison there are two groups. We will refer to group one as the group whose results are in the first row of each comparison above. We will refer to group two as the group whose results are in the second row of each comparison above. Before we can do that we must first compute a standard error and a t-score. We will find general formulae which is necessary in order to do all three calculations at once.

    We assume that the means for the first group are defined in a variable called m1. The means for the second group are defined in a variable called m2. The standard deviations for the first group are in a variable called sd1. The standard deviations for the second group are in a variable called sd2. The number of samples for the first group are in a variable called num1. Finally, the number of samples for the second group are in a variable called num2.

  • With these definitions the standard error is the square root of (sd1^2)/num1+(sd2^2)/num2. The R commands to do this can be found below:

    > m1 m2 sd1 sd2 num1 num2 se error m1 [1] 10 12 30 > m2 [1] 10.5 13.0 28.5 > sd1 [1] 3.0 4.0 4.5 > sd2 [1] 2.5 5.3 3.0 > num1 [1] 300 210 420 > num2 [1] 230 340 400 > se [1] 0.2391107 0.3985074 0.2659216 > error [1] 0.4711382 0.7856092 0.5227825

    Now we need to define the confidence interval around the assumed differences. Just as in the case of finding the p values in previous chapter we have to use the pmin command to get the number of degrees of freedom. In this case the null hypotheses are for a difference of zero, and we use a 95% confidence interva