programming with r
DESCRIPTION
Introduction to R ProgrammingTRANSCRIPT
-
Programming with R
1
-
Some General Programming Guidelines
1. Understand the problem.
2. Work out a general idea how to solve it.
3. Translate your idea into a detailed implementation.
4. Check: Does it work?
Is it good enough?
If yes, you are done!
If no, go back to step 2.
2
-
Example
We wish to write a program which will sort a vector of inte-
gers into increasing order.
3
-
Understand the Problem
Start with a specific case, usually simple, but not too simple.
Sometimes, you might try to solve the problem on your own,
without the computer.
Consider sorting the vector consisting of the elements
3,5,24,6,2,4,13,1.
4
-
Understand the Problem
Our goal is to write a function called bubblesort() for which
we could do the following:
x
-
Work out a General Idea
A first idea might be to find where the smallest value is, and
record it.
Repeat, with the remaining values, recording the smallest
value each time.
Repeat ...
This might be time-consuming.
6
-
Work out a General Idea
An alternative idea: compare successive pairs of values, start-
ing at the beginning of the vector, and running through to the
end.
Swap pairs if they are out of order.
Try using this idea on 2,1,4,3,0, for example.
After running through it, you should end up with 1,2,3,0,4.
This method doesnt give the solution, directly.
7
-
Work out a General Idea
In checking the alternate idea, notice that the largest value
always lands at the end of the new vector. (Can you prove to
yourself that this should always happen?)
This means that we can sort the vector by starting at the be-
ginning of the vector, go through all adjacent pairs.
Then repeat this procedure for all but the last value, and so
on.
8
-
Detailed Implementation
At this point, we need to address specific coding questions.
e.g. How do we swap x[i] and x[i+1]?
Here is a way to swap the value of x[3] with that of x[4]:
> save x[3] x[4]
-
Detailed Implementation
Note that you should not over-write the value of x[3] with the
value of x[4] before its old value has been saved in another
place; otherwise, you will not be able to assign that value to
x[4].
10
-
Detailed Implementation
We are now ready to write the code:
bubblesort x[first + 1]) { # swap the pair
save
-
Check
Always begin testing your code on simple examples to iden-
tify obvious bugs.
> bubblesort(c(2, 1))
[1] 1 2
> bubblesort(c(2, 24, 3, 4, 5, 13, 6, 1))
[1] 1 2 3 4 5 6 13 24
12
-
Check
Try the code on several other numeric vectors. What is the
output when the input vector has length 1?
> bubblesort(1)
Error in if (x[first] > x[first + 1]) { : missing value where
TRUE/FALSE needed
13
-
Check
The problem is that when length(x) == 1, the value of last
will take on the values 1:2, rather than no values at all.
This doesnt require a redesign of the function; we can fix
it by handling this as a special case at the beginning of our
function:
14
-
Check
bubblesort x[first + 1]) { # swap the pair
save
-
Check
Test the new version:
> bubblesort(1)
[1] 1
16
-
Top-down design
Working out the detailed implementation of a program canappear to be a daunting task. The key to making it manage-able is to break it down into smaller pieces which you knowhow to solve.
One strategy for doing that is known as top-down design.Top-down design is similar to outlining an essay before fillingin the details:
1. Write out the whole program in a small number (1-5) ofsteps.
2. Expand each step into a small number of steps.3. Keep going until you have a program.
17
-
Example Merge Sort
The sort algorithm just described is known as a bubble sort.
The bubble sort is easy to program and is efficient when the
vector x is short, but when x is longer, more efficient meth-
ods are available.
One of these is known as a merge sort.
The general idea of a merge sort is to split the vector into two
halves, sort each half, and then merge the two halves.
18
-
Example Merge Sort
During the merge, we only need to compare the first elements
of each sorted half to decide which is the smallest value over
all.
Remove that value from its half; then the second value be-
comes the smallest remaining value in this half, and we can
proceed to put the two parts together into one sorted vector.
19
-
Example Merge Sort
So how do we do the initial sorting of each half?
We could use a bubble sort, but a more elegant procedure is
to use a merge sort on each of them.
This is an idea called recursion.
The mergesort() function which we will write below can
make calls to itself.
Because of variable scoping, new copies of all of the local
variables will be created each time it is called, and the differ-
ent calls will not interfere with each other.20
-
Understanding the idea
It is often worthwhile to consider small numerical examples
in order to ensure that we understand the basic idea of the
algorithm, before we proceed to designing it in detail.
For example, suppose x is [8,6,7,4], and we want to con-
struct a sorted result r.
Then our merge sort would proceed as follows:
21
-
Understanding the idea
1. Split x into two parts: y [8,6], z [7,4]2. Sort y and z: y [6,8], z [4,7]3. Merge y and z:
(a) Compare y1 = 6 and z1 = 4: r1 4; Remove z1; z isnow [7].
(b) Compare y1 = 6 and z1 = 7: r2 6; Remove y1; y isnow [8].
(c) Compare y1 = 8 and z1 = 7: r3 7; Remove z1; z isnow empty.
(d) Append remaining values of y onto r: r4 84. Return r = [4,6,7,8]
22
-
Translating into code
It is helpful to think of the translation process as a stepwiseprocess of refining a program until it works.
We begin with a general statement, and gradually expandeach part.
We will use a double comment marker ## to mark descriptivelines that still need expansion. We will number these com-ments so that we can refer to them in the slides; in practice,you would probably not find this necessary.
After expanding, we will change to the usual comment markerto leave our description in place.
23
-
Initial Steps
We start with just one aim, which we can use as our firstdescriptive line:
## 1. Use a merge sort to sort a vector
We will gradually expand upon previous steps, adding in de-
tail as we go.
An expansion of step 1 follows from recognizing that we need
an input vector x which will be processed by a function that
we are naming mergesort.
Somehow, we will sort this vector.24
-
Initial Steps
In the end, we want the output to be returned:
# 1. Use a merge sort to sort a vector
mergesort
-
Breaking Down one of the Steps
We now expand step 2, noting how the merge sort algorithm
proceeds:
# 1. Use a merge sort to sort a vector
mergesort
-
Breaking Down Substeps
Each substep of the above needs to be expanded. First, we
expand step 2.1.
# 2.1: split x in half
len
-
Caution: check your code
x
-
Check your code
x
-
Check your code
x
-
Caution: Boundary Cases can be Different
Be careful with edge cases; usually, we expect to sort a
vector containing more than one element, but our sort func-
tion should be able to handle the simple problem of sorting a
single element.
The code above does not handle len < 2 properly.
We must try again, fixing step 2.1. The solution is simple: if
the length of x is 0 or 1, our function should simply return x.
Otherwise, we proceed to split x and sort as above. This
affects code outside of step 2.1, so we need to correct our
outline.31
-
Revised Program
Here is the new outline, including the new step 2.1:
# 1. Use a merge sort to sort a vector
mergesort
-
Revised Program
# 2: sort x into result
# 2.1: split x in half
y
-
Further Expansion
Step 2.2 is very easy to expand, because we can make use of
our mergesort() function, even though we havent written
it yet!
The key idea is to remember that we are not executing the
code at this point, we are designing it.
We should assume our design will eventually be successful,
and we will be able to make use of the fruits of our labour.
34
-
Further Expansion
So step 2.2 becomes
# 2.2: sort y and z
y
-
Further Expansion
Step 2.3 is more complicated, so lets take it slowly.
We know that we will need a result vector, but lets describe
the rest of the process before we code it.
We repeat the whole function here, including this expansion
and the expansion of step 2.2:
36
-
Further Expansion
# 1. Use a merge sort to sort a vector
mergesort
-
Further Expansion
# 2: sort x into result
# 2.1: split x in half
y
-
Further Expansion
Steps 2.3.2 and 2.3.3 both depend on the test of which ofy[1] and z[1] is smallest.
> # 1. Use a merge sort to sort a vector> mergesort
-
Further Expansion
+ while (min(length(y), length(z)) > 0) {+ # 2.3.2: put the smallest first element on the end+ # 2.3.3: remove it from y or z+ if (y[1] < z[1]) {+ result
-
Debugging and Maintenance
Computer errors are called bugs.
Removing these errors from a program is called debugging.
Debugging is difficult, and one of our goals is to write pro-
grams that dont have bugs in them: but sometimes we make
mistakes.
41
-
Debugging and Maintenance
We have found that the following five steps help us to find
and fix bugs in our own programs:
1. Recognize that a bug exists.
2. Make the bug reproducible.
3. Identify the cause of the bug.
4. Fix the error and test.
5. Look for similar errors.
We will consider each of these in turn.
42
-
Recognizing that a bug exists
Sometimes this is easy; if the program doesnt work, there is
a bug. However, in other cases the program seems to work,
but the output is incorrect, or the program works for some
inputs, but not for others.
A bug causing this kind of error is much more difficult to
recognize.
There are several strategies to make it easier.
43
-
Recognizing that a bug exists
First, follow the advice given earlier, and break up your pro-
gram into simple, self-contained functions.
Document their inputs and outputs.
Within the function, test that the inputs obey your assump-
tions about them, and think of test inputs where you can see
at a glance whether the outputs match your expectations.
44
-
Recognizing that a bug exists
In some situations, it may be worthwhile writing two versions
of a function: one that may be too slow to use in practice,
but which you are sure is right, and another that is faster but
harder to be sure about.
Test that both versions produce the same output in all situa-
tions.
45
-
Recognizing that a bug exists
When errors only occur for certain inputs, our experience
shows that those are often what are called edge cases:
situations which are right on the boundary between legal and
illegal inputs.
Test those! For example, test what happens when you try a
vector of length zero, test very large or very small values, etc.
46
-
Make the bug reproducible
Before you can fix a bug, you need to know where things are
going wrong. This is much easier if you know how to trigger
the bug.
Bugs that only appear unpredictably are extremely difficult
to fix. The good news is that for the most part computers are
predictable: if you give them the same inputs, they give you
the same outputs.
The difficulty is in working out what the necessary inputs are.
47
-
Make the bug reproducible
For example, a common mistake in programming is to mis-
spell the name of a variable.
Normally this results in an immediate error message, but some-
times you accidentally choose a variable that actually does
exist.
Then youll probably get the wrong answer, and the answer
you get may appear to be random, because it depends on the
value in some unrelated variable.
48
-
Make the bug reproducible
The key to tracking down this sort of problem is to work hardto make the error reproducible.
Simplify things as much as possible: start a new empty Rsession, and see if you can reproduce it.
Once you can reproduce the error, you will eventually be ableto track it down.
Some programs do random simulations.
For those, you can make the simulations reproducible by set-ting the value of the random number seed at the start.
49
-
Identify the cause of the bug
When you have confirmed that a bug exists, the next step is
to identify its cause.
If your program has stopped with an error, read the error mes-
sages.
Try to understand them as well as you can.
50
-
Trouble-shooting
The simplest way to do this is to edit your functions to add
statements like this:
cat("In cv, x=", x, "\n")
This will print the value of x, identifying where the message
is coming from. The "\n" at the end tells R to go to a new
line after printing.
51
-
Trouble-shooting
You may want to use print() rather than cat() to take ad-
vantage of its formatting, but remember that it can only print
one thing at a time, so you would likely use it as
cat("In cv, x=\n")
print(x)
52
-
Trouble-shooting
Another way to understand what is going wrong in a small
function is to simulate it by hand.
Act as you think R would act, and write down the values of all
variables as the function progresses.
53
-
Fixing errors and testing
Once you have identified the bug in your program, you need
to fix it.
Try to fix it in such a way that you dont cause a different
problem.
Then test what youve done.
You should put together tests that include the way you know
that would reproduce the error, as well as edge cases, and
anything else you can think of.
54
-
The debug() Function
Rather than using cat() or print() for debugging, R allows
you to call the function debug(). This will pause execution
of your function, and allow you to examine (or change!) lo-
cal variables, or execute any other R command, inside the
evaluation environment of the function.
55
-
The debug() Function
Commands to use with debug() are
n - next; execute the next line of code, single-steppingthrough the function
c - continue; let the function continue running Q - quit the debugger
You mark function f for debugging using debug(f), and then
the browser will be called when you enter the function. Turn
off debugging using undebug(f).
56
-
Example Constructing and Debugging a Function
We will write and debug a function which will compute a con-
fidence interval for the true mean of a population, based on a
random sample of size n using the formula
x t/2,n1s/n
where x is the sample mean and s is the sample standard
deviation, and the t value is the 1 /2 percentile of the tdistribution on n 1 degrees of freedom.
57
-
Writing a Confidence Interval Function
Our goal is to write a function which will take input like x
such as some male heights:
x
-
Writing a Confidence Interval Function
ci ci(x) # this should print out a 95%
# confidence interval for the true mean
59
-
Solving the Problem
The confidence interval formula requires:
the sample mean which we can compute with mean(x) the sample standard deviation (sd(x)) the t percentiles (qt(c(alpha/2, 1-alpha/2), df)) the square root of n (sqrt(n))
60
-
Implementing the Solution
Here is a first attempt at implementing the solution to the
problem:
ci
-
Testing the Function
Here is a first test for our ci function. Use the data vector of
heights:
> x ci(x)
Error in qt(p, df, lower.tail, log.p) :
Non-numeric argument to mathematical function
Something is wrong. One of the arguments to qt is incorrect.
62
-
Looking for the Error
We can add a print statement immediately before the call toqt:
ci
-
Looking for the Error
> ci(x)
$alpha
[1] 0.05
$df
function (x, df1, df2, ncp, log = FALSE)
{
if (missing(ncp))
.Internal(df(x, df1, df2, log))
else .Internal(dnf(x, df1, df2, ncp, log))
}
Error in qt(p, df, lower.tail, log.p) :
Non-numeric argument to mathematical function
The df argument to qt should be set to n-1.64
-
Another Attempt
ci
-
Another Attempt
ci
-
Checking the Boundary Case
Although we should not compute confidence intervals for
sample sizes less than 2, it might happen by accident:
> ci(3)
[1] NA NA
Warning message:
In qt(p, df, lower.tail, log.p) : NaNs produced
67
-
Checking the Boundary Case
Again, we can handle this boundary case with an if state-ment.
ci
-
Our Function Can Now be Used Elsewhere
Now that ci() is a function that is known to work on numeric
vectors of any length, we can call it from other functions.
For example, the following function uses the ci() functionto compute confidence intervals for all vectors in a matrix orlist, as well as for a single vector.
CI
-
Testing the CI Function
This function needs to be tested on vectors, lists (including
data frames) and matrices:
> x # a vector
[1] 170 185 177 160
> CI(x)
[1] 156.11 189.89
70
-
Testing with a Matrix
> xy # a matrix
[,1] [,2] [,3] [,4] [,5]
[1,] -0.2799555 0.49433909 -0.76405054 0.30727532 -1.35506713
[2,] 0.7972663 -0.79788501 -0.31684602 -0.63859843 -1.18923510
[3,] 0.8847079 -0.03282889 0.21370405 1.01534939 0.29377499
[4,] -0.2333586 1.46200042 -0.02394421 -0.08885412 -0.08092837
> CI(xy)
X1 X2 X3 X4 X5
[1,] -0.7182852 -1.228931 -0.8927856 -0.9584125 -1.8770214
[2,] 1.3026153 1.791744 0.4472173 1.2559986 0.7112936
71
-
Testing with a List
> xy3 # a list
$x
[1] 170 185 177 160
$y
[1] 149 155 162 158 154
$z
[1] 170 185 177 160
> CI(xy3)
x y z
[1,] 156.11 149.6065 156.11
[2,] 189.89 161.5935 189.89
72