programming with r

Programming with R

1

Some General Programming Guidelines

1. Understand the problem.

2. Work out a general idea how to solve it.

3. Translate your idea into a detailed implementation.

4. Check: Does it work?

Is it good enough?

If yes, you are done!

If no, go back to step 2.

2

Example

We wish to write a program which will sort a vector of inte-

gers into increasing order.

3

Understand the Problem

Start with a specific case, usually simple, but not too simple.

Sometimes, you might try to solve the problem on your own,

without the computer.

Consider sorting the vector consisting of the elements

3,5,24,6,2,4,13,1.

4

Understand the Problem

Our goal is to write a function called bubblesort() for which

we could do the following:

x

Work out a General Idea

A first idea might be to find where the smallest value is, and

record it.

Repeat, with the remaining values, recording the smallest

value each time.

Repeat ...

This might be time-consuming.

6


An alternative idea: compare successive pairs of values, start-

ing at the beginning of the vector, and running through to the

end.

Swap pairs if they are out of order.

Try using this idea on 2,1,4,3,0, for example.

After running through it, you should end up with 1,2,3,0,4.

This method doesnt give the solution, directly.

7


In checking the alternate idea, notice that the largest value

always lands at the end of the new vector. (Can you prove to

yourself that this should always happen?)

This means that we can sort the vector by starting at the be-

ginning of the vector, go through all adjacent pairs.

Then repeat this procedure for all but the last value, and so

on.

8

Detailed Implementation

At this point, we need to address specific coding questions.

e.g. How do we swap x[i] and x[i+1]?

Here is a way to swap the value of x[3] with that of x[4]:

> save x[3] x[4]


Note that you should not over-write the value of x[3] with the

value of x[4] before its old value has been saved in another

place; otherwise, you will not be able to assign that value to

x[4].

10


We are now ready to write the code:

bubblesort x[first + 1]) { # swap the pair

save

Check

Always begin testing your code on simple examples to iden-

tify obvious bugs.

> bubblesort(c(2, 1))

[1] 1 2

> bubblesort(c(2, 24, 3, 4, 5, 13, 6, 1))

[1] 1 2 3 4 5 6 13 24

12

Check

Try the code on several other numeric vectors. What is the

output when the input vector has length 1?

> bubblesort(1)

Error in if (x[first] > x[first + 1]) { : missing value where

TRUE/FALSE needed

13

Check

The problem is that when length(x) == 1, the value of last

will take on the values 1:2, rather than no values at all.

This doesnt require a redesign of the function; we can fix

it by handling this as a special case at the beginning of our

function:

14

Check

bubblesort x[first + 1]) { # swap the pair

save

Check

Test the new version:

> bubblesort(1)

[1] 1

16

Top-down design

Working out the detailed implementation of a program canappear to be a daunting task. The key to making it manage-able is to break it down into smaller pieces which you knowhow to solve.

One strategy for doing that is known as top-down design.Top-down design is similar to outlining an essay before fillingin the details:

1. Write out the whole program in a small number (1-5) ofsteps.

2. Expand each step into a small number of steps.3. Keep going until you have a program.

17

Example Merge Sort

The sort algorithm just described is known as a bubble sort.

The bubble sort is easy to program and is efficient when the

vector x is short, but when x is longer, more efficient meth-

ods are available.

One of these is known as a merge sort.

The general idea of a merge sort is to split the vector into two

halves, sort each half, and then merge the two halves.

18

Example Merge Sort

During the merge, we only need to compare the first elements

of each sorted half to decide which is the smallest value over

all.

Remove that value from its half; then the second value be-

comes the smallest remaining value in this half, and we can

proceed to put the two parts together into one sorted vector.

19

Example Merge Sort

So how do we do the initial sorting of each half?

We could use a bubble sort, but a more elegant procedure is

to use a merge sort on each of them.

This is an idea called recursion.

The mergesort() function which we will write below can

make calls to itself.

Because of variable scoping, new copies of all of the local

variables will be created each time it is called, and the differ-

ent calls will not interfere with each other.20

Understanding the idea

It is often worthwhile to consider small numerical examples

in order to ensure that we understand the basic idea of the

algorithm, before we proceed to designing it in detail.

For example, suppose x is [8,6,7,4], and we want to con-

struct a sorted result r.

Then our merge sort would proceed as follows:

21

Understanding the idea

1. Split x into two parts: y [8,6], z [7,4]2. Sort y and z: y [6,8], z [4,7]3. Merge y and z:

(a) Compare y1 = 6 and z1 = 4: r1 4; Remove z1; z isnow [7].

(b) Compare y1 = 6 and z1 = 7: r2 6; Remove y1; y isnow [8].

(c) Compare y1 = 8 and z1 = 7: r3 7; Remove z1; z isnow empty.

(d) Append remaining values of y onto r: r4 84. Return r = [4,6,7,8]

22

Translating into code

It is helpful to think of the translation process as a stepwiseprocess of refining a program until it works.

We begin with a general statement, and gradually expandeach part.

We will use a double comment marker ## to mark descriptivelines that still need expansion. We will number these com-ments so that we can refer to them in the slides; in practice,you would probably not find this necessary.

After expanding, we will change to the usual comment markerto leave our description in place.

23

Initial Steps

We start with just one aim, which we can use as our firstdescriptive line:

## 1. Use a merge sort to sort a vector

We will gradually expand upon previous steps, adding in de-

tail as we go.

An expansion of step 1 follows from recognizing that we need

an input vector x which will be processed by a function that

we are naming mergesort.

Somehow, we will sort this vector.24

Initial Steps

In the end, we want the output to be returned:

# 1. Use a merge sort to sort a vector

mergesort

Breaking Down one of the Steps

We now expand step 2, noting how the merge sort algorithm

proceeds:


mergesort

Breaking Down Substeps

Each substep of the above needs to be expanded. First, we

expand step 2.1.

# 2.1: split x in half

len

Caution: check your code

x

Check your code

x

Caution: Boundary Cases can be Different

Be careful with edge cases; usually, we expect to sort a

vector containing more than one element, but our sort func-

tion should be able to handle the simple problem of sorting a

single element.

The code above does not handle len < 2 properly.

We must try again, fixing step 2.1. The solution is simple: if

the length of x is 0 or 1, our function should simply return x.

Otherwise, we proceed to split x and sort as above. This

affects code outside of step 2.1, so we need to correct our

outline.31

Revised Program

Here is the new outline, including the new step 2.1:


mergesort

Revised Program

# 2: sort x into result


y

Further Expansion

Step 2.2 is very easy to expand, because we can make use of

our mergesort() function, even though we havent written

it yet!

The key idea is to remember that we are not executing the

code at this point, we are designing it.

We should assume our design will eventually be successful,

and we will be able to make use of the fruits of our labour.

34

Further Expansion

So step 2.2 becomes

# 2.2: sort y and z

y

Further Expansion

Step 2.3 is more complicated, so lets take it slowly.

We know that we will need a result vector, but lets describe

the rest of the process before we code it.

We repeat the whole function here, including this expansion

and the expansion of step 2.2:

36

Further Expansion


mergesort

Further Expansion

# 2: sort x into result


y

Further Expansion

Steps 2.3.2 and 2.3.3 both depend on the test of which ofy[1] and z[1] is smallest.

> # 1. Use a merge sort to sort a vector> mergesort

Further Expansion

+ while (min(length(y), length(z)) > 0) {+ # 2.3.2: put the smallest first element on the end+ # 2.3.3: remove it from y or z+ if (y[1] < z[1]) {+ result

Debugging and Maintenance

Computer errors are called bugs.

Removing these errors from a program is called debugging.

Debugging is difficult, and one of our goals is to write pro-

grams that dont have bugs in them: but sometimes we make

mistakes.

41

Debugging and Maintenance

We have found that the following five steps help us to find

and fix bugs in our own programs:

1. Recognize that a bug exists.

2. Make the bug reproducible.

3. Identify the cause of the bug.

4. Fix the error and test.

5. Look for similar errors.

We will consider each of these in turn.

42

Recognizing that a bug exists

Sometimes this is easy; if the program doesnt work, there is

a bug. However, in other cases the program seems to work,

but the output is incorrect, or the program works for some

inputs, but not for others.

A bug causing this kind of error is much more difficult to

recognize.

There are several strategies to make it easier.

43


First, follow the advice given earlier, and break up your pro-

gram into simple, self-contained functions.

Document their inputs and outputs.

Within the function, test that the inputs obey your assump-

tions about them, and think of test inputs where you can see

at a glance whether the outputs match your expectations.

44


In some situations, it may be worthwhile writing two versions

of a function: one that may be too slow to use in practice,

but which you are sure is right, and another that is faster but

harder to be sure about.

Test that both versions produce the same output in all situa-

tions.

45


When errors only occur for certain inputs, our experience

shows that those are often what are called edge cases:

situations which are right on the boundary between legal and

illegal inputs.

Test those! For example, test what happens when you try a

vector of length zero, test very large or very small values, etc.

46

Make the bug reproducible

Before you can fix a bug, you need to know where things are

going wrong. This is much easier if you know how to trigger

the bug.

Bugs that only appear unpredictably are extremely difficult

to fix. The good news is that for the most part computers are

predictable: if you give them the same inputs, they give you

the same outputs.

The difficulty is in working out what the necessary inputs are.

47


For example, a common mistake in programming is to mis-

spell the name of a variable.

Normally this results in an immediate error message, but some-

times you accidentally choose a variable that actually does

exist.

Then youll probably get the wrong answer, and the answer

you get may appear to be random, because it depends on the

value in some unrelated variable.

48


The key to tracking down this sort of problem is to work hardto make the error reproducible.

Simplify things as much as possible: start a new empty Rsession, and see if you can reproduce it.

Once you can reproduce the error, you will eventually be ableto track it down.

Some programs do random simulations.

For those, you can make the simulations reproducible by set-ting the value of the random number seed at the start.

49

Identify the cause of the bug

When you have confirmed that a bug exists, the next step is

to identify its cause.

If your program has stopped with an error, read the error mes-

sages.

Try to understand them as well as you can.

50

Trouble-shooting

The simplest way to do this is to edit your functions to add

statements like this:

cat("In cv, x=", x, "\n")

This will print the value of x, identifying where the message

is coming from. The "\n" at the end tells R to go to a new

line after printing.

51

Trouble-shooting

You may want to use print() rather than cat() to take ad-

vantage of its formatting, but remember that it can only print

one thing at a time, so you would likely use it as

cat("In cv, x=\n")

print(x)

52

Trouble-shooting

Another way to understand what is going wrong in a small

function is to simulate it by hand.

Act as you think R would act, and write down the values of all

variables as the function progresses.

53

Fixing errors and testing

Once you have identified the bug in your program, you need

to fix it.

Try to fix it in such a way that you dont cause a different

problem.

Then test what youve done.

You should put together tests that include the way you know

that would reproduce the error, as well as edge cases, and

anything else you can think of.

54

The debug() Function

Rather than using cat() or print() for debugging, R allows

you to call the function debug(). This will pause execution

of your function, and allow you to examine (or change!) lo-

cal variables, or execute any other R command, inside the

evaluation environment of the function.

55

The debug() Function

Commands to use with debug() are

n - next; execute the next line of code, single-steppingthrough the function

c - continue; let the function continue running Q - quit the debugger

You mark function f for debugging using debug(f), and then

the browser will be called when you enter the function. Turn

off debugging using undebug(f).

56

Example Constructing and Debugging a Function

We will write and debug a function which will compute a con-

fidence interval for the true mean of a population, based on a

random sample of size n using the formula

x t/2,n1s/n

where x is the sample mean and s is the sample standard

deviation, and the t value is the 1 /2 percentile of the tdistribution on n 1 degrees of freedom.

57

Writing a Confidence Interval Function

Our goal is to write a function which will take input like x

such as some male heights:

x

Writing a Confidence Interval Function

ci ci(x) # this should print out a 95%

# confidence interval for the true mean

59

Solving the Problem

The confidence interval formula requires:

the sample mean which we can compute with mean(x) the sample standard deviation (sd(x)) the t percentiles (qt(c(alpha/2, 1-alpha/2), df)) the square root of n (sqrt(n))

60

Implementing the Solution

Here is a first attempt at implementing the solution to the

problem:

ci

Testing the Function

Here is a first test for our ci function. Use the data vector of

heights:

> x ci(x)

Error in qt(p, df, lower.tail, log.p) :

Non-numeric argument to mathematical function

Something is wrong. One of the arguments to qt is incorrect.

62

Looking for the Error

We can add a print statement immediately before the call toqt:

ci

Looking for the Error

> ci(x)

$alpha

[1] 0.05

$df

function (x, df1, df2, ncp, log = FALSE)

{

if (missing(ncp))

.Internal(df(x, df1, df2, log))

else .Internal(dnf(x, df1, df2, ncp, log))

}

Error in qt(p, df, lower.tail, log.p) :

Non-numeric argument to mathematical function

The df argument to qt should be set to n-1.64

Another Attempt

ci

Checking the Boundary Case

Although we should not compute confidence intervals for

sample sizes less than 2, it might happen by accident:

> ci(3)

[1] NA NA

Warning message:

In qt(p, df, lower.tail, log.p) : NaNs produced

67

Checking the Boundary Case

Again, we can handle this boundary case with an if state-ment.

ci

Our Function Can Now be Used Elsewhere

Now that ci() is a function that is known to work on numeric

vectors of any length, we can call it from other functions.

For example, the following function uses the ci() functionto compute confidence intervals for all vectors in a matrix orlist, as well as for a single vector.

CI

Testing the CI Function

This function needs to be tested on vectors, lists (including

data frames) and matrices:

> x # a vector

[1] 170 185 177 160

> CI(x)

[1] 156.11 189.89

70

Testing with a Matrix

> xy # a matrix

[,1] [,2] [,3] [,4] [,5]

[1,] -0.2799555 0.49433909 -0.76405054 0.30727532 -1.35506713

[2,] 0.7972663 -0.79788501 -0.31684602 -0.63859843 -1.18923510

[3,] 0.8847079 -0.03282889 0.21370405 1.01534939 0.29377499

[4,] -0.2333586 1.46200042 -0.02394421 -0.08885412 -0.08092837

> CI(xy)

X1 X2 X3 X4 X5

[1,] -0.7182852 -1.228931 -0.8927856 -0.9584125 -1.8770214

[2,] 1.3026153 1.791744 0.4472173 1.2559986 0.7112936

71

Testing with a List

> xy3 # a list

$x

[1] 170 185 177 160

$y

[1] 149 155 162 158 154

$z

[1] 170 185 177 160

> CI(xy3)

x y z

[1,] 156.11 149.6065 156.11

[2,] 189.89 161.5935 189.89

72

programming with r

Documents