an introduction to r - university of lausanne · practical1. afirstrtutorial: winthelottery...

69
An introduction to R Jerome Goudet 2015-06-12 Contents Practical 1. A first R tutorial: win the lottery 2 Prerequisites ................................................ 2 Introduction ................................................. 2 Questions/Tutorial ............................................. 3 Practical 2: Data manipulation in R: vectors, matrices, list et data frames 12 Introduction ................................................. 12 Questions .................................................. 22 Practical 3: Basic graphing functions, scatter plot and regression 24 Introduction ................................................. 24 Questions .................................................. 37 Practical 4: Functions in R 39 Introduction ................................................. 39 Questions .................................................. 42 Practical 5:Statistical distributions and functions in R 44 Introduction ................................................. 44 Questions .................................................. 55 Practical 6:TESTS in R 57 Introduction ................................................. 57 Questions .................................................. 69 Practical 7: Packages 69 1

Upload: others

Post on 25-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

  • An introduction to RJerome Goudet

    2015-06-12

    Contents

    Practical 1. A first R tutorial: win the lottery 2

    Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Questions/Tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    Practical 2: Data manipulation in R: vectors, matrices, list et data frames 12

    Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    Practical 3: Basic graphing functions, scatter plot and regression 24

    Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    Practical 4: Functions in R 39

    Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    Practical 5:Statistical distributions and functions in R 44

    Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    Practical 6:TESTS in R 57

    Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    Practical 7: Packages 69

    1

  • Practical 1. A first R tutorial: win the lottery

    Prerequisites

    For connection via your own laptop, pick the guest-unil network, connect to the internet and enter thefollowing guestpass:

    INTROR

    Moodle site

    http://moodle2.unil.ch/enrol/instances.php?id=3333

    Inscription key: IntroR14 (to be confirmed)

    What you must have installed: Rstudio @ http://www.rstudio.com/

    Introduction

    This course:

    The goal is not to learn statistics (I expect you to know your basic stats), but how to use R to do statsand other things. Focus will be on loading and manipulating data, writing function, plotting graphs, usingpackages, searching informations.

    Way it’s gonna be run: each half day, short intro (1/2 to 1 hour max) followed by practical. 4 assistants willbe here to help you during the practicals.Use them!

    Assessment: Active participation and doing the exercices. If you fail to do so, a test Thursday afternoon.

    Program

    1. Monday am Introductory turorial2. Monday pm Data manipulation in R: vectors, matrices, list et data frames (room 2107)3. Tuesday am Basic graphing functions4. Tuesday pm Functions in R

    5. Wednesday am Functions in R6. Wednesday pm Stats functions / bootstrap & randomization7. Thursday am Stats function bestiary8. Thursday pm Some packages (ade4, . . . ) / test if necessary

    Mornings: 9am-12am Afternoons: 1pm-5pm

    Documentation and setting things up

    What is R: http://stat.ethz.ch/CRAN/doc/FAQ/R-FAQ.html

    R editors:

    Why an editor? syntax highliting, plus avoid re-typing commands. . .

    1. For All systems: integrated environment RStudio ( http://www.rstudio.com/ ) Highly recom-mended (i.e., install it if not done yet and use it for this course). We will use R markdown (http://rmarkdown.rstudio.com/ ) to write notes and produce documents. A short video about RStudio

    2

    http://moodle2.unil.ch/enrol/instances.php?id=3333http://www.rstudio.com/http://stat.ethz.ch/CRAN/doc/FAQ/R-FAQ.htmlhttp://www.rstudio.com/http://rmarkdown.rstudio.com/http://player.vimeo.com/video/97166163?api=1&player_id=player_1

  • 2. For windows: Tinn-R, Notepad++

    3. For linux: ess (emacs speaks statistics) distributed as an R package, Vim,. . .

    4. For Mac: Textmate or the online editor

    5. a more extensive list: http://www.sciviews.org/_rgui/projects/Editors.html

    Installing and enhancing R

    1. Binaries for windows, macOs X, several brends of linux and source code available from http://stat.ethz.ch/CRAN

    2. extra packages to extend R: from the same URL. Currently (Thursday May 28th, 2015), the CRANpackage repository features 6692 available packages

    3. Also many packages available from github (development version)4. R Home page for further information etc. . . (bookmark this link!) http://www.r-project.org/5. Search Engine specific for R: http://www.rseek.org/6. Much on CRAN in forms of tutorial, FAQ, etc. . .7. Books: a lot available from http://www.r-project.org/doc/bib/R-books.html

    Questions/Tutorial

    One of the best ways to getting acquainted with R is to use it to help you to understand a particular set ofdata. So let us consider data issued from a lottery, where you might be motivated to perform data analysis.The readers are invited to work through the following familiarization session and see what happens. First-timeusers may not yet understand every detail, but the best plan is to type what you see and observewhat happens as a result.

    This chapter is mainly based on Becker, Chambers and Wilks’s book (1988, The New S Language, Chapter1).

    The specific data we will look at concerns the New Jersey Pick-It Lottery. Our data is for 254 drawings justafter the lottery was started, from May, 1975 to March, 1976. Pick- It is a parimutuel game, meaning thatthe winners share a fraction of the money taken in for the particular drawing. Each ticket cost fifty centsand at the time of purchase the player picks a three-digit number ranging from 000 to 999. The money betduring the day is placed in a prize pool and anyone who picked the winning number shares equally in thepool. The data is in the file lottery.txt

    It can be loaded in R using the following command:

    data

  • ## 4 542 184.0## 5 507 384.5## 6 972 324.5

    The data available gives for each drawing the winning number and the payoff for a winning ticket. Thewinning numbers and the corresponding payoffs are:

    data$number # print the winning numbers

    ## [1] 810 156 140 542 507 972 431 981 865 499 20 123 356 15 11 160 507## [18] 779 286 268 698 640 136 854 69 199 413 192 602 987 112 245 174 913## [35] 828 539 434 357 178 198 406 79 34 89 257 662 524 809 527 257 8## [52] 446 440 781 615 231 580 987 391 267 808 258 479 516 964 742 537 275## [69] 112 230 310 335 238 294 854 309 26 960 200 604 841 659 735 105 254## [86] 117 751 781 937 20 348 653 410 468 77 921 314 683 0 963 122 18## [103] 827 661 918 110 767 761 305 485 8 808 648 508 684 879 67 282 928## [120] 733 518 441 661 219 310 771 906 235 396 223 695 499 42 230 623 300## [137] 380 646 553 182 158 744 894 689 978 314 337 226 106 299 947 896 863## [154] 239 180 764 849 87 975 92 701 402 1 884 750 236 395 999 744 714## [171] 253 711 863 496 214 430 107 781 954 941 416 243 480 111 47 691 616## [188] 253 477 11 114 133 293 812 197 358 7 996 842 255 374 693 383 99## [205] 474 333 467 515 357 694 919 424 274 913 919 245 964 472 935 434 170## [222] 300 476 528 403 677 559 187 652 319 582 541 16 981 158 945 72 167## [239] 77 185 209 893 346 515 555 858 434 541 411 109 761 767 597 479

    data$payoffs # print the payoffs

    ## [1] 190.0 120.5 285.5 184.0 384.5 324.5 114.0 506.5 290.0 869.5 668.5## [12] 83.0 188.0 449.0 289.5 212.0 466.0 548.5 260.0 300.5 556.5 371.5## [23] 112.5 254.5 368.0 510.0 102.0 206.5 261.5 361.0 167.5 187.0 146.5## [34] 205.0 348.5 283.5 447.0 102.5 219.0 292.5 343.0 332.5 532.5 445.5## [45] 127.0 557.5 203.5 373.5 142.0 230.5 482.5 512.5 330.0 273.0 171.0## [56] 178.0 463.5 476.0 290.0 176.0 195.0 159.5 296.0 177.5 406.0 182.0## [67] 164.5 137.0 191.0 298.0 110.0 353.0 192.5 308.5 287.0 203.5 377.5## [78] 211.5 342.0 259.0 231.0 348.0 159.0 130.5 176.0 128.5 159.0 290.0## [89] 335.0 514.0 191.0 304.5 167.0 257.0 640.0 142.0 146.0 356.0 96.0## [100] 295.0 237.0 312.5 215.0 442.5 127.0 127.0 756.0 228.5 132.0 256.0## [111] 374.5 262.5 286.5 264.0 380.5 357.5 478.5 511.5 218.0 353.0 162.5## [122] 184.0 548.0 166.5 147.5 240.0 386.0 130.5 287.5 230.0 480.5 247.5## [133] 380.0 238.5 237.5 214.5 394.5 416.5 392.5 244.5 202.0 371.5 553.0## [144] 293.5 295.0 178.0 334.5 226.0 194.0 388.5 353.0 404.0 348.0 163.5## [155] 216.5 283.0 388.5 567.5 250.5 478.0 267.5 326.5 369.0 512.5 341.0## [166] 188.5 386.0 239.0 480.5 105.0 227.0 130.5 384.5 294.5 154.0 324.0## [177] 116.0 229.0 301.5 334.0 143.5 212.0 448.0 126.5 417.5 276.5 303.0## [188] 211.0 373.0 209.5 207.5 195.0 317.0 170.5 230.0 143.0 361.0 452.0## [199] 260.5 308.5 206.0 256.5 291.0 421.5 295.5 119.5 268.5 221.0 151.5## [210] 314.5 313.5 323.5 204.0 241.0 637.0 214.0 348.0 191.5 384.0 220.0## [221] 285.5 335.0 251.5 131.5 328.0 392.0 509.0 235.5 249.5 129.5 303.0## [232] 201.5 365.0 346.5 210.5 334.0 376.5 215.5 312.0 239.5 221.0 388.0## [243] 154.5 268.5 127.0 537.5 427.5 272.0 197.0 167.5 292.0 170.0 486.5## [254] 262.0

    4

  • From these instructions we notice that, for the first drawing, the winning number was 810 and it paid $190.00to each winning ticket holder. In what follows we will try to examine the data. Numerical summaries providea statistical synopsis of the data in a tabular format. Such a function is summary. The following displays asummary of the lottery payoffs:

    summary(data$payoffs)

    ## Min. 1st Qu. Median Mean 3rd Qu. Max.## 83.0 194.2 270.2 290.4 364.0 869.5

    We read from this that the mean payoff was $290.4, that the payoffs ranged from $83 to $869.5 and that 50%of all payoffs lay between $194.2 and $364. The quantiles for a set of data x can also be computed by meansof quantile(x, quantiles). For example,

    quantile(data$payoffs, c(.25, .75))

    ## 25% 75%## 194.25 364.00

    This means that 25% of the values are less than 194, and 75% of the payoffs are less than 365. A better wayto understand the data is to look at it graphically. With RStudio, a graphical window is already open, butif you are using classical R and want a new one, under windows, just type:

    windows()

    A separate window should appear on your screen. If you have a mac, type:

    quartz()

    or Unix/linux:

    X11()

    In our data, to detect long-term irregularities we will look at the winning numbers to see if they appear to bechosen at random. To do so we could produce a histogram of the lottery numbers:

    hist(data$number)

    5

  • Histogram of data$number

    data$number

    Fre

    quen

    cy

    0 200 400 600 800 1000

    05

    1015

    2025

    30

    The histogram should show on your screen. Since there are 10 bars, the count should be approximately 25. Itlooks fairly flat, no need to inform a jury. Of course, most of our attention will probably be directed at thepayoffs. Elementary probabilistic reasoning tells us that a single number we pick has a 1 in 1000 chance ofwinning. If we play many times, we expect about 1 winning number per 1000 plays. Since a ticket costs fiftycents, 1000 plays will cost

  • Histogram of data$payoffs

    data$payoffs

    Fre

    quen

    cy

    0 200 400 600 800

    020

    4060

    80

    The figure shows that payoffs range from less than $100 to more than $800, although the bulk of the payoffsare between $100 and $400, i.e. there were a number of payoffs larger than $500. Perhaps we have a chance.The widely varying payoffs are primarily due to the pari mutuel betting in the lottery: if you win when fewothers win, you will get a large payoff. If you are unlucky enough to win along with others, the payoff maybe relatively small.

    Let us see what the largest and the smallest payoffs and corresponding winning numbers were:

    max(data$payoffs) # the largest payoff

    ## [1] 869.5

    data$number[data$payoffs==max(data$payoffs)]

    ## [1] 499

    min(data$payoffs) # the smallest payoff

    ## [1] 83

    data$number[data$payoffs==min(data$payoffs)]

    ## [1] 123

    7

  • Winners who bet on ‘123’ must have been disappointed; $83 is not a very large payoff. On the other hand$869.50 is very nice. Since the winning numbers and the payoffs come in pairs, a number and a payoff foreach drawing, we can produce a scatterplot of the data to see if there is any relationship between the payoffand the winning number. R provides a generic plotting function, plot, which produces different kinds of plotsdepending on the data passed to it. In its most common use, it produces a scatterplot of two numeric objects:

    plot(data$number, data$payoffs)

    0 200 400 600 800 1000

    200

    400

    600

    800

    data$number

    data

    $pay

    offs

    What do you see in the graph? Does the payoff seem to depend on the position of the winning number?Perhaps it would help to add a ‘middle’ line that follows the overall pattern of the data:

    plot(data$number, data$payoffs)lines(lowess(data$number, data$payoffs, f=.2))

    8

  • 0 200 400 600 800 1000

    200

    400

    600

    800

    data$number

    data

    $pay

    offs

    This command superimposes a smooth curve on the winning number and payoff scatterplot. Can you see theinteresting characteristics now?

    There are substantially higher payoffs for numbers with a leading zero, meaning fewer people bet on thesenumbers. Perhaps that reflects people’s reluctance to think of numbers with leading zeros. After all, no onewrites $010 on a ten dollar check! Also note that, expect for the numbers with leading zeros, payoffs seem toincrease as the winning number increases.

    It would be interesting to see exactly what numbers correspond to the large payoffs. Fortunately, with aninteractive graphical input device, we can do that by simply pointing at the ‘outliers’:

    plot(data$number, data$payoffs)lines(lowess(data$number, data$payoffs, f=.2))identify(data$number, data$payoffs, data$number) #might not work on all platforms

    9

  • 0 200 400 600 800 1000

    200

    400

    600

    800

    data$number

    data

    $pay

    offs

    ## integer(0)

    To identify a payoff with its corresponding winning number just click on a point using the left mouse button.Once you have pointed out the ‘outliers’ just type the escape key. Can you see the pattern in the numberswith very high payoffs?

    As a little help it is to say that the lottery has a mode of betting, called ‘combination bets’ where players winif the digits in their number appear in any order (Ticket 123 would win on 321, 231). The pattern in thenumbers is that most of the numbers with high payoffs have duplicate digits. This results from the fact thatpayoffs for the numbers with duplicate digits are not shared with combination betters, and thus are higher.

    Another method to look out for ‘outliers’ is to make a boxplot of the data. We noticed before that the payoffsseem to depend on the first digit of the winning number. So it would be interesting to draw boxes for the tensubsets of payoffs in a single plot to study this phenomenon graphically. Rather than extracting each setseparately, we use the R function split to create a list, where each element of the list gives all of the payoffsthat correspond to a particular first digit of the winning number. The boxplot function will draw a box foreach element in the list.

    digit

  • 0 1 2 3 4 5 6 7 8 9

    200

    400

    600

    800

    First Digit of Winning Number

    Pay

    off

    The box in a boxplot contains the middle half of the data; the whiskers extending from the box reach tothe most extreme non-outlier; outlying points are plotted individually. Notice the high payoffs for the firstbox. The graphic shows us as well that it is rare for a payoff to exceed $500. So, place your bet if you enjoygambling. Do not expect to win.

    11

  • Practical 2: Data manipulation in R: vectors, matrices, list et dataframes

    Introduction

    You should be able to answer easily the questions at the bottom after having read and understood the textbelow. An advice: type the examples in R, it helps! The basic objects in R are vectors, tables, lists and “dataframes”.

    Vectors

    R operates on structured data bearing a name. The simplest structure is a vector, which is a simple entityconsisting of an ordered collection of numbers. For instance,

    x.vector

  • y.vector

  • x.vector[3]

    ## [1] 7.8

    and the instruction

    x.vector[1:3]

    ## [1] 5.1 6.3 7.8

    extract the first 3 elements of the vector. Two useful functions are the sum and product of the element of avector:

    sum(x.vector)

    ## [1] 39

    prod(x.vector)

    ## [1] 24472.46

    Vectors can be made of logical elements:

    c(TRUE,FALSE,TRUE,TRUE,FALSE,TRUE)

    ## [1] TRUE FALSE TRUE TRUE FALSE TRUE

    c(T,F,T,T,F,T)

    ## [1] TRUE FALSE TRUE TRUE FALSE TRUE

    where T stands for TRUE and F for FALSE. Beware, it is dangerous to use this shortcuts. . .

    We can also have character strings

    c("This","is","my","first","R","course")

    ## [1] "This" "is" "my" "first" "R" "course"

    for a sequence of digits, use ‘:’:

    1:10

    ## [1] 1 2 3 4 5 6 7 8 9 10

    If you want it in steps of 2, type:

    14

  • 2*1:10

    ## [1] 2 4 6 8 10 12 14 16 18 20

    This works because ‘:’ has higher precedence than “*“. To obtain the reverse sequence:

    10:1

    ## [1] 10 9 8 7 6 5 4 3 2 1

    A more general function to obtain a sequence is the function seq (try ?seq):

    seq(-2*pi,3*pi,pi/4)

    ## [1] -6.2831853 -5.4977871 -4.7123890 -3.9269908 -3.1415927 -2.3561945## [7] -1.5707963 -0.7853982 0.0000000 0.7853982 1.5707963 2.3561945## [13] 3.1415927 3.9269908 4.7123890 5.4977871 6.2831853 7.0685835## [19] 7.8539816 8.6393798 9.4247780

    And another useful function is rep:

    rep(1,10)

    ## [1] 1 1 1 1 1 1 1 1 1 1

    rep(1:5,2)

    ## [1] 1 2 3 4 5 1 2 3 4 5

    Two other useful functions are ‘sort’ and ‘order’:

    y

  • m1

  • ## [,1] [,2] [,3] [,4]## [1,] 1 4 7 10## [2,] 2 5 8 11## [3,] 3 6 9 12## [4,] 1 2 3 4## [5,] 5 6 7 8## [6,] 9 10 11 12

    cbind(m1,m2)

    ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]## [1,] 1 4 7 10 1 2 3 4## [2,] 2 5 8 11 5 6 7 8## [3,] 3 6 9 12 9 10 11 12

    Elements from a matrix are extracted using square brackets:

    m1[3,2]

    ## [1] 6

    m1[2,3]

    ## [1] 8

    We can access to a full line

    m1[2,]

    ## [1] 2 5 8 11

    or column:

    m1[,2]

    ## [1] 4 5 6

    We can also remove a line or a column from a matrix:

    m1[-2,]

    ## [,1] [,2] [,3] [,4]## [1,] 1 4 7 10## [2,] 3 6 9 12

    m1[,-2]

    17

  • ## [,1] [,2] [,3]## [1,] 1 7 10## [2,] 2 8 11## [3,] 3 9 12

    elementwise multiplication of a matrix can be carried out:

    m1*m2

    ## [,1] [,2] [,3] [,4]## [1,] 1 8 21 40## [2,] 10 30 56 88## [3,] 27 60 99 144

    And classic matrix product is otained using ’%*%’:

    m1 %*% t(m2)

    ## [,1] [,2] [,3]## [1,] 70 158 246## [2,] 80 184 288## [3,] 90 210 330

    t(m1) %*% m2

    ## [,1] [,2] [,3] [,4]## [1,] 38 44 50 56## [2,] 83 98 113 128## [3,] 128 152 176 200## [4,] 173 206 239 272

    where ‘t’ stands for transpose (by the way, why don’t we take the product m1 %*% m2?).

    Lists

    A list in R is an object containing an ordered list of objects, its components. Objects in a list do not have tobe of the same type. A new list is created using ‘list’

    (my.list

  • To get to a specific element of a liste, double square brackets are used’[[]]’:

    my.list[[2]]

    ## [1] "Coucou"

    Note that to obtain several elements, we use simple square brackets:

    my.list[1:3]

    ## $`first element`## [1] TRUE#### $`second element`## [1] "Coucou"#### $third## [1] 4 3

    Last, it is convenient to call an element in a list by its name rather than its position:, using ‘$’ followed bythe fist discriminatory characters of the element’s name:

    my.list$fir

    ## [1] TRUE

    (here, the first 2 letters would have been sufficient)

    my.list$s #second element

    ## [1] "Coucou"

    the names of the elements in a list can be obtained with ‘names’:

    names(my.list)

    ## [1] "first element" "second element" "third" "fourth"

    and the internal structure of a list can be obtained with ‘str’:

    str(my.list)

    ## List of 4## $ first element : logi TRUE## $ second element: chr "Coucou"## $ third : num [1:2] 4 3## $ fourth : num 9.99

    19

  • Data frames

    Data frames can be considered as list of data vectors of the same length, but not necessarily of the same type(contrary to matrices).The main idea behind data frames is to group data corresponding to an entity of observation in a singlestructure. It’s similar in concept to your familiar excel spreadsheet. Elements from a data frame can beaccessed in the same way as those of a list or of a matrix:

    a

  • dataf$D[dataf$D==TRUE]

    ## [1] TRUE TRUE TRUE TRUE TRUE TRUE

    To select the element of A for which D is equal to TRUE::

    dataf$A[dataf$D==TRUE]

    ## [1] 1 2 3 4 5 6

    And to select all the columns of the data frame for which D is TRUE:

    dataf[dataf$D==TRUE,] #why the comma?

    ## A B C D## 1 1 12 a TRUE## 2 2 11 b TRUE## 3 3 10 c TRUE## 4 4 9 d TRUE## 5 5 8 e TRUE## 6 6 7 f TRUE

    A extremely useful command is read.table, allowing to load a text file in the environment, and to associate itwith a data frame.

    If for instance, you want to read the the content of file ‘monfichier.txt’ in R, you just need to type thecommand:

    my.data

  • write.table(my.data,"mydata.txt",row.names=FALSE,quote=FALSE,sep="\t")

    Questions

    Answer questions 1 to 10 below:

    0. Why should I avoid calling objects c, T or F?

    1. clean your R environment by issuing the instruction rm(list=ls())

    2. For the following numbers 7.3, 6.8, 0.005, 9, 12, 2.4, 18.9, 0.9 1. Find their mean, 2. substract it fromall the numbers, 3. get their square root, 4. print the numbers that are larger than their square root

    3. use seq and rep to generate the following vectors:

    (a) 1; 2; 3; 4; 1; 2; 3; 4; 1; 2; 3; 4; 1; 2; 3; 4;(b) 10; 10; 10; 9; 9; 9; 8; 8; 8; 7; 7; 7;(c) 1; 2; 2; 3; 3; 3; 4; 4; 4; 4; 5; 5; 5; 5; 5;(d) 1; 1; 3; 3; 5; 5; 7; 7; 9; 9;

    4. Create the following matrix, as simply as possible:

    3 1 1 11 3 1 11 1 3 11 1 1 3

    4a. multiply this matrix elementwise with itself. Get its matrix product by itself

    5. Solve the following system (have a look at solve)

    2x + 3y - z = 45x - 10y + 2z = 0x + y - 4z = 5

    6. create a list mylist with the following elements:

    + numbers 1 to 5. call it nb1+ numbers 1 to 50 by 0.375 steps. call it nb3+ the string "My mummy told me". call it char1+ A vector made of the strings "apples", "bananas", "pears". call it char2+ The boolean vector FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE . Call it bool

    7. What’s the list’s length? calculate the mean of the first element; of the second; of the first and second

    8. paste the third and fourth element of the list (use paste); Which instruction would give the sentence“My mummy told me: apples, bananas, pears” (be careful, punctuation is important!)

    9. Create a data frame with the data contained in file parus1.txt located at http://www2.unil.ch/popgen/teaching/R11

    22

    http://www2.unil.ch/popgen/teaching/R11http://www2.unil.ch/popgen/teaching/R11

  • 10. Create a vector data.fly containing the observations from alldat for which treat is mouche. Add to datathe element: “puce”, 12.7300

    11. Try loading the file alien.txt (from http://www2.unil.ch/popgen/teaching/R11). What is wrong withthis file? Fix it.

    12. Try loading the file class2005.xls What needs to be done before you can load it into R?

    23

    http://www2.unil.ch/popgen/teaching/R11

  • Practical 3: Basic graphing functions, scatter plot and regression

    Introduction

    Plots

    We have already met some of the graphic functions in R (hist, boxplot. . . ). We will now see the plot functionin more details.

    If you are not using RStudio, before anything, you need to define a graphic windows. on windows, usewindows(), on mac, use quartz(), on unix/linux, use X11();

    Plots can also be sent directly to specific files, without being printed on screen. available devices are pdf(),postscript(),png(),jpeg() among others. once you call such a device, you need to close it, by using thefunction dev.off(). pdf & postscript can have multiple pages, but not the others.

    For drawing any function, use plot:

    x

  • −10 −5 0 5 10

    −15

    0−

    100

    −50

    050

    100

    150

    x

    f(x)

    x

  • 0 1 2 3 4

    0.0

    0.2

    0.4

    0.6

    0.8

    x

    exp(

    f(x)

    )

    x

  • 0 1 2 3 4

    0.0

    0.2

    0.4

    0.6

    0.8

    x

    exp(

    f(x)

    )

    plot default is to use points. using the type option, we can specify whether we want points and lines(type=“b”) or a line (type=“l”) or nothing (type=“n”):

    x

  • −10 −5 0 5 10

    −15

    0−

    100

    −50

    050

    100

    150

    x

    (x^3

    + 2

    * x

    ^2 −

    3 *

    x +

    1)/

    (2 *

    x^2

    + 5

    * x

    − 1

    2)

    plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="l")

    28

  • −10 −5 0 5 10

    −15

    0−

    100

    −50

    050

    100

    150

    x

    (x^3

    + 2

    * x

    ^2 −

    3 *

    x +

    1)/

    (2 *

    x^2

    + 5

    * x

    − 1

    2)

    plot(x,(x^3+2*x^2-3*x+1)/(2*x^2+5*x-12),type="n")

    29

  • −10 −5 0 5 10

    −15

    0−

    100

    −50

    050

    100

    150

    x

    (x^3

    + 2

    * x

    ^2 −

    3 *

    x +

    1)/

    (2 *

    x^2

    + 5

    * x

    − 1

    2)

    If you want to see the 2 graphs in one window, you can specify the numbers of panel to show, using the parfunction (see ?par):

    par(mfrow=c(1,2))x

  • −10 −5 0 5 10

    −15

    0−

    100

    −50

    050

    100

    150

    x

    (x^3

    + 2

    * x

    ^2 −

    3 *

    x +

    1)/

    (2 *

    x^2

    + 5

    * x

    − 1

    2)

    −10 −5 0 5 10

    −15

    0−

    100

    −50

    050

    100

    150

    x

    (x^3

    + 2

    * x

    ^2 −

    3 *

    x +

    1)/

    (2 *

    x^2

    + 5

    * x

    − 1

    2)

    par(mfrow=c(1,1))#to go back to one panel

    We can also add several functions to the same graph. After first using plot, which will define the first thingsto be printed and axes length, labels etc, you can use the functions lines, text or points:

    x

  • 0.0 0.5 1.0 1.5 2.0

    02

    46

    810

    x

    f(x)

    3 3 3 3 3 3 3 3 3 3 3 33 3 3

    3 33 3

    3 33

    33

    33

    33

    33

    33

    33

    3

    3

    3

    3

    3

    3

    3

    3 3 3 3 3 3 3 3 3 3 3 33 3 3

    3 33 3

    3 33

    33

    33

    33

    33

    33

    33

    3

    3

    3

    3

    3

    3

    3

    Note that several call to plot will erase each other in turn.

    plot is of course useful for showing scatter plot of 2 variables, but could also be used for a single vector:

    a

  • 0 20 40 60 80 100

    −2

    −1

    01

    2

    Index

    a

    Other useful plots are histograms, barplots and piecharts (for categorical data):

    hist(a)

    33

  • Histogram of a

    a

    Fre

    quen

    cy

    −2 −1 0 1 2 3

    05

    1015

    20

    categ

  • black red white

    010

    2030

    40

    pie(tcateg,col=c("black","red","white"))

    35

  • black

    red

    white

    list of colors predefined in R : colors()For a tour of R graphics ability, type:

    demo(graphics)

    Variance, covariances, correlations

    Variance, covariance and correlations are obtained using var, cov and cor respectively. It is possible to usethese functions to obtain the correlation between 2 vectors, or all variables (columns) of a data frame:

    cor(x,y);cor(d);

    These functions don’t like missing data. If you have such data, you need to specify how to remove them. Youcould either remove the whole line containing one missing observation, (use="complete.obs"), or eliminatethem only when they turn up in a calculation (use="pairwise.complete.obs"). Compare the results ofthe following instructions:

    x

  • cor(d, use="pairwise.complete.obs")

    ## x y z## x 1 1.0000000 -1.0000000## y 1 1.0000000 -0.9561118## z -1 -0.9561118 1.0000000

    these functions return one element only (a number or a matrix. More complex functions, for instance lsfit(?lsfit) or lm (?lm), return a list of elements. To access each element individually use $. For instance:

    fit

  • f. Produce a box plot of weight as a function of the smoking status and sexg. Produce a two panels graph (a panel for males and the other for females) of the scatterplot of

    weight versus height. Differentiate smokers from non smokers using a color argumenth. If you have the time and envy, explore the package ggplot2 and find the commands necessary to

    produce the previous plots with it.

    2. What is the correlation between weight and size in men? women? And the covariance?3. load the file http://www.unil.ch/popgen/teaching/R11/scatt.txt in a four panels windows (2 times 2)

    draw the scatterplot of y1,y2,y3 and y4 as a function of x. what relation between the ys and x do theyshow? What is the correlation between x and y1, y2, y3 and y4?

    4. Use abline to add a line that best predict the different ys as a function of x (you’ll have to redrawthe preceding graphs). Use the lsfit or the lm function , and its element coeff to obtain the leastsquare/linear model best fit.

    5. what are the 10th and 90th percentile of y5? plot the scatter diagram of y5 versus x. What type ofrelation exists between the 2? use lsfit and abline to produce the best linear fit line.

    6. Store the predicted value of y5 in a vector y5.pred. what are the residuals of the regression of y5 on x?store this in y5.res

    7. plot the residuals as a function of x, and comment the graph.8. do questions 6 & 7 for variable y4. Is showing residuals as a function of x useful? Why?

    38

    http://www.unil.ch/popgen/teaching/R11/scatt.txt

  • Practical 4: Functions in R

    Introduction

    R has by default several builtin functions (mean, var, cor, plot, lm, . . . ). And several more can bedownloaded from http://www.r-project.org as packages. But you might (will) need to write your ownfunctions to carry out a repetitive task. The basic concepts are very similar to those you ve learned for anyother programming languages, such as c++, java or fortran. This practical should give a first brief feel offunction writing in R.

    General structure of a function:

    my.function

  • my.var.ub

  • if (cond) {cons.expr1; cons.expr2} else {alt.expr1; alt.expr2} (same as beforebut with several expression for each situation. Note the curly braketsaround the expressions)

    for(i in seq) expr (do expr as many times as there are element in seq.

    x

  • ## 1 1## 2 3## 3 6## 4 10## 5 15## 6 21## 7 28## 8 36## 9 45## 10 55

    #i

  • 5. Draw a histogram of sizes. Are the different size classes equifrequent?

    6. Plot the empirical cumulative distribution of sizes (use plot and sort). Using this plot, estimate the30th and 70th percentile of the distribution of sizes.

    7. use histogram and qqnorm (?qqnorm) and qqline to check whether sizes are normally distributed.Which is better? How would you test for normality?

    8. Use quantile to validate your approximation for question 6

    9. What’s the mean weight? estimate it in kg then in gr (you don’t need to use R to estimate it in gr).

    10. What’s the weight variance in kgˆ2? in grˆ2? (you don’t need to use R to estimate it in grˆ2)

    11. rescale the variable weight so that it has mean 0 and sd 1. Why is this transformation useful? what isthe 2.5th and 97.5th percentile of the rescaled weight?

    43

  • Practical 5:Statistical distributions and functions in R

    Introduction

    R has built in functions to calculate probability densities, cumulative distribution, inverse of cumulative(quantile) for several standard probability distributions. Furthermore, R can generate random values fromthese distributions. There are 4 functions for each distribution, starting with the following letters:

    1. d: for density, for instance 'dnorm' for the probability density of the normal distribution2. p: for probability, e.g. 'pnorm';3. q: for quantile, e.g 'qnorm';4. r: for random, e.g. 'rnorm';

    The classical distributions are:

    • normal (see ?pnorm ): central for all stats. take mean and variance as parameters.• binomial (?pbinom): takes 2 parameters, n the number of trials and p, the probability of a success.• poisson (?ppois): determined by a single parameter lambda.• geometric (?pgeom)• uniform (?punif)• chisquare (?pchisq)• F (?pf)• t (?pt) . . .

    These are the most common, but several others are also available:

    ?Distributions

    and even more here:

    http://cran.r-project.org/web/views/Distributions.html

    Lets take 1000 random numbers from a normal with mean 0 and sd 1 (default parameters):

    a

  • Histogram of a

    a

    Fre

    quen

    cy

    −3 −2 −1 0 1 2 3 4

    050

    100

    150

    200

    qqnorm(a);qqline(a) #best way to verify normality!

    45

  • −3 −2 −1 0 1 2 3

    −3

    −2

    −1

    01

    23

    4Normal Q−Q Plot

    Theoretical Quantiles

    Sam

    ple

    Qua

    ntile

    s

    do the same but with mean 4 and sd 5:

    b

  • Histogram of b

    b

    Fre

    quen

    cy

    −10 0 10 20

    050

    100

    150

    200

    250

    300

    350

    qqnorm(b);qqline(b)

    47

  • −3 −2 −1 0 1 2 3

    −10

    −5

    05

    1015

    20Normal Q−Q Plot

    Theoretical Quantiles

    Sam

    ple

    Qua

    ntile

    s

    pnorm gives the probability that a normal deviate is less than a given number X, for instance -1.96:

    pnorm(-1.96)

    ## [1] 0.0249979

    pnorm(1.96) #why?

    ## [1] 0.9750021

    if a random variable follows a normal, then it has 2.5% chance to be smaller than -1.96 (and since thisdistribution is symmetrical, it also has a 2.5% chance of being larger than 1.96)With qnorm, you do the reverse: what is the value of the normal deviate for which I have a 2.5% chance offinding a lower deviate:

    qnorm(0.025)

    ## [1] -1.959964

    Now try:

    mean(a)

    ## [1] -0.04490363

    48

  • sd(a)

    ## [1] 1.023454

    b

  • plot(0:10,dpois(0:10,2.4),type="h")

    0 2 4 6 8 10

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    0:10

    dpoi

    s(0:

    10, 2

    .4)

    A particularly interesting distribution is that of p-values and the null hypothesis. It can be shown that itshould follow a uniform distribution. Now with big data available everywhere, we are often confronted with avery large number of p-values. And, by chance, some will be below the classical 5% threshold. How to decidewhich p-values are really outliers?

    pval

  • Histogram of pval

    pval

    Fre

    quen

    cy

    0.0 0.2 0.4 0.6 0.8 1.0

    010

    0020

    0030

    0040

    0050

    00

    out of 100000 p-values, 4920 are significant at the 5% level. Is there something to worry about?

    n

  • 0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    theo. perc.

    obs.

    per

    c.

    0 1 2 3 4 5

    01

    23

    45

    theo perc. (−log10)

    obs.

    per

    c.(−

    log1

    0)

    now imagine that 5 of the p-values are really, really small, less than 10−6 say:

    pval[1:5]

  • 0.0 0.2 0.4 0.6 0.8 1.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    theo. perc.

    obs.

    per

    c

    0 1 2 3 4 5

    02

    46

    8

    theo perc. (−log10)

    obs.

    per

    c. (

    −lo

    g10)

    The outlying values appear now clearly on the -log10 (right) panel, not on the untransformed panel. Thesetype of plots are now commonly used to identify p-values that are outliers.

    Another useful function, connected to the binomial distribution, is the function sample. It allows samplingwith or without replacement from a vector (see ?sample). This function is at the heart of randomization andbootstrap tests

    sample(1:10,replace=FALSE) #permutations of the elements

    ## [1] 7 4 5 9 8 3 10 6 2 1

    sample(0:99,size=200,replace=TRUE) #200 random draws of numbers between 0 & 99

    ## [1] 82 43 77 98 0 50 53 79 12 27 52 54 79 73 61 81 20 74 67 87 42 26 84## [24] 77 8 34 80 86 86 3 62 99 26 38 84 16 90 46 14 40 68 16 53 49 17 44## [47] 26 87 4 98 56 34 92 99 95 47 70 20 25 47 81 45 3 33 25 91 99 85 67## [70] 3 70 90 24 57 29 4 19 56 49 72 9 81 36 2 37 51 42 20 99 79 7 33## [93] 75 74 76 94 82 86 7 33 47 29 91 87 60 15 6 35 21 80 72 76 21 5 18## [116] 73 89 62 0 5 14 48 18 67 83 76 38 66 0 23 28 73 81 61 48 5 52 48## [139] 40 11 26 74 0 73 17 27 86 95 94 8 99 34 89 27 62 95 27 54 51 80 28## [162] 21 96 9 2 89 79 41 41 19 78 94 52 17 84 51 72 49 36 21 50 95 61 89## [185] 64 62 69 50 49 23 61 61 70 85 0 72 3 44 13 23

    sample(1:0,size=1000,replace=TRUE,prob=c(0.2,0.8))

    ## [1] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0

    53

  • ## [35] 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0## [69] 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1## [103] 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 1 1 0 0## [137] 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0## [171] 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1## [205] 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0## [239] 0 1 0 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0## [273] 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0## [307] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0## [341] 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1## [375] 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1 0 1 0 0## [409] 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0## [443] 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0## [477] 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 1 0## [511] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 1 0 1 0 0 0 0## [545] 0 1 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0## [579] 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0## [613] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1## [647] 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0## [681] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1## [715] 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1## [749] 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0## [783] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0## [817] 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0## [851] 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1## [885] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0## [919] 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0## [953] 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0## [987] 0 0 0 1 0 0 0 0 0 0 0 0 0 0

    #same asrbinom(1000,size=1,p=0.2)

    ## [1] 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0## [35] 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0## [69] 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1## [103] 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0## [137] 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0## [171] 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0## [205] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0## [239] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0## [273] 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0## [307] 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1## [341] 1 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0## [375] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0## [409] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 1 0## [443] 0 1 0 1 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0## [477] 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 1 1 1 0 0 0 1 0 0 1 0## [511] 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1## [545] 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0## [579] 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0## [613] 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0## [647] 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0## [681] 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 0 1 0

    54

  • ## [715] 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0## [749] 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0## [783] 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0## [817] 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0## [851] 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0## [885] 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0## [919] 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0## [953] 0 0 1 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1## [987] 0 0 0 0 0 1 0 0 1 1 0 0 0 0

    Questions

    1. generate 254 random numbers drawn from a uniform between 0 and 1000. Draw a histogram of thesenumbers. Does it look familiar? Go back to the lotery data set, and now think of a graphical way ofverifying uniformity in the winning numbers (hint: have a look at qqplot). Can you think of a way fortesting this?

    2. Imagine that the size of the male population in Switzerland follows a normal with mean 1.72m and sd0.3m.

    a. What is the probability that an individual from this population is 1.60 m or less?

    b. 1.80 m or less?c. 1.90 m or more?d. 2.00 m or more?e. What proportion of the male population should be between 1.5 and 1.80 meters tall?

    3. The greater white tooth shrew, Crocidura russula, is a small anthropophilic insectivore leaving aroundhere, and is studied by research groups in Lausanne. The population size is stable, meaning that themean number of surviving offspring per female is 2. Suppose that the number of surviving offspringfollows a Poisson distribution. What is the probability that a female has no surviving offspring? onesurviving offspring? 4 or more? (beware, there is a small trap here).

    4. In several species, the probability p of dying is independant of age. The distribution of mortality agesfollows a geometric distribution P (X = x) = p(1 − p)x. We are interested in a population where theprobability of dying during a year is 0.15. You can use the geom (e.g pgeom, rgeom, dgeom, qgeom)function to answer the following:

    a. In this population, what is the probability of dying at 20?b. What is the probability of dying before 2?c. What is the probability of attaining 30?d. What is the mean age? (hint: think about expected value)e. What is the variance of age?

    The following exercice is a good way of demonstrating the central limit theorem

    5. Imagine that we are rolling a dice. Using the function sample, generate 1000 dice rolls and store themin a vector x1. How are these data distributed?

    55

  • Now, roll two dices simultaneously a 1000 times, and store the sum of the 2 displayed numbers in a vector x2.How are these data distributed? Do the same for 5 dices (in vector x5) and 20 dices (in vector x20). Plot onthe same figure the normal quantile quantile plots (qqnorm) obtained from these 4 experiments. Conclusions?

    [optional] You can do the same for other distributions, for instance, the geometric distribution we ve seen inthe previous paragraph

    The following two exercices will show you how to use R to do simulations. To solve them, start by thinkingabout what type of information you’d like the function to produce, and then think about ways of generating it.You’ll have to create vectors or matrices and use loops

    6. Write a function that simulates the evolution of the size of a finite hermaphroditic population withnon overlapping generations under demographic stochasticity. We will suppose that fertility follows apoisson distribution. This function should take as arguments the initial population size, the fertility,the number of generations over which you want to follow population size, and the possibility of plottingthe graph of the census size as a function of generation. It should output the population sizes throughttime at the end.

    7. Write a function drift that simulates the effect of random genetic drift. It should take as parameter thepopulation size (fixed), the number of generations, the initial allele frequency, a number of replicates. Itshould produce at the end a graph of the changes throught time of allele frequencies in the differentreplicates.

    56

  • Practical 6:TESTS in R

    Introduction

    R has builtin functions for carrying out your typical statistical tests. We will go through them in thispractical.we’ll use the following data set to illustrate some of these functions:

    dat

  • t.test(x, y = NULL,alternative = c("two.sided", "less", "greater"),mu = 0, paired = FALSE, var.equal = FALSE,conf.level = 0.95, ...)

    t.test(x,mu=0)t.test(x,y)t.test(x,y,paired=T)

    examples:

    x

  • ## mean of x mean of y## 0.1055108 0.7821701

    Wilcoxon tests: (non parametric equivalent of t-tests)

    wilcox.test(x,y)wilcox.test(x,y,paired=T)

    simple bootstrap percentile confidence intervals:

    several functions in package boot. But simple to program

    for instance, for a confidence interval of the mean:

    bootstrap

  • Histogram of x

    x

    Fre

    quen

    cy

    500 600 700 800 900 1000

    020

    4060

    8010

    0Histogram of bx$bm

    bx$bm

    Fre

    quen

    cy

    730 740 750 760 770

    010

    020

    030

    0

    One way ANOVA:

    oneway.test(y~A,var.equal=TRUE)aov(y~A)

    testing homogeneity of variance:

    bartlett.test #strongly depends on the normality assumption

    Levene test: less sensitive to normality assumption, hence preferred, but has to be programmed. Simpleidea: if variance are heterogeneous, then this should show if one does an anova on the absolute values of theresiduals:

    set.seed(12) #to make sure that you obtain the same results as mey1

  • #unlist is for transforming a list in a vector (providing elements of the list#are of ths same type)boxplot(res~a)boxplot(abs(res)~a) #see the difference in mean/median among the groups ?summary(aov(abs(res)~a))

    ## Df Sum Sq Mean Sq F value Pr(>F)## a 2 643.1 321.5 32.08 3.66e-11 ***## Residuals 87 872.0 10.0## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    1 2 3

    −10

    010

    20

    1 2 3

    −10

    010

    20

    1 2 3

    05

    1015

    20

    The car library has a function for this, leveneTest (?leveneTest)

    Permutation tests (distribution free ANOVA)

    anova.perm

  • bf[i]=bf[nperm])/nperm,bf=bf))}#exampley

  • x

  • y ~ xy ~ 1 + x

    Both imply the same simple linear regression model of y on x. The first hasan implicit intercept term, and the second an explicit one.

    y ~ 0 + xy ~ -1 + xy ~ x - 1

    Simple linear regression of y on x through the origin (that is, without anintercept term).

    log(y) ~ x1 + x2Multiple regression of the transformed variable, log(y), on x1 and x2(with an implicit intercept term).

    y ~ 1 + x + I(x^2)Polynomial regression of y on x of degree 2.

    y ~ ASingle classification analysis of variance model of y, with classesdetermined by A.

    y ~ A + xSingle classification analysis of covariance model of y, with classesdetermined by A, and with covariate x.

    y ~ A*By ~ A + B + A:By ~ A+ B %in% Ay ~ A/B

    Two factor non-additive model of y on A and B. The first two specifythe same crossed classification and the second three specify the samenested classification.

    y ~ (A + B + C)^2y ~ A*B*C - A:B:C

    Three factor experiment but with a model containing main effects andtwo factor interactions only. Both formulae specify the same model.

    y ~ A * xy ~ A/xy ~ A/(1 + x) - 1

    Separate simple linear regression models of y on x within the levels of A,with different codings. The last form produces explicit estimates of asmany different intercepts and slopes as there are levels in A.

    y ~ A*B + Error(C)An experiment with two treatment factors, A and B, and error stratadetermined by factor C. For example a split plot experiment, with wholeplots (and hence also subplots), determined by factor C.

    cbind(y1,y2,y3) ~ A

    formula for MANOVA

    Linear models (Regression, multiple regressions, ANOVAs, ANCOVAs)

    The basic function for fitting ordinary multiple models is lm(), and a streamlined version of the call is asfollows:

    64

  • fitted.model

  • ## Analysis of Variance Table#### Model 1: y ~ x1 + x2## Model 2: y ~ x1## Res.Df RSS Df Sum of Sq F Pr(>F)## 1 97 2597.0## 2 98 2644.4 -1 -47.41 1.7708 0.1864

    par(mfrow=c(2,2)) #diagnostic plots, essential to checkplot(fm1)

    50 60 70 80 90 100 110

    −15

    −10

    −5

    05

    10

    Fitted values

    Res

    idua

    ls

    Residuals vs Fitted

    47

    53

    16

    −2 −1 0 1 2

    −3

    −2

    −1

    01

    2

    Theoretical Quantiles

    Sta

    ndar

    dize

    d re

    sidu

    als

    Normal Q−Q

    47

    53

    16

    50 60 70 80 90 100 110

    0.0

    0.5

    1.0

    1.5

    Fitted values

    Sta

    ndar

    dize

    d re

    sidu

    als

    Scale−Location47

    53 16

    0.00 0.01 0.02 0.03 0.04 0.05 0.06

    −3

    −2

    −1

    01

    2

    Leverage

    Sta

    ndar

    dize

    d re

    sidu

    als

    Cook's distance0.5

    Residuals vs Leverage

    69777

    par(mfrow=c(1,1))#note: order of variables matters for prop of variance explained (anova)!

    fm2b F)## x1 1 22270.4 22270.4 831.8164

  • anova(fm2b)

    ## Analysis of Variance Table#### Response: y## Df Sum Sq Mean Sq F value Pr(>F)## x2 1 178.4 178.4 6.6638 0.01134 *## x1 1 22139.4 22139.4 826.9234 < 2e-16 ***## Residuals 97 2597.0 26.8## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    Generic functions for extracting model information

    The value of lm() is a fitted model object; technically a list of results of class “lm”. Information aboutthe fitted model can then be displayed, extracted, plotted and so on by using generic functions that orientthemselves to objects of class “lm”. These include

    add1 deviance formula predict stepalias drop1 kappa print summaryanova effects labels proj vcovcoef family plot residuals

    A brief description of the most commonly used ones is given below.

    anova(object_1, object_2)Compare a submodel with an outer model and produce an analysis of variance table.

    summary(object)Print a comprehensive summary of the results of the regression analysis.

    coef(object)Extract the regression coefficient (matrix).

    Long form: coefficients(object).

    plot(object)Produce four plots, showing residuals, fitted values and some diagnostics.

    predict(object, newdata=data.frame)The data frame supplied must have variables specified with the same labels asthe original. The value is a vector or matrix of predicted values correspondingto the determining variable values in data.frame.

    residuals(object)Extract the (matrix of) residuals, weighted as appropriate.

    Short form: resid(object).

    67

  • step(object)Select a suitable model by adding or dropping terms and preserving hierarchies.The model with the largest value of AIC (Akaike s An Information Criterion)discovered in the stepwise search is returned.

    Generalized linear models:

    glm(formula, family=gaussian)

    family can be one of:binomial(link = "logit")gaussian(link = "identity")Gamma(link = "inverse")inverse.gaussian(link = "1/mu^2")poisson(link = "log")quasi(link = "identity", variance = "constant")quasibinomial(link = "logit")quasipoisson(link = "log")

    Mixed models (with a mixture of fixed and random effects),

    use package lme4:

    library(lme4)?lmer

    Multivariate tests:

    manova(cbind(y1,y2,y3)~A))

    prcomp() for PCA, but better to use package ade4:

    library(ade4)?dudi.pca

    68

  • Questions

    Answer these questions using the file class2005.txt at URL:

    http://www2.unil.ch/popgen/teaching/R11/

    1. average size of the class? for males? for females?2. are the size normally distributed?3. confidence interval for the average size4. correlation between weight and size?5. can one predict the size of somebody based on his shoe size?6. Are eye and hair colors independant?7. Are smokers heavier than non-smokers?8. does answer to 7 depend on sex?9. Does size differ according to hair colour?

    10. Does the weight of boys and girls differ? does this depends on their respective size?11. Does the smoking status depends on sex?

    Practical 7: Packages

    Part of the strenght of R is its almost infinite number of packages available to enhance it. However, thisflurry has grown without much order, and it is often difficult to find what one is looking for. A notableexception is the bioconductor suite of packages (http://www.bioconductor.org), which

    provides tools for the analysis and comprehension of high-throughput genomic data.

    Other specific areas have been loosely organized in task views, a list of which can be found at http://cran.r-project.org/web/views/

    To install a package, use install.packages(). For instance to install ape, type

    install.packages(ape)

    You might find it useful to install the packages in a specific place rather than the default, either because youare working on a machine where you don’t have administrator rights, or because you don’t want to have toupdate everything each time you install a new version of R. One way is to create a .Renviron file in yourhome directory, where you specify R_LIBS=/your_path_to_Rlibs

    To load a package, use ‘library(package)“, e.g.:

    library(ape)?plot.phyloexample(plot.phylo)

    69

    http://www2.unil.ch/popgen/teaching/R11/http://www.bioconductor.orghttp://cran.r-project.org/web/views/http://cran.r-project.org/web/views/

    Practical 1. A first R tutorial: win the lotteryPrerequisitesIntroductionQuestions/Tutorial

    Practical 2: Data manipulation in R: vectors, matrices, list et data framesIntroductionQuestions

    Practical 3: Basic graphing functions, scatter plot and regressionIntroductionQuestions

    Practical 4: Functions in RIntroductionQuestions

    Practical 5:Statistical distributions and functions in RIntroductionQuestions

    Practical 6:TESTS in RIntroductionQuestions

    Practical 7: Packages