r programming for life scientists version 2.0 raymond r. balise, ph.d. health research and policy...

Download R Programming for Life Scientists Version 2.0 Raymond R. Balise, Ph.D. Health Research and Policy Spectrum

If you can't read please download the document

Post on 19-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1
  • R Programming for Life Scientists Version 2.0 Raymond R. Balise, Ph.D. Health Research and Policy Spectrum
  • Slide 2
  • Roadmap What makes R different for the rest? Setting up R Types of data Working with collections of data Importing and exporting data Writing functions Graphics
  • Slide 3
  • When to Use R Shoestring budget Cutting edge statistics Developing your own or fine-tuning existing methods Local expertise
  • Slide 4
  • Programming Languages Procedural languages C, Fortran, Cobol, Basic use a model where the logic flows from the top of the page to the bottom with calls to goto subroutines as needed It is hard to encapsulate the code. Object oriented languages C++, Visual Basic, JAVA involves creating objects and then operating on them
  • Slide 5
  • R is Object Oriented (OO) You create objects vector of numbers, a graphic, etc. You call methods/functions to operate on the objects. Working with an OO language requires you to learn about special methods to create, access, modify, or destroy objects and their properties. R hides these processes. It helps a lot if you want to write new statistics and methods and is required for making new packages.
  • Slide 6
  • OO Example With R you write code in the editor which I will show you in a minute. You can create an object which holds a bunch of numbers (a vector, if you remember math) You can then use (aka call) a function (aka method) to operate on the object. The summary() function Create and display a numeric summary object The plot() function Create and display a graphic summary object
  • Slide 7
  • Make the ages object Call the summary function. Call the plot function. Call the plot function.
  • Slide 8
  • But wait theres more! There is a lot of functionality built into R. It ships with libraries that do many different tasks. And you can download more. Map most of the USA. Activate the map datasets and functions.
  • Slide 9
  • But hold on. There is MORE! You can add options to the function calls to make them do fancy things like color. Or you can have one function act on the output of another function. And you can save output as objects!
  • Slide 10
  • Important Objects Vectors are lists of numbers. Dataframes are like database or spreadsheets.
  • Slide 11
  • Everything is an Object Class: character numeric logical function factor data.frame list matrix lm table lots more R objects can have a class which indicates what they are used for. character numeric logical function complex raw Mode: Every object in R has a mode which determines how much space it uses. Structures to hold data Columns (of same length) with data like database Columns of data (of different length) Grid of data (all same type) Holding simple data elements Output from summary/graphic procedures
  • Slide 12
  • Where to Get R R has two main websites. One describes the project: http://www.r-project.org/ The other has most of the stuff you want to download: http://cran.r-project.org/ Because the R project has people working all over the globe, the software download site is mirrored everywhere. The closest mirror is USA CA1 (aka UC Berkeley).
  • Slide 13
  • http://cran.cnr.berkeley.edu/ There is an R installer for all the common operating systems: cran.cnr.berkeley.edu/bin/windows/base/ cran.cnr.berkeley.edu/bin/macosx/ cran.cnr.berkeley.edu/bin/linux/ Each is basically self explanatory.
  • Slide 14
  • Slide 15
  • Installing on Windows Double click the installer and just push next until you get to this screen. Specify that you want to do customized startup. This will let you set up R to work with other programs nicely.
  • Slide 16
  • Customize Use these options, then hit Next> a bunch.
  • Slide 17
  • help.start() and push enter to start the help. q() and push enter to quit but dont yet.
  • Slide 18
  • GUI Use the built in editor. Save or restore all the objects in use. Save or reload the code from the console. Keep all the text in the console for the session. Set the working directory to save objects.
  • Slide 19
  • GUI Edit existing data. Tweak the appearance of the console.
  • Slide 20
  • Rprofile.site If you have instructions that you always want run when R starts up, you can include them in the Rprofile.site file:
  • Slide 21
  • GUI Common commands. Show the add on packages currently accessible.
  • Slide 22
  • Packages in R User-supplied packages are typically found at one of three places: CRAN for all kinds of stuff Omegahat for web-based statistics Bioconductor for genomic analysis R packages update often. Your colleagues will recommend task-specific packages. Rcmdr is my favorite.
  • Slide 23
  • GUI Use a previously downloaded package. I type library(name) instead. Use a previously downloaded package. I type library(name) instead. USA (CA1) is closest to Stanford. Choose which set of packages to look at. See the HUGE list of packages. Update often!
  • Slide 24
  • GUI This is useful.
  • Slide 25
  • HTML help This is useful but not Google. This will not find information if you have not installed the packages.
  • Slide 26
  • Rseek.org is Google-driven I highly recommend it.
  • Slide 27
  • Mac Quick Help Search help for the word "map". Search for details on a function if the package is loaded and you know the functions name.
  • Slide 28
  • Windows Quick Help Search help for the word "map". Search help for the function named "map". Load the package.
  • Slide 29
  • Mac Install Download and double click the dmg file. Click customize and make sure Tcl/Tk is checked on.
  • Slide 30
  • X11 Some packages for R on the Mac (like Rcmdr) require X11 to be installed. I think it is part of the standard Leopard installation but was an option with Tiger. If you need it, try to install it off of the DVD that came with your machine because people have reported using the dmg files from Apple.com.
  • Slide 31
  • X11 and Add-on Packages To get add on packages, use this menu. You can click here to make sure X11 works.
  • Slide 32
  • Getting or Updating Packages Click Get List, click the package name, be sure install dependencies is checked on, then click install.
  • Slide 33
  • Instead of Point and Click You can also run this code to have Mac or Windows R download a list of packages: usefulPackages = c("car", "foreign", "hexbin", "gdata", "ggplot2", "gmodels", "gplots", "Hmisc", "reshape", "Rcmdr") install.packages(usefulPackages, dependencies = TRUE) Be sure to take note of any packages that do not install. marray, affy, Biobase, Rgraphviz were not available
  • Slide 34
  • I suggest you install the Rcmdr package first thing. Use the Install packages option on the package menu to download Rcmdr To make it available for your R session type: library(Rcmdr) CAPITALIZATION MATTERS! The first time you run it, it will ask you if it can download additional packages. Your First Package
  • Slide 35
  • Slide 36
  • If you are on Windows you can directly import Excel. On a Mac you can not directly import from Excel.
  • Slide 37
  • Hate Typing? Tab is your fiend. It will auto-complete if it can or give you a list of functions that match what you have typed. It woks very well on the Mac. In Windows sometimes you need to type tab twice. In Windows if you type tab after a ( it displays options for the function or they just appear in the Mac.
  • Slide 38
  • Before Analysis Rcmdr makes the most commonly done analyses easy or if you are told the names of the functions to use, writing the code is almost tolerable. Data manipulation is relatively difficult in R compared to other analysis and data management tools. You need to know how to manipulate the data objects.
  • Slide 39
  • Data Set Objects Vectors A bunch of data in a single row or column All of the same type Matrix A row and column arrangement of data All of the same type Data frame A row and column arrangement of data Columns are of different types List Very free-form structure A grouping of different types of data Like a good spreadsheet or relational database file
  • Slide 40
  • Types of Data Vectors Numeric Integer, real, and complex are different types but you will not need to pay attention to the details NA means missing NAN means not a number String Characters of the alphabet Logical TRUE, FALSE or NA
  • Slide 41
  • R is case sensitive. # means ignore the rest of the line. ; means a new command follows. R is case sensitive. # means ignore the rest of the line. ; means a new command follows. Making a Vector Surround the expression with () to display the result automatically. This will be VERY useful. Make a sequence OneToThirty = seq(1, 30) OneToThirty # same as print(OneToThirty) oneToThirty = seq(1, 30, by = 2); oneToThirty x1230 = 1:30 (x1to30 = 1:30)
  • Slide 42
  • Making Vectors With c() c stands for concatenate ages = c(9, 11, 40, 41) ; ages stooges = c("Larry", "Moe", "Curly", "Shemp"); stooges
  • Slide 43
  • Getting Details You can use is functions and length to get details on a vector. is.vector(ages) is.numeric(ages) is.logical(ages) length(ages)
  • Slide 44
  • You can add one to all four ages. ages + c(1,1,1,1) If you provide the scalar integer, R will temporarily vectorize the 1 by recycling that value to match the length of the ages vector. ages + 1 It will recycle a series also. ages ages + c(1,2) Recycling and Vectorizing
  • Slide 45
  • Naming Parts of a Vector You can assign names to the elements of a vector. This allows later access to the elements using the names instead of the position. names(ages) = stooges ages To erase them: names(ages) = NULL; ages Notice what happens when the lengths differ: stooges= c("Larry", "Moe", "Curly") names(ages) = stooges ages
  • Slide 46
  • Attributes When you add names to things (objects) they acquire or change their names attribute. attributes(ages) When you strip off the names, the vector is left with no attributes. names(ages) = NULL attributes(ages)
  • Slide 47
  • A data frame is an object with many attributes. R ships with a lot of datasets if you want one help.start() Click packages then datasets. esoph ?esoph attributes(esoph) Complex Objects
  • Slide 48
  • Getting at Parts of a Vector Specify the element number. heyMoe = ages[2] ; heyMoe Specify to drop everything except the element number. ages[c(-1, -3, -4)] Specify a list with TRUE and FALSE ages[c(FALSE, TRUE, FALSE, FALSE)]
  • Slide 49
  • Getting Parts with Names ages = c(9, 11, 40, 41) ; ages names(ages) = c("Larry", "Moe", "Curly", "Shemp") ages Specify the name. heyMoe = ages["Moe"]
  • Slide 50
  • Duplicate Names That code only returns the first one if there are duplicates. names(ages)[4] = "Moe" ages heyMoe = ages["Moe"] heyMoe Gives all if duplicates names(ages) %in% "Moe" ages[names(ages) %in% "Moe"]
  • Slide 51
  • Parts of a Data Frame You can select columns of a data frame just like you selected elements from a vector. booze = esoph["alcgp"] is.data.frame(booze) esoph[2] esoph[c(4,5)]
  • Slide 52
  • Choosing Records If you put a single item or series inside of the square brackets, R thinks you are requesting columns. If you want to get access to specific rows, you include a comma after the rows. blah[rows, columns] esoph[ 1, ] esoph[ 1:3, ]
  • Slide 53
  • Smarter Access to a Vector You can use logic checks to find the record numbers in a vector which meet your criteria. ages < 21 which(ages < 21) You can then subset down your data to the records of interest using the [ ] subset operator. ages[which(ages < 21)] ages[ages < 21]
  • Slide 54 0 esoph$ncases > 0 which(esoph$ncases > 0) Smarter Access to a"> 0 esoph$ncases > 0 which(esoph$ncases > 0) Smarter Access to a Data Frame"> 0 esoph$ncases > 0 which(esoph$ncases > 0) Smarter Access to a" title="You can find the records that pass a logic check inside a data frame. esoph["ncases"] > 0 esoph$ncases > 0 which(esoph$ncases > 0) Smarter Access to a">
  • You can find the records that pass a logic check inside a data frame. esoph["ncases"] > 0 esoph$ncases > 0 which(esoph$ncases > 0) Smarter Access to a Data Frame
  • Slide 55
  • Subset a Data Frame Recall that you can select rows with frameName[rows,columns] and if you do not include a comma, all records are chosen. which(esoph$ncases > 0) gives you a list of records which adhere to that rule. Therefore, the code below gives you a subset esoph[ which(esoph$ncases > 0), ] or esoph[esoph$ncases > 0, ]
  • Slide 56
  • Subset Data Frames Easily If that logic is rough to think about, use the subset function. subset(esoph, ncases > 0)
  • Slide 57
  • Choosing Values If you need specific values, you can use the & (and) or the | (or) operators to get the ordered set of TRUE and FALSE values. ages > 21 & ages < 41 ! means not !(ages > 21 & ages < 41) Notice that it is applying the one logic check to the vector of ages. How does it do that?
  • Slide 58
  • Math on Data Frame Columns You have seen how to do scalar and vector algebra. Algebra on a data frame is easy. names(esoph) esoph$total=esoph$ncases + esoph$ncontrols To see the end of the data frame, use tail() tail(esoph)
  • Slide 59
  • Comparing Against Vectors This one uses recycling and gives wrong answers. What happens when you try to compare a vector to a set of things? gender = c(NA, "Male", "Female", "Blue", "Female") gender == "Male" | gender == "Female" gender == c("Male", "Female") R recycles the shorter vector to be the longer length, then does the comparison. Use the %in% operator if you want to compare as if you wrote a series of or statements. gender %in% c("Male", "Female")
  • Slide 60
  • Categorical Variables R makes a distinction between variables holding a bunch of characters from the alphabet and variables holding categorical information. If you have a classification/categorical variable, you want R to treat it as a factor or an ordered factor. Typical factors are treatment or gender. dose = c("low", "placebo", "high", "low") dose typeof(dose)
  • Slide 61
  • Factors To convert a character variable to a factor, use the as.factor function. doseF = as.factor(dose) typeof(doseF) class(doseF) Behind the scenes, the character variable is converted into numbers and the numbers are given character strings to display. In modern R the levels of the factor are ordered alphabetically and the first one is represented with the digit 1, the second is 2, etc. There are is. or as. predicate functions to check object types or convert between types of objects.
  • Slide 62
  • Comparing Factors Notice wrong answer thanks to recycling. You can compare a factor vs. a constant value. doseF == "high" as.integer(doseF) == 1 Or you can compare vs. vectors (CAREFULLY). doseF == c("high", "low") doseF %in% c("high", "low") R will stop you from comparing factors that have different categories. doseF2 = as.factor(c("blah", "placebo", "high", "low")) doseF == doseF2
  • Slide 63
  • Recoding Factors Often you will want to regroup factor levels. amount=as.factor(c("placebo", "10mg", "5mg", "10mg")) levels(amount) regroup = list(none="placebo", some=c("5mg", "10mg")) levels(amount) = regroup amount none placebo some 5mg 10mg
  • Slide 64
  • Numeric Factors If you have numeric factors, be careful converting from factors back to numbers. ID = c(1000, 1000, 1001, 2) IDf = factor(ID) as.integer(IDf) levels(IDf) numbersAgain = as.numeric(levels(IDf))[IDf]
  • Slide 65
  • Recoding Values in a Vector R has functions like if and ifelse to process values. ifelse(ages < 21, "Young", "Old") ifelse(ages
  • Test Scores scores = read.table("c:\\blah\\walkerScores.txt", header = TRUE) rapply(scores, class) scores$CENTER = as.factor(scores$CENTER) scores$PAT = as.character(scores$PAT) rapply(scores, class) scores$isSick = ifelse(scores$SCORE > 0, 1, 0); library(car) (scores$SEV = with(scores, recode(SCORE, '0 = "None" ;1:30 = "Mild"; 31:69 = "Moderate"; 70:100 = "Severe"; else = "BAD DATA"'))) (scores$SEV = factor(scores$SEV, levels = c("None", "Mild", "Moderate", "Severe"), ordered = TRUE));
  • Slide 108
  • Common Plots are Easy attach(scores) #to avoid typing scores$ plot(SEV, main = "MainTitle", xlab = "xlab", ylab = "ylab") plot(SCORE) hist (SCORE) boxplot(SCORE) boxplot(SCORE ~ SEX, ylim = c(0,100)) detach(scores)
  • Slide 109
  • Graphics Tweaks mfrow is used to set number of rows and columns of graphics on a page
  • Slide 110
  • Strip Charts for Small Datasets par(cex = 1.5) # big font with(Gad, stripchart(HAMA ~ DOSEGRP, xlab = "HAMA", pch = 16))
  • Slide 111
  • 3 Languages for the Price of 1 The graphics I have shown use the classic graphic methods. There are trellis plots from the lattice package that split the data into multiple panes automatically. ggplot2 uses a "grammar of graphics" approach (like SPSS).
  • Slide 112
  • Dont play with pie! library(lattice) trellis.par.set(list(fontsize=list(points=20))) trellis.par.set(list(fontsize=list(text=25))) dotplot(table(Gad$DOSEGRP), xlim = c(-1, 21)) The lattice package makes trellis graphics (I didnt make up these names!).
  • Slide 113
  • Typical lattice plot with banding to show subsets
  • Slide 114
  • trellis.par.set(list(fontsize=list(points=15))) trellis.par.set(list(fontsize=list(text=15))) EE
  • Slide 115
  • Basic plot + geometric details + adding details + adding more details + yet more details qplot(carat, price, data = diamonds, geom = c("point", "smooth"))
  • Slide 116
  • Rcmdr has A LOT of great graphics built into the point and click interface. library(Rcmdr) Look up my short course (5 talks) covering basic statistics to see how to code many graphics. www.stanford.edu/~balise/HowToDoBiostatistics.htm Use Rcmdr (R Commander)
  • Slide 117
  • You are Going to Need More Help Data Manipulation with R by Spector. A must-have book on how to read and write data with or without SQL, manipulate data with R, aggregate data, and reshape datasets easily. R Programming For Bioinformatics by Gentleman. A very good intermediate level book on how R object-oriented programming really works. The R Book or Statistical Computing by Crawley. These have nicely written intermediate level statistics. But they are highly redundant across the two books. Redundant
  • Slide 118
  • A couple datasets used for this talk were from Glenn A. Walkers Common Statistical Methods for Clinical Research with SAS Examples. Buy Walkers book if you do clinical research and you use SAS. Data Analysis and Graphics Using R by Maindonald and Braun I constantly use this book to figure out how to do graphics. Using R for Introductory Statistics by Verzani. This one has fantastic coverage of the R to do the common statistics. It also has a nearly useless index at the end of book.
  • Slide 119
  • Biostatistics John Fox, the guy who made Rcmdr, is an excellent author and he provides an R based supplement for his superb statitics book.
  • Slide 120
  • Spectrum If you are doing biomedical research and have questions we are here to help. Study design Analysis plan Power and sample size calculation (Limited availability help with SAS and R code) med.stanford.edu/spctrm/biostatistician.html Special thanks to 5F-X, Assemblage 23