data management and statistical analysis - data manipulation

8
Presentation Title Goes Here presentation subtitle. Introduction to R: Data Manipulation and Statistical Analysis Data Manipulation Violeta I. Bartolome Senior Associate Scientist-Biometrics Crop Research Informatics Laboratory International Rice Research Institute :: color, composition, and layout Sample data set mydata[3,4] :: color, composition, and layout Selecting Variables Select variable Y1 o mydata[“Y1”] o mydata[,3] o mydata[3] o mydata[c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)] o mydata[as.logical(c(0,0,1,0,0,0))] o mydata[names(mydata)==“Y1”] o mydata$Y1 To create a data frame containing Y1 myA<- mydata[“Y1”] :: color, composition, and layout Select variables Y1, Y2, Y3, Y4 o mydata[c(3,4,5,6)] o mydata[3:6] o mydata[-c(1,2)] o mydata[-I(1:2)] # I() is the isolation function o mydata[c(“Y1”, “Y2”, “Y3”, “Y4”)] o mydata[c(FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)] o mydata[as.logical(c(0,0,1,1,1,1))] Selecting Variables To create a data frame containing Y1, Y2, Y3, Y4 myB<- mydata[c(3,4,5,6)] Dataset

Upload: vivay-salazar

Post on 22-Nov-2014

291 views

Category:

Documents


8 download

DESCRIPTION

Data Management and Statistical Analysis - Data Manipulation

TRANSCRIPT

Page 1: Data Management and Statistical Analysis - Data Manipulation

Presentation Title Goes Here…presentation subtitle.

Introduction to R:

Data Manipulation and Statistical Analysis

Data Manipulation

Violeta I. BartolomeSenior Associate Scientist-BiometricsCrop Research Informatics LaboratoryInternational Rice Research Institute

:: color, composition, and layout

Sample data set

mydata[3,4]

:: color, composition, and layout

Selecting Variables

• Select variable Y1

o mydata[“Y1”]

o mydata[,3]

o mydata[3]

o mydata[c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)]

o mydata[as.logical(c(0,0,1,0,0,0))]

o mydata[names(mydata)==“Y1”]

o mydata$Y1

To create a data frame containing Y1

myA<- mydata[“Y1”]

:: color, composition, and layout

• Select variables Y1, Y2, Y3, Y4

o mydata[c(3,4,5,6)]

o mydata[3:6]

o mydata[-c(1,2)]

o mydata[-I(1:2)] # I() is the isolation function

o mydata[c(“Y1”, “Y2”, “Y3”, “Y4”)]

o mydata[c(FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)]

o mydata[as.logical(c(0,0,1,1,1,1))]

Selecting Variables

To create a data frame containing Y1, Y2, Y3, Y4

myB<- mydata[c(3,4,5,6)]Dataset

Page 2: Data Management and Statistical Analysis - Data Manipulation

:: color, composition, and layout

Selecting Variables

• Select variables Y1, Y2, Y3, Y4

o myB<-data.frame(mydata$Y1, mydata$Y2,mydata$Y3, mydata$Y4)

this is equivalent to

attach(mydata)

myB<-data.frame(Y1,Y2,Y3,Y4)

detach(mydata)

o myB<-subset(mydata, select=Y1:Y4)

:: color, composition, and layout

Selecting Observations

• Select observation numbers 3 to 8

o mydata[3:8, ]

o mydata[-c(1,2), ]

• Select observations of Site B

o mydata[mydata$Site==“B”, ]

o subset(mydata,subset=Site==“B”)

o mydata[which(mydata$Site==“B”),]

To create a data frame

myC<- mydata[mydata$Site==“B”, ]Dataset

:: color, composition, and layout

Selecting ObservationsSelect observations of Sites A and B, and

Trt 1 and 2

o attach(mydata)

mydata[(Site==“A” | Site==“B”) & (Trt==1 | Trt==2), ]

detach(mydata)

o subset(mydata,subset=((Site==“A” | Site==“B”) & (Trt==1 | Trt==2)))

o mydata[which((mydata$Site==“A” | mydata$Site==“B”) & (mydata$Trt==1 | mydata$Trt==2)),] Dataset

:: color, composition, and layout

Selecting Both Variables and

Observations

• Data frame containing Site B and Y1-Y4

o myD<-mydata[4:6, 3:6]

myD<-mydata[mydata$Site==“B”, c(“Y1”,”Y2”,”Y3”,”Y4”)]

o myD<-subset(mydata,subset=Site==“B”,select=Y1:Y4)

Dataset

Hands-on

Page 3: Data Management and Statistical Analysis - Data Manipulation

:: color, composition, and layout

Transforming/Creating New Variables• Using Numerical Expressions

o mydata$Y5 <- mydata$Y3

o mydata$Y6 <- 0

• Using Mathematical Operations (+, -, *. / **)

o mydata$sum <-mydata$Y1+mydata$Y2+mydata$Y3+mydata$Y4

o attach(mydata)

mydata$sum<-Y1+Y2+Y3+Y4

detach(mydata)

o mydata<-transform(mydata, sum=Y1+Y2+Y3+Y4)

o If with more than 1 transformation

mydata<-transform(mydata,

sum=Y1+Y2+Y3+Y4,

mean=sum/4)

sample dataset

sample dataset

:: color, composition, and layout

Using Numerical Expressions Using Mathematical Operations

forwardback

:: color, composition, and layout

Transforming/Creating New Variables

• Using functions

o mydata$sqrtY3 <- sqrt(mydata$Y3)

o mydata$Y4 <- log10(mydata$Y4)

:: color, composition, and layout

Missing data: using the na.rm option

• Consider the statement

o mydata$sumy<-mydata$Y1+mydata$Y2+mydata$Y3

Note: if any of the Y’s is missing sum will be missing

• To get sum of non-missing observations

o myYs<-subset(mydata,select=c(Y1,Y2,Y3))

o mydata$sum<-rowSums(myYs,na.rm=TRUE)

sample data set

Page 4: Data Management and Statistical Analysis - Data Manipulation

:: color, composition, and layout

backforward

:: color, composition, and layout

Missing data: using the is.na()

• Selecting observations with at least one missing observation

o missing <- subset(mydata,subset=(is.na(Y1)==T|is.na(Y2)==T|is.na(Y3)==T|is.na(Y4)==T))

:: color, composition, and layout

Keeping and Dropping Variables

• Create a copy of mydata

mysubset <- mydata

• Drop Y3 and Y4 from mysubset

mysubset$Y3 <- mysubset$Y4 <- NULL

:: color, composition, and layout

Renaming Variables

• Rename Y1-Y4 to X1-X4, respectively

o library (reshape)

mydata <- rename(mydata, c(Y1=“X1”))

mydata <- rename(mydata, c(Y2=“X2”))

mydata <- rename(mydata, c(Y3=“X3”))

mydata <- rename(mydata, c(Y4=“X4”))

o names(mydata) <- c(“Site”, “Trt”, “X1”, “X2”, “X3”, X4”)

Hands-on

Page 5: Data Management and Statistical Analysis - Data Manipulation

:: color, composition, and layout

Stacking/Concatenating Data Frames

• Data frame containing Site A only

attach(mydata)

A <- mydata[Site==“A”, ]

• Data frame containing Site B only

B <- mydata[Site==“B”, ]

• Combine the two data frames

both <- rbind(A,B)

detach(mydata)

Hands-on :: color, composition, and layout

Merging Data Frames

• Data frame containing Y1 and Y2

attach(mydata)

left <- mydata[c(“Site”,”Trt”,”Y1”,”Y2”)]

• Data frame containing Y3 and Y4

right <- mydata[c(“Site”,”Trt”,”Y3”,”Y4”)]

• Merge the two data frames

both <- merge(left, right,

by=c(“Site”,”Trt”))

detach(mydata)

Hands-on

:: color, composition, and layout

Sorting Data Frames

• Sort by Trt and Site

mydataSorted <-mydata[order(mydata$Trt, mydata$Site), ]

Note: Default is ascending order. Prefix a variable by a minus sign to get descending order

mydataSorted <-mydata[order(-mydata$Trt, mydata$Site), ]

Hands-on :: color, composition, and layout

Parallel to Serial

data.serial <- reshape(mydata, # object to be reshapedvarying=list(3:6), # if >1 variable -- list(3:4,5:6)v.names=“Y", # v.names=c(“Y”,”X”)

idvar=c(“Site“,”Trt”), # be used as rownames

timevar=“Rep", # new variable to be createdtimes=c(1:4), # values of new variabledirection="long“)

data.serial

Page 6: Data Management and Statistical Analysis - Data Manipulation

:: color, composition, and layout

Parallel to Serial

row.names(data.serial) <- 1:NROW(data.serial) data.serial

Change row names

idvar used as row names

:: color, composition, and layout

Parallel to Serial

Hands-on

:: color, composition, and layout

Serial to Parallel

data.parallel <- reshape(serialdata, # object to be reshapedv.names=c("yld","dm"), # variables to be convertedidvar=c("plot","date"), # variables to be retainedtimevar="rep", # values of which will be

affixed to column names

drop=c(“var1”,”var2”) # variables to be removed

from the reshaped data

direction="wide“)data.parallel :: color, composition, and layout

Serial to Parallel

colnames(data.parallel) <- gsub("[.]", "", colnames(data.parallel))data.parallel

Remove “.” from column names

Page 7: Data Management and Statistical Analysis - Data Manipulation

:: color, composition, and layout

Serial to Parallel

row.names(data.parallel) <-1:NROW(data.parallel)data.parallel

Change row names

:: color, composition, and layout

Serial to Parallel

Hands-on

:: color, composition, and layout

Aggregating data

• With only one response variable

meanY <- aggregate(data.serial$Y,by = list(data.serial$Site,data.serial$Trt),FUN=mean,na.rm=TRUE) # gets statistics from nonmissing values

meanYna.rm=TRUE na.rm=FALSE

:: color, composition, and layout

Aggregating data

• With more than one response variables

Ys <- subset(mydata,select=Y1:Y4) # data frame of numerical variables

meanYs <- aggregate(Ys, by=list(mydata$Site), # subsetting variables

FUN=mean, # function to be performed

na.rm=TRUE)meanYs

Hands-on

Page 8: Data Management and Statistical Analysis - Data Manipulation

:: color, composition, and layout

Please do the exercise.

Thank You.