data manipulation on r

32
Data Manipulation on R Factor Manipulations,subset,sorting and Reshape Abhik Seal Indiana University School of Informatics and Computing(dsdht.wikispaces.com)

Upload: abhik-seal

Post on 05-Dec-2014

200 views

Category:

Education


0 download

DESCRIPTION

Manipulating data with E

TRANSCRIPT

Page 1: Data manipulation on r

Data Manipulation on RFactor Manipulations,subset,sorting and Reshape

Abhik SealIndiana University School of Informatics and Computing(dsdht.wikispaces.com)

Page 2: Data manipulation on r

Basic Manipulating DataSo far , we've covered how to read in data from various ways like from files, internet and databases andreading various formats of files. This session we are interested to manipulate data after reading in the file foreasy data processing.

2/35

Page 3: Data manipulation on r

Sorting and Ordering datasort(x,decreasing=FALSE) : 'sort (or order) a vector or factor (partially) into ascending or descendingorder.' order(...,decreasing=FALSE):'returns a permutation which rearranges its first argument intoascending or descending order, breaking ties by further arguments.'

x <- c(1,5,7,8,3,12,34,2)sort(x)

## [1] 1 2 3 5 7 8 12 34

order(x)

## [1] 1 8 5 2 3 4 6 7

3/35

Page 4: Data manipulation on r

Some examples of sorting and ordering# sort by mpgnewdata <- mtcars[order(mpg),]head(newdata,3)

## mpg cyl disp hp drat wt qsec vs am gear carb## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4

# sort by mpg and cylnewdata <- mtcars[order(mpg, cyl),]head(newdata,3)

## mpg cyl disp hp drat wt qsec vs am gear carb## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4

4/35

Page 5: Data manipulation on r

Ordering with plyrlibrary(plyr)head(arrange(mtcars,mpg),3)

## mpg cyl disp hp drat wt qsec vs am gear carb## 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4## 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4## 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4

head(arrange(mtcars,desc(mpg)),3)

## mpg cyl disp hp drat wt qsec vs am gear carb## 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1## 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1## 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

5/35

Page 6: Data manipulation on r

Subsetting dataset.seed(12345)#create a dataframe X<-data.frame("A"=sample(1:10),"B"=sample(11:20),"C"=sample(21:30))# Add NA VALUESX<-X[sample(1:10),];X$B[c(1,6,10)]=NAhead(X)

## A B C## 8 4 NA 27## 1 8 11 25## 2 10 12 23## 5 3 13 24## 3 7 16 28## 10 5 NA 26

6/35

Page 7: Data manipulation on r

Basic data subsetting# Accessing only first rowX[1,]

## A B C## 8 4 NA 27

# accessing only first columnX[,1]

## [1] 4 8 10 3 7 5 9 1 2 6

# accessing first row and first columnX[1,1]

## [1] 4

7/35

Page 8: Data manipulation on r

And/OR'shead(X[(X$A <=6 & X$C > 24),],3)

## A B C## 8 4 NA 27## 10 5 NA 26## 7 2 19 29

head(X[(X$A <=6 | X$C > 24),],3)

## A B C## 8 4 NA 27## 1 8 11 25## 5 3 13 24

8/35

Page 9: Data manipulation on r

select Non NA values Data Frame# select the dataframe without NA values in B columnhead(X[which(X$B!='NA'),],4)

## A B C## 1 8 11 25## 2 10 12 23## 5 3 13 24## 3 7 16 28

# select those which have values > 14head(X[which(X$B>11),],4)

## A B C## 2 10 12 23## 5 3 13 24## 3 7 16 28## 4 9 20 30

9/35

Page 10: Data manipulation on r

# creating a data frame with 2 variablesdata <- data.frame(x1=c(2,3,4,5,6),x2=c(5,6,7,8,1))list_data<-list(dat=data,vec.obj=c(1,2,3))list_data

## $dat## x1 x2## 1 2 5## 2 3 6## 3 4 7## 4 5 8## 5 6 1## ## $vec.obj## [1] 1 2 3

# accessing second element of the list_obj objectslist_data[[2]]

## [1] 1 2 3

10/35

Page 11: Data manipulation on r

FactorsFactors are used to represent categorical data, and can also be used for ordinal data (ie categories have anintrinsic ordering) Note that R reads in character strings as factors by default in functions like read.table()'Thefunction factor is used to encode a vector as a factor (the terms 'category' and 'enumerated type' are also usedfor factors). If argument ordered is TRUE, the factor levels are assumed to be ordered. For compatibility with Sthere is also a function ordered.'is.factor, is.ordered, as.factor and as.ordered are the membership andcoercion functions for these classes.

11/35

Page 12: Data manipulation on r

FactorsSuppose we have a vector of case-control status

cc=factor(c("case","case","case","control","control","control"))cc

## [1] case case case control control control## Levels: case control

levels(cc)=c("control","case")cc

## [1] control control control case case case ## Levels: control case

12/35

Page 13: Data manipulation on r

FactorsFactors can be converted to numericor charactervery easily

x=factor(c("case","case","case","control","control","control"),levels=c("control","case"))as.character(x)

## [1] "case" "case" "case" "control" "control" "control"

as.numeric(x)

## [1] 2 2 2 1 1 1

13/35

Page 14: Data manipulation on r

CutNow that we know more about factors, cut()will make more sense:

x=1:100cx=cut(x,breaks=c(0,10,25,50,100))head(cx)

## [1] (0,10] (0,10] (0,10] (0,10] (0,10] (0,10]## Levels: (0,10] (10,25] (25,50] (50,100]

table(cx)

## cx## (0,10] (10,25] (25,50] (50,100] ## 10 15 25 50

14/35

Page 15: Data manipulation on r

CutWe can also leave off the labels

cx=cut(x,breaks=c(0,10,25,50,100),labels=FALSE)head(cx)

## [1] 1 1 1 1 1 1

table(cx)

## cx## 1 2 3 4 ## 10 15 25 50

15/35

Page 16: Data manipulation on r

Cutcx=cut(x,breaks=c(10,25,50),labels=FALSE)head(cx)

## [1] NA NA NA NA NA NA

table(cx)

## cx## 1 2 ## 15 25

table(cx,useNA="ifany")

## cx## 1 2 <NA> ## 15 25 60

16/35

Page 17: Data manipulation on r

Adding to data framesm1=matrix(1:9,nrow=3,ncol=3,byrow=FALSE)m1

## [,1] [,2] [,3]## [1,] 1 4 7## [2,] 2 5 8## [3,] 3 6 9

m2=matrix(1:9,nrow=3,ncol=3,byrow=TRUE)m2

## [,1] [,2] [,3]## [1,] 1 2 3## [2,] 4 5 6## [3,] 7 8 9

17/35

Page 18: Data manipulation on r

Adding using cbindYou can add columns (or another matrix/data frame) to a data frame or matrix using cbind()('column bind').You can also add rows (or another matrix/data frame) using rbind()('row bind'). Note that the vector you areadding has to have the same length as the number of rows (for cbind()) or the number of columns (rbind())

cbind(m1,m2)

## [,1] [,2] [,3] [,4] [,5] [,6]## [1,] 1 4 7 1 2 3## [2,] 2 5 8 4 5 6## [3,] 3 6 9 7 8 9

18/35

Page 19: Data manipulation on r

Reshape dataDatasets layout could be long or wide. In long-layout, multiple rows represent a single subject's record,whereas in wide-layout, a single row represents a single subject's record. In doing some statistical analysissometimes we require wide data and sometimes long data, so that we can easily reshape the data to meet therequirements of statistical analysis. Data reshaping is just a rearrangement of the form of the data—it does notchange the content of the dataset. This section mainly focuses the melt and cast paradigm of reshapingdatasets, which is implemented in the reshape contributed package. Later on, this same package isreimplemented with a new name, reshape2, which is much more time and memory efficient (the ReshapingData with the reshape Package paper, by Wickham, which can be found at(http://www.jstatsoft.org/v21/i12/paper))

19/35

Page 20: Data manipulation on r

Wide data has a column for each variable. For example, this is wide-format data:

Data in long format

# ozone wind temp# 1 23.62 11.623 65.55# 2 29.44 10.267 79.10# 3 59.12 8.942 83.90# 4 59.96 8.794 83.97

# variable value# 1 ozone 23.615# 2 ozone 29.444# 3 ozone 59.115# 4 ozone 59.962# 5 wind 11.623# 6 wind 10.267# 7 wind 8.942# 8 wind 8.794# 9 temp 65.548# 10 temp 79.100# 11 temp 83.903# 12 temp 83.968

20/35

Page 21: Data manipulation on r

reshape 2 Package"In reality, you need long-format data much more commonly than wide-format data. For example, ggplot2requires long-format data plyr requires long-format data, and most modelling functions (such as lm(), glm(),and gam()) require long-format data. But people often find it easier to record their data in wide format."

reshape2 is based around two key functions: melt and cast: melt takes wide-format data and melts it intolong-format data. cast takes long-format data and casts it into wide-format data.

21/35

Page 22: Data manipulation on r

Meltlibrary(reshape2)head(airquality,2)

## ozone solar.r wind temp month day## 1 41 190 7.4 67 5 1## 2 36 118 8.0 72 5 2

aql <- melt(airquality) # [a]ir [q]uality [l]ong formathead(aql,5)

## variable value## 1 ozone 41## 2 ozone 36## 3 ozone 12## 4 ozone 18## 5 ozone NA

22/35

Page 23: Data manipulation on r

By default, melt has assumed that all columns with numeric values are variables with values. Maybe here wewant to know the values of ozone, solar.r, wind, and temp for each month and day. We can do that with meltby telling it that we want month and day to be “ID variables”. ID variables are the variables that identifyindividual rows of data.

m <- melt(airquality, id.vars = c("month", "day"))head(m,4)

## month day variable value## 1 5 1 ozone 41## 2 5 2 ozone 36## 3 5 3 ozone 12## 4 5 4 ozone 18

23/35

Page 24: Data manipulation on r

Melt also allow us to control the column names in long data format

m <- melt(airquality, id.vars = c("month", "day"), variable.name = "climate_variable", value.name = "climate_value")head(m)

## month day climate_variable climate_value## 1 5 1 ozone 41## 2 5 2 ozone 36## 3 5 3 ozone 12## 4 5 4 ozone 18## 5 5 5 ozone NA## 6 5 6 ozone 28

24/35

Page 25: Data manipulation on r

Long- to wide-format data: the cast functionsIn reshape2 there are multiple cast functions. Since you will most commonly work with data.frame objects,we’ll explore the dcast function. (There is also acast to return a vector, matrix, or array.) dcast uses a formulato describe the shape of the data.

Here, we need to tell dcast that month and day are the ID variables.

Besides re-arranging the columns, we’ve recovered our original data.

m <- melt(airquality, id.vars = c("month", "day"))aqw <- dcast(m, month + day ~ variable)head(aqw)

## month day ozone solar.r wind temp## 1 5 1 41 190 7.4 67## 2 5 2 36 118 8.0 72## 3 5 3 12 149 12.6 74## 4 5 4 18 313 11.5 62## 5 5 5 NA NA 14.3 56## 6 5 6 28 NA 14.9 66

25/35

Page 26: Data manipulation on r

Data Manipulation Using plyrFor large-scale data, we can split the dataset, perform the manipulation or analysis, and then combine it into asingle output again. This type of split using default R is not much efficient, and to overcome this limitation,Wickham, in 2011, developed an R package called plyr in which he efficiently implemented the split-apply-combine strategy. We can compare this strategy to map-reduce strategy for processing large amount of data.

In the coming slides i will give example of the split-apply-combine strategy using

Without Loops

With Loops

Using plyr package

·

·

·

26/35

Page 27: Data manipulation on r

Without loopsI am using the iris dataset here

1. Split the iris dataset into three parts.

2. Remove the species name variable from the data.

3. Calculate the mean of each variable for the three different parts separately.

4. Combine the output into a single data frame.

iris.set <- iris[iris$Species=="setosa",-5]iris.versi <- iris[iris$Species=="versicolor",-5]iris.virg <- iris[iris$Species=="virginica",-5]# calculating mean for each piece (The apply step)mean.set <- colMeans(iris.set)mean.versi <- colMeans(iris.versi)mean.virg <- colMeans(iris.virg)# combining the output (The combine step)mean.iris <- rbind(mean.set,mean.versi,mean.virg)# giving row names so that the output could be easily understoodrownames(mean.iris) <- c("setosa","versicolor","virginica")

27/35

Page 28: Data manipulation on r

With Loops

NB: In the split-apply-combine strategy is that each piece should be independent of the other. The strategywont work if one piece is dependent upon one another.

mean.iris.loop <- NULLfor(species in unique(iris$Species)) { iris_sub <- iris[iris$Species==species,] column_means <- colMeans(iris_sub[,-5]) mean.iris.loop <- rbind(mean.iris.loop,column_means) } # giving row names so that the output could be easily understoodrownames(mean.iris.loop) <- unique(iris$Species)

28/35

Page 29: Data manipulation on r

Using plyrlibrary (plyr)ddply(iris,~Species,function(x) colMeans(x[,-which(colnames(x)=="Species")]))

## Species Sepal.Length Sepal.Width Petal.Length Petal.Width## 1 setosa 5.006 3.428 1.462 0.246## 2 versicolor 5.936 2.770 4.260 1.326## 3 virginica 6.588 2.974 5.552 2.026

mean.iris.loop

## Sepal.Length Sepal.Width Petal.Length Petal.Width## setosa 5.006 3.428 1.462 0.246## versicolor 5.936 2.770 4.260 1.326## virginica 6.588 2.974 5.552 2.026

29/35

Page 30: Data manipulation on r

Merging data frames# Make a data frame mapping story numbers to titlesstories <- read.table(header=T, text=' storyid title 1 lions 2 tigers 3 bears')

# Make another data frame with the data and story numbers (no titles)data <- read.table(header=T, text=' subject storyid rating 1 1 6.7 1 2 4.5 1 3 3.7 2 2 3.3 2 3 4.1 2 1 5.2')

30/35

Page 31: Data manipulation on r

Merge the two data frames

If the two data frames have different names for the columns you want to match on, the names can bespecified:

merge(stories, data, "storyid")

## storyid title subject rating## 1 1 lions 1 6.7## 2 1 lions 2 5.2## 3 2 tigers 1 4.5## 4 2 tigers 2 3.3## 5 3 bears 1 3.7## 6 3 bears 2 4.1

# In this case, the column is named 'id' instead of storyidstories2 <- read.table(header=T, text=' id title 1 lions 2 tigers 3 bears ')merge(x=stories2, y=data, by.x="id", by.y="storyid")

31/35

Page 32: Data manipulation on r

Resources and Materials usedData Manipulation with R by Phil Spector

Getting and Cleaning data Coursera Course

plyr by Hadley Wickham

Andrew Jaffe Notes

R cookbok

·

·

·

·

·

32/35