vector and data frame indexing - stanford universityrecyclingrule •...

Post on 19-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Vector and data frame indexing

Steve Bagley

somgen223.stanford.edu 1

More about vectors

somgen223.stanford.edu 2

Recycling rule

• When operating on multiple vectors of different lengths, R will reuse values ifthere are insufficient ones, wrapping around.

• This is the cause of confusion (and bugs), so be careful.

somgen223.stanford.edu 3

Recycling rule examples

c(1, 2) + c(3, 4) # length 2 + length 2[1] 4 61 + c(3, 4) # length 1 + length 2[1] 4 5c(1, 2, 3, 4, 5) + c(3, 4) # length 5 + length 2Warning in c(1, 2, 3, 4, 5) + c(3, 4): longer object length is not a multiple ofshorter object length[1] 4 6 6 8 8## which is as if you had typed (but without a warning):c(1, 2, 3, 4, 5) + c(3, 4, 3, 4, 3) # length 5 + length 5[1] 4 6 6 8 8

somgen223.stanford.edu 4

Indexing a vector: positive integers index those elements of the vector

(x <- c(9, 12, 6, 10, 10, 16, 8, 4))[1] 9 12 6 10 10 16 8 4x[1][1] 9x[2:4][1] 12 6 10x[c(3, 1)][1] 6 9index <- c(1, 1, 1, 2, 2, 3)x[index][1] 9 9 9 12 12 6

• Indexing returns a subsequence of the vector. It does not change the originalvector.

• Brackets [ ] are used for indexing.• R starts counting vector indices from 1.• You can index using a multi-element vector.• The length of the result is the length of the index vector.

somgen223.stanford.edu 5

Indexing a vector: logical values pick those vector elements corresponding toTRUE

x[1] 9 12 6 10 10 16 8 4x >= 11[1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSEx[x >= 11][1] 12 16

• Logical values are either TRUE or FALSE.• They are typically produced by using a comparison operator or similar test.

somgen223.stanford.edu 6

Indexing a vector: negative integers leave out those elements of the vector

x[1] 9 12 6 10 10 16 8 4x[1][1] 9x[-1][1] 12 6 10 10 16 8 4x[-length(x)][1] 9 12 6 10 10 16 8x[c(-1, -length(x))][1] 12 6 10 10 16 8

• You can’t mix positive and negative vector indices in a single index expression.R will complain.

• What about using 0 as an index? It is ignored.

somgen223.stanford.edu 7

Exercise: mean values

Using the vector x with values (9, 12, 6, 10, 10, 16, 8, 4)• Select out those values greater than the mean.

somgen223.stanford.edu 8

Answer: mean values

x <- c(9, 12, 6, 10, 10, 16, 8, 4)x > mean(x)[1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSEx[x > mean(x)][1] 12 10 10 16

somgen223.stanford.edu 9

Indexing a vector with an out-of-bounds index

x[1] 9 12 6 10 10 16 8 4x[20][1] NA

• An out-of-bounds index does not cause an error.• It returns NA.

somgen223.stanford.edu 10

Assigning to an out-of-bounds position

x[1] 9 12 6 10 10 16 8 4x[10] <- 333x[1] 9 12 6 10 10 16 8 4 NA 333• Assigning to an out-of-bounds position creates that position and all the positionsup to it.

somgen223.stanford.edu 11

Indexing a data frame

somgen223.stanford.edu 12

Set up data

(gene_exp1 <- read_csv(str_c(data_dir, "gene_exp1.csv")))# A tibble: 3 x 3

gene control treatment<chr> <dbl> <dbl>

1 ABC123 0 12 DEF234 10 33 GKK7 12 13

somgen223.stanford.edu 13

Turn a data frame column into a vector

gene_exp1$gene[1] "ABC123" "DEF234" "GKK7"gene_exp1$gene_nameWarning: Unknown or uninitialised column: `gene_name`.NULL

• Use $ when you need to explicitly refer to the column by name.• Note that using a non-existent name will issue a warning, and return the value

NULL. This is a common source of bugs.

somgen223.stanford.edu 14

Select column(s) by name

gene_exp1[, "gene"]# A tibble: 3 x 1

gene<chr>

1 ABC1232 DEF2343 GKK7gene_exp1[, c("treatment", "control")]# A tibble: 3 x 2

treatment control<dbl> <dbl>

1 1 02 3 103 13 12

• Use [row, col] format for a data frame.• You can leave out row or col.• This returns a data frame, perhaps with only a single column.

somgen223.stanford.edu 15

Select column(s) by number

gene_exp1[, c(2, 1)]# A tibble: 3 x 2

control gene<dbl> <chr>

1 0 ABC1232 10 DEF2343 12 GKK7

• You can refer to columns by number, starting from 1.

somgen223.stanford.edu 16

Select rows or columns by number

z <- c(2, 3)gene_exp1[z, ]# A tibble: 2 x 3

gene control treatment<chr> <dbl> <dbl>

1 DEF234 10 32 GKK7 12 13gene_exp1[, z]# A tibble: 3 x 2

control treatment<dbl> <dbl>

1 0 12 10 33 12 13

somgen223.stanford.edu 17

Select row and column

gene_exp1[1, 2]# A tibble: 1 x 1

control<dbl>

1 0• The result is a one-row, one-column data frame.

somgen223.stanford.edu 18

Use of [[ ]]

## Use name explicitlygene_exp1[["gene"]][1] "ABC123" "DEF234" "GKK7"## Set a variable to the column namecol <- "treatment"gene_exp1[[col]][1] 1 3 13

• [[ ]] returns a single data frame column as a vector.

somgen223.stanford.edu 19

Comparing $ and [[ ]]

x <- 2df$xdf[[x]]df[["x"]]

• The first expression returns the column named x.• The second expression returns the second column, because x has the value 2.• The third expression returns the column named x, using quotes around thecolumn name.

somgen223.stanford.edu 20

Factors (repeated from day 4)

• Factors are a powerful, but sometimes perplexing, way to work withdiscrete-valued data.

• The possible values of a factor are drawn from a finite set of alternatives orcategories. Factors are often used in graphics and analysis for grouping.

• Example: encoding the sex of a human subject as either M or F and grouping bysex.

• Example: encoding the names of the fifty US states and grouping by state.• Note that many measured values are better represented not as factors but aseither integers (such as for counting) or floating-point (real-valued) numbers.Example: number of subjects, weight.

somgen223.stanford.edu 21

Set up data

gene_tall <- gather(gene_exp1, condition, expression_level,control:treatment)

(gene_tall2 <- mutate(gene_tall, condition = as.factor(condition)))# A tibble: 6 x 3

gene condition expression_level<chr> <fct> <dbl>

1 ABC123 control 02 DEF234 control 103 GKK7 control 124 ABC123 treatment 15 DEF234 treatment 36 GKK7 treatment 13

• <fct> means that column type is factor.

somgen223.stanford.edu 22

Plot

gene_tall2 %>%ggplot(aes(condition, expression_level)) +geom_point(aes(color = gene))

0

5

10

control treatmentcondition

expression_level

gene

ABC123

DEF234

GKK7

• Note order of values on x-axis: it comes from the order of the levels of thefactor: “control,” then “treatment”.

• By default this will be alphabetical order.

somgen223.stanford.edu 23

What are the levels?

gene_tall2$condition[1] control control control treatment treatment treatmentLevels: control treatmentlevels(gene_tall2$condition)[1] "control" "treatment"

• A factor is a type of vector, so has a similar print representation.• It is augmented by the second line, which lists the levels in order.• The levels function returns the levels explicitly.

somgen223.stanford.edu 24

How to change the order of the levels

gene_tall2$condition[1] control control control treatment treatment treatmentLevels: control treatmentfct_relevel(gene_tall2$condition, "treatment", "control")[1] control control control treatment treatment treatmentLevels: treatment control

• Note the values are unchanged.• Note the order of the levels is changed.

somgen223.stanford.edu 25

Update the data frame with the new levels

gene_tall2 <- gene_tall2 %>%mutate(condition = fct_relevel(condition, "treatment", "control"))

somgen223.stanford.edu 26

New plot

gene_tall2 %>%ggplot(aes(condition, expression_level)) +geom_point(aes(color = gene))

0

5

10

treatment controlcondition

expression_level

gene

ABC123

DEF234

GKK7

• Order on x-axis reflects the new factor level order.

somgen223.stanford.edu 27

How to change the factor values

gene_tall2$condition[1] control control control treatment treatment treatmentLevels: treatment controlfct_recode(gene_tall2$condition, ctrl = "control",

trt = "treatment")[1] ctrl ctrl ctrl trt trt trtLevels: trt ctrl

• You might need shorter values for graph labels.• In fct_recode, assign the old value to the new value.• Note the factor order stays the same.

somgen223.stanford.edu 28

New plot

gene_tall2 <- gene_tall2 %>%mutate(condition = fct_recode(condition, ctrl = "control",

trt = "treatment"))gene_tall2 %>%

ggplot(aes(condition, expression_level)) +geom_point(aes(color = gene))

0

5

10

trt ctrlcondition

expression_level

gene

ABC123

DEF234

GKK7

• Order on x-axis reflects the new factor level order.

somgen223.stanford.edu 29

Reading

• Read: 15 Factors | R for Data Science

somgen223.stanford.edu 30

top related