I recently used the function spread with a factor. Spread is defined in Hadley Wickham‘s package tidyR. It’s used to convert a data.frame from long format to wide format.

Here’s an example showing my problem:

1
2
3
4
5
6
library(tidyr)
set.seed(1)
data.df <- data.frame(year=rep(1:5,3), blood=rep(c("A", "B", "0"), 5), count=as.integer(runif(15)*20))
data.df$blood <- factor(data.df$blood, levels=c("A", "B", "AB", "0"))

data.df
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
##    year blood count
## 1     1     A     5
## 2     2     B     7
## 3     3     0    11
## 4     4     A    18
## 5     5     B     4
## 6     1     0    17
## 7     2     A    18
## 8     3     B    13
## 9     4     0    12
## 10    5     A     1
## 11    1     B     4
## 12    2     0     3
## 13    3     A    13
## 14    4     B     7
## 15    5     0    15
1
str(data.df)
1
2
3
4
## 'data.frame':	15 obs. of  3 variables:
##  $ year : int  1 2 3 4 5 1 2 3 4 5 ...
##  $ blood: Factor w/ 4 levels "A","B","AB","0": 1 2 4 1 2 4 1 2 4 1 ...
##  $ count: int  5 7 11 18 4 17 18 13 12 1 ...

This format of data.df is called long format so there is one column containing the key (here: blood) and another column containing the value (here: count). The other format, called wide format has a column for each possible value of the column key containing the corresponding value.

Converting from wide to long and back is done either by melt() and dcast() of package reshape2 or gather() and spread() of package tidyR (see here or here).

I prefer tidyR because I like to use Hadley’s dplyr, too.

But here I lose the not used value key of “AB” (I thought…):

1
spread(data.df, blood, count)
1
2
3
4
5
6
##   year  A  B  0
## 1    1  5  4 17
## 2    2 18  7  3
## 3    3 13 13 11
## 4    4 18  7 12
## 5    5  1  4 15

As you can see the possible value “AB” of the factor blood isn’t used in this data.frame. So we don’t get a column “AB”.

So I thought that this would be a nice todo for me to enhance this package. So I was looking at each line how Hadley did implement this package. Doing so I was learning a lot. But the most I did learn was Hadley has implemented this feature already: Setting the parameter drop to false and the unused values of a factor won’t be dropped 😉

1
spread(data.df, blood, count, drop=FALSE, fill=0)
1
2
3
4
5
6
##   year  A  B AB  0
## 1    1  5  4  0 17
## 2    2 18  7  0  3
## 3    3 13 13  0 11
## 4    4 18  7  0 12
## 5    5  1  4  0 15

fill=0 ensures that missing values are set to zero.

So the most important I’ve learnt was once again: RTFM – Read the friendly manual! The second I’ve learnt was that Hadley is just doing a great job with all his packages — thanks a lot!!