When someone asks me what’s the main advantage of R over python I almost always
answer: “It’s dplyr! The way you can handle data.frames is a dream.”
Pythonista: “But pandas has also Data.Frames. They are built to resemble their
counterparts in R.”
Me: “That’s true. But they manage the functionality of plain R. Actually R has made
several steps ahead with dplyr.”
But dplyr is built mainly for interactive data exploration. So it’s very easy to
select, mutate, group and summarize your data.frame (or tibble).
The reason is non-standard evaluation (NSE) (See more
in Hadley Wickham’s book Advanced R. NSE occures when
you use a column-name without any quoting.
But when it’s
up to programming it get’s a little more complicated.
So let’s look at an example:
1
2
3
|
library(dplyr, warn.conflicts = FALSE); options(dplyr.summarise.inform = FALSE)
head(starwars)
|
1
2
3
4
5
6
7
8
9
10
11
|
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Dart… 202 136 none white yellow 41.9 male mascu…
## 5 Leia… 150 49 brown light brown 19 fema… femin…
## 6 Owen… 178 120 brown, gr… light blue 52 male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
|
1
2
3
|
starwars %>%
group_by(gender) %>%
summarise(maximum = max(height, na.rm = TRUE))
|
1
2
3
4
5
6
|
## # A tibble: 3 x 2
## gender maximum
## <chr> <int>
## 1 feminine 213
## 2 masculine 264
## 3 <NA> 183
|
Grouping by custom values within a function
Now let’s imagine we want to group by several variables and want to compute the
maximum of also seraral variables.
So we write a function.
1
2
3
4
5
|
max_by <- function(data, var, by) {
data %>%
group_by(by) %>%
summarise(maximum = max(var, na.rm = TRUE))
}
|
But using it fails with an error
1
|
starwars %>% max_by(height, by = gender)
|
1
2
|
## Error: Must group by variables found in `.data`.
## * Column `by` is not found.
|
So R doesn’t find the column by
. That’s correct. Our data starwars
doesn’t
contain a such a column. R should have used the value of by
as a column-name.
So we must tell R to do so:
1
2
3
4
5
|
max_by <- function(data, var, by) {
data %>%
group_by(!!enquo(by)) %>%
summarise(maximum = max(!!enquo(var), na.rm = TRUE))
}
|
As you can see we use !!enquo()
.
So let’s call this function.
1
|
starwars %>% max_by(height, by = gender)
|
1
2
3
4
5
6
|
## # A tibble: 3 x 2
## gender maximum
## <chr> <int>
## 1 feminine 213
## 2 masculine 264
## 3 <NA> 183
|
Great, this works as expected!
But since rlang 0.4.0 we can use a more user friendly way.
1
2
3
4
5
6
|
max_by <- function(data, var, by) {
data %>%
group_by({{ by }}) %>%
summarise(maximum = max({{ var }}, na.rm = TRUE)) }
starwars %>% max_by(height, by = gender)
|
1
2
3
4
5
6
|
## # A tibble: 3 x 2
## gender maximum
## <chr> <int>
## 1 feminine 213
## 2 masculine 264
## 3 <NA> 183
|
As you can see we use the curly-curly brackets {{ }}
instead of !!enquo()
.
Calling with a String
But what about calling the function max_by
with strings instead of symbols?
This can happen when the parameters we use for calling the function are computed
on another way.
So let’s try this:
1
2
3
4
|
height_var <- 'height'
by_var <- 'gender'
starwars %>% max_by(height_var, by = by_var)
|
1
2
|
## Error: Must group by variables found in `.data`.
## * Column `by_var` is not found.
|
So this doesn’t work.
So let’s redefine our function:
1
2
3
4
5
|
max_by <- function(data, var, by) {
data %>%
group_by(.data[[by]]) %>%
summarise(maximum = max(.data[[var]], na.rm = TRUE))
}
|
and call it
1
2
3
|
height_var <- 'height'
by_var <- 'gender'
starwars %>% max_by(height_var, by = by_var)
|
1
2
3
4
5
6
|
## # A tibble: 3 x 2
## gender maximum
## <chr> <int>
## 1 feminine 213
## 2 masculine 264
## 3 <NA> 183
|
Setting new column by a parameter value
So now we want to configure also the name of the new column with the maximum:
1
2
3
4
5
6
7
8
|
max_by <- function(data, var, by, colname) {
data %>%
group_by(.data[[by]]) %>%
summarise(!!colname := max(.data[[var]], na.rm = TRUE))
}
starwars %>% max_by("height", by = "gender", colname = "my_column")
|
1
2
3
4
5
6
|
## # A tibble: 3 x 2
## gender my_column
## <chr> <int>
## 1 feminine 213
## 2 masculine 264
## 3 <NA> 183
|
Using multiple parameters
That’s not the end. We can even pass several function/value- combinations:
1
2
3
4
5
|
summarise_by <- function(data, ..., by) {
data %>%
group_by({{ by }}) %>%
summarise(...)
}
|
1
2
3
4
5
6
|
starwars %>%
summarise_by(
average = mean(height, na.rm = TRUE),
maximum = max(height, na.rm = TRUE),
by = gender
)
|
1
2
3
4
5
6
|
## # A tibble: 3 x 3
## gender average maximum
## <chr> <dbl> <int>
## 1 feminine 165. 213
## 2 masculine 177. 264
## 3 <NA> 181. 183
|
Thanks to
This article is the result of a talk I held at Campus useR Group Frankfurt.