dplyr for programming

When someone asks me what’s the main advantage of R over python I almost always answer: “It’s dplyr! The way you can handle data.frames is a dream.”

Pythonista: “But pandas has also Data.Frames. They are built to resemble their counterparts in R.”

Me: “That’s true. But they manage the functionality of plain R. Actually R has made several steps ahead with dplyr.”

But dplyr is built mainly for interactive data exploration. So it’s very easy to select, mutate, group and summarize your data.frame (or tibble). The reason is non-standard evaluation (NSE) (See more in Hadley Wickham’s book Advanced R. NSE occures when you use a column-name without any quoting.

But when it’s up to programming it get’s a little more complicated.

So let’s look at an example:

1
2
3


library(dplyr, warn.conflicts = FALSE); options(dplyr.summarise.inform = FALSE)

head(starwars)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


## # A tibble: 6 x 14
##   name  height  mass hair_color skin_color eye_color birth_year sex   gender
##   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
## 1 Luke…    172    77 blond      fair       blue            19   male  mascu…
## 2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu…
## 3 R2-D2     96    32 <NA>       white, bl… red             33   none  mascu…
## 4 Dart…    202   136 none       white      yellow          41.9 male  mascu…
## 5 Leia…    150    49 brown      light      brown           19   fema… femin…
## 6 Owen…    178   120 brown, gr… light      blue            52   male  mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

1
2
3


starwars %>%
  group_by(gender) %>%
  summarise(maximum = max(height, na.rm = TRUE))

1
2
3
4
5
6


## # A tibble: 3 x 2
##   gender    maximum
##   <chr>       <int>
## 1 feminine      213
## 2 masculine     264
## 3 <NA>          183

Grouping by custom values within a function

Now let’s imagine we want to group by several variables and want to compute the maximum of also seraral variables.

So we write a function.

1
2
3
4
5


max_by <- function(data, var, by) { 
  data %>%
    group_by(by) %>%
    summarise(maximum = max(var, na.rm = TRUE)) 
}

But using it fails with an error

1

starwars %>% max_by(height, by = gender)

1
2


## Error: Must group by variables found in `.data`.
## * Column `by` is not found.

So R doesn’t find the column by. That’s correct. Our data starwars doesn’t contain a such a column. R should have used the value of by as a column-name.

So we must tell R to do so:

1
2
3
4
5


max_by <- function(data, var, by) { 
  data %>%
    group_by(!!enquo(by)) %>%
    summarise(maximum = max(!!enquo(var), na.rm = TRUE)) 
}

As you can see we use !!enquo().

So let’s call this function.

1

starwars %>% max_by(height, by = gender)

1
2
3
4
5
6


## # A tibble: 3 x 2
##   gender    maximum
##   <chr>       <int>
## 1 feminine      213
## 2 masculine     264
## 3 <NA>          183

Great, this works as expected!

But since rlang 0.4.0 we can use a more user friendly way.

1
2
3
4
5
6


max_by <- function(data, var, by) { 
  data %>%
    group_by({{ by }}) %>%
    summarise(maximum = max({{ var }}, na.rm = TRUE)) }
 
starwars %>% max_by(height, by = gender)

1
2
3
4
5
6


## # A tibble: 3 x 2
##   gender    maximum
##   <chr>       <int>
## 1 feminine      213
## 2 masculine     264
## 3 <NA>          183

As you can see we use the curly-curly brackets {{ }} instead of !!enquo().

Calling with a String

But what about calling the function max_by with strings instead of symbols? This can happen when the parameters we use for calling the function are computed on another way.

So let’s try this:

1
2
3
4


height_var <- 'height'
by_var <- 'gender'

starwars %>% max_by(height_var, by = by_var)

1
2


## Error: Must group by variables found in `.data`.
## * Column `by_var` is not found.

So this doesn’t work.

So let’s redefine our function:

1
2
3
4
5


max_by <- function(data, var, by) { 
  data %>%
    group_by(.data[[by]]) %>%
    summarise(maximum = max(.data[[var]], na.rm = TRUE)) 
}

and call it

1
2
3


height_var <- 'height'
by_var <- 'gender'
starwars %>% max_by(height_var, by = by_var)

1
2
3
4
5
6


## # A tibble: 3 x 2
##   gender    maximum
##   <chr>       <int>
## 1 feminine      213
## 2 masculine     264
## 3 <NA>          183

Setting new column by a parameter value

So now we want to configure also the name of the new column with the maximum:

1
2
3
4
5
6
7
8


max_by <- function(data, var, by, colname) { 
  data %>%
    group_by(.data[[by]]) %>%
    summarise(!!colname := max(.data[[var]], na.rm = TRUE)) 
}


starwars %>% max_by("height", by = "gender", colname = "my_column")

1
2
3
4
5
6


## # A tibble: 3 x 2
##   gender    my_column
##   <chr>         <int>
## 1 feminine        213
## 2 masculine       264
## 3 <NA>            183

Using multiple parameters

That’s not the end. We can even pass several function/value- combinations:

1
2
3
4
5


summarise_by <- function(data, ..., by) { 
  data %>%
    group_by({{ by }}) %>%
    summarise(...) 
}

1
2
3
4
5
6


starwars %>% 
  summarise_by(
    average = mean(height, na.rm = TRUE), 
    maximum = max(height, na.rm = TRUE), 
    by = gender
)

1
2
3
4
5
6


## # A tibble: 3 x 3
##   gender    average maximum
##   <chr>       <dbl>   <int>
## 1 feminine     165.     213
## 2 masculine    177.     264
## 3 <NA>         181.     183

Thanks to

Most of the examples are taken from this article at tidyverse.org: https://www.tidyverse.org/ articles/2019/06/rlang-0-4-0/
Documentation of rlang: https://rlang.r-lib.org/index.html
Cheatsheet: https://github.com/rstudio/cheatsheets/blob/master/tidyeval.pdf
Tidyeval: https://tidyeval.tidyverse.org/

This article is the result of a talk I held at Campus useR Group Frankfurt.

Contents

Grouping by custom values within a function

Calling with a String

Setting new column by a parameter value

Using multiple parameters

Thanks to