I’m working on an R-package to access the data of a web service.
So I have to handle large data I get back from an API call. The result is
encoded in JSON-format which itself results in a large list of lists (of lists).
But I want to convert these lists of lists into a data.frame or tibble.
Sounds easy …. Here are the caveats I came across.
Some sample data
Let’s first create some sample data. As I mentioned before the result of an API-call
is a list of lists. The inner lists contain lists, too.
So here’s a function that generates such data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
options(tidyverse.quiet = TRUE)
library(tidyverse)
set.seed(21)
list_entry <- function(dummy) {
list(
sample(letters, 1),
sample(1:100, 1),
list(paste0("Sublist ", round(runif(1, 0, 100))),
paste0("Sublist ", round(runif(1, 0, 100)))
),
runif(1, 0, 100)
)
}
|
Let’s build a small list and another one we can use for performance measurement.
1
2
3
4
5
|
n_small <- 5
list_of_list <-lapply(as.list(1:n_small), list_entry)
n_large <- 100000
long_list_of_list <-lapply(as.list(1:n_large), list_entry)
|
Tidyverse
First we will use the tidyverse
functions to convert the data.
Naive Way
1
|
list_of_list %>% as_tibble()
|
1
|
## Error: Columns 1, 2, 3, 4, and 5 must be named.
|
Uhh, the error occurs because the list does not contain named values.
So we need to generate (or repair) the names.
1
|
list_of_list %>% as_tibble(.name_repair = "universal")
|
1
2
3
4
5
6
|
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
|
1
2
3
4
5
6
7
|
## # A tibble: 4 x 5
## ...1 ...2 ...3 ...4 ...5
## <list> <list> <list> <list> <list>
## 1 <chr [1]> <chr [1]> <chr [1]> <chr [1]> <chr [1]>
## 2 <int [1]> <int [1]> <int [1]> <int [1]> <int [1]>
## 3 <list [2]> <list [2]> <list [2]> <list [2]> <list [2]>
## 4 <dbl [1]> <dbl [1]> <dbl [1]> <dbl [1]> <dbl [1]>
|
Okay, that’s the wrong way. It generates the transposed version of what we want.
So use a version of all of these fantastic map
-function from the purrr
-package:
(Read here for other
fantastic stuff of purrr.)
1
|
list_of_list %>% map_dfr(as_tibble, .name_repair = "universal")
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
|
1
2
3
4
5
6
7
8
9
10
11
12
13
|
## # A tibble: 10 x 4
## ...1 ...2 ...3 ...4
## <chr> <int> <list> <dbl>
## 1 a 3 <chr [1]> 91.9
## 2 a 3 <chr [1]> 91.9
## 3 p 21 <chr [1]> 66.8
## 4 p 21 <chr [1]> 66.8
## 5 j 94 <chr [1]> 52.5
## 6 j 94 <chr [1]> 52.5
## 7 b 67 <chr [1]> 77.6
## 8 b 67 <chr [1]> 77.6
## 9 j 46 <chr [1]> 77.8
## 10 j 46 <chr [1]> 77.8
|
Look’s nice. But the original list consists of 5 rows. The result above is
twice as long. So what happend? as_tibble
generates for each entry of the list
of the inner lists a row. That’s not what I’ve expected. I would prefer to get a
another column for each entry.
Flatten the Inner List
When we apply as.data.frame
to an inner list the list of each inner list is
converted into two columns:
1
2
3
4
5
|
column_names <- letters[1:5]
list_of_list %>%
map(function(x) {x <- as.data.frame(x); colnames(x) <- column_names; x}) %>%
bind_rows() %>%
as_tibble()
|
1
2
3
4
5
6
7
8
|
## # A tibble: 5 x 5
## a b c d e
## <chr> <int> <chr> <chr> <dbl>
## 1 a 3 Sublist 18 Sublist 96 91.9
## 2 p 21 Sublist 99 Sublist 85 66.8
## 3 j 94 Sublist 17 Sublist 4 52.5
## 4 b 67 Sublist 1 Sublist 61 77.6
## 5 j 46 Sublist 79 Sublist 57 77.8
|
Yeah, that’s great. That’s what we want. We can even do it with map_dfr
:
1
2
3
|
column_names <- letters[1:5]
list_of_list %>% map_dfr(function(x) {x <- as.data.frame(x); colnames(x) <- column_names; x}) %>%
as_tibble()
|
1
2
3
4
5
6
7
8
|
## # A tibble: 5 x 5
## a b c d e
## <chr> <int> <chr> <chr> <dbl>
## 1 a 3 Sublist 18 Sublist 96 91.9
## 2 p 21 Sublist 99 Sublist 85 66.8
## 3 j 94 Sublist 17 Sublist 4 52.5
## 4 b 67 Sublist 1 Sublist 61 77.6
## 5 j 46 Sublist 79 Sublist 57 77.8
|
But how performant is this code. Let’s check it with the long_list_of_list
and
measere the time with the package tictoc
.
1
2
3
4
5
6
|
library(tictoc)
tic()
column_names <- letters[1:5]
long_list_of_list %>% map_dfr(function(x) {x <- as.data.frame(x); colnames(x) <- column_names; x}) %>%
as_tibble()
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
## # A tibble: 100,000 x 5
## a b c d e
## <chr> <int> <chr> <chr> <dbl>
## 1 w 71 Sublist 93 Sublist 51 74.5
## 2 p 98 Sublist 65 Sublist 34 50.8
## 3 a 49 Sublist 51 Sublist 6 15.1
## 4 u 92 Sublist 77 Sublist 41 87.0
## 5 e 91 Sublist 63 Sublist 83 13.0
## 6 b 68 Sublist 7 Sublist 3 92.2
## 7 x 63 Sublist 13 Sublist 82 92.3
## 8 l 50 Sublist 53 Sublist 97 52.8
## 9 m 85 Sublist 92 Sublist 23 92.8
## 10 t 53 Sublist 99 Sublist 6 86.3
## # … with 99,990 more rows
|
WHAT? One minute for 100,000 rows? That’s long. What can we change?
1
2
3
4
5
|
library(tictoc)
tic()
column_names <- letters[1:5]
long_list_of_list %>% map_dfr(function(x) {x = unlist(x); x = set_names(x, column_names); x}) %>%
as_tibble()
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
## # A tibble: 100,000 x 5
## a b c d e
## <chr> <chr> <chr> <chr> <chr>
## 1 w 71 Sublist 93 Sublist 51 74.5060192188248
## 2 p 98 Sublist 65 Sublist 34 50.765589158982
## 3 a 49 Sublist 51 Sublist 6 15.1016460731626
## 4 u 92 Sublist 77 Sublist 41 87.0233365567401
## 5 e 91 Sublist 63 Sublist 83 13.0497927311808
## 6 b 68 Sublist 7 Sublist 3 92.2474714694545
## 7 x 63 Sublist 13 Sublist 82 92.2626771498471
## 8 l 50 Sublist 53 Sublist 97 52.8331008506939
## 9 m 85 Sublist 92 Sublist 23 92.7664576098323
## 10 t 53 Sublist 99 Sublist 6 86.2593436148018
## # … with 99,990 more rows
|
Three seconds – that’s better!
But the downside is we lost the different types of the columns. All columns
are converted to character
.
Don’t lose the types
The solution is still within purrr
: We can use the functionflatten
:
1
2
3
4
5
|
library(tictoc)
tic()
column_names <- letters[1:5]
long_list_of_list %>% map_dfr(function(x) {x = flatten(x); x = set_names(x, column_names); x}) %>%
as_tibble()
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
## # A tibble: 100,000 x 5
## a b c d e
## <chr> <int> <chr> <chr> <dbl>
## 1 w 71 Sublist 93 Sublist 51 74.5
## 2 p 98 Sublist 65 Sublist 34 50.8
## 3 a 49 Sublist 51 Sublist 6 15.1
## 4 u 92 Sublist 77 Sublist 41 87.0
## 5 e 91 Sublist 63 Sublist 83 13.0
## 6 b 68 Sublist 7 Sublist 3 92.2
## 7 x 63 Sublist 13 Sublist 82 92.3
## 8 l 50 Sublist 53 Sublist 97 52.8
## 9 m 85 Sublist 92 Sublist 23 92.8
## 10 t 53 Sublist 99 Sublist 6 86.3
## # … with 99,990 more rows
|