Lately I tried to visualize an hierarchy with Tableau Desktop. The problem was
that the hierarchy had a variable depth because it was tree-based.
Each row had an id and a parent_id. Normally hierarchies in Tableau
are defined by pulling some fields together, such as product category, product
group and product id.
Handling tree-based hierarchies seems to be a lot more complex. I found a plugin
at https://github.com/tableau/extension-hierarchy-navigator-sandboxed but this
only works online.
So I asked myself how I can handle this using R. I found the R-package
data.tree
at https://github.com/gluc/data.tree. I want to describe how I use
this package to preprocess my data.
Some tree-based data
Let’s create some tree-based data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
options(tidyverse.quiet = TRUE)
library(tidyverse)
geography <- tribble(
~id, ~parent_id, ~area, ~some_additional_data,
1, NA, "Europe", "That's a continent",
2, 1, "Germany", "That's a country",
3, 1, "France", "Oh, yeah, another country",
4, 1, "Denmark", "Oh dear, how many contries do you know",
5, 2, "Hesse", "ah, that's called Bundesland",
6, 5, "Wiesbaden", "and that's the capital of Hesse",
7, 2, "Berlin", "that's the capital of Germany and also a Bundesland",
8, 5, "Main-Kinzig-Kreis", "that's a Landkreis",
9, 8, "Gelnhausen", "and the Landkreis does have a capital called Kreisstadt",
10, 3, "Paris", "The city of Love",
11, 2, "Bavaria", "another Bundesland",
12, 11, "Munich", "and its capital"
)
geography
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
## # A tibble: 12 x 4
## id parent_id area some_additional_data
## <dbl> <dbl> <chr> <chr>
## 1 1 NA Europe That's a continent
## 2 2 1 Germany That's a country
## 3 3 1 France Oh, yeah, another country
## 4 4 1 Denmark Oh dear, how many contries do you know
## 5 5 2 Hesse ah, that's called Bundesland
## 6 6 5 Wiesbaden and that's the capital of Hesse
## 7 7 2 Berlin that's the capital of Germany and also a Bun…
## 8 8 5 Main-Kinzig-Kr… that's a Landkreis
## 9 9 8 Gelnhausen and the Landkreis does have a capital called…
## 10 10 3 Paris The city of Love
## 11 11 2 Bavaria another Bundesland
## 12 12 11 Munich and its capital
|
The Task
So the first two columns id
and parent_id
define the tree. How can we convert
this format into a format where the hierarchy consists of n
columns with
the highest level in column Level_1
, the second highest in Level_2
and so on?
The code should not depend on a given level of depth.
data.tree
Let’s take a look at data.tree
:
1
2
3
|
library(data.tree)
FromDataFrameNetwork(geography)
|
That doesn’t look promising. But wait, let’s replace NA
by 0
and for further
investigation let’s save id
in another column called my_id
:
1
2
3
4
|
geography %>%
replace_na(list(parent_id = 0)) %>%
mutate(my_id = id) %>%
FromDataFrameNetwork()
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
## levelName
## 1 0
## 2 °--1
## 3 ¦--2
## 4 ¦ ¦--5
## 5 ¦ ¦ ¦--6
## 6 ¦ ¦ °--8
## 7 ¦ ¦ °--9
## 8 ¦ ¦--7
## 9 ¦ °--11
## 10 ¦ °--12
## 11 ¦--3
## 12 ¦ °--10
## 13 °--4
|
Oh great, that’s the tree structure we had in mind.
Now let’s build a data.frame
:
1
2
3
4
5
|
geography %>%
replace_na(list(parent_id = 0)) %>%
mutate(my_id = id) %>%
FromDataFrameNetwork() %>%
ToDataFrameTree("level", "area", "my_id")
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
## levelName level area my_id
## 1 0 1 <NA> NA
## 2 °--1 2 Europe 1
## 3 ¦--2 3 Germany 2
## 4 ¦ ¦--5 4 Hesse 5
## 5 ¦ ¦ ¦--6 5 Wiesbaden 6
## 6 ¦ ¦ °--8 5 Main-Kinzig-Kreis 8
## 7 ¦ ¦ °--9 6 Gelnhausen 9
## 8 ¦ ¦--7 4 Berlin 7
## 9 ¦ °--11 4 Bavaria 11
## 10 ¦ °--12 5 Munich 12
## 11 ¦--3 3 France 3
## 12 ¦ °--10 4 Paris 10
## 13 °--4 3 Denmark 4
|
Now we have the level of the hierarchy stored in column level
and the value or
name of the level in area
.
Pivoting this data.frame
should do the rest:
1
2
3
4
5
6
7
8
|
geography %>%
replace_na(list(parent_id = 0)) %>%
mutate(my_id = id) %>%
FromDataFrameNetwork() %>%
ToDataFrameTree("level", "area", "my_id") %>%
select(-levelName) %>%
mutate(level = glue::glue("Level_{level}")) %>%
pivot_wider(names_from = level, values_from = area)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
## # A tibble: 13 x 7
## my_id Level_1 Level_2 Level_3 Level_4 Level_5 Level_6
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 NA <NA> <NA> <NA> <NA> <NA> <NA>
## 2 1 <NA> Europe <NA> <NA> <NA> <NA>
## 3 2 <NA> <NA> Germany <NA> <NA> <NA>
## 4 5 <NA> <NA> <NA> Hesse <NA> <NA>
## 5 6 <NA> <NA> <NA> <NA> Wiesbaden <NA>
## 6 8 <NA> <NA> <NA> <NA> Main-Kinzig-Kreis <NA>
## 7 9 <NA> <NA> <NA> <NA> <NA> Gelnhausen
## 8 7 <NA> <NA> <NA> Berlin <NA> <NA>
## 9 11 <NA> <NA> <NA> Bavaria <NA> <NA>
## 10 12 <NA> <NA> <NA> <NA> Munich <NA>
## 11 3 <NA> <NA> France <NA> <NA> <NA>
## 12 10 <NA> <NA> <NA> Paris <NA> <NA>
## 13 4 <NA> <NA> Denmark <NA> <NA> <NA>
|
As we can see the first level Level_1
is always empty but we will need it later.
Let’s solve this by
reducing each level before pivoting.
1
2
3
4
5
6
7
8
9
10
11
|
geopgraphy_data.frame <- geography %>%
replace_na(list(parent_id = 0)) %>%
mutate(my_id = id) %>%
FromDataFrameNetwork() %>%
ToDataFrameTree("level", "area", "my_id") %>%
select(-levelName) %>%
mutate(level = level -1) %>%
mutate(level = glue::glue("Level_{level}")) %>%
pivot_wider(names_from = level, values_from = area)
geopgraphy_data.frame
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
## # A tibble: 13 x 7
## my_id Level_0 Level_1 Level_2 Level_3 Level_4 Level_5
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 NA <NA> <NA> <NA> <NA> <NA> <NA>
## 2 1 <NA> Europe <NA> <NA> <NA> <NA>
## 3 2 <NA> <NA> Germany <NA> <NA> <NA>
## 4 5 <NA> <NA> <NA> Hesse <NA> <NA>
## 5 6 <NA> <NA> <NA> <NA> Wiesbaden <NA>
## 6 8 <NA> <NA> <NA> <NA> Main-Kinzig-Kreis <NA>
## 7 9 <NA> <NA> <NA> <NA> <NA> Gelnhausen
## 8 7 <NA> <NA> <NA> Berlin <NA> <NA>
## 9 11 <NA> <NA> <NA> Bavaria <NA> <NA>
## 10 12 <NA> <NA> <NA> <NA> Munich <NA>
## 11 3 <NA> <NA> France <NA> <NA> <NA>
## 12 10 <NA> <NA> <NA> Paris <NA> <NA>
## 13 4 <NA> <NA> Denmark <NA> <NA> <NA>
|
That’s nice.
Filling missing values of the hierarchy levels
Now we need to fill the missing value by using the right value of the line
above. But wait, we have to group by each level and do this for each level.
That’s a great example for using purrr::reduce
.
We apply a function fill_level
for each level and put the result as
argument of the next call with next level.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
# group by the parent level of the given level_name and fill the level
fill_level <- function(data, level_name) {
grouping_level_id <- max(as.numeric(str_extract(level_name, "[0-9]+")) - 1, 0)
previous_level <- glue::glue("Level_{grouping_level_id}")
data %>%
group_by(.data[[previous_level]]) %>%
fill({{ level_name }}, .direction = "down")
}
# get all level names
level_names <- data.frame(names = names(geopgraphy_data.frame)) %>%
filter(grepl("Level_", names)) %>%
pull(names)
# use reduce to apply fill_level for each level to the result of the previous
# result
geopgraphy_data.frame <- reduce(level_names, fill_level, .init = geopgraphy_data.frame) %>%
select(starts_with("Level_"), everything()) %>%
select(-Level_0)
geopgraphy_data.frame
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
## # A tibble: 13 x 6
## # Groups: Level_4 [4]
## Level_1 Level_2 Level_3 Level_4 Level_5 my_id
## <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 <NA> <NA> <NA> <NA> <NA> NA
## 2 Europe <NA> <NA> <NA> <NA> 1
## 3 Europe Germany <NA> <NA> <NA> 2
## 4 Europe Germany Hesse <NA> <NA> 5
## 5 Europe Germany Hesse Wiesbaden <NA> 6
## 6 Europe Germany Hesse Main-Kinzig-Kreis <NA> 8
## 7 Europe Germany Hesse Main-Kinzig-Kreis Gelnhausen 9
## 8 Europe Germany Berlin <NA> <NA> 7
## 9 Europe Germany Bavaria <NA> <NA> 11
## 10 Europe Germany Bavaria Munich <NA> 12
## 11 Europe France <NA> <NA> <NA> 3
## 12 Europe France Paris <NA> <NA> 10
## 13 Europe Denmark <NA> <NA> <NA> 4
|
Adding other fancy columns
Now we can add all the other columns our original data has, currently just
some_additional_data
:
1
2
3
|
geopgraphy_data.frame %>%
left_join(geography %>% select(-parent_id, -area), by = c('my_id' = 'id')) %>%
filter(! is.na(Level_1)) # get rid of this line with all NAs
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
## # A tibble: 12 x 7
## # Groups: Level_4 [4]
## Level_1 Level_2 Level_3 Level_4 Level_5 my_id some_additional_data
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 Europe <NA> <NA> <NA> <NA> 1 That's a continent
## 2 Europe Germany <NA> <NA> <NA> 2 That's a country
## 3 Europe Germany Hesse <NA> <NA> 5 ah, that's called Bundesla…
## 4 Europe Germany Hesse Wiesbaden <NA> 6 and that's the capital of …
## 5 Europe Germany Hesse Main-Kinz… <NA> 8 that's a Landkreis
## 6 Europe Germany Hesse Main-Kinz… Gelnhau… 9 and the Landkreis does hav…
## 7 Europe Germany Berlin <NA> <NA> 7 that's the capital of Germ…
## 8 Europe Germany Bavaria <NA> <NA> 11 another Bundesland
## 9 Europe Germany Bavaria Munich <NA> 12 and its capital
## 10 Europe France <NA> <NA> <NA> 3 Oh, yeah, another country
## 11 Europe France Paris <NA> <NA> 10 The city of Love
## 12 Europe Denmark <NA> <NA> <NA> 4 Oh dear, how many contries…
|
So now we have a data.frame
we can use in Tableau. Our script doesn’t depend
on the overall depth of the hierarchy or the depth of one special branch.
Maybe there’s a simpler way to build this data.frame
. If so, let me know, please. :-)
Update (2021-08-06)
I’ve got an interesting feedback from Fabio. He’s got another much more elegant
way to convert the data.
He uses the pathString
attribute of a data.tree
.
Therefor he converts the orginal data with to self-joins to get rid of the ids
and replacing them with their names:
1
2
3
4
5
6
7
8
9
|
www <- geography %>%
replace_na(list(parent_id = 0)) %>%
left_join(geography %>% select(id, area), by="id", suffix=c("", "_id")) %>%
left_join(geography %>% select(id, area), by=c("parent_id"="id"), suffix=c("", "_pid")) %>%
replace_na(list(area_pid = "World")) %>%
mutate(my_id = id) %>%
select(area_id, area_pid, area, my_id) %>%
FromDataFrameNetwork()
www
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
## levelName
## 1 World
## 2 °--Europe
## 3 ¦--Germany
## 4 ¦ ¦--Hesse
## 5 ¦ ¦ ¦--Wiesbaden
## 6 ¦ ¦ °--Main-Kinzig-Kreis
## 7 ¦ ¦ °--Gelnhausen
## 8 ¦ ¦--Berlin
## 9 ¦ °--Bavaria
## 10 ¦ °--Munich
## 11 ¦--France
## 12 ¦ °--Paris
## 13 °--Denmark
|
Here is now the trick with pathString
:
1
|
ToDataFrameTree(www, "pathString", "my_id")
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
## levelName
## 1 World
## 2 °--Europe
## 3 ¦--Germany
## 4 ¦ ¦--Hesse
## 5 ¦ ¦ ¦--Wiesbaden
## 6 ¦ ¦ °--Main-Kinzig-Kreis
## 7 ¦ ¦ °--Gelnhausen
## 8 ¦ ¦--Berlin
## 9 ¦ °--Bavaria
## 10 ¦ °--Munich
## 11 ¦--France
## 12 ¦ °--Paris
## 13 °--Denmark
## pathString my_id
## 1 World NA
## 2 World/Europe 1
## 3 World/Europe/Germany 2
## 4 World/Europe/Germany/Hesse 5
## 5 World/Europe/Germany/Hesse/Wiesbaden 6
## 6 World/Europe/Germany/Hesse/Main-Kinzig-Kreis 8
## 7 World/Europe/Germany/Hesse/Main-Kinzig-Kreis/Gelnhausen 9
## 8 World/Europe/Germany/Berlin 7
## 9 World/Europe/Germany/Bavaria 11
## 10 World/Europe/Germany/Bavaria/Munich 12
## 11 World/Europe/France 3
## 12 World/Europe/France/Paris 10
## 13 World/Europe/Denmark 4
|
As you can see pathString
contains the path from top to bottom for the each
entry. Now we str_split
and unnest_wider
(what a great function :-)) these
values and get the similar as above.
The last step is another join to rejoin the other columns
(here only some_additional_data
).
1
2
3
4
|
ToDataFrameTree(www, "pathString", "my_id") %>%
mutate(path=stringr::str_split(pathString, "/")) %>%
unnest_wider(path, names_sep="_Level_") %>%
left_join(geography %>% select(-parent_id, -area), by = c('my_id' = 'id'))
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
## # A tibble: 13 x 10
## levelName pathString my_id path_Level_1 path_Level_2 path_Level_3
## <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 "World … World NA World <NA> <NA>
## 2 " °--Europe … World/Europe 1 World Europe <NA>
## 3 " ¦--Germ… World/Europe/Ger… 2 World Europe Germany
## 4 " ¦ ¦--… World/Europe/Ger… 5 World Europe Germany
## 5 " ¦ ¦ … World/Europe/Ger… 6 World Europe Germany
## 6 " ¦ ¦ … World/Europe/Ger… 8 World Europe Germany
## 7 " ¦ ¦ … World/Europe/Ger… 9 World Europe Germany
## 8 " ¦ ¦--… World/Europe/Ger… 7 World Europe Germany
## 9 " ¦ °--… World/Europe/Ger… 11 World Europe Germany
## 10 " ¦ … World/Europe/Ger… 12 World Europe Germany
## 11 " ¦--Fran… World/Europe/Fra… 3 World Europe France
## 12 " ¦ °--… World/Europe/Fra… 10 World Europe France
## 13 " °--Denm… World/Europe/Den… 4 World Europe Denmark
## # … with 4 more variables: path_Level_4 <chr>, path_Level_5 <chr>,
## # path_Level_6 <chr>, some_additional_data <chr>
|
Thanks to Fabio for the great feedback!