Sometimes I want to summarize some categories which don’t have much impact on my analysis. So the best way to do this is using some of the forcats::fct_lump*()
functions.
But I often struggle to find the way using the weights to order the categories. That’s because the main use case of fct_lump()
is a vector of a factor containing several values and getting the most n
and the rest combined as “other”.
Example
Let’s look at an example:
1
2
3
4
|
library(tidyverse)
set.seed(42)
values <- sample(letters[1:10], 40, replace = TRUE)
values
|
1
2
3
|
## [1] "a" "e" "a" "i" "j" "d" "b" "j" "a" "h" "g" "d" "i" "e" "d" "j" "b" "c" "i"
## [20] "i" "d" "e" "e" "d" "b" "h" "c" "j" "a" "j" "h" "f" "j" "h" "d" "d" "f" "b"
## [39] "e" "d"
|
So values is a vector with 40 letters.
Now we want to see the 5 most used letters and combine all other letters as “other”.
1
|
forcats::fct_lump(values, 5)
|
1
2
3
4
5
|
## [1] a e a i j d b j a h Other d
## [13] i e d j b Other i i d e e d
## [25] b h Other j a j h Other j h d d
## [37] Other b e d
## Levels: a b d e h i j Other
|
1
|
forcats::fct_lump(values, 5) %>% table()
|
1
2
3
|
## .
## a b d e h i j Other
## 4 4 8 5 4 4 6 5
|
Because of ties there are more than 5 letters. But that’s okay. There are options to handle ties.
Weights
Instead of simple counting it’s also possible to use weights. But in my regular cases I have to compute those weights.
Here’s my case.
Let’s say I’m analyzing pageviews of a websites per browser. So I get for each day and browser a value.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
set.seed(42)
chrome <- tibble(day = seq(1:5), useragent = "chrome", pageviews = round(rnorm(5, 1000, 100)))
firefox <- tibble(day = seq(1:5), useragent = "firefox", pageviews = round(rnorm(5, 600, 100)))
edge <- tibble(day = seq(1:5), useragent = "edge", pageviews = round(rnorm(5, 600, 100)))
junk_1 <- tibble(day = seq(1:5), useragent = "junk 1", pageviews = round(rnorm(5, 100, 20)))
junk_2 <- tibble(day = seq(1:5), useragent = "junk 2", pageviews = round(rnorm(5, 100, 20)))
junk_3 <- tibble(day = seq(1:5), useragent = "junk 3", pageviews = round(rnorm(5, 100, 20)))
data <- chrome %>%
rbind(firefox) %>%
rbind(edge) %>%
rbind(junk_1) %>%
rbind(junk_2) %>%
rbind(junk_3)
data %>%
arrange(day) %>%
head()
|
1
2
3
4
5
6
7
8
9
|
## # A tibble: 6 × 3
## day useragent pageviews
## <int> <chr> <dbl>
## 1 1 chrome 1137
## 2 1 firefox 589
## 3 1 edge 730
## 4 1 junk 1 113
## 5 1 junk 2 94
## 6 1 junk 3 91
|
For each day there are lots of different browsers. Here I have three main browsers (chrome, firefox and edge) and three obscure ones (called junk 1 to junk 3).
Those obscure ones I want to combine as “other” because I’m only interested in the main (or top) three browsers.
I like to rank my browsers by their total pageviews and then lump them.
1
2
3
4
5
6
7
8
9
|
data_lumped <- data %>%
group_by(useragent) %>%
mutate(browser_total = sum(pageviews)) %>%
ungroup() %>%
mutate(
ua = fct_lump_n(f = useragent, n = 3, w = browser_total)
) %>%
arrange(day, ua)
data_lumped
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
## # A tibble: 30 × 5
## day useragent pageviews browser_total ua
## <int> <chr> <dbl> <dbl> <fct>
## 1 1 chrome 1137 5220 chrome
## 2 1 edge 730 3179 edge
## 3 1 firefox 589 3327 firefox
## 4 1 junk 1 113 431 Other
## 5 1 junk 2 94 517 Other
## 6 1 junk 3 91 447 Other
## 7 2 chrome 944 5220 chrome
## 8 2 edge 829 3179 edge
## 9 2 firefox 751 3327 firefox
## 10 2 junk 1 94 431 Other
## # ℹ 20 more rows
|
Now it’s simple grouping and summarizing:
1
2
3
4
|
data_lumped %>%
group_by(day, ua) %>%
summarise(pageviews = sum(pageviews), .groups = "drop") %>%
arrange(day, ua)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
## # A tibble: 20 × 3
## day ua pageviews
## <int> <fct> <dbl>
## 1 1 chrome 1137
## 2 1 edge 730
## 3 1 firefox 589
## 4 1 Other 298
## 5 2 chrome 944
## 6 2 edge 829
## 7 2 firefox 751
## 8 2 Other 253
## 9 3 chrome 1036
## 10 3 edge 461
## 11 3 firefox 591
## 12 3 Other 209
## 13 4 chrome 1063
## 14 4 edge 572
## 15 4 firefox 802
## 16 4 Other 284
## 17 5 chrome 1040
## 18 5 edge 587
## 19 5 firefox 594
## 20 5 Other 351
|