{forcats} demo by Allison Horst (Bren)

In this session, Allison did a few examples using functions in the {forcats} package to reorder, relevel and lump together factor levels!

The functions shown in examples were:

  • fct_reorder(): reorder factor levels by values of another variable
  • fct_relevel(): manually change factor levels
  • fct_infreq(): reorder factor levels by frequency of observations each level
  • fct_lump(): aggregate factor levels

Here are several examples, using the starwars dataset in dplyr. Have more fun with factors, by letting {forcats} help!

Making a simplified dataset for examples

Starting from starwars, I create a subset with the species variable recast as a factor, only keeping variables name, species, and height and removing any cases (rows) where either species or height is missing (NA):

# Attach the tidyverse
library(tidyverse)

# Check out starwars data
glimpse(starwars)
## Rows: 87
## Columns: 14
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia O…
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, …
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77…
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", …
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", …
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue"…
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0,…
## $ sex        <chr> "male", "none", "none", "male", "female", "male", "female"…
## $ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femin…
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "…
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Hum…
## $ films      <list> [<"The Empire Strikes Back", "Revenge of the Sith", "Retu…
## $ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "I…
## $ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1…
# Wrangling & simplifying
sw_fct <- starwars %>% 
  mutate(species = factor(species)) %>% 
  select(name, species, height) %>% 
  drop_na(species, height) 

# Check the class and levels of species variable
class(sw_fct$species)
## [1] "factor"
levels(sw_fct$species)
##  [1] "Aleena"         "Besalisk"       "Cerean"         "Chagrian"      
##  [5] "Clawdite"       "Droid"          "Dug"            "Ewok"          
##  [9] "Geonosian"      "Gungan"         "Human"          "Hutt"          
## [13] "Iktotchi"       "Kaleesh"        "Kaminoan"       "Kel Dor"       
## [17] "Mirialan"       "Mon Calamari"   "Muun"           "Nautolan"      
## [21] "Neimodian"      "Pau'an"         "Quermian"       "Rodian"        
## [25] "Skakoan"        "Sullustan"      "Tholothian"     "Togruta"       
## [29] "Toong"          "Toydarian"      "Trandoshan"     "Twi'lek"       
## [33] "Vulptereen"     "Wookiee"        "Xexto"          "Yoda's species"
## [37] "Zabrak"

We can see from the levels() output above that the default order of factor levels is alphabetical. But often we want factor levels to be ordered by frequency, or value, or some other manual specification.

Below are several ways to update factor levels.

fct_reorder(): Reordering a factor by another variable

Example: I want to reorder species factor levels by the maximum value of height in each level.

by_max_height <- sw_fct %>% 
  mutate(species = fct_reorder(species, height, max))

levels(by_max_height$species) # Ta-da.
##  [1] "Yoda's species" "Aleena"         "Ewok"           "Vulptereen"    
##  [5] "Dug"            "Xexto"          "Toydarian"      "Sullustan"     
##  [9] "Toong"          "Clawdite"       "Mirialan"       "Rodian"        
## [13] "Hutt"           "Zabrak"         "Togruta"        "Mon Calamari"  
## [17] "Twi'lek"        "Geonosian"      "Tholothian"     "Iktotchi"      
## [21] "Kel Dor"        "Trandoshan"     "Muun"           "Neimodian"     
## [25] "Skakoan"        "Chagrian"       "Nautolan"       "Besalisk"      
## [29] "Cerean"         "Droid"          "Human"          "Pau'an"        
## [33] "Kaleesh"        "Gungan"         "Kaminoan"       "Wookiee"       
## [37] "Quermian"
# Showing that this lines up with the order if we find the maximum height for each level: 
sw_fct %>% 
  group_by(species) %>% 
  summarize(max_height = max(height)) %>% 
  arrange(max_height)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 37 x 2
##    species        max_height
##    <fct>               <int>
##  1 Yoda's species         66
##  2 Aleena                 79
##  3 Ewok                   88
##  4 Vulptereen             94
##  5 Dug                   112
##  6 Xexto                 122
##  7 Toydarian             137
##  8 Sullustan             160
##  9 Toong                 163
## 10 Clawdite              168
## # … with 27 more rows

fct_relevel(): Reorder factor levels by hand

Sometimes (e.g. when updating a reference level for a model) we’ll want to manually set the order of factor levels. Use fct_relevel() to manually order levels. If you give the function a number of levels less than the total number of levels, all remaining are added in alphabetical order following the levels you’ve specified.

Example: Make the first three levels Ewok, Droid, and Wookie.

ewok_droid_wookiee <- sw_fct %>% 
  mutate(species = fct_relevel(species, "Ewok","Droid","Wookiee"))

# Verify the new level order in species
levels(ewok_droid_wookiee$species)
##  [1] "Ewok"           "Droid"          "Wookiee"        "Aleena"        
##  [5] "Besalisk"       "Cerean"         "Chagrian"       "Clawdite"      
##  [9] "Dug"            "Geonosian"      "Gungan"         "Human"         
## [13] "Hutt"           "Iktotchi"       "Kaleesh"        "Kaminoan"      
## [17] "Kel Dor"        "Mirialan"       "Mon Calamari"   "Muun"          
## [21] "Nautolan"       "Neimodian"      "Pau'an"         "Quermian"      
## [25] "Rodian"         "Skakoan"        "Sullustan"      "Tholothian"    
## [29] "Togruta"        "Toong"          "Toydarian"      "Trandoshan"    
## [33] "Twi'lek"        "Vulptereen"     "Xexto"          "Yoda's species"
## [37] "Zabrak"

fct_infreq(): Reordering a factor by the frequency of values

If you want to reorder factor levels based on the frequency of observations in factor levels, use fct_infreq(). The level with the most observations will be (by default) the first level.

Example: reorder species factor levels based on the number of observations in each group.

# Reassign factor levels based on frequency of observations within each level
sw_by_freq <- sw_fct %>% 
  mutate(species = fct_infreq(species))

# See the new levels
levels(sw_by_freq$species)
##  [1] "Human"          "Droid"          "Gungan"         "Kaminoan"      
##  [5] "Mirialan"       "Twi'lek"        "Wookiee"        "Zabrak"        
##  [9] "Aleena"         "Besalisk"       "Cerean"         "Chagrian"      
## [13] "Clawdite"       "Dug"            "Ewok"           "Geonosian"     
## [17] "Hutt"           "Iktotchi"       "Kaleesh"        "Kel Dor"       
## [21] "Mon Calamari"   "Muun"           "Nautolan"       "Neimodian"     
## [25] "Pau'an"         "Quermian"       "Rodian"         "Skakoan"       
## [29] "Sullustan"      "Tholothian"     "Togruta"        "Toong"         
## [33] "Toydarian"      "Trandoshan"     "Vulptereen"     "Xexto"         
## [37] "Yoda's species"
# Verify that this lines up with counts by species (shows first 5 rows)
sw_fct %>% 
  count(species) %>% 
  arrange(-n) %>% 
  head(5)
## # A tibble: 5 x 2
##   species      n
##   <fct>    <int>
## 1 Human       31
## 2 Droid        5
## 3 Gungan       3
## 4 Kaminoan     2
## 5 Mirialan     2

fct_lump() options: Collapsing the least/most frequent values of a factor into “other”

There are several variations of fct_lump() (not all are included here).

fct_lump_n(): Lumps all levels together except the n most frequent.

# Lump together all groups except the three with the highest number of observations; everything else gets put in level "Other":
sw_lump_n <- sw_fct %>% 
  mutate(species = fct_lump_n(species, 3))

levels(sw_lump_n$species)
## [1] "Droid"  "Gungan" "Human"  "Other"

fct_lump_min(): Lump together any levels that appear less than n times

# Lump together any levels with LESS than 2 observations in the group into "Other":
sw_lump_min <- sw_fct %>% 
  mutate(species = fct_lump_min(species, 2))

levels(sw_lump_min$species)
## [1] "Droid"    "Gungan"   "Human"    "Kaminoan" "Mirialan" "Twi'lek"  "Wookiee" 
## [8] "Zabrak"   "Other"

That’s it for today! See more at forcats.tidyverse.org!

Related