Skip to contents

This function lets the user reduce categorical values in a vector. It is tidyverse friendly for use on pipelines

Usage

categ_reducer(
  df,
  var,
  nmin = 0,
  pmin = 0,
  pcummax = 100,
  top = NA,
  pvalue_max = 1,
  cor_var = "tag",
  limit = 20,
  other_label = "other",
  ...
)

Arguments

df

Categorical Vector

var

Variable. Which variable do you wish to reduce?

nmin

Integer. Number of minimum times a value is repeated

pmin

Numerical. Percentage of minimum times a value is repeated

pcummax

Numerical. Top cumulative percentage of most repeated values

top

Integer. Keep the n most frequently repeated values

pvalue_max

Numeric (0-1]. Max pvalue categories

cor_var

Character. If pvalue_max < 1, you must define which column name will be compared with (numerical or binary).

limit

Integer. Limit one hot encoding to the n most frequent values of each column. Set to NA to ignore argument.

other_label

Character. With which text do you wish to replace the filtered values with?

...

Additional parameters.

Value

data.frame df on which var has been transformed

Examples

data(dft) # Titanic dataset
categ_reducer(dft, Embarked, top = 2) %>% freqs(Embarked)
#> # A tibble: 3 × 5
#>   Embarked     n     p order  pcum
#>   <chr>    <int> <dbl> <int> <dbl>
#> 1 S          644 72.3      1  72.3
#> 2 C          168 18.9      2  91.1
#> 3 other       79  8.87     3 100. 
categ_reducer(dft, Ticket, nmin = 7, other_label = "Other Ticket") %>% freqs(Ticket)
#> # A tibble: 4 × 5
#>   Ticket           n     p order  pcum
#>   <chr>        <int> <dbl> <int> <dbl>
#> 1 Other Ticket   870 97.6      1  97.6
#> 2 1601             7  0.79     2  98.4
#> 3 347082           7  0.79     3  99.2
#> 4 CA. 2343         7  0.79     4 100. 
categ_reducer(dft, Ticket, pvalue_max = 0.05, cor_var = "Survived") %>% freqs(Ticket)
#> Warning: Not a valid input: cor_var_temp was transformed or does not exist.
#>   >> Automatically using 'cor_var_temp_TRUE'
#> # A tibble: 6 × 5
#>   Ticket       n     p order  pcum
#>   <chr>    <int> <dbl> <int> <dbl>
#> 1 other      866 97.2      1  97.2
#> 2 347082       7  0.79     2  98.0
#> 3 CA. 2343     7  0.79     3  98.8
#> 4 113760       4  0.45     4  99.2
#> 5 2666         4  0.45     5  99.7
#> 6 110152       3  0.34     6 100.