This function lets the user balance a given data.frame by resampling with a given relation rate and a binary feature.
Arguments
- df
Vector or Dataframe. Contains different variables in each column, separated by a specific character
- var
Variable. Which variable should we used to re-sample dataset?
- rate
Numeric. How many X for every Y we need? Default: 1. If there are more than 2 unique values, rate will represent percentage for number of rows
- target
Character. If binary, which value should be reduced? If kept in
"auto", then the most frequent value will be reduced.- seed
Numeric. Seed to replicate and obtain same values
- quiet
Boolean. Keep quiet? If not, informative messages will be shown.
Value
data.frame. Reduced sampled data.frame following the rate of
appearance of a specific variable.
See also
Other Data Wrangling:
categ_reducer(),
cleanText(),
date_cuts(),
date_feats(),
file_name(),
formatHTML(),
holidays(),
impute(),
left(),
normalize(),
num_abbr(),
ohe_commas(),
ohse(),
quants(),
removenacols(),
replaceall(),
replacefactor(),
textFeats(),
textTokenizer(),
vector2text(),
year_month(),
zerovar()
Examples
data(dft) # Titanic dataset
df <- balance_data(dft, Survived, rate = 1)
#> Resampled from: 549 v 342
#> Reducing size for label: FALSE
#> New label distribution: 342 v 342
df <- balance_data(dft, .data$Survived, rate = 0.5, target = "TRUE")
#> Resampled from: 549 v 342
#> Reducing size for label: TRUE
#> New label distribution: 549 v 274
