This function lets the user balance a given data.frame by resampling with a given relation rate and a binary feature.
Arguments
- df
Vector or Dataframe. Contains different variables in each column, separated by a specific character
- var
Variable. Which variable should we used to re-sample dataset?
- rate
Numeric. How many X for every Y we need? Default: 1. If there are more than 2 unique values, rate will represent percentage for number of rows
- target
Character. If binary, which value should be reduced? If kept in
"auto"
, then the most frequent value will be reduced.- seed
Numeric. Seed to replicate and obtain same values
- quiet
Boolean. Keep quiet? If not, messages will be printed
Value
data.frame. Reduced sampled data.frame following the rate
of
appearance of a specific variable.
See also
Other Data Wrangling:
categ_reducer()
,
cleanText()
,
date_cuts()
,
date_feats()
,
file_name()
,
formatHTML()
,
holidays()
,
impute()
,
left()
,
normalize()
,
num_abbr()
,
ohe_commas()
,
ohse()
,
quants()
,
removenacols()
,
replaceall()
,
replacefactor()
,
textFeats()
,
textTokenizer()
,
vector2text()
,
year_month()
,
zerovar()
Examples
data(dft) # Titanic dataset
df <- balance_data(dft, Survived, rate = 0.5)
#> Resampled from: 549 v 342
#> Reducing size for label: FALSE
#> New label distribution: 171 v 342
df <- balance_data(dft, .data$Survived, rate = 0.1, target = "TRUE")
#> Resampled from: 549 v 342
#> Reducing size for label: TRUE
#> New label distribution: 549 v 55