Skip to contents

This function lets the user balance a given data.frame by resampling with a given relation rate and a binary feature.

Usage

balance_data(df, variable, rate = 1, target = "auto", seed = 0, quiet = FALSE)

Arguments

df

Vector or Dataframe. Contains different variables in each column, separated by a specific character

variable

Variable. Which variable should we used to re-sample dataset?

rate

Numeric. How many X for every Y we need? Default: 1. If there are more than 2 unique values, rate will represent percentage for number of rows

target

Character. If binary, which value should be reduced? If kept in "auto", then the most frequent value will be reduced.

seed

Numeric. Seed to replicate and obtain same values

quiet

Boolean. Keep quiet? If not, messages will be printed

Value

data.frame. Reduced sampled data.frame following the rate of appearance of a specific variable.

Examples

data(dft) # Titanic dataset
df <- balance_data(dft, Survived, rate = 0.5)
#> Resampled from: 549 v 342
#> Reducing size for label: FALSE
#> New label distribution: 171 v 342
df <- balance_data(dft, .data$Survived, rate = 0.1, target = "TRUE")
#> Resampled from: 549 v 342
#> Reducing size for label: TRUE
#> New label distribution: 549 v 55