This function lets the user automatically transform a dataframe with categorical columns into numerical by one hot encoding technic.
Usage
ohse(
df,
redundant = FALSE,
drop = TRUE,
ignore = NULL,
dates = FALSE,
holidays = FALSE,
country = "Venezuela",
currency_pair = NA,
trim = 0,
limit = 10,
variance = 0.9,
other_label = "OTHER",
sep = "_",
quiet = FALSE,
...
)Arguments
- df
Dataframe
- redundant
Boolean. Should we keep redundant columns? i.e. If the column only has two different values, should we keep both new columns? Is set to
NULL, only binary variables will dump redundant columns.- drop
Boolean. Drop automatically some useless features?
- ignore
Vector or character. Which column should be ignored?
- dates
Boolean. Do you want the function to create more features out of the date/time columns?
- holidays
Boolean. Include holidays as new columns?
- country
Character or vector. For which countries should the holidays be included?
- currency_pair
Character. Which currency exchange do you wish to get the history from? i.e, USD/COP, EUR/USD...
- trim
Integer. Trim names until the nth character
- limit
Integer. Limit one hot encoding to the n most frequent values of each column. Set to
NAto ignore argument.- variance
Numeric. Drop columns with more than n variance. Range: 0-1. For example: if a variable contains 91 unique different values out of 100 observations, this column will be suppressed if value is set to 0.9
- other_label
Character. With which text do you wish to replace the filtered values with?
- sep
Character. Separator's string
- quiet
Boolean. Keep quiet? If not, informative messages will be shown.
- ...
Additional parameters.
Value
data.frame on which all features are numerical by nature or transformed with one hot encoding.
See also
Other Data Wrangling:
balance_data(),
categ_reducer(),
cleanText(),
date_cuts(),
date_feats(),
file_name(),
formatHTML(),
holidays(),
impute(),
left(),
normalize(),
num_abbr(),
ohe_commas(),
quants(),
removenacols(),
replaceall(),
replacefactor(),
textFeats(),
textTokenizer(),
vector2text(),
year_month(),
zerovar()
Other Feature Engineering:
date_feats(),
holidays()
Other One Hot Encoding:
date_feats(),
holidays(),
ohe_commas()
Examples
data(dft)
dft <- dft[, c(2, 3, 5, 9, 11)]
ohse(dft, limit = 3) %>% head(3)
#> >>> One Hot Encoding applied to 3 variables: 'Pclass', 'Embarked', 'Survived'
#> # A tibble: 3 × 8
#> Age Fare Survived_TRUE Pclass_1 Pclass_2 Embarked_C Embarked_OTHER
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 22 7.25 0 0 0 0 0
#> 2 38 71.3 1 1 0 1 0
#> 3 26 7.92 1 0 0 0 0
#> # ℹ 1 more variable: Embarked_Q <dbl>
ohse(dft, limit = 3, redundant = NULL) %>% head(3)
#> >>> One Hot Encoding applied to 3 variables: 'Pclass', 'Embarked', 'Survived'
#> # A tibble: 3 × 10
#> Age Fare Survived_TRUE Pclass_1 Pclass_2 Pclass_3 Embarked_C Embarked_OTHER
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 22 7.25 0 0 0 1 0 0
#> 2 38 71.3 1 1 0 0 1 0
#> 3 26 7.92 1 0 0 1 0 0
#> # ℹ 2 more variables: Embarked_Q <dbl>, Embarked_S <dbl>
# Getting rid of columns with no (or too much) variance
dft$no_variance1 <- 0
dft$no_variance2 <- c("A", rep("B", nrow(dft) - 1))
dft$no_variance3 <- as.character(rnorm(nrow(dft)))
dft$no_variance4 <- c(rep("A", 20), round(rnorm(nrow(dft) - 20), 4))
ohse(dft, limit = 3) %>% head(3)
#> >>> One Hot Encoding applied to 4 variables: 'Pclass', 'Embarked', 'Survived', 'no_variance2'
#> # A tibble: 3 × 11
#> Age Fare no_variance3 no_variance4 Survived_TRUE no_variance2_B Pclass_1
#> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 22 7.25 0.153373117836… A 0 0 0
#> 2 38 71.3 -1.13813693701… A 1 1 1
#> 3 26 7.92 1.253814921069… A 1 1 0
#> # ℹ 4 more variables: Pclass_2 <dbl>, Embarked_C <dbl>, Embarked_OTHER <dbl>,
#> # Embarked_Q <dbl>
