This function lets the user do one hot encoding on a variable with comma separated values
Arguments
- df
Dataframe. May contain one or more columns with comma separated values which will be separated as one hot encoding
- ...
Variables. Which variables to split into new columns?
- sep
Character. Which regular expression separates the elements?
- noval
Character. No value text
- remove
Boolean. Remove original variables?
Value
data.frame on which all features are numerical by nature or transformed with one hot encoding.
See also
Other Data Wrangling:
balance_data(),
categ_reducer(),
cleanText(),
date_cuts(),
date_feats(),
file_name(),
formatHTML(),
holidays(),
impute(),
left(),
normalize(),
num_abbr(),
ohse(),
quants(),
removenacols(),
replaceall(),
replacefactor(),
textFeats(),
textTokenizer(),
vector2text(),
year_month(),
zerovar()
Other One Hot Encoding:
date_feats(),
holidays(),
ohse()
Examples
df <- data.frame(
id = c(1:5),
x = c("AA, D", "AA,B", "B, D", "A,D,B", NA),
z = c("AA+BB+AA", "AA", "BB, AA", NA, "BB+AA")
)
ohe_commas(df, x, remove = TRUE)
#> # A tibble: 5 × 7
#> id z x_AA x_D x_B x_A x_NoVal
#> <int> <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 1 AA+BB+AA TRUE TRUE FALSE FALSE FALSE
#> 2 2 AA TRUE FALSE TRUE FALSE FALSE
#> 3 3 BB, AA FALSE TRUE TRUE FALSE FALSE
#> 4 4 NA FALSE TRUE TRUE TRUE FALSE
#> 5 5 BB+AA FALSE FALSE FALSE FALSE TRUE
ohe_commas(df, z, sep = "\\+")
#> # A tibble: 5 × 6
#> id x z z_AA z_BB `z_AA, AA, BB, AA, NoVal, BB`
#> <int> <chr> <chr> <lgl> <lgl> <lgl>
#> 1 1 AA, D AA+BB+AA TRUE TRUE FALSE
#> 2 2 AA,B AA TRUE FALSE FALSE
#> 3 3 B, D BB, AA FALSE FALSE FALSE
#> 4 4 A,D,B NA FALSE FALSE FALSE
#> 5 5 NA BB+AA TRUE TRUE FALSE
ohe_commas(df, x, z)
#> # A tibble: 5 × 13
#> id x z x_AA z_D x_B z_A x_NoVal `x_AA+BB+AA` z_AA x_BB
#> <int> <chr> <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 1 AA, D AA+BB+AA TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
#> 2 2 AA,B AA TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
#> 3 3 B, D BB, AA FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
#> 4 4 A,D,B NA FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
#> 5 5 NA BB+AA FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
#> # ℹ 2 more variables: z_NoVal <lgl>, `x_BB+AA` <lgl>
