This function lets the user do one hot encoding on a variable with comma separated values
Arguments
- df
Dataframe. May contain one or more columns with comma separated values which will be separated as one hot encoding
- ...
Variables. Which variables to split into new columns?
- sep
Character. Which regular expression separates the elements?
- noval
Character. No value text
- remove
Boolean. Remove original variables?
Value
data.frame on which all features are numerical by nature or transformed with one hot encoding.
See also
Other Data Wrangling:
balance_data()
,
categ_reducer()
,
cleanText()
,
date_cuts()
,
date_feats()
,
file_name()
,
formatHTML()
,
holidays()
,
impute()
,
left()
,
normalize()
,
num_abbr()
,
ohse()
,
quants()
,
removenacols()
,
replaceall()
,
replacefactor()
,
textFeats()
,
textTokenizer()
,
vector2text()
,
year_month()
,
zerovar()
Other One Hot Encoding:
date_feats()
,
holidays()
,
ohse()
Examples
df <- data.frame(
id = c(1:5),
x = c("AA, D", "AA,B", "B, D", "A,D,B", NA),
z = c("AA+BB+AA", "AA", "BB, AA", NA, "BB+AA")
)
ohe_commas(df, x, remove = TRUE)
#> # A tibble: 5 × 7
#> id z x_AA x_D x_B x_A x_NoVal
#> <int> <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 1 AA+BB+AA TRUE TRUE FALSE FALSE FALSE
#> 2 2 AA TRUE FALSE TRUE FALSE FALSE
#> 3 3 BB, AA FALSE TRUE TRUE FALSE FALSE
#> 4 4 NA FALSE TRUE TRUE TRUE FALSE
#> 5 5 BB+AA FALSE FALSE FALSE FALSE TRUE
ohe_commas(df, z, sep = "\\+")
#> # A tibble: 5 × 6
#> id x z z_AA z_BB `z_AA, AA, BB, AA, NoVal, BB`
#> <int> <chr> <chr> <lgl> <lgl> <lgl>
#> 1 1 AA, D AA+BB+AA TRUE TRUE FALSE
#> 2 2 AA,B AA TRUE FALSE FALSE
#> 3 3 B, D BB, AA FALSE FALSE FALSE
#> 4 4 A,D,B NA FALSE FALSE FALSE
#> 5 5 NA BB+AA TRUE TRUE FALSE
ohe_commas(df, x, z)
#> # A tibble: 5 × 13
#> id x z x_AA z_D x_B z_A x_NoVal `x_AA+BB+AA` z_AA x_BB
#> <int> <chr> <chr> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 1 AA, D AA+BB+AA TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
#> 2 2 AA,B AA TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
#> 3 3 B, D BB, AA FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
#> 4 4 A,D,B NA FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
#> 5 5 NA BB+AA FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
#> # ℹ 2 more variables: z_NoVal <lgl>, `x_BB+AA` <lgl>