Pre-process your data before training a model. This is the prior step
on the h2o_automl()
function's pipeline. Enabling for
other use cases when wanting too use any other framework, library,
or custom algorithm.
Usage
model_preprocess(
df,
y = "tag",
ignore = NULL,
train_test = NA,
split = 0.7,
weight = NULL,
target = "auto",
balance = FALSE,
impute = FALSE,
no_outliers = TRUE,
unique_train = TRUE,
center = FALSE,
scale = FALSE,
thresh = 10,
seed = 0,
quiet = FALSE
)
Arguments
- df
Dataframe. Dataframe containing all your data, including the dependent variable labeled as
'tag'
. If you want to define which variable should be used instead, use they
parameter.- y
Character. Column name for dependent variable or response.
- ignore
Character vector. Force columns for the model to ignore
- train_test
Character. If needed,
df
's column name with 'test' and 'train' values to split- split
Numeric. Value between 0 and 1 to split as train/test datasets. Value is for training set. Set value to 1 to train with all available data and test with same data (cross-validation will still be used when training). If
train_test
is set, value will be overwritten with its real split rate.- weight
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.
- target
Value. Which is your target positive value? If set to
'auto'
, the target with largestmean(score)
will be selected. Change the value to overwrite. Only used when binary categorical model.- balance
Boolean. Auto-balance train dataset with under-sampling?
- impute
Boolean. Fill
NA
values with MICE?- no_outliers
Boolean/Numeric. Remove
y
's outliers from the dataset? Will remove those values that are farther than n standard deviations from the dependent variable's mean (Z-score). Set toTRUE
for default (3) or numeric to set a different multiplier.- unique_train
Boolean. Keep only unique row observations for training data?
- center, scale
Boolean. Using the base function scale, do you wish to center and/or scale all numerical values?
- thresh
Integer. Threshold for selecting binary or regression models: this number is the threshold of unique values we should have in
'tag'
(more than: regression; less than: classification)- seed
Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_time is resource limited.
- quiet
Boolean. Quiet all messages, warnings, recommendations?
Value
List. Contains original data.frame df
, an index
to identify which observations with be part of the train dataset
train_index
, and which model type should be model_type
.
See also
Other Machine Learning:
ROC()
,
conf_mat()
,
export_results()
,
gain_lift()
,
h2o_automl()
,
h2o_predict_MOJO()
,
h2o_selectmodel()
,
impute()
,
iter_seeds()
,
lasso_vars()
,
model_metrics()
,
msplit()
Examples
data(dft) # Titanic dataset
model_preprocess(dft, "Survived", balance = TRUE)
#> - DEPENDENT VARIABLE: Survived
#> - MODEL TYPE: Classification
#> # A tibble: 2 × 5
#> tag n p order pcum
#> <lgl> <int> <dbl> <int> <dbl>
#> 1 FALSE 549 61.6 1 61.6
#> 2 TRUE 342 38.4 2 100
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - CATEGORICALS: There are 5 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> >>> Splitting data: train = 0.7 && test = 0.3
#> train_size test_size
#> 623 268
#> - BALANCE: Training set balanced: 244 observations for each (2) category; using 78.33% of training data
model_preprocess(dft, "Fare", split = 0.5, scale = TRUE)
#> - DEPENDENT VARIABLE: Fare
#> - MODEL TYPE: Regression
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00 7.91 14.45 32.20 31.00 512.33
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - CATEGORICALS: There are 6 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> - TRANSFORMATIONS: All numerical features (4) were scaled
#> >>> Splitting data: train = 0.5 && test = 0.5
#> train_size test_size
#> 435 436
model_preprocess(dft, "Pclass", ignore = c("Fare", "Cabin"))
#> - DEPENDENT VARIABLE: Pclass
#> - MODEL TYPE: Classification
#> # A tibble: 3 × 5
#> tag n p order pcum
#> <fct> <int> <dbl> <int> <dbl>
#> 1 n_3 491 55.1 1 55.1
#> 2 n_1 216 24.2 2 79.4
#> 3 n_2 184 20.6 3 100
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - SKIPPED: Ignored variables for training models: 'Fare', 'Cabin'
#> - CATEGORICALS: There are 4 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> >>> Splitting data: train = 0.7 && test = 0.3
#> train_size test_size
#> 623 268
model_preprocess(dft, "Pclass", quiet = TRUE)