Pre-process your data before training a model. This is the prior step
on the `h2o_automl()`

function's pipeline. Enabling for
other use cases when wanting too use any other framework, library,
or custom algorithm.

## Usage

```
model_preprocess(
df,
y = "tag",
ignore = NULL,
train_test = NA,
split = 0.7,
weight = NULL,
target = "auto",
balance = FALSE,
impute = FALSE,
no_outliers = TRUE,
unique_train = TRUE,
center = FALSE,
scale = FALSE,
thresh = 10,
seed = 0,
quiet = FALSE
)
```

## Arguments

- df
Dataframe. Dataframe containing all your data, including the dependent variable labeled as

`'tag'`

. If you want to define which variable should be used instead, use the`y`

parameter.- y
Character. Column name for dependent variable or response.

- ignore
Character vector. Force columns for the model to ignore

- train_test
Character. If needed,

`df`

's column name with 'test' and 'train' values to split- split
Numeric. Value between 0 and 1 to split as train/test datasets. Value is for training set. Set value to 1 to train with all available data and test with same data (cross-validation will still be used when training). If

`train_test`

is set, value will be overwritten with its real split rate.- weight
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.

- target
Value. Which is your target positive value? If set to

`'auto'`

, the target with largest`mean(score)`

will be selected. Change the value to overwrite. Only used when binary categorical model.- balance
Boolean. Auto-balance train dataset with under-sampling?

- impute
Boolean. Fill

`NA`

values with MICE?- no_outliers
Boolean/Numeric. Remove

`y`

's outliers from the dataset? Will remove those values that are farther than n standard deviations from the dependent variable's mean (Z-score). Set to`TRUE`

for default (3) or numeric to set a different multiplier.- unique_train
Boolean. Keep only unique row observations for training data?

- center, scale
Boolean. Using the base function scale, do you wish to center and/or scale all numerical values?

- thresh
Integer. Threshold for selecting binary or regression models: this number is the threshold of unique values we should have in

`'tag'`

(more than: regression; less than: classification)- seed
Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_time is resource limited.

- quiet
Boolean. Quiet all messages, warnings, recommendations?

## Value

List. Contains original data.frame `df`

, an index
to identify which observations with be part of the train dataset
`train_index`

, and which model type should be `model_type`

.

## See also

Other Machine Learning:
`ROC()`

,
`conf_mat()`

,
`export_results()`

,
`gain_lift()`

,
`h2o_automl()`

,
`h2o_predict_MOJO()`

,
`h2o_selectmodel()`

,
`impute()`

,
`iter_seeds()`

,
`lasso_vars()`

,
`model_metrics()`

,
`msplit()`

## Examples

```
data(dft) # Titanic dataset
model_preprocess(dft, "Survived", balance = TRUE)
#> - DEPENDENT VARIABLE: Survived
#> - MODEL TYPE: Classification
#> # A tibble: 2 × 5
#> tag n p order pcum
#> <lgl> <int> <dbl> <int> <dbl>
#> 1 FALSE 549 61.6 1 61.6
#> 2 TRUE 342 38.4 2 100
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - CATEGORICALS: There are 5 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> >>> Splitting data: train = 0.7 && test = 0.3
#> train_size test_size
#> 623 268
#> - BALANCE: Training set balanced: 244 observations for each (2) category; using 78.33% of training data
model_preprocess(dft, "Fare", split = 0.5, scale = TRUE)
#> - DEPENDENT VARIABLE: Fare
#> - MODEL TYPE: Regression
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00 7.91 14.45 32.20 31.00 512.33
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - CATEGORICALS: There are 6 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> - TRANSFORMATIONS: All numerical features (4) were scaled
#> >>> Splitting data: train = 0.5 && test = 0.5
#> train_size test_size
#> 435 436
model_preprocess(dft, "Pclass", ignore = c("Fare", "Cabin"))
#> - DEPENDENT VARIABLE: Pclass
#> - MODEL TYPE: Classification
#> # A tibble: 3 × 5
#> tag n p order pcum
#> <fct> <int> <dbl> <int> <dbl>
#> 1 n_3 491 55.1 1 55.1
#> 2 n_1 216 24.2 2 79.4
#> 3 n_2 184 20.6 3 100
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - SKIPPED: Ignored variables for training models: 'Fare', 'Cabin'
#> - CATEGORICALS: There are 4 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> >>> Splitting data: train = 0.7 && test = 0.3
#> train_size test_size
#> 623 268
model_preprocess(dft, "Pclass", quiet = TRUE)
```