Automate Data Preprocess for Modeling

Pre-process your data before training a model. This is the prior step on the h2o_automl() function's pipeline. Enabling for other use cases when wanting too use any other framework, library, or custom algorithm.

Usage

model_preprocess(
  df,
  y = "tag",
  ignore = NULL,
  train_test = NA,
  split = 0.7,
  weight = NULL,
  target = "auto",
  balance = FALSE,
  impute = FALSE,
  no_outliers = TRUE,
  unique_train = TRUE,
  center = FALSE,
  scale = FALSE,
  thresh = 10,
  seed = 0,
  quiet = FALSE
)

Arguments

df: Dataframe. Dataframe containing all your data, including the dependent variable labeled as 'tag'. If you want to define which variable should be used instead, use the y parameter.
y: Character. Column name for dependent variable or response.
ignore: Character vector. Force columns for the model to ignore
train_test: Character. If needed, df's column name with 'test' and 'train' values to split
split: Numeric. Value between 0 and 1 to split as train/test datasets. Value is for training set. Set value to 1 to train with all available data and test with same data (cross-validation will still be used when training). If train_test is set, value will be overwritten with its real split rate.
weight: Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.
target: Value. Which is your target positive value? If set to 'auto', the target with largest mean(score) will be selected. Change the value to overwrite. Only used when binary categorical model.
balance: Boolean. Auto-balance train dataset with under-sampling?
impute: Boolean. Fill NA values with MICE?
no_outliers: Boolean/Numeric. Remove y's outliers from the dataset? Will remove those values that are farther than n standard deviations from the dependent variable's mean (Z-score). Set to TRUE for default (3) or numeric to set a different multiplier.
unique_train: Boolean. Keep only unique row observations for training data?
center, scale: Boolean. Using the base function scale, do you wish to center and/or scale all numerical values?
thresh: Integer. Threshold for selecting binary or regression models: this number is the threshold of unique values we should have in 'tag' (more than: regression; less than: classification)
seed: Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_time is resource limited.
quiet: Boolean. Keep quiet? If not, informative messages will be shown.

Value

List. Contains original data.frame df, an index to identify which observations with be part of the train dataset train_index, and which model type should be model_type.

Examples

data(dft) # Titanic dataset

model_preprocess(dft, "Survived", balance = TRUE)
#> - DEPENDENT VARIABLE: Survived
#> - MODEL TYPE: Classification
#> # A tibble: 2 × 5
#>   tag       n     p order  pcum
#>   <lgl> <int> <dbl> <int> <dbl>
#> 1 FALSE   549  61.6     1  61.6
#> 2 TRUE    342  38.4     2 100  
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - CATEGORICALS: There are 5 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> >>> Splitting data: train = 0.7 &&  test = 0.3
#> train_size  test_size 
#>        623        268 
#> - BALANCE: Training set balanced: 243 observations for each (2) category; using 78.01% of training data

model_preprocess(dft, "Fare", split = 0.5, scale = TRUE)
#> - DEPENDENT VARIABLE: Fare
#> - MODEL TYPE: Regression
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.00    7.91   14.45   32.20   31.00  512.33 
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - CATEGORICALS: There are 6 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> - TRANSFORMATIONS: All numerical features (4) were scaled
#> >>> Splitting data: train = 0.5 &&  test = 0.5
#> train_size  test_size 
#>        435        436 

model_preprocess(dft, "Pclass", ignore = c("Fare", "Cabin"))
#> - DEPENDENT VARIABLE: Pclass
#> - MODEL TYPE: Classification
#> # A tibble: 3 × 5
#>   tag       n     p order  pcum
#>   <fct> <int> <dbl> <int> <dbl>
#> 1 n_3     491  55.1     1  55.1
#> 2 n_1     216  24.2     2  79.4
#> 3 n_2     184  20.6     3 100  
#> - MISSINGS: The following variables contain missing observations: Age (19.87%). Consider using the impute parameter.
#> - SKIPPED: Ignored variables for training models: 'Fare', 'Cabin'
#> - CATEGORICALS: There are 4 non-numerical features. Consider using ohse() or equivalent prior to encode categorical variables.
#> >>> Splitting data: train = 0.7 &&  test = 0.3
#> train_size  test_size 
#>        623        268 

model_preprocess(dft, "Pclass", quiet = TRUE)

Usage

Arguments

Value

See also

Examples