This function lets the user create a robust and fast model, using
H2O's AutoML function. The result is a list with the best model,
its parameters, datasets, performance metrics, variables
importance, and plots. Read more about the h2o_automl()
pipeline
here.
Usage
h2o_automl(
df,
y = "tag",
ignore = NULL,
train_test = NA,
split = 0.7,
weight = NULL,
target = "auto",
balance = FALSE,
impute = FALSE,
no_outliers = TRUE,
unique_train = TRUE,
center = FALSE,
scale = FALSE,
thresh = 10,
seed = 0,
nfolds = 5,
max_models = 3,
max_time = 10 * 60,
start_clean = FALSE,
exclude_algos = c("StackedEnsemble", "DeepLearning"),
include_algos = NULL,
plots = TRUE,
alarm = TRUE,
quiet = FALSE,
print = TRUE,
save = FALSE,
subdir = NA,
project = "AutoML Results",
verbosity = NULL,
...
)
# S3 method for class 'h2o_automl'
plot(x, ...)
# S3 method for class 'h2o_automl'
print(x, importance = TRUE, ...)
Arguments
- df
Dataframe. Dataframe containing all your data, including the dependent variable labeled as
'tag'
. If you want to define which variable should be used instead, use they
parameter.- y
Variable or Character. Name of the dependent variable or response.
- ignore
Character vector. Force columns for the model to ignore
- train_test
Character. If needed,
df
's column name with 'test' and 'train' values to split- split
Numeric. Value between 0 and 1 to split as train/test datasets. Value is for training set. Set value to 1 to train with all available data and test with same data (cross-validation will still be used when training). If
train_test
is set, value will be overwritten with its real split rate.- weight
Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed.
- target
Value. Which is your target positive value? If set to
'auto'
, the target with largestmean(score)
will be selected. Change the value to overwrite. Only used when binary categorical model.- balance
Boolean. Auto-balance train dataset with under-sampling?
- impute
Boolean. Fill
NA
values with MICE?- no_outliers
Boolean/Numeric. Remove
y
's outliers from the dataset? Will remove those values that are farther than n standard deviations from the dependent variable's mean (Z-score). Set toTRUE
for default (3) or numeric to set a different multiplier.- unique_train
Boolean. Keep only unique row observations for training data?
- center, scale
Boolean. Using the base function scale, do you wish to center and/or scale all numerical values?
- thresh
Integer. Threshold for selecting binary or regression models: this number is the threshold of unique values we should have in
'tag'
(more than: regression; less than: classification)- seed
Integer. Set a seed for reproducibility. AutoML can only guarantee reproducibility if max_models is used because max_time is resource limited.
- nfolds
Number of folds for k-fold cross-validation. Must be >= 2; defaults to 5. Use 0 to disable cross-validation; this will also disable Stacked Ensemble (thus decreasing the overall model performance).
- max_models, max_time
Numeric. Max number of models and seconds you wish for the function to iterate. Note that max_models guarantees reproducibility and max_time not (because it depends entirely on your machine's computational characteristics)
- start_clean
Boolean. Erase everything in the current h2o instance before we start to train models? You may want to keep other models or not. To group results into a custom common AutoML project, you may use
project_name
argument.- exclude_algos, include_algos
Vector of character strings. Algorithms to skip or include during the model-building phase. Set NULL to ignore. When both are defined, only
include_algos
will be valid.- plots
Boolean. Create plots objects?
- alarm
Boolean. Ping (sound) when done. Requires
beepr
.- quiet
Boolean. Quiet all messages, warnings, recommendations?
Boolean. Print summary when process ends?
- save
Boolean. Do you wish to save/export results into your working directory?
- subdir
Character. In which directory do you wish to save the results? Working directory as default.
- project
Character. Your project's name
- verbosity
Verbosity of the backend messages printed during training; Optional. Must be one of NULL (live log disabled), "debug", "info", "warn", "error". Defaults to "warn".
- ...
Additional parameters on
h2o::h2o.automl
- x
h2o_automl object
- importance
Boolean. Print important variables?
Value
List. Trained model, predicted scores and datasets used, performance
metrics, parameters, importance data.frame, seed, and plots when plots=TRUE
.
List of algorithms
- DRF
Distributed Random Forest, including Random Forest (RF) and Extremely-Randomized Trees (XRT)
- GLM
Generalized Linear Model
- XGBoost
eXtreme Grading Boosting
- GBM
Gradient Boosting Machine
- DeepLearning
Fully-connected multi-layer artificial neural network
- StackedEnsemble
Stacked Ensemble
Methods
Use
print
method to print models stats and summary- plot
Use
plot
method to plot results usingmplot_full()
See also
Other Machine Learning:
ROC()
,
conf_mat()
,
export_results()
,
gain_lift()
,
h2o_predict_MOJO()
,
h2o_selectmodel()
,
impute()
,
iter_seeds()
,
lasso_vars()
,
model_metrics()
,
model_preprocess()
,
msplit()
Examples
if (FALSE) { # \dontrun{
# CRAN
data(dft) # Titanic dataset
dft <- subset(dft, select = -c(Ticket, PassengerId, Cabin))
# Classification: Binomial - 2 Classes
r <- h2o_automl(dft, y = Survived, max_models = 1, impute = FALSE, target = "TRUE", alarm = FALSE)
# Let's see all the stuff we have inside:
lapply(r, names)
# Classification: Multi-Categorical - 3 Classes
r <- h2o_automl(dft, Pclass, ignore = c("Fare", "Cabin"), max_time = 30, plots = FALSE)
# Regression: Continuous Values
r <- h2o_automl(dft, y = "Fare", ignore = c("Pclass"), exclude_algos = NULL, quiet = TRUE)
print(r)
# WITH PRE-DEFINED TRAIN/TEST DATAFRAMES
splits <- msplit(dft, size = 0.8)
splits$train$split <- "train"
splits$test$split <- "test"
df <- rbind(splits$train, splits$test)
r <- h2o_automl(df, "Survived", max_models = 1, train_test = "split")
} # }