This function correlates a whole dataframe, running one hot smart
encoding (ohse) to transform non-numerical features.
Note that it will automatically suppress columns
with less than 3 non missing values and warn the user.
Usage
corr(
df,
method = "pearson",
use = "pairwise.complete.obs",
pvalue = FALSE,
padjust = NULL,
half = FALSE,
dec = 6,
ignore = NULL,
dummy = TRUE,
redundant = NULL,
logs = FALSE,
limit = 10,
top = NA,
...
)Arguments
- df
Dataframe. It doesn't matter if it's got non-numerical columns: they will be filtered.
- method
Character. Any of: c("pearson", "kendall", "spearman").
- use
Character. Method for computing covariances in the presence of missing values. Check
stats::corfor options.- pvalue
Boolean. Returns a list, with correlations and statistical significance (p-value) for each value.
- padjust
Character. NULL to skip or any of
p.adjust.methodsto calculate adjust p-values for multiple comparisons usingp.adjust().- half
Boolean. Return only half of the matrix? The redundant symmetrical correlations will be
NA.- dec
Integer. Number of decimals to round correlations and p-values.
- ignore
Vector or character. Which column should be ignored?
- dummy
Boolean. Should One Hot (Smart) Encoding (
ohse()) be applied to categorical columns?- redundant
Boolean. Should we keep redundant columns? i.e. If the column only has two different values, should we keep both new columns? Is set to
NULL, only binary variables will dump redundant columns.- logs
Boolean. Calculate log(x)+1 for numerical columns?
- limit
Integer. Limit one hot encoding to the n most frequent values of each column. Set to
NAto ignore argument.- top
Integer. Select top N most relevant variables? Filtered and sorted by mean of each variable's correlations.
- ...
Additional parameters passed to
ohse,corr, and/orcor.test.
Value
data.frame. Squared dimensions (N x N) to match every
correlation between every df data.frame column/variable. Notice
that when using ohse() you may get more dimensions.
See also
Other Calculus:
dist2d(),
model_metrics(),
quants()
Other Correlations:
corr_cross(),
corr_var()
Examples
data(dft) # Titanic dataset
df <- dft[, 2:5]
# Correlation matrix (without redundancy)
corr(df, half = TRUE)
#> Age Survived_TRUE Sex_male Pclass_1 Pclass_2 Pclass_3
#> Age NA NA NA NA NA NA
#> Survived_TRUE -0.077221 NA NA NA NA NA
#> Sex_male 0.093254 -0.543351 NA NA NA NA
#> Pclass_1 0.348941 0.285904 -0.098013 NA NA NA
#> Pclass_2 0.006954 0.093349 -0.064746 -0.288585 NA NA
#> Pclass_3 -0.312271 -0.322308 0.137143 -0.626738 -0.56521 NA
# Ignore specific column
corr(df, ignore = "Pclass")
#> Age Survived_TRUE Sex_male
#> Age 1.000000 -0.077221 0.093254
#> Survived_TRUE -0.077221 1.000000 -0.543351
#> Sex_male 0.093254 -0.543351 1.000000
# Calculate p-values as well
corr(df, pvalue = TRUE, limit = 1)
#> $cor
#> Age Survived_TRUE Sex_male Pclass_3 Pclass_OTHER
#> Age 1.000000 -0.077221 0.093254 -0.312271 0.312271
#> Survived_TRUE -0.077221 1.000000 -0.543351 -0.322308 0.322308
#> Sex_male 0.093254 -0.543351 1.000000 0.137143 -0.137143
#> Pclass_3 -0.312271 -0.322308 0.137143 1.000000 -1.000000
#> Pclass_OTHER 0.312271 0.322308 -0.137143 -1.000000 1.000000
#>
#> $pvalue
#> Age Survived_TRUE Sex_male Pclass_3 Pclass_OTHER
#> Age 0.000000e+00 3.912465e-02 1.267130e-02 1.295594e-17 1.295594e-17
#> Survived_TRUE 3.912465e-02 0.000000e+00 1.406066e-69 5.510281e-23 5.510281e-23
#> Sex_male 1.267130e-02 1.406066e-69 0.000000e+00 4.002500e-05 4.002500e-05
#> Pclass_3 1.295594e-17 5.510281e-23 4.002500e-05 0.000000e+00 0.000000e+00
#> Pclass_OTHER 1.295594e-17 5.510281e-23 4.002500e-05 0.000000e+00 0.000000e+00
#>
# Test when no more than 2 non-missing values
df$trash <- c(1, rep(NA, nrow(df) - 1))
# and another method...
corr(df, method = "spearman")
#> Warning: Dropped columns with less than 3 non-missing values: 'trash'
#> Age Survived_TRUE Sex_male Pclass_1 Pclass_2 Pclass_3
#> Age 1.000000 -0.052565 0.083330 0.333881 0.031291 -0.319907
#> Survived_TRUE -0.052565 1.000000 -0.543351 0.285904 0.093349 -0.322308
#> Sex_male 0.083330 -0.543351 1.000000 -0.098013 -0.064746 0.137143
#> Pclass_1 0.333881 0.285904 -0.098013 1.000000 -0.288585 -0.626738
#> Pclass_2 0.031291 0.093349 -0.064746 -0.288585 1.000000 -0.565210
#> Pclass_3 -0.319907 -0.322308 0.137143 -0.626738 -0.565210 1.000000
