Feature selection with random forests
So far, we've looked at several feature selection techniques, such as regularization, stepwise, and recursive feature elimination. I now want to introduce an effective feature selection method for classification problems with random forests using the Boruta package. A paper is available that provides details on how it works in providing all the relevant features: Kursa M., Rudnicki W. (2010), Feature Selection with the Boruta Package, Journal of Statistical Software, 36(11), 1 - 13.
What I'll do here is provide an overview of the algorithm and then apply it to the simulated dataset. I've found it to be highly effective at eliminating unimportant features, but be advised it can be computationally intensive. However, it's usually time well spent.
At a high level, the algorithm creates shadow attributes by copying all of the input values and shuffling the order of their observations to decorrelate them. Then, a random forest model is built on all of the input values and a Z-score of the mean accuracy loss for each feature, including the shadow ones. Features with significantly higher Z-scores or significantly lower Z-scores than the shadow attributes are deemed important and unimportant respectively. The shadow attributes and those features with known importance are removed and the process repeats itself until all features are assigned an importance value. You can also specify the maximum number of random forest iterations. After completion of the algorithm, each of the original features will be labeled as confirmed, tentative, or rejected. You must decide on whether or not to include the tentative features for further modeling. Depending on your situation, you have some options:
- Change the random seed and rerun the methodology multiple (k) times and select only those features that are confirmed in all of the k runs
- Divide your data (training data) into k folds, run separate iterations on each fold, and select those features which are confirmed for all of the k folds
Note that all of this can be done with just a few lines of code. To get started, load the simulated data, sim_df, again. We'll create train and test sets as before:
> sim_df$y <- as.factor(sim_df$y)
> set.seed(1066)
> index <- caret::createDataPartition(sim_df$y, p = 0.7, list = F)
> train <- sim_df[index, ]
> test <- sim_df[-index, ]
To run the algorithm, you just need to call the Boruta package and create a formula in the boruta() function. Keep in mind that the labels must be a factor or the algorithm won't work. If you want to track the progress of the algorithm, specify doTrace = 1. But, I shall forgot that option in the following. Also, don't forget to set the random seed:
> set.seed(5150)
> rf_fs <- Boruta::Boruta(y ~ ., data = train)
As mentioned, this can be computationally intensive. Here's how long it took on my old-fashioned laptop:
> rf_fs$timeTaken #2.84 minutes workstation, 28.22
Time difference of 22.15982 mins
I ran this same thing on a high-powered workstation and it ran in two minutes.
A simple table will provide the count of the final importance decision. We see that the algorithm rejects five features and selects 11:
> table(rf_fs$finalDecision)
Tentative Confirmed Rejected
0 11 5
Using these results, it's simple to create a new dataframe with our selected features. We start out using the getSelectedAttributes() function to capture the feature names. In this example, let's only select those that are confirmed. If we wanted to include confirmed and tentative, we just specify withTentative = TRUE in the function:
> fnames <- Boruta::getSelectedAttributes(rf_fs) #withTentative = TRUE
> fnames
[1] "TwoFactor1" "TwoFactor2" "Linear2" "Linear3" "Linear4" "Linear5"
[7] "Linear6" "Nonlinear1" "Nonlinear2" "Nonlinear3" "random1"
Using the feature names, we create our subset of the data:
> boruta_train <- train[, colnames(train) %in% fnames]
> boruta_train$y <- train$y
We'll go ahead now and build a random forest algorithm with the selected features and see how it performs:
> boruta_fit <- randomForest::randomForest(y ~ ., data = train)
> boruta_pred <- predict(boruta_fit, type = "prob", newdata = test)
> boruta_pred <- boruta_pred[, 2]
> ytest <- as.numeric(ifelse(test$y == "1", 1, 0))
> MLmetrics::AUC(boruta_pred, ytest)
[1] 0.9604841
> MLmetrics::LogLoss(boruta_pred, ytest)
[1] 0.2704204
This is quite an impressive performance when you compare to the results from Chapter 4, Advanced Feature Selection in Linear Models. I think this example serves as a good validation of the technique. Go get some computing horsepower and start using it!