Fortunately, XGBoost implements the scikit-learn API, so tuning its hyperparameters is very easy.
Table of Contents
I assume that you have already preprocessed the dataset and split it into training, test dataset, so I will focus only on the tuning part.
First, we have to import XGBoost classifier and GridSearchCV from scikit-learn.
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
After that, we have to specify the constant parameters of the classifier. We need the objective. In this case, I use the “binary:logistic” function because I train a classifier which handles only two classes. Additionally, I specify the number of threads to speed up the training, and the seed for a random number generator, to get the same results in every run.
estimator = XGBClassifier(
objective= 'binary:logistic',
nthread=4,
seed=42
)
In the next step, I have to specify the tunable parameters and the range of values.
parameters = {
'max_depth': range (2, 10, 1),
'n_estimators': range(60, 220, 40),
'learning_rate': [0.1, 0.01, 0.05]
}
Get Weekly AI Implementation Insights
Join engineering leaders who receive my analysis of common AI production failures and how to prevent them. No fluff, just actionable techniques.
In the last setup step, I configure the GridSearchCV object. I choose the best hyperparameters using the ROC AUC metric to compare the results of 10-fold cross-validation.
grid_search = GridSearchCV(
estimator=estimator,
param_grid=parameters,
scoring = 'roc_auc',
n_jobs = 10,
cv = 10,
verbose=True
)
Now, we can do the training.
grid_search.fit(X, Y)
Here are the results:
Fitting 10 folds for each of 96 candidates, totalling 960 fits
[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done 30 tasks | elapsed: 11.0s
[Parallel(n_jobs=10)]: Done 180 tasks | elapsed: 40.1s
[Parallel(n_jobs=10)]: Done 430 tasks | elapsed: 1.7min
[Parallel(n_jobs=10)]: Done 780 tasks | elapsed: 3.1min
[Parallel(n_jobs=10)]: Done 960 out of 960 | elapsed: 4.0min finished
The best_estimator_
field contains the best model trained by GridSearch.
grid_search.best_estimator_