Fortunately, XGBoost implements the scikit-learn API, so tuning its hyperparameters is very easy.
I assume that you have already preprocessed the dataset and split it into training, test dataset, so I will focus only on the tuning part.
First, we have to import XGBoost classifier and GridSearchCV from scikit-learn.
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
After that, we have to specify the constant parameters of the classifier. We need the objective. In this case, I use the “binary:logistic” function because I train a classifier which handles only two classes. Additionally, I specify the number of threads to speed up the training, and the seed for a random number generator, to get the same results in every run.
estimator = XGBClassifier(
objective= 'binary:logistic',
nthread=4,
seed=42
)
In the next step, I have to specify the tunable parameters and the range of values.
parameters = {
'max_depth': range (2, 10, 1),
'n_estimators': range(60, 220, 40),
'learning_rate': [0.1, 0.01, 0.05]
}
In the last setup step, I configure the GridSearchCV object. I choose the best hyperparameters using the ROC AUC metric to compare the results of 10-fold cross-validation.
grid_search = GridSearchCV(
estimator=estimator,
param_grid=parameters,
scoring = 'roc_auc',
n_jobs = 10,
cv = 10,
verbose=True
)
Now, we can do the training.
grid_search.fit(X, Y)
Here are the results:
Fitting 10 folds for each of 96 candidates, totalling 960 fits
[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done 30 tasks | elapsed: 11.0s
[Parallel(n_jobs=10)]: Done 180 tasks | elapsed: 40.1s
[Parallel(n_jobs=10)]: Done 430 tasks | elapsed: 1.7min
[Parallel(n_jobs=10)]: Done 780 tasks | elapsed: 3.1min
[Parallel(n_jobs=10)]: Done 960 out of 960 | elapsed: 4.0min finished
The best_estimator_
field contains the best model trained by GridSearch.
grid_search.best_estimator_