Hotel Booking Cancellation

Mudesir Suleyman
5 min readJul 31, 2020

Abstract

Booking cancellations have impact in demand management decisions in the hotel industry. Cancellations limit the production of accurate forecasts, a critical tool in terms of revenue management performance. To circumvent the problems caused by booking cancellations, hotels implement rigid cancellation policies and overbooking strategies, which can also have a negative influence on revenue and reputation.

Objective
This study aims to assess the importance of each feature on the prediction of hotel reservation cancelation.

Overview

In tourism and travel-related industries, most of the research on Revenue Management demand forecasting and prediction problems employ data from the aviation industry, in the format known as the Passenger Name Record (PNR). This is a format developed by the aviation industry. However, the remaining tourism and travel industries like hospitality, cruising, theme parks, etc., have different requirements and particularities that cannot be fully explored without industry’s specific data. Hence, two hotel datasets with demand data are shared to help in overcoming this limitation.

The datasets now made available were collected aiming at the development of prediction models to classify a hotel booking’s likelihood to be canceled. Nevertheless, due to the characteristics of the variables included in these datasets, their use goes beyond this cancellation prediction problem.

Not all variables in these datasets come from the booking database tables. Some come from other tables, and some are engineered from different variables from different tables. A detailed description of each variable is offered in the following section.

Dataset

This data describes two datasets with hotel demand data. One of the hotels is a resort hotel and the other is a city hotel. Both datasets share the same structure, with 31 variables describing the 40,060 observations of a resort hotel and 79,330 observations of a city hotel. Each observation represents a hotel booking.

Targets

As mentioned above the main objective of this paper is to assess the importance of the features of the cancellation of hotel booking. From the existing features ‘is_canceled’ is the best candidate to be a target variable. The distribution of the target variable is almost 63 % is not canceled only 37% of the total reserved hotels canceled.

Methodology and model fit

There are 2 classes, this is a binary classification problem. The majority class occurs with 63% frequency, so this is not too imbalanced. I could just use the accuracy score as my evaluation metric. I fit the random forest model and hyperparameter tuning to improve performance. From the hyperparameter tuning our best validation accuracy is 85%.

Feature importance’s

We have to choose really what features are more related to the cancellation of the booking, which one is more important this the main objective of this paper. According to the default feature selection which is too fast but it doesn’t actually consider all necessary combinations, according to the feature selection, lead time, the type of deposit, arrival date day of the month, the arrival date of the week, and a total of special requests are the top five important features.

Feature importance based on impurity reduction and drop column importance is the best but it very costly and expensive. Permutation importance method is a good compromise between feature importance and drop column importance. The permutation importance using the eli5 method gives us a randomly permutated class which gives us the confidence to choose the importance of features that have a high relation with our target variable shows us with a green color. We can see that the top five important features are deposit type, a total of special requests, market segment, lead time, and previous cancelation history.

Results and Interpretation

From the model which tuned and rebooted using different metrics, I get the best validation accuracy 85% which the baseline was about 63%. Using the partial dependence plot from the figure below we figure out that when the lead time is small and the request for special items increases the probability of cancellation booking is less and vice versa.

In addition to PDP for individual prediction, we use a SHAP tool and we see the two figures below the features which increase the booking cancellation are with red color most of them which we see them as the top important features’ selections. These values focused on a single observation.

Conclusion

The feature selection from the dataset which is more related to the target variable is unquestionable, from the above results which used to fit the model using tuned and boosted tools shows us how the validation accuracy improved. The hotel policy which implement rigid cancellation policies and overbooking strategies, which can also have a negative influence on revenue and reputation use the suggestion of the features selection given above which more influence the booking cancellation.

--

--