Business Problem:
The restaurant business isn’t always charming as it appears from outside. With lots of unexpected problems popping up, it’s not a cakewalk to run a thriving restaurant. Restaurants thrive on the customers, their impressions about the customer’s service and many more. Hence, the restaurants need to know/assume the daily customer count in order to serve them better, manage their staff and resources.
This forecasting problem is specific to Japanese restaurants, specifically, restaurants that are registered to the database of Recruit holdings. Recruit Holdings has unique access to key datasets that could make automated future customer prediction possible. To solve this problem, we will use the historical reservation data and visitation data to predict visitors count on future dates.
Why use Machine Learning?
One could argue that the visitors forecasting can be solved using simple statistics or the restaurant owner has a general idea of daily visitor trends and with practice, accurate prediction is possible. In both cases, the experience, and historical data is necessary to arrive at a number. However, this isn’t the case with new restaurants with no historical data. What about other factors such as unexpected weather changes or the competition in the area. ML accounts to all these factors and conditions before spilling a number, even for restaurants with less or no historical data. With ML the confidence of tackling this problem is higher compared to the arguments.
Data Source
As mentioned earlier, The problem is specific to the Japanese restaurants that are registered to Recruit Holdings database.
Recruit Holdings owns
- Hot Pepper Gourmet (a restaurant review service)
- AirREGI (a restaurant point of sales service), and Restaurant Board (reservation log management software).
To account the local weather conditions, we will use the local weather information from another source.
For this problem, we will be predicting the future visitors for the restaurants of AirRegi database.
The given data spans from Jan 1st 2016 to May 30 2017. Of this, the data from Jan 1st 2016 to April 22nd 2017(3rd week) will be used for training, while the rest for testing(prediction)
Understanding the Data
Here is a simple diagram explaining the given data and their relation:
Most of the data features are self-explanatory. If you are in a hurry, skip to the next section or follow-through for a detailed explanation.
From the kaggle page:
This is a relational dataset from two systems. Each file is prefaced with the source (either air_ or hpg_) to indicate its origin. Each restaurant has a unique air_store_id and hpg_store_id. Note that not all restaurants are covered by both systems and that you have been provided data beyond the restaurants for which you must forecast. Latitudes and Longitudes are not exact to discourage de-identification of restaurants.
air_reserve.csv
This file contains reservations made in the air system. Note that the reserve_datetime indicates the time when the reservation was created, whereas the visit_datetime is the time in the future where the visit will occur.
air_store_id — the restaurant’s id in the air system
visit_datetime — the time of the reservation
reserve_datetime — the time the reservation was made
reserve_visitors — the number of visitors for that reservation
hpg_reserve.csv
This file contains reservations made in the hpg system.
hpg_store_id — the restaurant’s id in the hpg system
visit_datetime — the time of the reservation
reserve_datetime — the time the reservation was made
reserve_visitors — the number of visitors for that reservation
air_store_info.csv
This file contains information about select air restaurants. Column names and contents are self-explanatory.
air_store_id — Unique ID for AIR registered restaurants
air_genre_name — Type of restaurant
air_area_name — area name of the restaurant
latitude
longitude
hpg_store_info.csv
This file contains information about select hpg restaurants. Column names and contents are self-explanatory.
hpg_store_id — Unique ID for the restaurants in HPG database
hpg_genre_name — Type of restaurant
hpg_area_name — area name of the restaurant
latitude
longitude
Note: latitude and longitude are the latitude and longitude of the area to which the store belongs
store_id_relation.csv
This file allows you to join select restaurants that have both the air and hpg system.
hpg_store_id — Unique ID for the restaurants in HPG database
air_store_id — Unique ID for AIR registered restaurants
air_visit_data.csv
This file contains historical visit data for air restaurants.
air_store_id — Unique ID for AIR stores
visit_date — the date
visitors — the number of visitors to the restaurant on the date
sample_submission.csv
This file shows a submission in the correct format, including the days for which you must forecast.
id — the id is formed by concatenating the air_store_id and visit_date with an underscore
visitors- the number of visitors forecasted for the store and date combination
date_info.csv
This file gives basic information about the calendar dates in the dataset.
calendar_date
day_of_week
holiday_flg — is the day a holiday in Japan
Metric
From ML perspective, this is a straightforward regression problem, meaning we can use any one of the MSE, RMSE and MAE. However, for this problem, we will use RMSLE, which stands for Root Mean Square Logarithmic Error. This is the same as RMSE, except on the log form of a prediction.
Why RMSLE? [1]
Remember, we are trying to help the restaurants to be well prepared for the visitors. If they are underprepared, which is the case when the prediction is less than the actual number, then they are short on resources and that is a bad dining experience for the customers. On the other hand, if the restaurants are over-prepared, as in they bought resources for 20 customers, and only 15 showed up. They still can store the extra resources for the next day, while none of the customers returned unhappily.
This is exactly why RMSLE is important, which penalizes higher to underpredictions compared to over-predictions
Consider the following predictions:
Actual = [180, 270, 200, 280, 400, 180, 270, 200, 280]
pred = [140, 240, 180, 270, 400, 190, 290, 230, 320]
Now see the plot of RMSE and RMSLE for the above
For under and over prediction RMSE penalizes same, however, RMSLE penalizes higher for underprediction compared to over prediction.
EDA
Now that we have the problem definition and its evaluation metric sorted, let’s dig into the data to find patterns and observations.
Visitor feature trend across time/days:
- The visitor feature does exhibit seasonality across the data. Also, Around July 2016, there is a sharp rise in visitors, which could be due to more restaurants being added to the database.
- The seasonality can be assumed between successive weeks.
- On Jan 1st 2017, there is a sharp dip in visitors, due to most restaurants being closed on new years eve.
Average visitors by the day of the week and month:
- As we observed the presence of the seasonality in the visitors variable, we assumed it to be between successive weeks. The above plot confirms that assumption.
- As the weekend is approaching, there is a rise in visitors across the restaurants. Visitors tend to dine outside on weekends.
- Month wise, on average every month has consistent visitors except August, September. December has the highest visitors, again, could be due to new years eve.
- Also, there is a significant difference between mean and median values, which could be due to the presence of some outliers.
Visitor variable distribution:
- From the histogram, we can see that some restaurants have visitors in a range of 100 to 800. The box plot confirms the same too.
- Now, this could be the case on New years eve or some wrong entry in the database. Whatever might be the case, those numbers do not describe the everyday visitor’s trend. Hence, these can be removed using 95% Confidence interval or 1.5 times of IQR range.
Visitors stats on weekdays and weekends:
- We earlier saw that weekends are more popular among visitors. The above plot confirms the same.
- Weekends, on average, have 13+% more visitors in comparison with weekdays.
Visitors stats on holidays:
- We saw earlier that weekends are popular among visitors since those are non-working days.
- The above plot explains the visitor’s trend on holidays in general. Here too we see that there is about ~6% difference visitors on holidays and working days.
Visitors during Golden Week:
- It is mentioned that from April 29 to May 5 is a golden week, where only May 1st, 2nd are working days and rest are holidays.
- In the above plot, we can see that the visitor’s seasonality breaks during the golden week. Visitors are high throughout this week.
- Since our test data spans across these dates, we might wanna adjust our predictions to accommodate this pattern.
Visitors stats if next day is holiday/working:
- It’s usually common for employees to relax if the next day is a holiday, and maybe dine out.
- Again, the visitor’s stats are high if the next day is working in comparison with the next day working.
- We can also confirm this on a weekly plot, where Friday has more visitors in comparison with other working days.
Reservation Trend across time/days:
- From the plot, we see that AIR reservations have a flat line which is missing data between 2016–07 to 2016–11.
- Both HPG and AIR reservations are high during December, Obviously to beat the rush of New Year’s eve and festivals.
- Again as in visitors, the reservations are minimal on Jan 1st 2017.
- The reservations are high in 2017 compared to 2016. This could be due to less popularity. The popularity of reserving seats rose in 2017.
Visitors v/s Reservations:
- Visitors and Reservations are directly proportional by some percentage, especially in 2017.
- Although the direct walk-in seems to be a popular choice, The visitors are high on high reservations. Around 2016–11, visitors are spiking with reservations spikes.
Reservation Hour:
- Both in AIR & HPG systems, The reservations for dinner time are high.
- Also, the AIR system is popular among visitors for reservation.
Genre Popularity:
- Of the 14 genre’s (type of restaurant), “Izakaya” seems to be most popular among visitors. A quick google search reveals it to be a Japanese bar that serves Alchohol and Snacks.
- Coffee/Sweets is the next go-to choice for Japanese people. Now we know which type of restaurant has higher earnings if we ever decided to open a restaurant in Japan ;-)
- International cuisine has the least visitors.
Restaurant Count Area-wise/Genre-wise:
- “Daimyo” has the highest number of restaurants and most Tokyo areas. These areas can expect a minimum number of visitors due to the popularity of the area. Also, competition is high in those areas.
- Previously, “Izakaya” and “cafe/sweets” genres are popular among visitors, that also reflects in the restaurant count. Both genres have a high number of restaurants across Japan.
Restaurants count across Japan:
- From the map, Fukuoka has the highest restaurants around one single place. Hence the high number of visitors.
- While the Tokyo area has overall a high number of restaurants, they form different clusters spread in different areas of Tokyo. Hence the visitors are spread out.
- Area name could be very important information in determining the visitors.
- Osaka is the second most popular and again restaurants are spread out
- Also, we can see that the more restaurants in the area the more visitors count. Area wise restaurant count could be useful information.
EDA summary:
- Visitors variable data is periodic in nature corresponding to a week pattern where there is a rise in visitors on weekends.
- The visit database has a sharp spike in data during July 2016, which might be the effect of additional restaurants being added to the database.
- Since most restaurants are closed on New years eve, There is a sharp dip in visitors on 2017–01–01.
- On a weekly basis, Visitors are high on weekends in comparison to weekdays.
- Month-wise, December has more visitors compared to rest owing to festivals and New year’s Eve.
- In general, Visitors are high if its a holiday, in comparison to working days.
- Restaurants see a rise in visitors if the next day is a holiday, compared to the next day working. This trend is also seen on a weekly basis where Friday has more visitors compared to other working days since Saturday is weekend.
- The reservations numbers have a similar trend to visitors. However, many prefer to walk in. This is evident from the huge difference between reservations and visitors.
- The exception Golden Week from April 29th 2016 to May 5th 2017, has to be dealt.
- The AIR reservation system has missing data from 2016–07 to 2016–11. No explanation is given for this missing data.
- Both in AIR and HPG reservation system, the reservations are less before 2016–11, but there is a sudden rise in reservations after this.
- Majority of the reservations are done for dinner time.
- Alcohol and snacks genre, also known as ‘Izakaya’ in Japan, is most popular among customers.
- Next to Izakaya, Cafe/Sweets is most popular. Most office workers are common at the cafe for morning and evening coffee.
- ‘Fukuoka’ is the most popular area for restaurants though it has marginally fewer restaurants than Tokyo. However, Tokyo has restaurants spread in clusters which in turn makes the visitors to be spread. Tokyo is the next popular area among consumers
- Osaka though contains the second most restaurants, the visitors are marginally less compared to the other two.
- ‘Izakaya’ genre has a presence in the majority of locations followed by Cafe/sweets.
- Majority of genres have consistent customers, except western food, International Cuisine, Asian and Karaoke genres.
- International Cuisine/Asian/Karaoke Party are the least popular. Interestingly the Party is least popular. Japanese is popular for its longest working hours, which might be one of the reasons.
Existing Solutions:
8th Place kaggle solution: [2]
The air visit data here is resampled by day, so that missing dates are filled with zero visits.
Calendar information is used to find if the next/prev day of the current day is a holiday. This is useful since it can boost/lower the visitor number.
The store information is retrieved from https://www.kaggle.com/huntermcgushion/rrv-weather-data store which has weather data of the area.
Average precipitation and average temperatures from all weather stations are the two main features extracted from weather data.
Visitor count(target) has a lot of outliers especially around new years eve which are high count. Assuming the visit count per restaurant follow normal distribution, any value beyond 2.4 times the std deviation is capped to 2.4 times the std deviation
I.e values > 2.4 * visitCount.std() = max(values < 2.4 * visitCount.std())
Day of the month is used as a feature to mark the date getting paid. Usually paid days follow with outdoor dining.
Exponential weighted Means (EVM) is used to find the trend in the time series data on numerical features. Here the alpha(weight) value is determined by the optimization.
The simple statistics such as, mean, median, std, variance, count, max and minimum are added.
lightgbm is used for modelling, achieving RMSLE 0.50775
6th place Kaggle Solution: [3]
The dataset is from official data, and weather-data is used from another source.
The gap hour between reservation time and reservation made is used as a feature. This again is divided in 5 categories as gap less than 12 hrs, 12–36, 37–59, 60–85, 85+
The mean, median, max, min of visitors to restaurants grouped by working and non-working days is taken as a feature.
The mean of visitor count grouped by monthly is taken as a feature.
Similarly, the mean of visitor count grouped by weekly
Temperature and precipitation from weather data are taken as a feature.
XGboost is used for modelling, achieving RMSLE 0.50710
Potential Improvement Features:
- Restaurant Count in the areas:
This feature will help define the competition in the area. More the restaurant's count, the higher the chance of visitors being distributed.
2. Genre Count in the same area
We saw that each genre has its own trend of visitors, Hence The count same genre restaurant in the area could aid in prediction.
The first approach with FE:
- Since it is a time series problem, we will combine both train and test data to create features.
- We will combine features from the distributed data
- Store information from air_store_info.csv- Date information from date_info.csv- Reservation information from both HPG and AIR- Weather information from another source.
- Cap the outliers visitors data to 1.5 * IQR. So if a visitor value is beyond IQR range, we will use its IQR range as visitors. From now on we will use the capped visitors data for generating other features.
- Group the data by area names to get the count of restaurants in the area of the restaurant. (area competition)
- Group the data by area name, genre name to get the count of same genre restaurants in the same area of the restaurant. (genre competition)
- Get the number of visitors arriving via reservations from both HPG and AIR data for the restaurant.
- Get the number of reservations done for the restaurant for the visit date.
- In addition to the holiday flag, we will create one Holiday flag feature that is marked True, if it’s holiday or weekend, otherwise False. We need this information since its common for visitors to be high on holidays.
- Date features describing the day, week number, month, and year.
- Mean/Median visitors on a monthly basis for each store.
- Statistics of visitors for the restaurant grouped by the second-holiday flag feature.
- Statistics of visitors for the restaurant grouped by the week of the day.
- As we saw earlier, many restaurants form a cluster around the same area/prefecture. Extract the prefectures and sub-prefectures from the area names
- We will use log-transformed visitors data to train and predict.
- Use RFE/Decision Tree feature Importance to eliminate features and select the important ones.
- Finally, using the selected features, we will use all the regression model for the best one
Features selected using RFE:
Using RFECV, features with rank 1 were selected for modelling.
Modelling: ML
1. KNNRegression
Hyperparameters:
n_neighbours
tuning result:
n_neighbours = 19
The RMSLE score using KNN regression is 0.51938
2. Decision Tree Regression:
Hyperparameters:
max_depth
min_samples_split
min_samples_leaf
Tuning result:
max_depth=10
min_samples_leaf=50
min_samples_split=500
The RMSLE score using DT regression is 0.50471
3. Linear regression
Hyperparameters:
alpha: l2 penalty value
tuning result:
alpha=0.01
The RMSLE score using Linear regression is 0.51276
4. Random Forest
Hyperparameters:
max_depth
n_estimators
Tuning result:
max_depth=10
n_estimators=500
The RMSLE score using Random Forest regression is 0.50035
5. XGBoost
Hyperparameters
learning_rate
colsample_bytree
min_child_weight
max_depth
subsample
Tuning result:
learning_rate = 0.01
min_child_weight = 0.8
subsample = 0.7
colsample_bytree = 0.5
max_depth = 8
The RMSLE score using XGboost is 0.47826
6. AdaBoost
Hyperparameters:
n_estimators
learning_rate
Tuning result:
base_estimator=DecisionTreeRegressor(max_depth=10)
learning_rate=0.01
n_estimators=100
The RMSLE score using Adaboost is 0.50148
Modelling: DL
- MLP
Keras tuner is used to fine-tune the architecture and predict the visitors.
The RMSLE score using MLP is 0.51662
2. LSTM
Again, Keras tuner is used to tune and predict the LSTM architecture
The RMSLE score using LSTM regression is 0.51728
3. Conv1D
Again, Keras tuner is used to tune and predict the Conv1D based architecture
The RMSLE score using Conv1D regression is 0.52300
Score Comparison:
From the above scores, XGboost yields the best score
Kaggle Submission scores
From the kaggle scores too, the XGboost yields the best score of all.
Improvements:
- Providing restaurants with the live feed of expected visitors in the next hour/minute. The live feed could help them organise their restaurant schedule, resources and staff.
- Use the adjustment method by found by Kaggle User: 30CrMnSiA to handle the Golden week predictions.[5]
References:
[1] https://medium.com/analytics-vidhya/root-mean-square-log-error-rmse-vs-rmlse-935c6cc1802a
[2] https://github.com/MaxHalford/kaggle-recruit restaurant/blob/master/Solution.ipynb
[3] https://github.com/anki1909/Recruit-Restaurant-Visitor-Forecasting
[4] https://www.appliedaicourse.com/
[5] https://www.kaggle.com/c/recruit-restaurant-visitor-forecasting/discussion/49100
Collaborate/Contact:
- If you have any other features ideas to improve this model or collaborate on any other projects or any suggestions for me. Feel free to contact me on LinkedIn
- The full ready to use codebase can be found at my Github repo