Machine learning: the model behind the Waardecheck

At funda, we continuously develop new products to help consumers in the journey of buying or selling their house. But how do these products arise? In this article, Data Scientist Edwin Thoen tells you all about the origin of the Waardecheck and the model behind it.

If you are a homeowner, your house is not just a place to live in, it is also an investment. When the housing market is going up or down, so does the value of your property. At funda we knew that homeowners used our website to keep track of the asking prices of houses in their neighborhood. From this they tried to assess what the current value of their own house would be, if they were to put it on the market. In 2018, we decided to create a product for this group: the Waardecheck (Value check). It is an automatic translation of the recent sales prices to an estimation of the value of your house, taking into account its unique characteristics. As it turns out, the Waardecheck is frequently used and appreciated by our users. It is used over half a million times a year and the overall satisfaction rate is 86%.

First steps towards testing the product
It goes without saying that working on a product like this is really exciting for a data scientist. Applying machine learning on a well-defined problem, with the result being directly customer facing is about as good as it gets. During the development of the product in general, and the underlying model in specific, we went through a number of stages to create the Waardecheck as you see it now.

At funda we are huge fans of the Agile workflow. So, we first built a Minimal Viable Product, brought it live to a selected audience and learned from feedback and interactions with the product. This implied we also needed to build an MVP prediction engine, just good enough to be exposed on the website. To create this first version, we did some minimal data cleaning, split the data by province and applied XGBoosting to each of these datasets. We only used house characteristics as features that were readily available in public databases. This way we could calculate the estimates for each house in the Netherlands in advance. When someone used the product, we just did a simple lookup of the estimate of their home.

Since user feedback was very promising, we decided to further improve the product. The major focus point was giving better estimates. XGBoosting was very convenient for delivering a first version, because it takes care of nonlinear relationships out of the box, and it requires minimal data preparation. Unfortunately, it did not perform well for all regions. Although the Netherlands is a small country, it has large regional differences in house prices, even at a very local level.

A second challenge was the inclusion of the temporal component: how do we capture the market? House prices rise and fall over time, it is the very reason why homeowners want to have an up-to-date estimate of their house’s value. We made several attempts to include ‘the market’ in the XGBoosting approach, but nothing was really satisfactory. We just ended up with having a feature that counted the days since the first day we had in our training set, this did the job for the time being.

Going Bayesian
Off-the-shelf machine learning models like XGBoosting are convenient, but typically require a substantial amount of data to create stable predictions. In order to have these required numbers, we trained on a province level, which could not adequately account for regional differences. We did not see a clear perspective for improvement in this direction, hence we decided to try our luck with a completely different approach: hierarchical Bayesian models set-up locally. We felt that we had more control over the regional component when setting up a statistical model, instead of tuning the hyperparameters of a black-box method. Using a hierarchical structure, we could let the data speak if there were a lot of sales in that area. But if there were not, we would still have a stable estimate, albeit with a little larger interval around it.

After some tinkering, we also found a natural solution for modelling the market. Instead of using an explicit feature, we incorporated it as an implicit variable by training a new model for every month, using the prior-likelihood-posterior nature of Bayesian modelling. The first half year in our training set was used to train the initial models, giving us a start set of posteriors for each local model. Subsequently, we used the posteriors of a previous month as priors for the next, updated with that month’s data to arrive at the new posteriors. Giving us a model for each month, for every municipality.

This updating alone did not model the market. As any Bayesian can tell you, continuously updating prior to posterior will give you the exact same result as training on all data at once. Then why go through this trouble? Well, we did not just go from posterior to prior month to month. Before training on a new month, we discredited a little bit of the historical information by multiplying the standard deviation of the posterior by a factor just above one. This way we made the prior a little less informative and thus make that month’s data slightly more dominant at creating the posterior. The further a data point is in the past, the more often it is discredited in the monthly updates. This way the most recent data points are the most dominant in the current estimates.

The Bayesian paradigm enabled us to create a model that was both flexible enough to follow the market, while at the same time being stable enough to guard against overfitting.

This does ask for a lot more manual labour by the data scientist, such as creating cubic splines out of features to model some nonlinear relationships. Moreover, it demands good bookkeeping because we have to keep track of thousands and thousands of models. But once it is in place, the results in terms of model performance were vastly superior to the black box machine learning approaches.

Exposing the model, not the result
A final challenge we had to tackle was calculating the estimates on the fly as the user entered the home details, instead of pre-calculating the estimates for every house in the Netherlands. This was needed to incorporate features into the model of which the values are not publicly available, and thereby improve the estimations. Moreover, the variables that are publicly available proved to be inaccurate for many houses, yielding incorrect estimates. We wanted to enable the user to adjust these prefilled values. To allow on the fly estimates, we expose the full MCMC approximations of the posteriors of the latest model in an API. As soon as users fill in the specifics of their houses, these values are sent to the API which does a multiplication of the feature vector with the matrix with approximations. This results in a vector that approximates the posterior prediction estimation for a house. On this vector, we calculate the median as well as a lower and an upper bound of the uncertainty interval, which is returned by the API and subsequently shown to the user.

Although the black-box machine learning approach did prove to be useful for some regions in the Netherlands, we felt we could not fully use it to build a successful product. It’s true the Bayesian approach was more laborious to implement and created a more complex system, but at the same time it gave us much greater control over the model and allowed us to get creative. Most importantly, it gives superior predictions and thus a better product. We had a lot of fun working on the Waardecheck and learned a great deal along the way. It is funda’s first machine learning driven product and we are honored and proud that we could work on it.

Machine learning: the model behind funda's Waardecheck