property valuations by machine learning and hedonic

IN DEGREE PROJECT THE BUILT ENVIRONMENT,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2021

Property Valuations by Machine Learning and Hedonic Pricing ModelsA Case study on Swedish Residential Property

KANHA TEANG

YIRAN LU

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ARCHITECTURE AND THE BUILT ENVIRONMENT

www.kth.se

2

Master of science thesis

Title:

Author:

Department:

Master Thesis number:

Supervisor:

Keywords:

Property Valuation by Machine Learning and Hedonic Pricing Models: A Case study on Swedish Residential Property Kanha Teang & Yiran Lu

Department of Real Estate and Construction Management

TRITA-ABE-MBT-21419 Bertram Steininger Real estate valuation, Machine learning, Hedonic Pricing Models, Random Forest, Stockholm

Abstract

Property valuation is a critical concept for a variety of applications in the real estate market such as transactions, taxes, investments, and mortgages. However, there is little consistency in which method is the best for estimating the property value. This paper aims at investigating and comparing the differences in the Stockholm residential property valuation results among parametric hedonic pricing models (HPM) including linear and log-linear regression models, and Random Forest (RF) as the machine learning algorithm. The data consists of 114,293 arm-length transactions of the tenant-owned apartment between January 2005 to December 2014. The same variables are applied on both the HPM regression models and RF. There are two adopted techniques for data splitting into training and testing datasets, randomly splits and splitting based on the transaction years. These datasets will be used to train and test all the models. The performance evaluation and measurement of each model will base on four performance indicators: R-squared, MSE, RMSE, and MAPE.

The results from both data splitting circumstances have shown that the accuracy of random forest is the highest among the regression models. The discussions point out the causes of the models’ performance changes once applied on different datasets obtained from different data splitting techniques. Limitations are also pointed out at the end of the study for future improvements.

3

Examensarbete

Titel:

Författare:

Institution:

Examensarbete Thesis nummer:

Handledare:

Nyckelord:

Fastighetsvärderingar efter maskininlärning och hedoniska prissättningsmodeller: En fallstudie om svensk bostadsfastigheter Kanha Teang & Yiran Lu

Institution för Fastigheter och ByggandeTRITA-ABE-MBT-21419 Bertram Steininger Fastighetsvärderingar, Maskininlärning, hedoniska prissättningsmodeller, Random Forest, Stockholm

Abstract

Fastighetsvärdering är ett kritiskt koncept för en mängd olika applikationer på fastighetsmarknaden som transaktioner, skatter, investeringar och inteckningar. Det finns dock liten konsekvens i vilken metod som är bäst för att uppskatta fastighetsvärdet. Denna uppsats syftar till att undersöka och jämföra skillnaderna i Stockholms fastighetsvärderingsresultat bland parametriska hedoniska prissättningsmodeller (HPM) inklusive linjära och log-linjära regressionsmodeller, och Random Forest (RF) som maskininlärningsalgoritm. Uppgifterna består av 114,293 armlängds-transaktioner för hyresgästen från januari 2005 till december 2014. Samma variabler tillämpas på både HPM-regressionsmodellerna och RF. Det finns två antagna tekniker för uppdelning av data i utbildning och testning av datamängder: slumpmässig uppdelning och uppdelning baserat på transaktionsåren. Dessa datamängder kommer att användas för att träna och testa alla modeller. Prestationsutvärderingen och mätningen av varje modell baseras på fyra resultatindikatorer: R-kvadrat, MSE, RMSE och MAPE.

Resultaten från båda uppdelningsförhållandena har visat att noggrannheten hos slumpmässig skog är den högsta bland regressionsmodellerna. Diskussionerna pekar på orsakerna till modellernas prestandaförändringar när de tillämpats på olika datamängder erhållna från olika datasplittringstekniker. Begränsningar påpekas också i slutet av studien för framtida förbättringar.

4

Acknowledgment

We would like to express our deepest gratefulness to all the people who have supported us along the way. First and foremost, we would like to express our most sincere thanks to our supervisor Bertram Steininger who provided us a lot of guidance during this master thesis and help with the data collection. His valuable advice inspires us a lot and enlightens us to improve our work. Without his support, we can not make it this far.

Secondly, thanks to our family and friends who always send us encouragement and supports, their love and company along the way mean a lot to us. We also would like to thank all the teachers and colleagues who provided us help and knowledge in our master program.

Lastly, thanks again to all the people who have helped us, wish all the people will have a bright future.

Kanha Teang & Yiran Lu Stockholm, June 2021

5

Contents 1. Introduction ........................................................................................................... 7

1.1 Aim of the research .......................................................................................... 8 2. Literature Review ................................................................................................. 8

2.1 Overview of the Swedish Residential Property ............................................... 8 2.2 Traditional Property Valuation ......................................................................... 9 2.3 Advanced Property Valuation ........................................................................ 11

2.3.1 Hedonic Pricing Model ....................................................................... 11 2.3.1.1 Strengths and Weaknesses of Hedonic Pricing Model ............... 11

2.3.2 Machine Learning for property valuation ........................................... 12 2.3.2.1 Strengths and Weaknesses of Machine Learning in property

valuation ....................................................................................................................... 13 2.4 Random Forest ............................................................................................... 13 3. Methods ................................................................................................................ 15

3.1 Hedonic Pricing Model ................................................................................ 15 3.2 Random Forest ............................................................................................... 17 3.3 Performance Indicators .................................................................................. 19

4. Data ...................................................................................................................... 20

4.1 Variables Selection ......................................................................................... 20 4.2. Data Source ................................................................................................... 21 4.3. Data Description ........................................................................................... 22 4.4. Data Setting ................................................................................................... 24

5. Results .................................................................................................................. 25

5.1 Results from Random Data Splitting ............................................................. 25

5.1.1 Multiple linear regression ................................................................... 25 5.1.2 Log-linear regression .......................................................................... 27 5.1.3 Random Forest .................................................................................... 29

5.2 Results from data split by years ................................................................... 31 5.2.1 Multiple linear regression ................................................................... 31

5.2.2 Log-linear regression .......................................................................... 32 5.2.3 Random Forest .................................................................................... 34

6. Discussion............................................................................................................. 34

6.1 Models comparison: randomly splitting the data ........................................... 34 6.2 Models comparison: splitting the data by years ............................................. 36

6

6.3 General discussion ......................................................................................... 37 7. Conclusion ........................................................................................................... 38

8. Limitation .......................................................................................................... 39

References ................................................................................................................... 40

7

1. Introduction

Property valuation is the estimation of the properties´ market value which is significantly important in decision-making in real estate investment, transaction, development, taxation, and credit loan. Considering these applications, the quality, and accuracy of appraisal are critically important. In the traditional type of property valuation, appraisers responsible for the estimation of the value base on their opinion and judgment (Abidoye et al. 2019). To ensure quality, appraisers are required to follow the professional standard, technical standard, and performance standards that are regulated by the government and global professional bodies such as the Royal Institution of Chartered Surveyors (RICS). However, several studies have discussed and proved that appraisers have caused biases including anchoring the value to the recent transaction price by paying less attention to the current market conditions (Diaz III & Wolverton 1998), making judgments on the property value based on their opinions (Gallimore 1996) and clients’ influences on valuer judgment (Diaz & Hansz 1997). In turn, it increases the inaccuracy of the valuation. The advanced valuation method involves multi-regression methods and big data such as the hedonic pricing model and artificial intelligence has been claimed as new methods to improve the value accuracy (Diaz & Hansz 1997; Yacim & Boshoff 2014; Yilmazer & Kocaman 2020). The hedonic pricing model is the conventional regression model developed by Lancaster (1966) and Rosen (1974) (Wing & Chin 2003). The theory bases on consumer demand which means the characteristics of the goods are the main drivers of consumptions (Čeh et al. 2018). In this case, structural, locational, and environmental attributes are the characteristic of property that drive consumer demands. So, it is broadly applicable regarding predicting housing prices (Hong, Choi & Kim 2020) Another advanced method is machine learning. Machine learning is an application of artificial intelligence that trains the computer to learn and recognize the pattern of the input experiences to improve the processes of works automatically (Jordan & Mitchell 2015). The famous machine learning algorithms applied to property valuation are artificial neural networks (ANN), support vector machine (SVM), random forest (RF), gradient boosting machine (GBM), and boosted trees (BT). However, the studies have found that the RF technique performs better than other algorithms mentioned above in property value estimation (Antipov & Pokryshevskaya 2012; Dellstad 2018; Geltner & Mei 1995; Ho, Tang & Wong 2021; Masías et al. 2016; Thanh Noi & Kappas 2018).

8

Additionally, it is suitable and performs better for mass residential appraisal (Antipov & Pokryshevskaya 2012). The related study of the performance comparison between these two advanced methods, HPM and RF, has been studied by Hong, Choi & Kim (2020). The study chooses the conventional ordinary least square - linear regression model as the hedonic regression model and compares its performance applied on the mass appraisal for residential property in Korea. The result of the predicted property value by using RF has a smaller average deviation from the market value approximately 5% meanwhile HPM deviates approximately 20%. However, there is a limited number of studies on the comparison of RF with other regression models such as log-linear ordinary least square as the hedonic model. There is claimed by Follain and Malpezzi (1980) that the log-linear regression model performs better than the linear regression model in predicting property value. Rather than a comparative study between these two valuation methods, Dellstad (2018) adopted three ML techniques, RF, ANN, SVM for the Swedish commercial property valuation. The study also proved that ML produces less error percentage compared to the valuation on commercial property performed by appraisers at a Swedish company. However, there is insufficient evidence proving that ML outperforms in residential property valuation.

1.1 Aim of the research

Having learned from the above limitations, this paper aims to compare the performance of random forest algorithm with parametric hedonic pricing models both linear and log-linear regression models as the valuation methods on the Swedish residential property. The study will focus on the tenant-owned apartments in Stockholm city as the selected data.

2. Literature review

2.1 Overview of the Swedish residential property

As of 2019, the country with a population of 10 million people, Sweden stands as the 12th of the highest GDP in the world (OEDC 2021). According to Statistics Sweden (2021), the total population will increase to 12 million people in 2030 with approximately 25% of the population reside in the capital city of Stockholm. The growth of the population leads to the increasing demand for the housing market.

9

However, the low supply from construction residential housing into the market has been blamed as one of the causes of the sharp rise of the housing price 2.5 times since 1995 (Emanuelsson 2015). Other factors contribute to increasing housing prices and the limitation on the housing supply including strict regulation on planning and development of the new construction, high construction and land cost, regulation on the rental market, the high requirement on the quality (Boverket 2016; Emanuelsson 2015). Additionally, the ease of accessing mortgages with low-interest rate by the households also hike the demand of the residential property (European Commission 2018). The risk of the negative reaction by property developers to the tightening condition on the pre-sale financing could also contribute to the reduction of the housing supply (International Monetary Fund 2019). To tackle these problems, the Swedish government has introduced several policies to support the development of residential property (European Commission 2020). In 2016 through Sweden’s National Board of Housing, Building and Planning (Boverket), the government released the 22 measures including providing credit guarantee to the bank up to 90% of the project value for developers accessing finance with the bank. The measures also considered selling the necessary block of public land to the private companies, simplifying the process of the planning and building activities, advising municipalities not to introduce additional regulations on the planning and building act (European Commission 2018). The government even proposed temporary building permits in 2017 (European Commission 2020).

2.2 Traditional property valuation

Valuation plays an essential role in the estimation of the market value of the property. The purposes of the valuation are to support financial reports, transactions, tax reports, mortgages, and lending decisions (International Valuation Standards Council 2019, p. 4). There are two categories, traditionally valuation and advanced valuation (Abidoye Rotimi et al. 2019; Pagourtzi et al. 2003; Yacim & Boshoff 2014). Traditional valuation relies on appraiser opinion to assess the value of the property manually (Abidoye Rotimi et al. 2019) while advanced valuation uses mathematical models with the property information and attributes to estimate the market (Pagourtzi et al. 2003). As the traditional valuation is based on the appraisal judgments and opinion, RICS has defined it as:

10

An opinion of the value of an asset or liability on a stated basis, at a specified date. Unless limitations are agreed in the terms of engagement this will be provided after an inspection and any further investigations and inquiries that are appropriate, having regard to the nature of the asset and the purpose of the valuation (RICS 2019, p. 10).

Specific property type requires relevant and suitable valuation approach(s) and method(s). There are three main approaches in the traditional valuation, market approach, income approach, and cost approach (International Valuation Standards Council 2019). Each approach consists of different method(s). The summary of all the methods is comparable method, income method, profit method, development or residual method, contract method, multiple regression method, and stepwise regression method (Pagourtzi et al. 2003). According to International Valuation Standards Council (2019) (Redbook), the market approach accesses the value of the considered property by comparing it to the value of recent transaction properties that has similar attributes include type, location, age. The valuation by income approach base on converting the future cash flow to present value (discount cash flow). This approach is a suitable income-producing asset such as commercial property. Lastly, the cost approach provides the value of the property base cost of replacement or reproduction of the property with the same utilities with considering reducing the cost of deterioration and aging of the building. Lentz and Wang (1998) claimed property valuation is an art rather than science that involves selections of unique methods and variables into consideration. So, there is an expectation that appraisers have a number of strengths in property valuation include:

• Considering neighborhood characteristics within the located area of the property which is obtained from comparable properties.

• Considering the impacts of externality variables which are sensitive values to the property value (Lentz & Wang 1998; Murdoch, Singh & Thayer 1993). For example, the location of a nuclear plant in one of the neighboring communities is undesirable to live in. Hence, it has a negative value on the property.

• Comparing and making a judgment on the property's value based on the most recent transactions or time variable (RICS 2019).

However, the drawbacks of the valuation by appraisers are less accuracy (Yacim & Boshoff 2014; Zurada, Levitan & Guan 2006) and less reliable result due to bias (Diaz III & Wolverton 1998; Diaz & Hansz 1997).

11

2.3 Advanced property valuation

The distinction between advanced valuation and traditional valuation is the amount of input data used to estimate the market value of the property. The traditional valuation deal with one or few property valuations while mass appraisal usually works on a huge amount of data of properties information (Yilmazer & Kocaman 2020). The advanced valuation includes hedonic pricing model (HPM), spatial analysis, autoregressive integrated moving average (ARIMA), and approaches by using artificial intelligence through the machine learning algorithms such as artificial neural network (ANN), expert system, fuzzy logic (Pagourtzi et al. 2003) and random forest (RF) (Antipov & Pokryshevskaya 2012). 2.3.1 Hedonic pricing model The hedonic pricing model is known as a regression model that has been widely employed in property valuation for a long time (Annamoradnejad et al. 2019). By claiming that the value of a good would be a total accumulation of each implicit value of the good’s characteristic, its theoretical framework was then developed by Lancaster (1966) and Rosen (1974). The hedonic pricing model mostly uses the Ordinary Least Squares (OLS) regression approach since this approach can easily work on calculations (Bao & Wan 2007). 2.3.1.1 Strengths and weaknesses of Hedonic Pricing Model One advantage of the hedonic pricing model is that it is simple to predict price as well as easy to explain each regression coefficient (Hong et al. 2020). However, criticism occurs on this model for its strict requirements on assumptions. The model is directly built on the preferences of households and strict housing assumptions, which create requirements for perfect competition and market equilibrium (Hong, Choi & Kim 2020). The same conclusion was drawn by Fan et al. (2006), where they stated the relating criticisms of the hedonic regression approach about its model assumptions and prediction requirements. Thus, considering the problems that might be associated with the hedonic model, the complexity of implementation in practice might be somehow simplified. There would be a need to look for some other models that might be harder to implement however can better explain the complicated relationships in reality.

12

2.3.2 Machine learning for property valuation According to Simon et al. (2016), Machine learning, in general, is a part of computer science, which evolved from the learning of pattern recognition and computational learning theory in artificial intelligence. It is a method through training machines that then construct algorithms to initiate machines to learn and generate estimations based on the provided datasets. Thus, by working with machines, machine learning is a great alternative to solve prediction problems that hard to deal with manually. There are two kinds of tasks in this method: supervised and unsupervised. Supervised refers to the program trained on a pre-determined dataset and should be able to generate a prediction if a new data enters in. Unsupervised is when unlabeled data is given, the training program aims at seeking relationships and patterns for the data. Machine learning is particularly applicable in complicated data patterns (Ngiam & Khor 2019). There have been extensive applications of different machine learning models among fields, like computer engineering, medicine, and business (Singh et al. 2007). Various machine learning models also have been developed in the real estate field, which is of importance for estimating housing prices (Jamil et al. 2020). Mass appraisal is a concept that has a high relation with property valuation using machine learning models. International Association of Assessing Officers defined mass appraisal as the process of valuing groups of properties as of a given date using common data, standardized methods, and statistical testing (Eckert et al. 1990). Zhou et al. (2018) explained that mass appraisal in real estate can refer to automatic valuation as well, where using geographical information technique, mathematic statistics, and computer technology to create a model for a large number of properties to appraise their market values. There are already a bunch of methods that have been applied in real estate mass appraisal. Antipov & Pokryshevskaya (2012) stated that traditionally many studies employed parametric regression methods for analysis, while some others chose nonparametric regressions. Apart from that, machine learning methods also have a long history applied in mass appraisal. As machine learning techniques are getting more noticed in mass appraisal in the recent world, it requires more carefulness to understand the strengths and weaknesses of these techniques to better conduct them when appraising property values.

13

2.3.2.1 Strengths and weaknesses of Machine Learning in property valuation There are already substantial studies that have proved the effectiveness and high accuracy machine learning algorithms can bring to the real estate valuation, however, criticisms still on the discussion. Antipov & Pokryshevskaya (2012) compared ten machine learning algorithms for mass appraisal in Saint-Petersburg residential market, successfully proved the advantages that RF has, and its effectiveness in price prediction. However, they also pointed out some limitations RF may have. For example, this method requires an optimal setting for how many variables would be sufficient for each tree. Ho et al. (2020) studied the property price prediction in Hong Kong by using Support Vector Regression (SVM), Random Forest (RF), and Gradient Boosting Machine (GBM). Their result indicated that both RF and GBM have lower prediction errors, and can predict the housing prices more accurately in comparison with SVM. In their study, they highlighted machine learning algorithms also have limitations. Firstly, there are always different features/variables that can be selected within a model, researchers have to be careful to choose the suitable features. Then, the results generated by conventional methods are easier to interpret in comparison with machine learning algorithms. Lastly, conventional estimation methods like the hedonic pricing model usually less time-consuming in computation than machine learning algorithms. Thus, in the recent world, even machine learning is an advanced technology and can solve extremely complex problems, its limitations cannot be ignored. Even machine learning algorithms have excellent estimation capabilities in property valuation, conventional methods still would not be replaced. Because under some circumstances like limited-time constraints or lack of time to test what features would be the most suitable within a model, machine learning algorithms may not be a perfect choice.

2.4 Random Forest

Random Forest is a successful supervised machine learning algorithm which has found by Breiman (2001). The algorithm is the ensemble of decision trees. The extensive applications are suitable for both classification and regression problems. The regression deals with the quantitative prediction while the classification is for qualitative prediction. Having the origin as the decision tree, it obtains the predictions that learn from the observations of the pattern of the input data. Each terminal node represents one decision is referred to as one leaf of the trees, which is independent. However, the uniqueness of RF from the decision tree is the introduction of randomly

14

select a specific number of trees from the result of the decision trees and making the average of the trees to create the final result. Previous studies have explored the performance of RF in real estate mass appraisal, and part of them intended to compare to some traditional valuation methods. Masías et al. (2016) applied Random Forest (RF), Support Vector Machine (SVM), Neural Networks (NN), and multiple linear regression (LR) to predict housing prices in Santiago, the result showed that RF outperformed the other three in terms of accuracy, while LR scored better than NN. Yilmazer & Kocaman (2020) applied multiple linear regression (MRA) and random forest in part of the Ankara commercial property market, the result showed that RF slightly performed better at explaining rate and average deviation than MRA. Čeh et al. (2018) also used random forest and multiple linear regression in the Ljubljana housing market, the result also indicated that RF significantly outperformed MRA. Levantesi & Piscopo (2020) applied Random Forest in London real estate market to analyze the variable importance to the housing price. They also compared the prediction result between Random Forest (RF) and Generalized Linear Regression (GLM), the result showed that the prediction performance of RF is better than GLM. Antipov & Pokryshevskaya (2012) hold a positive view on Random Forest in mass appraisal. They pointed out several benefits this method can bring to property valuation. They said Random Forest performs better for classification comparing with other machine learning algorithms like neural networks. Also, this method can effectively tackle missing values and categorical variables with levels. Other strengths of RF are not sensitive to outliers because of bagging, and it can eliminate overfitting problems. Yilmazer & Kocaman (2020) stated that Random Forest can successfully detect both linear and non-linear relationships between the dependent variable and independent variables. Except that, RF can automatically help with selecting what variables are most suitable in an analytical case. However, even RF has been proved its extraordinary performance in housing price prediction, as reviewed in the previous studies, RF has some disadvantages with variable selections like how many and what variables would be suitable with a model.

15

3. Methods

By learning the strengths and weaknesses of hedonic pricing models and machine learning models for mass appraisal, there is an intention to select a few models under HPM and ML and put these different kinds of models in comparison in a practical case. Random forest has a relatively high performance in property valuation. As mentioned by Antipov & Pokryshevskaya (2012), there is still an insufficient number of comparison studies among machine learning algorithms for mass appraisal. They compared ten methods for property valuation and proved that RF has a better performance than the other nine methods. This is in line with other studies that also comparing RF with other valuation methods. The hedonic pricing models are the selected methods for this study. The conventional parametric linear regression and log-linear regression models will be applied to the property valuation. These regression models will illustrate the relationship between apartment prices and apartment characteristics. The hedonic pricing model can effectively reflect how each independent variable matters to the property price. However, the selected hedonic models cannot observe the other non-linear relationships perfectly between housing prices and input variables. Additionally, once the covered dataset is huge, complicated relationships might exist among different variables. Therefore, to supplement that, random forest is employed owing to its excellent performance in dealing with both linear and non-linear relationships, and capability to dig into the complex relationships existing in large datasets. In order to analyze all of the regression models, we have adopted Rstudio as the statistical programming language.

3.1 Hedonic pricing model

Č eh et al. (2018) stated that Lancaster’s theory of consumer demand is the theoretical basis for the hedonic model. In this theory, Lancaster (1966) had developed the traditional demand theory of consumer demand by assuming that multiple

16

characteristics of goods are drivers for consumption, instead of the goods themselves. His theory can also well predict the impact on demand when one or several attributes of a commodity change (Marcin 1993). Rosen (1974) developed Lancaster’s theory to the hedonic pricing model by assuming that a good can provide utilities owing to its unique characteristics. Each characteristic has its implicit price, the hedonic price of the good is the total value of all implicit prices of each characteristic. In our case, for the hedonic pricing model constructed in the real estate sector, the demand may exist in certain housing characteristics such as the age, size of the house, and distance to the city center. The same variables are both applied to the hedonic models and random forest. The hedonic pricing model has a long history applying in the real estate valuation that views real estates as heterogeneous goods (Lancaster 1966). Each of them possesses different attributes that drive one to differ from the other buildings. Due to the dissimilarity, the hedonic models for the property market are classic regression methods to examine correlations between the housing prices and their characteristics (Levantesi et al. 2020). These selected characteristics are structural characteristics that refer to the inherent of the house itself, locational characteristics as well as neighborhood characteristics. Then the formation of the hedonic pricing model formula is:

𝑃 = 𝑓(𝑆, 𝐿, N)

Where P is the sale price of houses, S is the structural characteristics like the age, size of the house, and the number of rooms. L refers to locational characteristics such as distance to the CBD while N is neighborhood characteristics including the parking lot and floor level. There are different forms in the hedonic pricing models like parametric models, non-parametric models, and semiparametric models (Owusu-Ansah 2011). While under each of them, there are diverse functions. For example, parametric models include functions like multiple linear regression and log-linear regression. Non-parametric contains Kernel regression method, while semi-parametric models have Robinson-stock model. Model selection would be based on taking into consideration of the strengths and weaknesses of a model. According to Owusu-Ansah (2011), parametric models are

17

extensively applied in estimations because they are simple to use and the results are easy to interpret. However, they require strong assumptions. For non-parametric models, they do not impose such strong assumptions as parametric models do, but the computations are more complicated. To better compare the selected methods in our study, parametric models including multiple regression and log-linear regression are then selected because they can perform better than RF in terms of explaining the results more simply and clearly. The formation of multiple regression is:

𝑃 = 𝛼0 + 𝛽𝑖 ∗ 𝑋𝑖 + 𝜀

Where 𝛼 and 𝛽 are the coefficients, P represents the price per square meter of the apartment, X is the explanatory variables, i is the number of observations, 𝜀 is the error term. The equation for log-linear regression is:

ln(𝑃) = 𝛼0 + 𝛽𝑖 ∗ 𝑋𝑖 + 𝜀 Where ln(𝑃) is the logarithm form of the price per square meter.

3.2 Random Forest

Random forest (RF) is a supervised machine learning that is in the category of the tree-based method. The same as other tree-based methods, the purpose of RF is to segment the explanatory variables into subcategories (branches) so that we can observe for the simple model to apply for prediction. RF is an ensemble algorithm of multiple decision trees and bagging for regressions and classifications (Ho 1995; Ho 1998). The decision tree is a decision-making technique that the model has the shape of a tree with a root, branches, and leaves. However, it has a shape like an upside-down tree which starts from the root at the top and the leaves at the bottom. The first split is the root. The decisions split from the root are called branches while the last splits are called leaves. Starting from the root, it splits into two sub-branches of the two new decisions of the same independent variable. Then, these two branches continue to split further into two new decisions each with the variable. The continued process of splitting into new variables (branches) is referred to as the tree-growing process (Tsay & Chen 2018). The selections of each variable for each split are based on the minimum sum of squared

18

errors (SSE) (Tsay & Chen 2018) and the context of the decision probability (Magerman 1995). The variable that produces the smallest SSE will be the root and the second-lowest will be the second split and so on. One problem of the decision tree with many leaves is over-fitting. Roelofs et al. (2019) define over-fitting as any unwanted branches on the trees or the model that do not help to explain the final regression. The simplified meaning of over-fitting is the phenomenon in which the model learns all the details and noise in the training dataset then it becomes perfectly fit with the trained data set. However, this tightly fit problem with the trained data set will be less fit to the test dataset which in return increases inaccuracy and higher deviation of the predicted result from the actual value. Avoiding this problem in the decision tree, tree pruning is the application to eliminate and cut off unwanted branches. Bagging stands for Bootstrap Aggregation and is known as a powerful tree-based method in reducing the variance of the model (Tsay & Chen 2018). Bagging can efficiently decrease the bias and variance at the same time also increase accuracy and generate a more stable environment for the decision tree model (CFI n.d.). As the result of the decision trees are usually over-fitting and high root mean squared error (RMSE) due to the small number of leaves, bagging is a powerful tool to solve these problems (Tsay & Chen 2018). Instead of using pruning, the bagging process produces multi-regression trees by drawing bootstrap samples (randomly selected) approximately two-thirds of the amount of data in the training dataset. One bootstrap creates a predicted tree or bagged tree. Then, the average of the total bagged trees is the result of the bagging. Random forest is similar to bagging but there are differences in the tree-growing processes (Tsay & Chen 2018). The purpose of the tweaking in the trees’ building process is to minimize the correlation between trees. The same as bagging, RF draws the bootstrap samples two-thirds numbers of all observations in the training data set to build trees. However, the split of each tree branch is randomly select among the explanatory variables rather than selecting based on the strong variables or smallest SSE. The averaging of these decision trees generate an accurate and reliable model call Random Forest regression model (Friedman et al. 2001).

19

The equation for the Random Forest is:

𝑦𝑖 =1𝑁

∑ 𝑓𝑛

𝑁

𝑛=1

(𝑥𝑛)

Where training sets are X= 𝑥1, 𝑥2, … 𝑥𝑖 and corresponding to Y= 𝑦1, 𝑦2, … 𝑦𝑖 , bagging repeatedly N times to pick a random group of units with replacement from the training sets. Then it trains the regression tree 𝑓𝑛 afterward. After the training is completed, the unseen units 𝑥𝑖 will be forecasted by averaging the sum of predictions from each regression tree.

3.3 Performance indicators

To measure and compare the predictive performance of random forest and hedonic pricing model, there are three performance evaluation metrics selected. They are R-squared (𝑅2), Mean squared error (MSE), Root mean squared error (RMSE), and Mean absolute percentage error (MAPE). R-squared (𝑅2) is common to use for the measurement of how well a model fits a dataset, normally a higher value is favored since more variations existing in the movement of the dependent variable and the explanatory variables can be explained. The formula for R-squares is:

𝑅2 = 1 −𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛

𝑇𝑜𝑡𝑜𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑡𝑖𝑜𝑛

Mean squared error (MSE) measures the prediction error by averaging all squared errors. The error means the difference between the predicted value and the actual value. A lower value of MSE indicates that the model fits better.

𝑀𝑆𝐸 =1𝑛

∑(�̂�𝑖 − 𝑦𝑖)2𝑛

𝑖=1

Root mean squared error (RMSE) measures the accuracy of the prediction by computing the standard deviation of all errors. A lower RMSE implies the model has a better prediction performance.

20

𝑅𝑀𝑆𝐸 = √1𝑛

∑(�̂�𝑖 − 𝑦𝑖)2

𝑛

𝑖=1

Mean absolute percentage error (MAPE) measures the prediction error by percentage. It is the average of the absolute ratio of the prediction error to the actual value. A lower MAPE value is better.

𝑀𝐴𝑃𝐸 =1𝑛

∑ |�̂�𝑖 − 𝑦𝑖

𝑦𝑖|

𝑛

𝑖=1

Where �̂�𝑖 is the predicted value of a property, 𝑦𝑖 is the observed value of a property, and n refers to the number of observations in the dataset.

4. Data

4.1 Variables Selection

According to Statistics Sweden (2019), there are three categories of dwelling buildings for residential purposes in Sweden based on ownership, one- or two-dwelling buildings, multi-dwelling buildings, and special housing. 42% of the population live in one- or two-dwelling buildings, 51% live in multi-dwelling buildings and only 2% live in special housing. One- or two-dwelling buildings are referred to privately owned houses such as detached, semi-detached, linked houses, and flats. Multi-dwelling buildings consist of more than three apartments in one building. Within this category, the distributions of rental apartments consist of 58%, and tenant-owned apartments (included condominium) 42%. Rental apartments are commonly owned by private landlords and rent to tenants for a long-term period. Tenant-owned apartment or cooperative apartment is a form of tenure where the whole apartment buildings are either owned by housing cooperative companies, municipal housing companies, or Swedish joint-stock companies. The owner of the cooperative apartment has the right to use the property includes living, selling, and renovating the interior of the apartment. The cooperative companies are responsible for exterior maintenance works. The residents of the cooperative apartment have to pay monthly fees to the cooperative

21

companies. The fee includes the maintenance cost, interest payment for the share of the loan with the cooperative company’s loan. This study will focus on the data of the tenant-owned apartments which are the most common form of housing in Stockholm, with a 57% share of the total multi-dwelling buildings (Statistics Sweden 2019). On the other hand, the selection of the property attributes to be the independent variables for the regression models in this paper has been studied from what commonly considered variables have applied on both machine learning methods and HPM models in the previous researches. The comparison study between HPM and artificial neural network in Nigeria by Abidoye, Rotimi Boluwatife, and Albert PC Chan (2018) has included the transaction price, location, numbers of bedrooms, number of bathrooms, property type, parking space, building age, the number of floors, and availability of the security fence, and the availability of the sea view. Another study by HPM and Random Forest by Hong & Choi in South Korea includes price, location to different buildings and entertainment place, living area, floor level, number of building in the apartment, parking space, and floor area ratio. However, they have also considered macro variables such as GDP, GDP growth rate, and mortgage interest rate. Similar to the previous studies, we cover three categorical variables such structural variables, locational variables, and neighborhood variables. In our study, the structural variables will include the living area of the apartment, floor level, number of rooms, age of the property, monthly fee of maintaining the building, balcony, elevator. Balcony and elevator are dummy variables, they equal 1 if the apartment with elevator/balcony. For locational variables, distance to the city center is an important variable to affect the housing price. The specific location of the city center we have set is Stockholm City Hall. Since the dataset includes the details of longitudes and latitudes, the coordinates for all properties are then formed. Bing Maps API is the support tool cooperating with Excel to measure the distance between the coordinate of Stockholm City Hall and each property. As for the neighborhood variable, we have the total stories of the buildings.

4.2. Data source

All the data in this study is supported by Svensk Mäklarstatistik. Svensk Mäklarstatistik is the Swedish leading independent housing prices statistics provider.

22

4.3. Data Description

The data of this study covers 114,293 arm-length transactions of tenant-owned apartments in Stockholm city, Sweden. The transaction period is between January 2005 to December 2014. We have adopted different data cleaning processes to get this final data. Firstly, we remove the data with missing information. Secondly, we clean outliers of the price variables, living area, number of rooms, distance to the Stockholm city center, and monthly fee. As the dependent variable, price is the apartment price per square meter. Figure 1. illustrates the price of apartment transactions. The distribution has form as the right skew distribution. To improve the standard deviation (STD) and the sum of square error (SSE), we excluded the properties with prices over 90,000 SEK that is considered as belongs to the specular market groups.

Figure 1. Histogram of Stockholm Apartment Price (transaction between 2005-2014)

23

Table 1 summarizes all the variables with descriptions and the descriptive statistics of the variables. The average apartment price is 33,722 SEK per square meter with a standard deviation of 17,604.65 SEK per square meter. This high standard deviation indicates that there is a huge spread between the housing value and the average apartment price. On average, the apartments have 2.4 rooms and a living area of 64.65 square meters. Additionally, the corporate apartments in Stockholm have an average monthly fee of approximately 3,349 SEK per month. Around 55.2% percent and 8.7% of the total properties with elevators and balconies respectively. As for the mean age of the buildings, they are about 51.25 years after the construction until the contract year. The total stories of the buildings are around 4.3, the average apartments situate about the third floor in the building. Lastly, based on the coordinate calculations from the property location to the Stockholm city center (Stockholm City Hall), the apartments locate about 13.27 kilometers away from the city center. Table 1. Descriptive statistics of regression variables

Variables Description Mean Standard Deviation

Min Max

PRICE Price per square meter

(SEK/SQM) 33,722.000 17,604.65 410 90,000

FL Floor level 2.605 1.93 -3 23

STOR Number of total stories with

the building 4.322 2.50 0 33

LA Living area of the

condominium (SQM) 64.650 24.59 10 200

NOR Number of rooms 2.411 1.00 1 6

ELEV Dummy: elevator 0.552 0.49 0 1

BALC Dummy: balcony 0.087 0.28 0 1

AGE The age of the residential

property 51.250 31.90 0 676

FEE Monthly fee (SEK) 3,349.000 1,348.27 211 12,214

DIST Distance to the city center

(Kilometers) 13.270 12.39 0 79

Total number of observations 114,293

24

4.4. Data setting

The data for this study is divided into two packages for training the regression models and testing the regression models. We refer to the data using for training regression model as the training dataset, and data used for testing regression models as the testing dataset. These two datasets will be applied both to the Hedonic Pricing Models and Random Forest. Furthermore, this study adopts two different techniques for splitting the datasets into training and testing packages: randomly splitting and split them by years.

The random split is the split of the shuffled data and randomly select from the overall data into two packages of 70% for the training dataset and 30% for the testing dataset. Similar to the previous studies by Bergadano et al. (2019) and Trawiński et al. (2017), this technique of data setting up hope to eliminate the problems caused by order and position dependencies.

The other technique is to split the dataset into training and testing packages based on years. As this study does not include the macroeconomic variables which frequently change from year to year, it would be advantageous to see how to do the macroeconomic variables that can capture from separation the dataset by years claimed by Hong, Choi & Kim (2020). Additionally, Renigier-Biłozor and Wiśniewski (2012) have classified variables such as unemployment rate, population growth, household consumption, and net national income are considered significant to consider for residential price indices. Instead of dividing the covered 10 years' data from 2005 to 2014 into 10 different packages, we set up two data packages, the training dataset includes the data from the year 2005 to 2011 and the testing dataset consists of the data between the year 2012 and 2014.

Therefore, the motivation of adopting these two techniques data setting up is to see the differences not only for the results differentiate among the performance of different models, but also the illustrate the suitable ways of different data splitting.

25

5. Result

5.1 Results from Random Data Splitting

5.1.1 Multiple linear regression The multiple linear regression is the simplest and the most popular among hedonic models. The overall result is rather straightforward and understandable. The exact form with chosen variables for the multiple linear regression is:

𝑃𝑟𝑖𝑐𝑒 = 𝛼0 + 𝛽1 ∗ 𝐹𝐿 + 𝛽2 ∗ 𝑆𝑇𝑂𝑅 + 𝛽3 ∗ 𝐿𝐴 + 𝛽4 ∗ 𝑁𝑂𝑅 + 𝛽5 ∗ 𝐸𝐿𝐸𝑉 + 𝛽6

∗ 𝐵𝐴𝐿𝐶 + 𝛽7 ∗ 𝐴𝐺𝐸 + 𝛽8 ∗ 𝐹𝐸𝐸 + 𝛽9 ∗ 𝐷𝐼𝑆𝑇 + 𝜀 Table 2 is the result of the linear regression model by using the training dataset. To evaluate the performance of the regression model, we test the model by applying the testing dataset and compare its performance. Table 3 present the performance of the multiple linear regression. Table 2. Training results summary of the multiple linear regression

Variables Coefficients Std.

Error t value Pr(>|t|)

(Intercept) 34,550.000 201.400 171.56 < 2e-16 ***

FL 781.800 29.180 26.79 < 2e-16 ***

STOR -489.100 25.160 -19.44 < 2e-16 ***

LA 13.010 4.313 3.02 0.00257 **

NOR 820.300 97.040 8.45 < 2e-16 ***

ELEV 8,283.000 104.500 79.23 < 2e-16 ***

BALC -1,085.000 150.300 -7.22 5.26E-13 ***

AGE 186.900 1.589 117.59 < 2e-16 ***

FEE -2.956 0.055 -53.63 < 2e-16 ***

DIST -570.700 3.740 -152.59 < 2e-16 ***

Number of observations

82,224

R-squared 0.5431

26

Table 3. Summary of the multi-linear regression’s performance indicators

R-square MSE RMSE MAPE

Training 0.5431 144,318,140 12,013.25 0.2700

Testing 0.5386 135,926,042 11,658.73 0.3964

The result in Table 3 shows that the R-squared of the model obtained from the training data is around 0.5431. This means the model can explain approximately 54.31% of the price variable. Normally the higher R-squared the model gets, the better the regression model since it fits the dataset better. The cross-validation result of applying the regression model on the testing dataset has shown that R-squared is slightly decreasing from 0.5431 to 0.5386. However, R-squared is not the only important indicator to look at when applying multiple linear regression. Mean square error of the model explains the deviation of the predicted price and the actual price of the apartment. The smaller MSE, the better the model. From Table 2, the MSE from the training dataset is higher than the testing dataset from 144,318,140 to 135,926,042. Hence, we can say that the model predicts the price of the apartment even better on the testing dataset. Similarly, for RMSE, the closer the value to zero, the less error between the predicted value of the model and the actual price. In Table 3, the result of the RMSE of the tested dataset is similar to the trained dataset. It explains that the regression model is also suitable to apply for different datasets such as the training dataset. MAPE is another criterion to validate the performance of the model. It indicates the percentage of the model prediction error. In this case, it explains how much percentage of the predicted apartment prices deviate from the actual price. The result in Table 3 shows that the model produces a 27 % error in predicting the apartment price from the actual apartment value. The smaller the MAPE, the less deviation of the model from the actual apartment value, the better the performance of the model. The result of MAPE obtained from applying the multi-linear model on the testing dataset is 39.64%. It means the model produces less accuracy once we apply it to the testing dataset comparing to the training data.

27

To understand the level of importance of variables in the model, Table 2 shows that all the explanatory variables within the model are statistically significant for explaining the price of the apartment, which refers to the asterisks. An explanatory variable with more asterisks has a higher confidence level to reject that this variable has no influence on the dependent variable. The apartment floor is positively significant, meaning that the higher the apartment floor is, the higher price per square meter would be. With a coefficient of 781.8, one floor level higher will increase 781.8 SEK per square meter in price. The living area can positively affect the price as expected, a larger size of living space will lead to a higher price. The number of rooms and the age of the building also can impact the housing price positively, with one more room included or one year add in the age, the apartment price will increase around 820.3 SEK/SQM and 186.9 SEK/SQM respectively. Variables like the building stories, monthly fee, and locations are negatively significant to the apartment price, so if there are more floors with the building, more monthly fees it requires to live in the building, and the building locates far from the city enter, these factors will lead to a lower apartment price. For dummy variables, the apartment building that owns an elevator will increase the value of an apartment. Adding one more elevator would result in 8,283 SEK/SQM price growth on average. As for the balcony, surprisingly, it has a negative effect on the price, indicating that it is not a preference to have extra balconies. 5.1.2 Log-linear regression After testing the multiple-linear model, the R-squared does not reach a high level, as there are no pre-assumptions of which model can better fit a dataset, there is a need to try other hedonic models to see the changes. The log-linear form is the other hedonic model that will apply the same variables from the two datasets, training and testing. Exceptionally, there is an additional step in the log-linear model by inserting the apartment price into natural logarithmic form. The equation for the log-linear model with explanatory variables is:

28

𝑙𝑛(𝑃𝑟𝑖𝑐𝑒) = 𝛼0 + 𝛽1 ∗ 𝐹𝐿 + 𝛽2 ∗ 𝑆𝑇𝑂𝑅 + 𝛽3 ∗ 𝐿𝐴 + 𝛽4 ∗ 𝑁𝑂𝑅 + 𝛽5 ∗ 𝐸𝐿𝐸𝑉+ 𝛽6 ∗ 𝐵𝐴𝐿𝐶 + 𝛽7 ∗ 𝐴𝐺𝐸 + 𝛽8 ∗ 𝐹𝐸𝐸 + 𝛽9 ∗ 𝐷𝐼𝑆𝑇 + 𝜀

The same steps for running the multiple linear regression, we will also apply to the log-linear regression. Table 4 illustrates the result of the log-linear regression model trained by the training dataset. Table 5 presents the performance indicators of the log-linear regression. Table 4. Training results summary from the log-linear regression

Variables Coefficients Std. Error t value Pr(>|t|)

(Intercept) 10.5700 0.006755 1,564.978 < 2e-16 ***

FL 0.0223 0.000979 22.747 < 2e-16 ***

STOR -0.0131 0.000844 -15.560 < 2e-16 ***

LA 0.0010 0.000145 6.986 2.86E-12 ***

NOR 0.0341 0.003255 10.474 < 2e-16 ***

ELEV 0.1948 0.003507 55.548 < 2e-16 ***

BALC -0.0173 0.005040 -3.441 0.000579 ***

AGE 0.0043 0.000053 80.006 < 2e-16 ***

FEE -0.0001 0.000002 -74.083 < 2e-16 ***

DIST -0.0237 0.000126 -189.248 < 2e-16 ***

Number of observations

82,224

R-squared 0.5595

Table 5. Summary of the log-linear regression performance indicators


Training 0.5595 148,817,623 12,199.08 0.3452

Testing 0.5588 143,514,135 11,979.74 0.3113

Table 5 has shown the R-squared of log-linear regression reaches 0.5595 for the training dataset. Once applying the model on the testing dataset, the R-squared has a minor change from 0.5595 to 0.5588. The general R-squared shows that the

29

explanatory power is slightly strengthened by almost 2% in comparison with the linear model with the same testing dataset.

The results from the testing dataset for both MSE and RMSE are lower than the training dataset. The reduction in MSE from the training to the testing dataset indicates the prediction performance is better in the testing set. For RMSE, the testing result is also lower than the training set, thus there is less error between the predicted value of the model and the actual price in the testing set rather than the training set.

For MAPE, the percentage error is about 34.52% for the training set while in the testing set is 31.13%. Thus, the model generates more accurate results in the testing dataset in comparison with the training dataset.

According to Table 4, the independent variables are all statistically significant to the logarithm price per square meter, it is the same conclusion from the multiple linear forms so that the selection of different variables are generally reasonable since all their statistically significant impacts on the price are proved.

The coefficients for floor level, living area, number of rooms, elevator, age of the building remain positive signs as the linear regression have. The price per square meter will grow by almost 2% if the floor is one level up. The effect for living area is about 0.1% on the growth of price if the area expands by 1%, while one more room included in an apartment will grow the price by 3.4%. For the dummy variable elevator, one more elevator with the building will add almost 19.5% extra value to an apartment. If the age of the building increase by 1%, the positive impact on the price will be about 0.43%.

The building stories, balcony, monthly fee, and distance are still with negative coefficients as in the linear regression. One more story with the building will lead to a 1.3 % reduction in the price. With an extra balcony, the apartment price will reduce by 1.7% in general. If there are more monthly fees required for an apartment, a 1% growth in the monthly fee will decrease the apartment price by 0.01%. As for the distance, one kilometer away from the city center can negatively impact the price by 2.37%.

5.1.3 Random Forest

Random forest is a method of the decision tree with the strengths of improving the

30

correlation between variables and improving the accuracy of the prediction. As a supervised machine learning algorithm applied to Rstudio statistical programming, Random Forest requires a specific package called randomForest. This study applies this package to produce the regression model learning from the training dataset. The package set the automatic selection of 500 trees. To improve the accuracy of the prediction, the number of trees can be adjusted, increasing or decreasing. The analysis of this study has been increased the number of trees to 1,000. However, the increasing number of trees to improve the accuracy of the regression and R-square could also cause the over-fitting problem. This means the regression strictly learns from the training dataset which in turn it is too fit to the training dataset. Even it produces a high R-square, however, it performs less accurately once applying to different datasets. In this case, we refer to the testing dataset. Hence, this study selected 1,000 trees to avoid big differences in R-square produce by training and testing dataset. According to Table 6, the R-square of RF regression learned from the training dataset and the testing dataset is 82.88% and 83.16% respectively. The results have shown that RF performs noticeably well. It explains that the 9 independent variables included in the training dataset explain the price prediction up to 82.88%. Moreover, the R-square obtained from the testing dataset is even higher than the R-square obtained from the training dataset. Hence there is no over-fitting problem causing by over selection of the number of trees. Similarly, the result of MSE and RMSE obtained from both training data and testing data has very little difference. The regression model that learned from the testing dataset is even produced less deviation from the actual price comparing to the training dataset. MAPEs of the RF regression model are low, less than 20% of both results obtained from training and testing datasets. Table 6. Summary of the Random Forest performance indicators


Training 0.8288 54,069,896 7,353.22 0.1915

Testing 0.8316 49,617,536 7,096.54 0.1771

Another important feature in Random Forest is the illustration of the important level of the variables in explaining the dependent variable of apartment prices. The important

31

level specifies by the percentage of the mean decrease of accuracy(%IncMSE). The mean decrease of accuracy is the percentage of the increasing inaccuracy if the omission of one variable exists. For example, the %IncMSE of the elevator (ELEV) is 120%, if we omit the elevator variable the regression will increase inaccuracy up to 120%. Hence, the higher the %IncMSE, the higher its importance to the regression model. From the scatterplot in Figure 2, we see that DIST, AGE, FEE, STOR, LA, NOR, FL, ELEV, BALC has the rank of the important level from the most important variable to the least important variable.

Figure 2. The scatterplot of the variables important level based on %IncMSE

5.2 Results from data split by years

5.2.1 Multiple linear regression To simply specify the results from the multiple linear regression using the new training dataset, table 7 has shown that the coefficients for the independent variables have obtained the same kinds of signs as in the previous results generated by regression models randomly splitting the datasets. All the independent variables are still statistically significant to the dependent variable. As for the R-squared of the regression model trained by the training data, it reaches 57.79%, which has a stronger explanation power in comparison with the R-squared of the multiple-linear model of the randomly split data method.

32

Table 8 summarizes the result of the multi-linear regression. The comparison between these results that trained by the training datasets and test on the testing datasets, there have been a much lower R-squared, higher MSE, RMSE, and MAPE in the testing results. Hence, it is apparent that the performance has significantly dropped from the training to the testing. Table 7. Training results summary of the multiple linear regression



(Intercept) 31,790.0 182.10 174.56 < 2e-16 ***

FL 775.7 26.30 29.49 < 2e-16 ***

STOR -555.0 22.69 -24.46 < 2e-16 ***

LA 46.2 3.89 11.87 0.000000111 **

NOR 464.6 87.53 5.31 < 2e-16 ***

ELEV 7,748.0 93.43 82.93 < 2e-12 ***

BALC -971.9 139.90 -6.95 3.7E-12 ***

AGE 178.9 1.46 122.62 < 2e-16 ***

FEE -3.1 499.10 -62.59 < 2e-16 ***

DIST -550.7 3.43 -160.74 < 2e-16 ***

number of observations

81,114

R-squared 0.5779

Table 8. Summary of the multi-linear regression’s performance indicators


Training 0.5779 115,161,247 10,731.32 0.1579

Testing 0.4008 232,250,203 15,239.76 0.3101

5.2.2 Log-linear regression Table 9 presents the results from the log-linear model using the training dataset. The coefficients remain their negative/positive effects as the log-linear model from the

33

randomly splitting dataset. At the same time, all the independent variables are statistically significant in interpreting the dependent variable. The R-squared in this case is almost 60% for the training dataset, which has a minor improvement than in the multiple linear regression. As for the performance indicators, table 10 faces the same problem as in table 8, the performance of the testing dataset fails to maintain the same level of performance as in the training dataset. Owing to a much lower R-squared, higher MSE, RMSE, and MAPE, the testing dataset even has a significantly lower performance than the training dataset. Table 9. Training results summary of the log-linear regression



(Intercept) 10.5000 0.006210 1,586.422 < 2e-16 ***

FL 0.0236 0.000956 24.707 < 2e-16 ***

STOR -0.0156 0.000825 -18.870 < 2e-16 ***

LA 0.0023 0.000142 15.989 < 2e-16 ***

NOR 0.0284 0.003182 8.916 < 2e-16 ***

ELEV 0.1945 0.003397 57.254 < 2e-16 ***

BALC -0.0231 0.005085 -4.533 0.00000582 ***

AGE 0.0043 0.000053 81.867 < 2e-16 ***

FEE -0.0002 0.000002 -87.769 < 2e-16 ***

DIST -0.0249 0.000125 -199.753 < 2e-16 ***

number of observations

81,114

R-squared 0.5953

Table 10. Summary of the log-linear regression’s performance indicators


Training 0.5953 118,194,436 10,871.73 0.3238

Testing 0.2393 294,845,012 17,171.05 0.2989

34

5.2.3 Random Forest The summary of the random forest results presented in Table 11 has shown that the R-squared of the regression model obtained from the training dataset is considerably high with 85.88%. However, once testing the regression model on the testing data, R-squared has decreased significantly to 66.97%. Additionally, other performance indicators such as MSE, RMSE, and MAPE have increased higher in the testing results comparing to the training results. Learning from these performance indicators, they determine the performance of random forest reduces its accuracy in predicting the apartment value if we apply it to the dataset with different years. Table 11. Summary of the Random Forest performance indicators


Training 0.8588 38,536,807 6,207.80 0.1704

Testing 0.6697 128,042,867 7,096.54 0.2103

However, the order of the level of the important variable is still the same based on the %IncMSE. DIST, AGE, FEE, STOR, LA, NOR, FL, ELEV to BALC are listed from the most important to the least important variable.

6. Discussion

6.1 Models comparison: randomly splitting the data

As the purpose of this study is to investigate the performance of the HPM and random forest through different techniques of data splitting. Hence, it is more beneficial to illustrate the comparison of the performance results of these models. Firstly, we have a look at the result from the randomly splitting data technique. The performance indicators of hedonic models and RF for the Stockholm housing market are shown in Table 12. To initiate the comparison, firstly looking at the results of the training dataset. At this part, there is regarding a gap between random forest and the other two hedonic pricing models. Random Forest obtains a very significantly higher 𝑅2. It reaches over 20% higher than the 𝑅2 of the multiple linear regression

35

and the log-linear regression. Similarly, MAPE, MSE, and RMSE results from the random forest are significantly smaller comparing to the other two models. This indicates Random Forest performs better with a much lower deviation of the predicted prices from the actual prices. Moving to the testing set, Random Forest remains outperformance by producing the highest 𝑅2 and lowest MSE, RMSE, and MAPE among the three models. Besides that, the performance indicators for random forest have all been improved in the testing dataset. This means the power of eliminating the over-fitting problem by random forest effectively works. However, both the prediction result of 𝑅2 on the testing dataset for the two hedonic models slightly decrease. For log-linear regression, except for the 𝑅2, other three indicators have been improved in the testing dataset. Whilst in the linear regression, the results for RMSE and MSE slightly decrease compared with the training dataset. But MAPE increases from 27% to 39.64%, which means the model produces less accuracy once applying the model to the testing dataset comparing to the training data. In general, random forest has the best performance among the models both for the training dataset and the testing dataset. On average among the results from the trained regression and prediction, RF produces approximately 29% higher R-squared than the linear model and 29.40% higher than the log-linear model. Additionally, RF regression has an average 14.89% lower prediction error than the linear model and 14% smaller than the log-linear model. Hence, this method can generate reliable outcomes for predicting housing prices with less deviation between the predicted prices and the real prices. Moreover, it can effectively eliminate the over-fitting problem existing in the training data process, the trained regression is trustable to also apply for other datasets.

36

Table 12. Comparison of Hedonic models and Random Forest

MSE RMSE MAPE R2 MSE RMSE MAPE R2

54,069,896 7,353.22 0.1915 82.88 49,617,536 7,096.53 0.1771 83.16

Linear Regression

Model144,318,140 12,013.25 0.2700 54.31 135,926,042 11,658.73 0.3964 53.86

Log-linear Regression

Model148,817,623 12,199.08 0.3452 55.95 143,514,135 11,979.74 0.3113 55.88

Valuation MethodsTrain Test

Random Forest

Hedonic Model Parametric

6.2 Models comparison: splitting the data by years

Based on the results of the regression models from splitting the dataset by years presented in Table 13, RF still has higher performance comparing to the other two regression models, linear and log-linear regression models. On average of R-squared of the both trained and tested RF regression produces approximately 27% higher than the linear regression models and 16% higher than log-linear regression. Additionally, RF has a noticeable small MSE and RMSE. Regarding the MAPE of both trained and test regression models, RF produces on average 4.3% less error than the linear model and 12% smaller than log-linear regression. However, the result of MAPE from the training dataset, the linear model has the lowest MAPE with 0.1579 while RF and log-linear regression have 0.1704 and 0.3237 respectively. This explains that the predicted apartment value by the linear regression model produces 15.79% error compared to RF produces 17.04% and log-linear produces 32.37%. However, the MAPE of linear regression and RF is small (approximately 1.5%) that we still can conclude that RF is still a good model for the training dataset. Paying attention to the result of testing the regression models applied on the testing dataset, RF is still outperformed among the other two regressions with higher R-squared and smallest MSE, RMSE, and MAPE. Even the R-squared of RF regression decreases from 85.88% in the training dataset to 66.95% in the testing dataset but it is bigger than the R-squared of the hedonic models. The decreasing R-squared of all the regression models from higher in the training to

37

lower in the testing datasets could be explained by macroeconomics performance between the period of the selected data for training (2005 to 2011) and testing (2012-2014). As discussed before by Hong, Choi & Kim (2020), Renigier-Biłozor, and Wiśniewski (2012), segregation of the data based on the years will cover the macroeconomic variable into the regression models. Hence, the performance of the regression models will not be significantly different between the training and testing if the covered data within these two datasets are within the years that have similar economic performance. Table 13. Comparison of Hedonic models and Random Forest

MSE RMSE MAPE R2 MSE RMSE MAPE R2

38,536,807 6,207.80 0.1704 85.88 128,042,867 7,096.54 0.2103 66.97

Linear Regression

Model115,161,247 10,731.32 0.1579 57.79 232,250,203 15,239.76 0.3101 40.08

Log-linear Regression

Model118,194,436 10,871.73 0.3238 59.53 294,845,012 17,171.05 0.2989 23.93

Valuation MethodsTrain Test

Random Forest

Hedonic Model Parametric

6.3 General discussion

The common point for table 12 and table 13 is that even with different circumstances of selecting the dataset to train and test the model, random forest is still the strongest regression model among the other two models. However, in table 12, the performance differences between the training dataset and the testing dataset for the random forest is that the higher performance of the testing dataset is proved by all slight improvements in all four indicators. Whilst for the two hedonic models, the R-squared indicates their performance has slightly decreased. In table 13, all the indicators in all three models have dropped significantly. The reason behind this might cause by the changes in macroeconomic performance such as GDP growth, inflation rates, interest rates, the unemployment rate during the years covered for the training and testing dataset.

38

7. Conclusion

Although the development of artificial intelligence through the application of Machine Learning is applicable in different industries, there are hopes to minimize the error and biases of property valuation by appraisers. The academicians have discussed and compared the performance of the many machine learning algorithms to identify the most suitable algorithms for property valuation. Recent studies have specified Random Forest as a machine learning algorithm that performs noticeably well. However, there are scarce comparison studies between Random Forest and advanced property valuation methods such as the hedonic pricing model in Swedish residential property. This study aims to compare the performances of Random Forest and the parametric hedonic pricing models both multiple linear and log-linear models on the Stockholm tenant-owned apartment valuations. To investigate the performances clearly, we test the regression models on two different types of splitting for training and testing datasets. They are randomly splitting, and year-based splitting. The result has identified random forest model performs best in both data separating ways. The first data splitting method of randomly select 70% for training and 30% for testing, RF produces 29% and 29.40% higher R-squared than the linear model and log-linear model respectively. Additionally, RF has a 14.89% of average prediction error less than the linear model and 14% smaller than the log-linear model. Despite changing different types of data splitting techniques to understand how sensitive the three models to the omitted macroeconomic variables, the random forest does not suffer much. As the result of data splitting based on the transaction year, RF regression has an average R-squared higher than linear and log-linear about 27% and 16% respectively. Within the same condition, RF produces a lower perdition error of about 4.3% than the linear model and 12% than the log-linear model. So, it is true that random forest is the most suitable regression model among the three that could detect the non-linear and complex relationship between variables. As the result, it produces higher accuracy with R-squared and low variance and error between the predicted apartment prices and actual prices.

39

8. Limitation

There is space to improve the hedonic pricing model, especially for the extra explanatory variables' selection. These explanatory variables should be local-based to some extent. For example, divide the year into dummy variables based on the occurrence of some historical events. Also, some variables have high correlations with each other, building interaction variables is also an important part when applying the hedonic model to improve the estimation performance. In this study, we have included two techniques for splitting the data, where the second one is to split them by years. When separating the data by years causes a reduction in the performance from the training dataset to the testing dataset in all the models, except for some economic reasons behind this result, there is still space for improvement of the result by grouping the years in different ways. Future research can focus on how to arrange the data in groups according to years, for example, expanding or shrinking the total period of years or trying to separate the training and testing dataset in another way to make more sense on the grouping.

40

References

Abidoye Rotimi, B., Junge, M., Lam Terence, Y.M., Oyedokun Tunbosun, B. & Tipping Malvern, L. 2019. Property valuation methods in practice: evidence from Australia. Property Management, 37(5), pp. 701-718, DOI 10.1108/PM-04-2019-0018, <https://doi.org/10.1108/PM-04-2019-0018>. Andersson, F. & Landberg, R. 2005, Real Estate Appraisal: A Study of Real Estate Appraisers in Sweden. Annamoradnejad, R., Annamoradnejad, I., Safarrad, T. & Habibi, J. 2019, April. Using Web Mining in the Analysis of Housing Prices: A Case study of Tehran. In 2019 5th International Conference on Web Research (ICWR), pp. 55-60. IEEE. Antipov, E.A. & Pokryshevskaya, E.B. 2012. Mass appraisal of residential apartments: An application of Random forest for valuation and a CART-based approach for model diagnostics. Expert Systems with Applications, 39(2), pp.1772-1778. Bao, H. & Wan, A. 2007. Improved estimators of hedonic housing price models. Journal of Real Estate Research, 29(3), pp.267-302. Bergadano, F., Bertilone, R., Paolotti, D. & Ruffo, G. 2019. Learning real estate automated valuation models from heterogeneous data sources. arXiv preprint arXiv:1909.00704. Breiman, L. 2001. Random forests. Machine learning, 45(1), pp.5-32. Čeh, M., Kilibarda, M., Lisec, A. & Bajat, B. 2018. Estimating the performance of random forest versus multiple regression for predicting prices of the apartments. ISPRS international journal of geo-information, 7(5), p.168. Dellstad, M. 2018. Comparing Three Machine Learning Algorithms in the Task of Appraising Commercial Real Estate. Diaz, J. & Hansz, J.A. 1997. How valuers use the value opinions of others. Journal of Property Valuation and Investment, 15(3), pp. 256-260.

https://doi.org/10.1108/PM-04-2019-0018

41

Diaz III, J. & Wolverton, M.L. 1998. A longitudinal examination of the appraisal smoothing hypothesis. Real Estate Economics, 26(2), pp.349-358.

Eckert, J.K., Gloudemans, R.J. & Almy, R.R. (eds.) 1990. Property appraisal and assessment administration. International Assn of Assessing Office.

Emanuelsson, R. 2015. Supply of housing in Sweden. Sveriges Riksbank Economic Review, 2(2), <http://archive.riksbank.se/Documents/Rapporter/POV/2015/2015_2/rap_pov_artikel_3_150917_eng.pdf>.

European Commission. 2018. European Construction Sector Observatory, Country profile Sweden. (2021)1321497, European Commission, <https://ec.europa.eu/docsroom/documents/23752/attachments/1/translations/en/renditions/...>.

European Commission 2020, 'European Construction Sector Observatory, Country profile Sweden', (2021)1321497, European Commission, <https://ec.europa.eu/docsroom/documents/40293/attachments/1/translations/en/renditions/pdf>.

Fan, G.Z., Ong, S.E. & Koh, H.C 2006. Determinants of house price: A decision tree approach. Urban Studies, 43(12), pp.2301-2315.

Follain, J.R. & Malpezzi, S. 1980. Dissecting housing value and rent: Estimates of hedonic indexes for thirty-nine large SMSAs. Urban Institute Press, 249

Friedman, J., Hastie, T. & Tibshirani, R. 2001. The elements of statistical learning. 1(10). New York: Springer series in statistics.

Gallimore, P. 1996. Confirmation bias in the valuation process: a test for corroborating evidence. Journal of Property Research, 13(4), pp.261-273.

Geltner, D. & Mei, J. 1995. The present value model with time-varying discount rates: Implications for commercial property valuation and investment decisions. The Journal of Real Estate Finance and Economics, 11(2), pp.119-135.

http://archive.riksbank.se/Documents/Rapporter/POV/2015/2015_2/rap_pov_artikel_3_150917_eng.pdf

http://archive.riksbank.se/Documents/Rapporter/POV/2015/2015_2/rap_pov_artikel_3_150917_eng.pdf

https://ec.europa.eu/docsroom/documents/23752/attachments/1/translations/en/renditions/

https://ec.europa.eu/docsroom/documents/23752/attachments/1/translations/en/renditions/

https://ec.europa.eu/docsroom/documents/40293/attachments/1/translations/en/renditions/pdf

https://ec.europa.eu/docsroom/documents/40293/attachments/1/translations/en/renditions/pdf

42

Ho, T.K. 1995, August. Random decision forests. Proceedings of 3rd international conference on document analysis and recognition, 1, pp. 278-282. IEEE. Ho, T.K. 1998. The random subspace method for constructing decision forests. IEEE transactions on pattern analysis and machine intelligence, 20(8), pp.832-844. Ho, W.K., Tang, B.S. & Wong, S.W. 2020. Predicting property prices with machine learning algorithms. Journal of Property Research, pp.1-23. Hong, J., Choi, H. & Kim, W.S. 2020. A house price valuation based on the random forest approach: the mass appraisal of residential property in south korea. International Journal of Strategic Property Management, 24(3), pp.140-152. International Monetary Fund. 2019. Sweden: 2019 Article IV Consultation-Press Release, Staff Report and Statement by the Executive Director for Sweden. International Monetary Fund, Washington D.C. International Valuation Standards Council. 2019. International Valuation Standards : General Standards – IVS 105 Valuation Approaches and Methods, London, UK. Jamil, S., Mohd, T., Masrom, S. & Ab Rahim, N 2020, July. Machine Learning Price Prediction on Green Building Prices. In 2020 IEEE Symposium on Industrial Electronics & Applications (ISIEA) (pp. 1-6). IEEE. Jordan, M.I. & Mitchell, T.M. 2015. Machine learning: Trends, perspectives, and prospects. Science, 349(6245), pp. 255-260. Lancaster, K.J. 1966. A new approach to consumer theory. Journal of political economy, 74(2), pp.132-157. Lentz, G.H. & Wang, K. 1998. Residential Appraisal and the Lending Process A Survey of Issues. The Journal of Real Estate Research, 15(½), pp. 11-39, JSTOR, <http://www.jstor.org.focus.lib.kth.se/stable/24886869>. Levantesi, S. & Piscopo, G. 2020. The Importance of Economic Variables on London Real Estate Market: A Random Forest Approach. Risks, 8(4), p.112.

http://www.jstor.org.focus.lib.kth.se/stable/24886869

43

Magerman, D.M. 1995. Statistical decision-tree models for parsing. arXiv preprint cmp-lg/9504030. Marcin, T.C. 1993. A Characteristics Model Approach to Demand Analysis for Wood Composites. Masías, V.H., Valle, M.A., Crespo, F., Crespo, R., Vargas, A. & Laengle, S. 2016. Property valuation using machine learning algorithms: A study in a Metropolitan-Area of Chile. In Selection at the AMSE Conferences-2016, pp. 97. Murdoch, J.C., Singh, H. & Thayer, M. 1993. The impact of natural hazards on housing values: the Loma Prieta earthquake. Real Estate Economics, 21(2), pp.167-184. Ngiam, K.Y. & Khor, W. 2019. Big data and machine learning algorithms for health-care delivery. The Lancet Oncology, 20(5), pp.262-273. Owusu-Ansah, A. 2011. A review of hedonic pricing models in housing research. Journal of International Real Estate and Construction Studies, 1(1), pp.19. OEDC. 2021. Gross domestic product (GDP) (indicator). DOI 10.1787/dc2f7aec-en. Pagourtzi, E., Assimakopoulos, V., Hatzichristos, T. & French, N. 2003. Real estate appraisal: a review of valuation methods. Journal of Property Investment & Finance, 21(4), pp. 383-401, DOI 10.1108/14635780310483656, Paper, C. & Mas, H. 2016. Property Valuation using Machine Learning Algorithms: A Study in a Metropolitan-Area of Chile. Internation Conference on Modeling and Simulation. Renigier-Biłozor, M. & Wiśniewski, R. 2012. The impact of macroeconomic factors on residential property prices indices in Europe. Aestimum, pp.149-66. RICS. 2019. RICS Valuation – Global Standards, the Royal Institution of Chartered Surveyors (RICS), London, UK. Roelofs, R., Shankar, V., Recht, B., Fridovich-Keil, S., Hardt, M., Miller, J. et al. 2019.

44

A meta-analysis of overfitting in machine learning. Advances in Neural Information Processing Systems, 32, pp. 9179-9189.

Rosen, S. 1974. Hedonic prices and implicit markets: product differentiation in pure competition. Journal of political economy, 82(1), pp.34-55.

Simon, A., Deo, M.S., Venkatesan, S. & Babu, D.R. 2016. An overview of machine learning and its applications. International Journal of Electrical Sciences & Engineering, 1(1), pp.22-24.

Singh, Y., Bhatia, P.K. & Sangwan, O. 2007. A review of studies on machine learning techniques. International Journal of Computer Science and Security, 1(1), pp.70-84.

Song, H.S. & Wilhelmsson, M. 2010. Improved price index for condominiums. Journal of Property Research, 27(1), pp. 39-60.

Thanh Noi, P. & Kappas, M. 2018. Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using Sentinel-2 imagery. Sensors, 18(1), pp. 18.

Trawiński, B., Telec, Z., Krasnoborski, J., Piwowarczyk, M., Talaga, M., Lasota, T. et al. 2017. Comparison of expert algorithms with machine learning models for real estate appraisal. IEEE, pp. 51-54.

Tsay, R.S. & Chen, R. 2018. Nonlinear time series analysis. 891. John Wiley & Sons.

Wing, C.K. & Chin, T. 2003. A Critical Review of Literature on the Hedonic Price Model. International Journal for Housing Science and Its Applications, 27, pp.145-165.

Yacim, J.A. & Boshoff, D. 2014. Mass appraisal of properties. pp.15-19.

Yilmazer, S. & Kocaman, S. 2020. A mass appraisal assessment study using machine learning based on multiple regression and random forest. Land Use Policy, 99, p.104889.

Zhou, G., Ji, Y., Chen, X. & Zhang, F. 2018. Artificial Neural Networks and the Mass

45

Appraisal of Real Estate. International Journal of Online Engineering, 14(3).

Źróbek, S., Kucharska-Stasiak, E., Trojanek, M., Adamiczka, J., Budzyński, T., Cellmer, R., Dąbrowski, J., Jasińska, E., Preweda, E. & Sajnóg, N. 2014. Current problems of valuation and real estate management by value. Croatian Information Technology Society, GIS Forum.

Zurada, J.M., Levitan, A.S. & Guan, J. 2006. Non-conventional approaches to property value assessment. Journal of Applied Business Research (JABR), 22(3).

Online sources:

CFI n.d., What is Bagging (Bootstrap Aggregation)? Viewed 08 February 2021, <https://corporatefinanceinstitute.com/resources/knowledge/other/bagging-bootstrap-aggregation/>

SCB 2021, Population in the country, counties and municipalities on December 31, 2020 and population change in January – December 2020. Total, viewed 01 March 2021, <http://www.scb.se/be0101-en>.

Statistics Sweden 2019, 'Dwelling Stock 2019-12-31', viewed 02 March 2021, <http://www.scb.se/bo0104-en>.

Statistics Sweden 2021, Population in the country, counties and municipalities on December 31, 2020 and population change in January – December 2020. Total, viewed 01 March 2021, <http://www.scb.se/be0101-en>.

TRITA-ABE-MBT-21419

https://corporatefinanceinstitute.com/resources/knowledge/other/bagging-bootstrap-aggregation/

https://corporatefinanceinstitute.com/resources/knowledge/other/bagging-bootstrap-aggregation/

http://www.scb.se/be0101-en

http://www.scb.se/bo0104-en

http://www.scb.se/be0101-en