Predicting House Prices Using Machine Learning Basics

A data based project studying on residential homes in Ames, Iowa

Zhenli Jin
4 min readJul 4, 2021
Photo by Dhruv Mehra on Unsplash

Introduction

When people consider their dream house, sometimes the living area of the house or whether there is a pool is not the most important part that people struggle with but the price. Therefore, an accurate estimate of the house price is valuable in that the estimate provides useful information to prospective home buyers for evaluating and comparing to the list price before they make their decision.

Therefore, we use the Ame in Iowa housing dataset compiled by Dean De Cock to study the house prices and related contributing factors.

The original data was split into a training set and a test set. The training set contains 1460 rows and 81 columns and the test set contains 1459 rows and 80 columns. The missing column is what we want to predict.

Business Question

We start by studying two business questions.

1. What story can we tell from our target variable SalePrice?

Figure 1: Histogram of Sale Prices

Unlike the common daily life examples of normal distribution, the house prices seem not to follow a normal distribution — the prices deviate from the normal distribution — and the sale prices are right skewed. It is in line with our expectation that the more expensive the price is, the fewer people could afford it. Therefore, we expect that there exist some outliers.

Figure 2: Bosplot of Sale Prices

A boxplot displays the distribution of data based on five statistics summary: minimum, maximum, median, first quartile, and third quartile. From the above boxplot, we do notice that there are some of the outliers presented in the small rhombuses.

2. What features do people care about most when they buy a house?

We know that people care about the price most when they make decisions of purchasing a house. Now we are interested in what features are most related to the sale price. We will focus on the numerical variables in this part.

Figure 3: Heatmap of Correlations

Correlation describes a statistical relationship between two groups of data. In other words, we are interested whethor there is a linear trend between two variables. From the above heatmap of correlation with the sale prices, we notice that the predictor GrLivArea has the highest correlation of 0.71 which makes sense in that it is the living area square feet above ground. People do care about the living area when buying a house and bigger houses are usually more expensive.

Figure 4: Scatterplot of Sale Prices vs. Living Area Above Ground

The figure on the left side shows that there is a strong linear trend between the living area above ground and the sale price.

In addition, we notice that there are two outliers on the bottom right. It is uncommon that a big house has a such low price.

Modeling

Now we are interested in How accurately can we perform on predicting the houses sale prices?, we will build a basic multiple linear regression to predict the sale prices. Note that in a data analysis process, the data cleansing and data pre-processing take the most time.

We first handle the missing values and remove potential outliers. Then we put some related features together to make one new variable. In addition, we take into account the necessity of polynomial regression.

Meanwhile, we deal with the skewness of response variable and decide to take the log of SalePrice .

In the end, we will submit a csv file on the Kaggle competition page to get our score.

Model Result

Figure 5: Score

Note that the score is not a decent score which reminds us our model still has limitation and space for improvement.

Conclusion

In this article, we studied the house prices in Ames, Iowa and how to predict the price given some related features.

  1. We studied the most important variable SalePrice and it showed that few people can afford very expensive houses.
  2. We studied the correlations and drew a conclusion that living area and garage area are the two features people care about most when buying a house.
  3. We handled the missing data, encoded the data properly, factorized categorical variables, and finally built a multiple linear regression to predict the house prices. See the detailed process in the notebook in the GitHub.

For more imformation about this project, see the link to my GitHub here.

--

--