Machine Learning

>>>Predictor App Link<<<

This section documents how the Condo Price Predictor app is being developed which involves the process of transforming raw real estate data into a sophisticated predictive model. We will walk you through the various stages of data analysis, model development, and decision-making that culminated in the creation of this app.

Tool: Streamlit

Github Repo

Singapore Condo Price Predictor App

Machine Learning Model - Streamlit App.jpg

The development of the Condo Price Predictor app is rooted in a comprehensive and methodical approach to data science. These are the steps that we have taken to create the app.

1. Data Loading and Preprocessing

Import the necessary packages – pandas, sklearn.preprocessing (onehotencoder, standardscaler), sklearn.model_selection (train_test_split), sklearn.compose (columntransformer), sklearn.pipeline (pipeline), joblib
Load Dataset.
Basic Cleaning (optional as it has been cleaned at the earlier stages) i.e., handle missing data and converting “sale date” to datetime.
Extract “year” and “month” columns.
Selecting the features (predictors) and target (prediction):
1. features = ['Project Name', 'Area (SQFT)', 'Postal Code', 'Year', 'Month']
2. target = 'Transacted Price ($)'
Encoding and Scaling:
1. numeric_features = ['Area (SQFT)', 'Year', 'Month']
2. categorical_features = ['Project Name', 'Postal Code']
Define X and Y for Machine Learning:
1. X = data[features]
2. y = data[target]
Apply columntransformer to handle categorical and numerical features.
Splitting the dataset into training and test sets i.e., X_train, X_test, y_train, y_test, with test_size=0.2
Create preprocessing and training pipeline.
Save the fitted preprocessor using joblib.

2. Model Training

The following models are foundational in the field of machine learning and offer a range of approaches for tackling regression tasks, from simple linear relationships to complex, non-linear data patterns.

1. Linear Regression - Linear Regression is a fundamental statistical approach for modeling the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the input variables and the target. This model is widely used for prediction and forecasting where data points are continuous and linearly separable. (James et al., 2013)

2. Random Forest Regressor - Random Forest Regressor is an ensemble learning method. It operates by constructing a multitude of decision trees at training time and outputting the average prediction of the individual trees. This model is effective for complex datasets and helps in reducing overfitting by averaging multiple decision trees. (Breiman, 2001)

3. Gradient Boosting Regressor - Gradient Boosting Regressor is an ensemble technique that builds the model in a stage-wise fashion. It constructs new trees that predict the residuals or errors of prior trees combined in an additive model. It's a powerful approach for dealing with both bias and variance in data. (Friedman, 2001)

4. Support Vector Regressor (SVR) - Support Vector Regressor is a type of Support Vector Machine used for regression challenges. It works by fitting the best line within a predefined or user-specified threshold distance from the actual data points. SVR can efficiently perform a non-linear regression using what is called the kernel trick. (Drucker et al., 1997)

3. Model Evaluation and Selection

Evaluation

R² is a statistic that indicates the proportion of variance in the dependent variable that is predictable from the independent variables. It provides a measure of how well observed outcomes are replicated by the model. An R² of 1 indicates perfect prediction, 0 indicates that the model does no better than a mean prediction, and a negative value indicates worse than a mean prediction. As quoted from Investopedia, “In finance, an R-squared above 0.7 would generally be seen as showing a high level of correlation, whereas a measure below 0.4 would show a low correlation. This is not a hard rule, however, and will depend on the specific analysis.” (Investopedia, n.d.)

The Standardised Root Mean Square Error (RMSE) is a measure of the average error in the predictions, adjusted by the standard deviation of the dependent variable, making it relative and more comparable across different datasets. The closer this value is to 0, the better the model's predictions are.

Selection

Random Forest yielded the best R² score and minimal standardised RMSE score, thus it is evaluated to be the best model. However, due to the complexity of the model which meant higher computing resources is required (the pickle file is 1.6gb while streamlit only has 1gb of storage), we decided to select the second-best model – Linear Regression to be deployed as the predictor app.

4. App Deployment

The linear regression model trained using the dataset, to be saved as a pickle file.

Load the preprocessor and model file.
Code the components of the app i.e., dropdown selection, input, slider, and predict button. [refer to the github repo]
Upload the .py file along with other dependencies onto github and link it to streamlit to deploy the app.

References