Data

This section describes the dataset and how we processed the data for subsequent analyses.

Data Source: Realis - Condo Transaction Data spanning from 1 Jan 2005 to 30 Nov 2023

Tool: Jupyter Notebook

Data Description

This is a dataset of condo transactions in Singapore from 2005 to 2023. It is extensive and detailed, covering various aspects of property sales. After the removal of enbloc transactions (which may skew the analyses) and properties with 60 years leasehold, we have a final dataset which we subjected to machine learning with these features:

Data Volume: The dataset consists of 369,674 rows and 24 columns.
Time Period: It spans the years from 2005 to 2023.
Columns: Key columns include 'Project Name', 'Transacted Price ($)', 'Area (SQFT)', 'Unit Price ($ PSF)', 'Sale Date', 'Address', 'Type of Sale', 'Type of Area', and various others relating to property details and transaction specifics.
Price Statistics: The transacted prices range significantly, with an average of approximately $1.22 million and a standard deviation of $526,050, indicating varied property values.
Area Statistics: The size of condos (in SQFT) also shows considerable variation, with an average area of about 1,063 SQFT.
Unit Price Statistics: The unit price per square foot (PSF) averages around $1,242, with a standard deviation of $526.
Diversity of Projects and Locations: The dataset includes transactions from 2,514 unique projects and 255,854 unique addresses, reflecting the diversity of the property market in Singapore.
Types of Sales: Transactions are categorised into 'Resale', 'New Sale', and 'Sub Sale'.
Planning Regions: The transactions are spread across various planning regions including Central, North East, East, West, and North regions of Singapore.
Postal Districts: The dataset encompasses 27 unique postal districts, indicating a wide geographical coverage.

This dataset provides a comprehensive overview of the condo market in Singapore, valuable for buyers, sellers, and researchers interested in real estate trends and valuations over an 18-year period.

The following steps were taken for data cleaning and processing:

1. Importing Libraries and Reading the File:

Necessary libraries like NumPy and pandas are imported.
The dataset is read using pandas.

2. Displaying Basic Information:

The info() method is used to display basic information about the dataset, such as the number of entries, column names, and data types.

3. Sampling the Data:

A sample of 5 rows from the dataset is displayed using sample(5) to get a quick overview of the data.

4. Checking Data Types:

The data types of each column in the dataset are checked using dtypes.

5. Filtering and Dropping Rows and Columns:

Rows where "Project Name" is "N.A." are removed.
Rows where "Number of Units" is greater than 1 are removed.
The "Number of Units" column is dropped from the dataset.
Rows with null values in "Tenure" or "Postal Code" are removed.

6. Checking for Null Values:

The dataset is checked for null values in each column using isna().sum().

7. Removing Duplicate Rows:

Duplicate rows in the dataset are identified and removed, keeping only the first occurrence.

8. Identifying and Counting Placeholder Values:

The presence of placeholder values (specifically the "-" character) in each column is checked and counted.
The results are sorted to identify which columns have the most occurrences of "-" (acting as a placeholder for missing or irrelevant data).

9. Removing Rows with Placeholder Values:

Rows with dashes in any column are removed from the dataset.

10. Converting Data Types and Removing Commas:

Specific columns, such as "Transacted Price ($)", "Area (SQFT)", "Unit Price ($ PSF)", "Area (SQM)", and "Unit Price ($ PSM)", are identified for conversion to numeric types.
Commas in these columns are removed, and the columns are converted to float data types.

11. Formatting Postal Codes:

The 'Postal Code' column is converted to a string format.
Postal codes are formatted to ensure a 6-digit standard, including leading zeros if necessary.
This is done using a lambda function that checks for non-null values and formats them accordingly.

12. Converting 'Sale Date' to Datetime:

The 'Sale Date' column, initially in string format, is converted to the datetime data type.
The conversion uses the to_datetime function from pandas, with error handling set to 'coerce', which replaces any invalid parsing with NaT (not a time).

13. Categorising Tenure:

A custom function categorize_tenure is defined to categorise the 'Tenure' field into 'Freehold' (800 - 1000 years) and 'Leasehold' (80 - 150 years).
The function is applied to the 'Tenure' column, creating a new column 'Tenure_cat'.

14. Removing Specific Tenure Rows:

Rows where 'Tenure' starts with "60 yrs" are identified and removed from the dataset.

15. Modifying 'Tenure' Column:

A custom function modify_tenure is used to modify the 'Tenure' column.
This function extracts the last four characters of the tenure value if it is not 'Freehold'.
The 'Tenure' column is renamed to 'Tenure_start'.
The data type for 'Tenure_start' is converted, while keeping "Freehold" as is.

16. Extracting Year from 'Sale Date':

The 'Sale Date' is converted to datetime format, and the year is extracted for further analysis.

17. Numbering repeated transactions on the same property:

We identified properties that were transacted more than once, and numbered the transactions, and computed the P/L ($) and P/L (%) for each transaction as a measure of profitability.
Replace empty/NaN value cells with '0'.

18. Analysis and Visualization:

The dataset undergoes various grouping and analysis steps, including calculating average prices, property age, and profit/loss.
Visualizations such as line plots and heatmaps are created using libraries like matplotlib and seaborn.